Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to know whether its Read1(Forward) or Read2(Reverse) from fastq contents.

    As per fastq file description on wikipedia(https://en.wikipedia.org/wiki/FASTQ_format) Illumina Sequence Identifier format as :
    Case A. Standard Illumina Format
    Read Identifier : @HWUSI-EAS100R:6:73:941:1973#0/1
    /1 indicates it is R1 i.e. Forward Read and
    /2 indicates it is R2 i.e. Reverse Read

    Case B. Illumina with Casava 1.8
    Read Identifier : @EAS139:136:FC706VJ:2:2104:15343:197393 1:Y:18:ATCACG
    1:Y:18:ATCACG i.e substring 1: indicates it is R1
    2:Y:18:ATCACG i.e substring 2: indicates it is R2 Case C : NCBI Sequence Read Archive(SRA)
    Read Identifier: @SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=36

    Case C: NCBI SRA fastq format
    Read Identifier :
    @SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=36

    I'm pasting 4 lines from the paired end data as :

    ==> SRR1583191_1.fastq <==
    @SRR1583191.1 SN7001163:87:C1ME6ACXX:1:1101:1176:2038 length=101
    NATCCAGTAGCCTCCTCCCCATCATCTCCCATTTCTTCTACAGGGGGACTCCCCCAGGTCTGGTAGCCCAAAGCTGCTGCTACAGCCGCCATGGGGGGGTG
    +SRR1583191.1 SN7001163:87:C1ME6ACXX:1:1101:1176:2038 length=101
    #1=DDFFFHHGHGIIIIIIIBFHCHIIIIIEHIIGIIGIIIIHIIIIGIIIIIIIIGHCHFEFFFCEEECBBCCCCCCCCCCCCCCCCBB9@ACABBCB09

    ==> SRR1583191_2.fastq <==
    @SRR1583191.1 SN7001163:87:C1ME6ACXX:1:1101:1176:2038 length=101
    TCCTGTTCTCCCTGCTTGGAGTCTTGGTTGCCTGTGGAAATATCAGGCATGTGAATGGGAAGGCAGGAGTAGACAGTGAATGTGGCCTACTTGATTTGAGG
    +SRR1583191.1 SN7001163:87:C1ME6ACXX:1:1101:1176:2038 length=101
    CCCFFFFFGHHGHJJJJIICGFGHHGGHIIIIIGFCG9CGHEHIIJJJHIGHJIIIJJIHIIIJIJJIHCEEHCEFEF3@C@CCCDBDCDDDDCCCDDDDD

    Here from the Case C identifier its not clear that which substring from Read Identifier can be used to distinguish R1 & R2.
    I tried looking into paired end files from SRA but I could not observe R1 or R2 identifier.

    I would like to know about getting R1 R2 information from fastq file contents. Apart from the three cases I would also like to know if there are any such sub strings in other fastq read identifier formats which provides R1 R2 information.
    Last edited by vaibhavvsk; 12-23-2015, 03:26 AM.
    Vaibhav Kulkarni

  • #2
    If you use

    -F | --origfmt Defline contains only original sequence name.
    option when extracting the fastq files from SRA you would potentially recover original Illumina fastq header.

    Comment


    • #3
      None of the information in this string:

      SN7001163:87:C1ME6ACXX:1:1101:1176:2038 length=101

      can be used as an identifier for R1 vs R2. The fields are things like the instrument serial number, flow cell ID, lane number, tile number and X/Y coordinates of the cluster.

      Genomax's suggestion to recover the original header would be the best option to get the data you're looking for.

      Comment


      • #4
        Originally posted by GenoMax View Post
        If you use



        option when extracting the fastq files from SRA you would potentially recover original Illumina fastq header.
        Hey GenoMax it worked for me. Thanks Jessica_L too!
        Vaibhav Kulkarni

        Comment

        Latest Articles

        Collapse

        • seqadmin
          Essential Discoveries and Tools in Epitranscriptomics
          by seqadmin


          The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist on Modified Bases...
          Yesterday, 07:01 AM
        • seqadmin
          Current Approaches to Protein Sequencing
          by seqadmin


          Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
          04-04-2024, 04:25 PM

        ad_right_rmr

        Collapse

        News

        Collapse

        Topics Statistics Last Post
        Started by seqadmin, 04-11-2024, 12:08 PM
        0 responses
        45 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-10-2024, 10:19 PM
        0 responses
        46 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-10-2024, 09:21 AM
        0 responses
        39 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-04-2024, 09:00 AM
        0 responses
        55 views
        0 likes
        Last Post seqadmin  
        Working...
        X