Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to know whether its Read1(Forward) or Read2(Reverse) from fastq contents.

    As per fastq file description on wikipedia(https://en.wikipedia.org/wiki/FASTQ_format) Illumina Sequence Identifier format as :
    Case A. Standard Illumina Format
    Read Identifier : @HWUSI-EAS100R:6:73:941:1973#0/1
    /1 indicates it is R1 i.e. Forward Read and
    /2 indicates it is R2 i.e. Reverse Read

    Case B. Illumina with Casava 1.8
    Read Identifier : @EAS139:136:FC706VJ:2:2104:15343:197393 1:Y:18:ATCACG
    1:Y:18:ATCACG i.e substring 1: indicates it is R1
    2:Y:18:ATCACG i.e substring 2: indicates it is R2 Case C : NCBI Sequence Read Archive(SRA)
    Read Identifier: @SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=36

    Case C: NCBI SRA fastq format
    Read Identifier :
    @SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=36

    I'm pasting 4 lines from the paired end data as :

    ==> SRR1583191_1.fastq <==
    @SRR1583191.1 SN7001163:87:C1ME6ACXX:1:1101:1176:2038 length=101
    NATCCAGTAGCCTCCTCCCCATCATCTCCCATTTCTTCTACAGGGGGACTCCCCCAGGTCTGGTAGCCCAAAGCTGCTGCTACAGCCGCCATGGGGGGGTG
    +SRR1583191.1 SN7001163:87:C1ME6ACXX:1:1101:1176:2038 length=101
    #1=DDFFFHHGHGIIIIIIIBFHCHIIIIIEHIIGIIGIIIIHIIIIGIIIIIIIIGHCHFEFFFCEEECBBCCCCCCCCCCCCCCCCBB9@ACABBCB09

    ==> SRR1583191_2.fastq <==
    @SRR1583191.1 SN7001163:87:C1ME6ACXX:1:1101:1176:2038 length=101
    TCCTGTTCTCCCTGCTTGGAGTCTTGGTTGCCTGTGGAAATATCAGGCATGTGAATGGGAAGGCAGGAGTAGACAGTGAATGTGGCCTACTTGATTTGAGG
    +SRR1583191.1 SN7001163:87:C1ME6ACXX:1:1101:1176:2038 length=101
    CCCFFFFFGHHGHJJJJIICGFGHHGGHIIIIIGFCG9CGHEHIIJJJHIGHJIIIJJIHIIIJIJJIHCEEHCEFEF3@C@CCCDBDCDDDDCCCDDDDD

    Here from the Case C identifier its not clear that which substring from Read Identifier can be used to distinguish R1 & R2.
    I tried looking into paired end files from SRA but I could not observe R1 or R2 identifier.

    I would like to know about getting R1 R2 information from fastq file contents. Apart from the three cases I would also like to know if there are any such sub strings in other fastq read identifier formats which provides R1 R2 information.
    Last edited by vaibhavvsk; 12-23-2015, 03:26 AM.
    Vaibhav Kulkarni

  • #2
    If you use

    -F | --origfmt Defline contains only original sequence name.
    option when extracting the fastq files from SRA you would potentially recover original Illumina fastq header.

    Comment


    • #3
      None of the information in this string:

      SN7001163:87:C1ME6ACXX:1:1101:1176:2038 length=101

      can be used as an identifier for R1 vs R2. The fields are things like the instrument serial number, flow cell ID, lane number, tile number and X/Y coordinates of the cluster.

      Genomax's suggestion to recover the original header would be the best option to get the data you're looking for.

      Comment


      • #4
        Originally posted by GenoMax View Post
        If you use



        option when extracting the fastq files from SRA you would potentially recover original Illumina fastq header.
        Hey GenoMax it worked for me. Thanks Jessica_L too!
        Vaibhav Kulkarni

        Comment

        Latest Articles

        Collapse

        • seqadmin
          Strategies for Sequencing Challenging Samples
          by seqadmin


          Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
          03-22-2024, 06:39 AM
        • seqadmin
          Techniques and Challenges in Conservation Genomics
          by seqadmin



          The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

          Avian Conservation
          Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
          03-08-2024, 10:41 AM

        ad_right_rmr

        Collapse

        News

        Collapse

        Topics Statistics Last Post
        Started by seqadmin, Yesterday, 06:37 PM
        0 responses
        11 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, Yesterday, 06:07 PM
        0 responses
        10 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 03-22-2024, 10:03 AM
        0 responses
        51 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 03-21-2024, 07:32 AM
        0 responses
        68 views
        0 likes
        Last Post seqadmin  
        Working...
        X