Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to know whether its Read1(Forward) or Read2(Reverse) from fastq contents.

    As per fastq file description on wikipedia(https://en.wikipedia.org/wiki/FASTQ_format) Illumina Sequence Identifier format as :
    Case A. Standard Illumina Format
    Read Identifier : @HWUSI-EAS100R:6:73:941:1973#0/1
    /1 indicates it is R1 i.e. Forward Read and
    /2 indicates it is R2 i.e. Reverse Read

    Case B. Illumina with Casava 1.8
    Read Identifier : @EAS139:136:FC706VJ:2:2104:15343:197393 1:Y:18:ATCACG
    1:Y:18:ATCACG i.e substring 1: indicates it is R1
    2:Y:18:ATCACG i.e substring 2: indicates it is R2 Case C : NCBI Sequence Read Archive(SRA)
    Read Identifier: @SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=36

    Case C: NCBI SRA fastq format
    Read Identifier :
    @SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=36

    I'm pasting 4 lines from the paired end data as :

    ==> SRR1583191_1.fastq <==
    @SRR1583191.1 SN7001163:87:C1ME6ACXX:1:1101:1176:2038 length=101
    NATCCAGTAGCCTCCTCCCCATCATCTCCCATTTCTTCTACAGGGGGACTCCCCCAGGTCTGGTAGCCCAAAGCTGCTGCTACAGCCGCCATGGGGGGGTG
    +SRR1583191.1 SN7001163:87:C1ME6ACXX:1:1101:1176:2038 length=101
    #1=DDFFFHHGHGIIIIIIIBFHCHIIIIIEHIIGIIGIIIIHIIIIGIIIIIIIIGHCHFEFFFCEEECBBCCCCCCCCCCCCCCCCBB9@ACABBCB09

    ==> SRR1583191_2.fastq <==
    @SRR1583191.1 SN7001163:87:C1ME6ACXX:1:1101:1176:2038 length=101
    TCCTGTTCTCCCTGCTTGGAGTCTTGGTTGCCTGTGGAAATATCAGGCATGTGAATGGGAAGGCAGGAGTAGACAGTGAATGTGGCCTACTTGATTTGAGG
    +SRR1583191.1 SN7001163:87:C1ME6ACXX:1:1101:1176:2038 length=101
    CCCFFFFFGHHGHJJJJIICGFGHHGGHIIIIIGFCG9CGHEHIIJJJHIGHJIIIJJIHIIIJIJJIHCEEHCEFEF3@C@CCCDBDCDDDDCCCDDDDD

    Here from the Case C identifier its not clear that which substring from Read Identifier can be used to distinguish R1 & R2.
    I tried looking into paired end files from SRA but I could not observe R1 or R2 identifier.

    I would like to know about getting R1 R2 information from fastq file contents. Apart from the three cases I would also like to know if there are any such sub strings in other fastq read identifier formats which provides R1 R2 information.
    Last edited by vaibhavvsk; 12-23-2015, 03:26 AM.
    Vaibhav Kulkarni

  • #2
    If you use

    -F | --origfmt Defline contains only original sequence name.
    option when extracting the fastq files from SRA you would potentially recover original Illumina fastq header.

    Comment


    • #3
      None of the information in this string:

      SN7001163:87:C1ME6ACXX:1:1101:1176:2038 length=101

      can be used as an identifier for R1 vs R2. The fields are things like the instrument serial number, flow cell ID, lane number, tile number and X/Y coordinates of the cluster.

      Genomax's suggestion to recover the original header would be the best option to get the data you're looking for.

      Comment


      • #4
        Originally posted by GenoMax View Post
        If you use



        option when extracting the fastq files from SRA you would potentially recover original Illumina fastq header.
        Hey GenoMax it worked for me. Thanks Jessica_L too!
        Vaibhav Kulkarni

        Comment

        Latest Articles

        Collapse

        • seqadmin
          Current Approaches to Protein Sequencing
          by seqadmin


          Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
          04-04-2024, 04:25 PM
        • seqadmin
          Strategies for Sequencing Challenging Samples
          by seqadmin


          Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
          03-22-2024, 06:39 AM

        ad_right_rmr

        Collapse

        News

        Collapse

        Topics Statistics Last Post
        Started by seqadmin, 04-11-2024, 12:08 PM
        0 responses
        18 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-10-2024, 10:19 PM
        0 responses
        22 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-10-2024, 09:21 AM
        0 responses
        17 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-04-2024, 09:00 AM
        0 responses
        49 views
        0 likes
        Last Post seqadmin  
        Working...
        X