Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • RNA-Seq sequence counts fastq and BAM files

    Hi:
    I am a bit confused about total number of reads obtained from RNA-Seq run.

    In the case of a paired end run fastq should both R1 and R2 reads be counted to get total number of reads.

    Example:

    Sample-S was run on two lanes as 2x100 PE reads configuration.

    Sample-S - R1 in lane 1 - 30Mil reads (as per FASTQ file)
    Sample-S - R2 in lane 1 - 30 Mil reads

    Sample-S -R1 in "lane 2" - 35Mil reads
    Sample-S -R2 in "lane 2" - 35Mil reads.


    If I want to know total # of reads I sequenced for Sample-S

    Is it 130Mil. reads or 65Mil reads?

    Question 2:
    I see difference close to double between FASTQ file and reads from samtools flagstat total reads. Why is this - is this because in a paired end BAM file both R1 and R2 reads are mapped and counted.
    In this case should one count both R1 and R2 reads in fastq file.

    Appreciate your help.

    Adrian

  • #2
    1) I would say "65 million read pairs", rather than "130 million reads", since the latter is always a bit ambiguous.
    2) samtools flagstat counts both reads in a pair, so you should expect to see double. The reason for this is to allow people to mix paired and single-end reads and still get meaningful metrics. Note also that there are separate metrics printed specifically for read1 and read2 in a pair (these numbers should more closely the numbers from the fastq files).

    Comment


    • #3
      CASAVA reports the stats as "X million reads" through in reality it is "X/2" M read pairs per sample per lane for a paired-end run.

      In similar vein, one terabase of sequence from a HiSeq 2500 counts output of *two* flowcells from one instrument.

      Comment


      • #4
        Originally posted by adrian View Post
        Hi:
        In the case of a paired end run fastq should both R1 and R2 reads be counted to get total number of reads.
        When you make the PE library there is one 'insert' or sequence between two primers, say R1 and R2. Each primer can hybridise to the flowcell. One end (R1) hybridises first. The read1 sequence is then 100bp into the insert from that direction. With PE sequencing the R2 end is then hybridised to the flowcell and read2 sequence is 100bp into the insert form the other direction. Thus it is really only a single sequence with an unknown stretch in between, (often, confusingly, called an insert) and so should be counted as such, as dpryan says.

        As a toy example, if you had a 10bp PE library sequenced, with original insert of 30bp:

        R1: ACTGACTGAC----------ACTGACTGAC :R2

        This is also more likely to align to a single position in the transcriptome, which is why it is a good sequencing strategy.

        Comment


        • #5
          Thanks for replies. I got it.

          Comment

          Latest Articles

          Collapse

          • seqadmin
            Current Approaches to Protein Sequencing
            by seqadmin


            Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
            04-04-2024, 04:25 PM
          • seqadmin
            Strategies for Sequencing Challenging Samples
            by seqadmin


            Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
            03-22-2024, 06:39 AM

          ad_right_rmr

          Collapse

          News

          Collapse

          Topics Statistics Last Post
          Started by seqadmin, 04-11-2024, 12:08 PM
          0 responses
          31 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-10-2024, 10:19 PM
          0 responses
          32 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-10-2024, 09:21 AM
          0 responses
          28 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-04-2024, 09:00 AM
          0 responses
          53 views
          0 likes
          Last Post seqadmin  
          Working...
          X