Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • tophat bam shows more reads than the library

    Hi there,

    I've run tophat (version 2.0.4 and encode gtf) to a paired-end 50bp RNA-Seq data . The fastq files shows the total number of reads is 36,497,412, after tophat run, I used samtools to check the bam file, I got:

    samtools flagstat accepted_hits.bam
    38,502,265 in total
    0 QC failure
    0 duplicates
    38502265 mapped (100.00%)
    38502265 paired in sequencing
    19645891 read1
    18856374 read2
    17808682 properly paired (46.25%)
    32180420 with itself and mate mapped
    6321845 singletons (16.42%)
    4384082 with mate mapped to a different chr
    301920 with mate mapped to a different chr (mapQ>=5)

    My questions are:

    1. Why is the total number of reads (38,502,265) from bam file is larger than my original library size (36,497,41)?

    2. The flagstat shows that there is no unmapped reads. In this case, what are the proper ways to calculate the percentage of unmapped reads?

    Thanks very much. Any suggestions are welcome.

  • #2
    But the number 36,497,412 is relative to R1 and R2 fastq file together?

    Comment


    • #3
      yes, the 36,497,412 is the total number of my raw reads.

      Comment


      • #4
        TopHat allows reads to map to more than one place in the genome (multihits). This can occur for reads that overlap with elements that repeat in the transcriptome. This means that those reads get counted twice since each multiread alignment is a separate line in the bam file.

        Tophat also does not keep reads that don't align in the bam file, so you'd have to go back and count the number of reads in the original fastq file (basically the number of lines in the fastq file divided by 4) to get the original read count including the reads that do not align and look at the difference between that and the bam file. You can use samtools view -F flag to output a bam file without the multireads. In this case the you'd use -F 256 to return a bam file without aligned reads that are not the primary alignment (see http://picard.sourceforge.net/explain-flags.html for more explanation of how to set the -f/-F flag to get the types of reads you want).

        Edit: Also, by default, tophat will give back up to 20 alignments for the same read. You can change this option with -g/--max-multihits <int> on the command line as noted in the manual.
        Last edited by jb2; 02-07-2013, 01:29 PM.

        Comment


        • #5
          Thanks a lot, jb2.

          Comment


          • #6
            I am currently stuck with a similar problem, maybe someone here can help.
            Ultimately I am trying to determine the number of reads that were mapped by TopHat.
            STEP 1
            First I used "samtools flagstat" on accepted_hits.bam:
            76865299 + 0 in total (QC-passed reads + QC-failed reads)
            0 + 0 duplicates
            76865299 + 0 mapped (100.00%:-nan%)
            0 + 0 paired in sequencing
            0 + 0 read1
            0 + 0 read2
            0 + 0 properly paired (-nan%:-nan%)
            0 + 0 with itself and mate mapped
            0 + 0 singletons (-nan%:-nan%)
            0 + 0 with mate mapped to a different chr
            0 + 0 with mate mapped to a different chr (mapQ>=5)
            As you can see, I have unpaired reads.
            If I understand jb2 correctly, samtools spuriously tells me 100% were mapped because it can't see any of the reads that TopHat threw out. While it looks kind of suspicious that someone would write code to tell you (some%) when it can't ever count unmapped reads, I can believe that's what's happening here. The problem is that when I run "wc -l" on the original fastq file and divide by four, I get 70,247,418. This is significantly less than 76,865,299. As far as I can tell, this may be because TopHat aligns some reads multiple times, so samtools is giving me a number that includes duplicates and increasing my percentage of mapped reads beyond 100%.

            STEP 2
            To get around this hurdle and determine the number of unique reads aligned, I used a script from this thread discussing techniques for counting reads:
            samtools view accepted_hits.bam | cut -f1 | sort | uniq | wc -l
            Which output: 63,702,278
            Great! So it looks like I have 70,247,418 reads; 63,702,278 of which were mapped with TopHat, some more than once; giving me 76,865,299 total mapped reads (including duplicates).

            STEP 3 - TopHat log files
            Here's what "bowtie.left_kept_reads" tells me:
            70221973 reads; of these:

            70221973 (100.00%) were unpaired; of these:

            11079060 (15.78%) aligned 0 times

            44347197 (63.15%) aligned exactly 1 time

            14795716 (21.07%) aligned >1 times

            84.22% overall alignment rate
            Do you guys think the tophat log outputs correspond well enough with the numbers from STEP 2?
            Why does samtools flagstat pretend to give a mapping percentage if it can't ever count the unmapped reads?
            Anyone know a better way to count reads?

            I appreciate anyone who can help, since this was a longer post than I first intended.
            Thanks!

            Comment


            • #7
              Do you guys think the tophat log outputs correspond well enough with the numbers from STEP 2?
              Why does samtools flagstat pretend to give a mapping percentage if it can't ever count the unmapped reads?
              Anyone know a better way to count reads?

              I appreciate anyone who can help, since this was a longer post than I first intended.
              Thanks!
              Of course samtools can count unmapped reads. It does so for me everyday. It counts every .bam entry with the 4 flagged.

              But it can't count what isn't there. It's Bowtie that is throwing them out. (in contrast, when bwa makes a .sam file, it leaves unmapped reads in) You can set bowtie to write the unmapped reads to another file.

              Comment


              • #8
                Small correction, it is TopHat that is throwing them out. I am pretty sure that Bowtie (when run by itself) leaves the unmapped reads in the output.

                Comment


                • #9
                  Actually unmapped hits are written separately under unmapped.bam. You can run samtools -flagstat on that and get the number of unmapped reads (QC pass + QC fail)

                  Comment


                  • #10
                    Originally posted by Siva View Post
                    Actually unmapped hits are written separately under unmapped.bam. You can run samtools -flagstat on that and get the number of unmapped reads (QC pass + QC fail)
                    Keep in mind that the QC pass/fail numbers only mean something if there is software in your pipeline somewhere that checks for QC and sets/changes the flag in the .bam appropriately. SAMTools flagstat is just reading the flags, it's not setting them. I don't know that bowtie/tophat will attempt to set or change that flag at all.

                    Comment


                    • #11
                      Originally posted by swbarnes2 View Post
                      Keep in mind that the QC pass/fail numbers only mean something if there is software in your pipeline somewhere that checks for QC and sets/changes the flag in the .bam appropriately. SAMTools flagstat is just reading the flags, it's not setting them. I don't know that bowtie/tophat will attempt to set or change that flag at all.
                      Thanks! I do understand your point about QC.

                      I have some tophat2 generated BAM files (50 base paired end RNA seq). I ran tophat with all default options. I want to get the number of uniquely mapped reads. I get a feeling that it isn't all that straight forward to get them. I am walking you through my steps and would like to get comments.

                      On the original accepted_hits.bam file I ran samtools -flagstat and got the following
                      Code:
                      145949292 + 0 in total (QC-passed reads + QC-failed reads)
                      0 + 0 duplicates
                      145949292 + 0 mapped (100.00%:-nan%)
                      145949292 + 0 paired in sequencing
                      73274066 + 0 read1
                      72675226 + 0 read2
                      114124162 + 0 properly paired (78.19%:-nan%)
                      138690450 + 0 with itself and mate mapped
                      7258842 + 0 singletons (4.97%:-nan%)
                      13016262 + 0 with mate mapped to a different chr
                      417680 + 0 with mate mapped to a different chr (mapQ>=5)
                      In order to get uniquely mapped reads, I think in in tophat (bowtie2)....the tag that signifies (highest probability) of being unique is "NH:i:1". So I then grepped for only lines with "NH:i:1" tag and got a sam file. I then made a bam file (accepted_hits_NH1.bam) and performed samtools flagstat and got the following:

                      Code:
                      108602832 + 0 in total (QC-passed reads + QC-failed reads)
                      0 + 0 duplicates
                      108602832 + 0 mapped (100.00%:-nan%)
                      108602832 + 0 paired in sequencing
                      54503639 + 0 read1
                      54099193 + 0 read2
                      95543890 + 0 properly paired (87.98%:-nan%)
                      104454694 + 0 with itself and mate mapped
                      4148138 + 0 singletons (3.82%:-nan%)
                      1058650 + 0 with mate mapped to a different chr
                      417680 + 0 with mate mapped to a different chr (mapQ>=5)
                      Is it then safe to assume that the number of uniquely mapped reads are the sum of properly paired and singletons from my NH:i:1 only bam file. Thus
                      No. of properly paired reads (95543890) + singletons (4148138) = 99692028.

                      By the way...is it better if I re -run tophat with
                      --report secondary alignments
                      ON so that I can see the difference between the AS and XS (secondary alignment) and use that as a measure of uniqueness?
                      Last edited by Siva; 03-12-2013, 11:05 AM. Reason: Adding info

                      Comment


                      • #12
                        Try using:

                        Code:
                        cut -f 1 <samfile> | sort | uniq -u | wc -l
                        Using the --report secondary alignments will force tophat2 to report a lower quality secondary alignment if one exists. The choice to use it depends I guess on how stringent you want to be in defining "uniquely mapped". It will lower a great deal the number of uniquely mapped reads and I would not typically use it.

                        Comment


                        • #13
                          Originally posted by chadn737 View Post
                          Try using:

                          Code:
                          cut -f 1 <samfile> | sort | uniq -u | wc -l
                          Hi Chad...
                          This will give a set of unique read IDs but not always uniquely mapped (aligned reads). I think it depends (as you point out) on how we ask tophat to align.

                          In default mode tophat reports best aligned reads with varying AS scores.The higher the AS score, the more uniquely mapped it is. We can ask tophat to report up to 20 extra alignments using --report-secondary-alignments. The optional tags AS (primary alignment score) and XS (Extra alignment score) can be used to report the read as uniquely mapped or not.

                          cheers
                          Siva
                          Last edited by Siva; 03-13-2013, 09:04 AM. Reason: typos

                          Comment


                          • #14
                            You're right Siva. That code goes off of the read IDs, so it will tell you the number of unique IDs in that file.

                            With the older versions of Tophat, where all unique reads simply had a score of 255, and multimapped reads were either 0,1,2,3 then there was no way to really distinguish this on the basis of mapping scores in the sam file.

                            So ultimately it depends on how stringently you want to define "unique" and what your purposes for that is.

                            Comment

                            Latest Articles

                            Collapse

                            • seqadmin
                              Recent Advances in Sequencing Analysis Tools
                              by seqadmin


                              The sequencing world is rapidly changing due to declining costs, enhanced accuracies, and the advent of newer, cutting-edge instruments. Equally important to these developments are improvements in sequencing analysis, a process that converts vast amounts of raw data into a comprehensible and meaningful form. This complex task requires expertise and the right analysis tools. In this article, we highlight the progress and innovation in sequencing analysis by reviewing several of the...
                              05-06-2024, 07:48 AM
                            • seqadmin
                              Essential Discoveries and Tools in Epitranscriptomics
                              by seqadmin




                              The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                              04-22-2024, 07:01 AM

                            ad_right_rmr

                            Collapse

                            News

                            Collapse

                            Topics Statistics Last Post
                            Started by seqadmin, Today, 06:35 AM
                            0 responses
                            10 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, Yesterday, 02:46 PM
                            0 responses
                            16 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 05-07-2024, 06:57 AM
                            0 responses
                            14 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 05-06-2024, 07:17 AM
                            0 responses
                            18 views
                            0 likes
                            Last Post seqadmin  
                            Working...
                            X