Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Tophat-accepted_hits.bam file shows more read with samtools flagstat

    Hi i m using tophat-cufflink pipeline for RNA seq data analysis. i have ~26 million high quality paired end reads each pair 1 and pair2. i mapped it on reference genome with tophat and output file accepted_hits.bam generated. then i cheked it with samtools and took stats with samtools flagstat it show the result mentioned below:

    81357362 + 0 in total (QC-passed reads + QC-failed reads)
    0 + 0 duplicates
    81357362 + 0 mapped (100.00%:nan%)
    81357362 + 0 paired in sequencing
    40660320 + 0 read1
    40697042 + 0 read2
    74893468 + 0 properly paired (92.05%:nan%)
    79296710 + 0 with itself and mate mapped
    2060652 + 0 singletons (2.53%:nan%)
    4103348 + 0 with mate mapped to a different chr
    78008 + 0 with mate mapped to a different chr (mapQ>=5)

    its shows more number of reads in read 1 and read2 and then properly paired 92.5% in compare of reads used in mapping. can anyone guide me or tell me how is it possible or why this is happened?? i m stucked with this problem when mapping with tophat. CAN ANYBODY TELL ME THE REASON? How can i know proper mapping percentage? THANKS.

  • #2
    Couple of points here:

    1) Tophat puts mapped reads and unmapped reads in different files. Therefore the accepted_hits.bam file only contains the mapped reads. This is the reason why it says 100% mapped (which is not true).

    2) You could have reads that are multimapped and tophat reports the multimapped reads as well. THerefore, the number of mapped read may be higher than they really are. When a read is multimapped, tophat will mark 1 read to be primary and other reads to be secondary. Samtools flagstat does not take multimapping into consideration. Therefore, to get the read number of primary mapped reads, you can just use a command like below:
    Code:
    samtools -F 256 accepted_hits.bam|wc -l
    Once you have this number, you can just divide by the total number of reads (x100) and you will be percent mapped.

    Look here for more bitwise flags: http://picard.sourceforge.net/explain-flags.html

    3) Read 1 and read 2 are just showing all the reads that are 1st in pair and 2nd in pair, respectively. Therefore the numbers will differ since 1 read in the pair could've mapped when the other one did not. The "properly paired" number is when both reads in pair are mapped, which excludes all the reads that mapped but it's mate did not.

    Comment


    • #3
      Hi sagarc88

      Which one Tophat or Tophat2 does have the ability multimapper as you mention? I tried using the samtools -F 256, it gave me 0. Does it mean my bam file doesn't have multimapped reads?

      Originally posted by sagarc88 View Post
      Couple of points here:

      1) Tophat puts mapped reads and unmapped reads in different files. Therefore the accepted_hits.bam file only contains the mapped reads. This is the reason why it says 100% mapped (which is not true).

      2) You could have reads that are multimapped and tophat reports the multimapped reads as well. THerefore, the number of mapped read may be higher than they really are. When a read is multimapped, tophat will mark 1 read to be primary and other reads to be secondary. Samtools flagstat does not take multimapping into consideration. Therefore, to get the read number of primary mapped reads, you can just use a command like below:
      Code:
      samtools -F 256 accepted_hits.bam|wc -l
      Once you have this number, you can just divide by the total number of reads (x100) and you will be percent mapped.

      Look here for more bitwise flags: http://picard.sourceforge.net/explain-flags.html

      3) Read 1 and read 2 are just showing all the reads that are 1st in pair and 2nd in pair, respectively. Therefore the numbers will differ since 1 read in the pair could've mapped when the other one did not. The "properly paired" number is when both reads in pair are mapped, which excludes all the reads that mapped but it's mate did not.

      Comment


      • #4
        correction to above:

        Code:
        samtools view -F 256 accepted_hits.bam | wc -l
        My two cents -

        If you're using Tophat2 then, by default, multi-mapped reads are not reported. That means accepted_hits.bam contains only primary alignments with one alignment per read. Most of them are paired.

        You can run samtools flagstat on the unmapped BAM file that's in the Tophat output folder. Take the total reads in that file (first row) and add it to the total reads in the flagstats you already ran (first row). That's your total reads - now you can compute percent aligned.

        samtools flagstat shows you a few things. The first line shows you how many total reads are in the file. the "read1" and "read2" rows show you how many first and second mates are in there. If you add "read1" and "read2" you should find the total number from the first row. tophat's output will always show that the number mapped (3rd row) is equal to the number in the first row. the fourth row shows you how many of the aligned reads came from pairs - sort of redundant but I guess if you had a mixture of paired and unpaired data in a single BAM this value would be useful. the "properly paired" row gives you the number of reads that aligned as pairs properly which is defined by the aligner. Most likely this means they are oriented properly and their insert size is correct as far as the aligner can tell. the "with itself and mate mapped" includes the proper pairings as well as those that aren't proper including those with mates mapped to different references (the second to last row) and those that have outlier insert sizes. if you add the "singletons" row to the "with itself and mate mapped" row you'll have the total mapped value from the third row.

        So the only question here is why this BAM file claims to have nearly 2x the number of pairs as you say you have in the data. If your fastq files are gzipped, try this:

        Code:
        gunzip -c <reads_file> | wc -l
        if not gzipped...
        Code:
        wc -l <reads_file>
        where <reads_file> is the file name for either of the two mate files. divide the output of that command by 4 - this is your number of pairs.
        /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
        Salk Institute for Biological Studies, La Jolla, CA, USA */

        Comment


        • #5
          so I was wrong - I just did a Tophat run and it does include secondary alignments in accepted_hits.bam. It's a little confusing because Tophat has an option '--report-secondary-alignments' which, as it reads, seems like it is meant to ENABLE this behavior. Maybe they made a mistake.

          So what you want to do is this...

          Code:
          samtools view -bF 0x100 accepted_hits.bam > primary_hits.bam
          samtools merge merged.bam primary_hits.bam unmapped.bam
          samtools sort -n merged.bam merged.nsort
          samtools fixmate merged.nsort.bam final.nsort.bam
          samtools flagstat final.nsort.bam
          final.nsort.bam will contain all of the original reads, aligned an unaligned, without any secondary alignments. When you run flagstats on that you will get the actual % mapped in the third line of the output. Also the 'read1' and 'read2' lines should reflect the number of pairs you said you should have (~26M).

          before you use that 'final.nsort.bam' for other analysis (like cufflinks or something) you'll need to coordinate sort it

          Code:
          samtools sort final.nsort.bam final.sorted
          /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
          Salk Institute for Biological Studies, La Jolla, CA, USA */

          Comment


          • #6
            From the manual:
            -g/--max-multihits <int> Instructs TopHat to allow up to this many alignments to the reference for a given read, and choose the alignments based on their alignment scores if there are more than this number. The default is 20 for read mapping. Unless you use --report-secondary-alignments, TopHat will report the alignments with the best alignment score. If there are more alignments with the same score than this number, TopHat will randomly report only this many alignments. In case of using --report-secondary-alignments, TopHat will try to report alignments up to this option value, and TopHat may randomly output some of the alignments with the same score to meet this number.
            --report-secondary-alignments By default TopHat reports best or primary alignments based on alignment scores (AS). Use this option if you want to output additional or secondary alignments (up to 20 alignments will be reported this way, this limit can be changed by using the -g/--max-multihits option above).


            I understand that with the default options tophat will report up to 20 multimapped alignments for a given read.

            Comment


            • #7
              Hi,
              I aligned through bwa mem fastq files generated from ion torrent pgm in order to analyze data with breakdancer to detect structural variants.
              However samtools flagstat gives these results:
              329861 + 0 in total (QC-passed reads + QC-failed reads)
              0 + 0 duplicates
              308567 + 0 mapped (93.54%:-nan%)
              0 + 0 paired in sequencing
              0 + 0 read1
              0 + 0 read2
              0 + 0 properly paired (-nan%:-nan%)
              0 + 0 with itself and mate mapped
              0 + 0 singletons (-nan%:-nan%)
              0 + 0 with mate mapped to a different chr
              0 + 0 with mate mapped to a different chr (mapQ>=5)

              Data are paired-end. Why did I obtain these results?

              Comment


              • #8
                Originally posted by simobioinfo View Post

                Data are paired-end. Why did I obtain these results?
                The simple answer is that they are not paired. How did you run 'bwa mem'?

                Comment


                • #9
                  The data are paired for sure.
                  I ran bwa mem
                  bwa mem ../../hg19.fa Campione_2.fastq > Campione_2.sam
                  Then through
                  samtools view -bS campione_2.sam > campione_2.bam
                  and then samtools flagstat.

                  I think that the aligner could be the problem. What do you think about?

                  Comment


                  • #10
                    The aligner is doing everything correctly. "bwa mem ../../hg19.fa Campione_2.fastq > Campione_2.sam" is how you would align single end reads. You meant to run "bwa mem ../../hg19.fa Campione_1.fastq Campione_2.fastq > Campione.sam". This assumes that the data is paired-end to begin with.

                    Comment


                    • #11
                      Then I have to separate my fastq files into two files... and how?

                      Comment


                      • #12
                        If they're paired end reads then they're normally already in two files. Why do you believe that you have a paired-end dataset?

                        Comment


                        • #13
                          Because the experiment is performed as paired end

                          Comment


                          • #14
                            Presumably the file is interleaved then (show the first 8 lines). Tophat can't use files like that. Either find a program to deinterleave them (e.g., seqtk) or use a different aligner. BBMap is the only aligner that I know of that might be able to handle that.

                            Next time, ask your sequencing provider to give you normal fastq files, not interleaved ones.

                            Comment


                            • #15
                              Well, some people like interleaved files. Not myself but I do believe that Brian Bushnell is a proponent of them. Speaking of which I think that his reformat.sh program will split apart an interleaved file like so:

                              reformat.sh in=interleave.fastq out=R1.fastq out2=R2.fastq

                              As I said I do not use interleaved files very often so the above may not work but it is worth a try.

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM
                              • seqadmin
                                Techniques and Challenges in Conservation Genomics
                                by seqadmin



                                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                Avian Conservation
                                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                03-08-2024, 10:41 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, Yesterday, 06:37 PM
                              0 responses
                              11 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, Yesterday, 06:07 PM
                              0 responses
                              10 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-22-2024, 10:03 AM
                              0 responses
                              51 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-21-2024, 07:32 AM
                              0 responses
                              68 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X