SEQanswers

Go Back   SEQanswers > Applications Forums > RNA Sequencing



Similar Threads
Thread Thread Starter Forum Replies Last Post
Tophat: merging accepted_hits.bam and unmapped.bam offspring RNA Sequencing 36 08-13-2015 03:08 AM
Samtools flagstat on Tophat generated .bam file rg_gis Bioinformatics 6 02-27-2015 09:35 PM
tophat bam shows more reads than the library arrchi Bioinformatics 13 03-13-2013 10:31 AM
tophat problem: no accepted_hits.bam generated RNAer Bioinformatics 0 07-19-2011 12:18 PM
Split accepted_hits.bam file after Tophat run? hong_sunwoo Bioinformatics 6 10-18-2010 12:06 AM

Reply
 
Thread Tools
Old 03-12-2013, 08:39 AM   #1
mehtaaditya
Junior Member
 
Location: India

Join Date: Mar 2013
Posts: 9
Default Tophat-accepted_hits.bam file shows more read with samtools flagstat

Hi i m using tophat-cufflink pipeline for RNA seq data analysis. i have ~26 million high quality paired end reads each pair 1 and pair2. i mapped it on reference genome with tophat and output file accepted_hits.bam generated. then i cheked it with samtools and took stats with samtools flagstat it show the result mentioned below:

81357362 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
81357362 + 0 mapped (100.00%:nan%)
81357362 + 0 paired in sequencing
40660320 + 0 read1
40697042 + 0 read2
74893468 + 0 properly paired (92.05%:nan%)
79296710 + 0 with itself and mate mapped
2060652 + 0 singletons (2.53%:nan%)
4103348 + 0 with mate mapped to a different chr
78008 + 0 with mate mapped to a different chr (mapQ>=5)

its shows more number of reads in read 1 and read2 and then properly paired 92.5% in compare of reads used in mapping. can anyone guide me or tell me how is it possible or why this is happened?? i m stucked with this problem when mapping with tophat. CAN ANYBODY TELL ME THE REASON? How can i know proper mapping percentage? THANKS.
mehtaaditya is offline   Reply With Quote
Old 03-12-2013, 09:06 AM   #2
sagarc88
Junior Member
 
Location: East Coast

Join Date: Jul 2012
Posts: 2
Thumbs up

Couple of points here:

1) Tophat puts mapped reads and unmapped reads in different files. Therefore the accepted_hits.bam file only contains the mapped reads. This is the reason why it says 100% mapped (which is not true).

2) You could have reads that are multimapped and tophat reports the multimapped reads as well. THerefore, the number of mapped read may be higher than they really are. When a read is multimapped, tophat will mark 1 read to be primary and other reads to be secondary. Samtools flagstat does not take multimapping into consideration. Therefore, to get the read number of primary mapped reads, you can just use a command like below:
Code:
samtools -F 256 accepted_hits.bam|wc -l
Once you have this number, you can just divide by the total number of reads (x100) and you will be percent mapped.

Look here for more bitwise flags: http://picard.sourceforge.net/explain-flags.html

3) Read 1 and read 2 are just showing all the reads that are 1st in pair and 2nd in pair, respectively. Therefore the numbers will differ since 1 read in the pair could've mapped when the other one did not. The "properly paired" number is when both reads in pair are mapped, which excludes all the reads that mapped but it's mate did not.
sagarc88 is offline   Reply With Quote
Old 05-27-2013, 05:10 AM   #3
wanfahmi
Member
 
Location: North Sea

Join Date: Apr 2008
Posts: 34
Default

Hi sagarc88

Which one Tophat or Tophat2 does have the ability multimapper as you mention? I tried using the samtools -F 256, it gave me 0. Does it mean my bam file doesn't have multimapped reads?

Quote:
Originally Posted by sagarc88 View Post
Couple of points here:

1) Tophat puts mapped reads and unmapped reads in different files. Therefore the accepted_hits.bam file only contains the mapped reads. This is the reason why it says 100% mapped (which is not true).

2) You could have reads that are multimapped and tophat reports the multimapped reads as well. THerefore, the number of mapped read may be higher than they really are. When a read is multimapped, tophat will mark 1 read to be primary and other reads to be secondary. Samtools flagstat does not take multimapping into consideration. Therefore, to get the read number of primary mapped reads, you can just use a command like below:
Code:
samtools -F 256 accepted_hits.bam|wc -l
Once you have this number, you can just divide by the total number of reads (x100) and you will be percent mapped.

Look here for more bitwise flags: http://picard.sourceforge.net/explain-flags.html

3) Read 1 and read 2 are just showing all the reads that are 1st in pair and 2nd in pair, respectively. Therefore the numbers will differ since 1 read in the pair could've mapped when the other one did not. The "properly paired" number is when both reads in pair are mapped, which excludes all the reads that mapped but it's mate did not.
wanfahmi is offline   Reply With Quote
Old 05-29-2013, 05:38 PM   #4
sdriscoll
I like code
 
Location: San Diego, CA, USA

Join Date: Sep 2009
Posts: 438
Default

correction to above:

Code:
samtools view -F 256 accepted_hits.bam | wc -l
My two cents -

If you're using Tophat2 then, by default, multi-mapped reads are not reported. That means accepted_hits.bam contains only primary alignments with one alignment per read. Most of them are paired.

You can run samtools flagstat on the unmapped BAM file that's in the Tophat output folder. Take the total reads in that file (first row) and add it to the total reads in the flagstats you already ran (first row). That's your total reads - now you can compute percent aligned.

samtools flagstat shows you a few things. The first line shows you how many total reads are in the file. the "read1" and "read2" rows show you how many first and second mates are in there. If you add "read1" and "read2" you should find the total number from the first row. tophat's output will always show that the number mapped (3rd row) is equal to the number in the first row. the fourth row shows you how many of the aligned reads came from pairs - sort of redundant but I guess if you had a mixture of paired and unpaired data in a single BAM this value would be useful. the "properly paired" row gives you the number of reads that aligned as pairs properly which is defined by the aligner. Most likely this means they are oriented properly and their insert size is correct as far as the aligner can tell. the "with itself and mate mapped" includes the proper pairings as well as those that aren't proper including those with mates mapped to different references (the second to last row) and those that have outlier insert sizes. if you add the "singletons" row to the "with itself and mate mapped" row you'll have the total mapped value from the third row.

So the only question here is why this BAM file claims to have nearly 2x the number of pairs as you say you have in the data. If your fastq files are gzipped, try this:

Code:
gunzip -c <reads_file> | wc -l
if not gzipped...
Code:
wc -l <reads_file>
where <reads_file> is the file name for either of the two mate files. divide the output of that command by 4 - this is your number of pairs.
sdriscoll is offline   Reply With Quote
Old 05-30-2013, 04:34 PM   #5
sdriscoll
I like code
 
Location: San Diego, CA, USA

Join Date: Sep 2009
Posts: 438
Default

so I was wrong - I just did a Tophat run and it does include secondary alignments in accepted_hits.bam. It's a little confusing because Tophat has an option '--report-secondary-alignments' which, as it reads, seems like it is meant to ENABLE this behavior. Maybe they made a mistake.

So what you want to do is this...

Code:
samtools view -bF 0x100 accepted_hits.bam > primary_hits.bam
samtools merge merged.bam primary_hits.bam unmapped.bam
samtools sort -n merged.bam merged.nsort
samtools fixmate merged.nsort.bam final.nsort.bam
samtools flagstat final.nsort.bam
final.nsort.bam will contain all of the original reads, aligned an unaligned, without any secondary alignments. When you run flagstats on that you will get the actual % mapped in the third line of the output. Also the 'read1' and 'read2' lines should reflect the number of pairs you said you should have (~26M).

before you use that 'final.nsort.bam' for other analysis (like cufflinks or something) you'll need to coordinate sort it

Code:
samtools sort final.nsort.bam final.sorted
sdriscoll is offline   Reply With Quote
Old 05-31-2013, 01:41 AM   #6
EGrassi
Member
 
Location: Turin, Italy

Join Date: Oct 2010
Posts: 66
Default

From the manual:
-g/--max-multihits <int> Instructs TopHat to allow up to this many alignments to the reference for a given read, and choose the alignments based on their alignment scores if there are more than this number. The default is 20 for read mapping. Unless you use --report-secondary-alignments, TopHat will report the alignments with the best alignment score. If there are more alignments with the same score than this number, TopHat will randomly report only this many alignments. In case of using --report-secondary-alignments, TopHat will try to report alignments up to this option value, and TopHat may randomly output some of the alignments with the same score to meet this number.
--report-secondary-alignments By default TopHat reports best or primary alignments based on alignment scores (AS). Use this option if you want to output additional or secondary alignments (up to 20 alignments will be reported this way, this limit can be changed by using the -g/--max-multihits option above).


I understand that with the default options tophat will report up to 20 multimapped alignments for a given read.
EGrassi is offline   Reply With Quote
Old 09-16-2015, 06:10 AM   #7
simobioinfo
Member
 
Location: italy

Join Date: Aug 2014
Posts: 38
Default

Hi,
I aligned through bwa mem fastq files generated from ion torrent pgm in order to analyze data with breakdancer to detect structural variants.
However samtools flagstat gives these results:
329861 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
308567 + 0 mapped (93.54%:-nan%)
0 + 0 paired in sequencing
0 + 0 read1
0 + 0 read2
0 + 0 properly paired (-nan%:-nan%)
0 + 0 with itself and mate mapped
0 + 0 singletons (-nan%:-nan%)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

Data are paired-end. Why did I obtain these results?
simobioinfo is offline   Reply With Quote
Old 09-16-2015, 07:21 AM   #8
westerman
Rick Westerman
 
Location: Purdue University, Indiana, USA

Join Date: Jun 2008
Posts: 1,104
Default

Quote:
Originally Posted by simobioinfo View Post

Data are paired-end. Why did I obtain these results?
The simple answer is that they are not paired. How did you run 'bwa mem'?
westerman is offline   Reply With Quote
Old 09-22-2015, 12:40 AM   #9
simobioinfo
Member
 
Location: italy

Join Date: Aug 2014
Posts: 38
Default

The data are paired for sure.
I ran bwa mem
bwa mem ../../hg19.fa Campione_2.fastq > Campione_2.sam
Then through
samtools view -bS campione_2.sam > campione_2.bam
and then samtools flagstat.

I think that the aligner could be the problem. What do you think about?
simobioinfo is offline   Reply With Quote
Old 09-22-2015, 01:30 AM   #10
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

The aligner is doing everything correctly. "bwa mem ../../hg19.fa Campione_2.fastq > Campione_2.sam" is how you would align single end reads. You meant to run "bwa mem ../../hg19.fa Campione_1.fastq Campione_2.fastq > Campione.sam". This assumes that the data is paired-end to begin with.
dpryan is offline   Reply With Quote
Old 09-22-2015, 01:32 AM   #11
simobioinfo
Member
 
Location: italy

Join Date: Aug 2014
Posts: 38
Default

Then I have to separate my fastq files into two files... and how?
simobioinfo is offline   Reply With Quote
Old 09-22-2015, 01:38 AM   #12
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

If they're paired end reads then they're normally already in two files. Why do you believe that you have a paired-end dataset?
dpryan is offline   Reply With Quote
Old 09-22-2015, 01:39 AM   #13
simobioinfo
Member
 
Location: italy

Join Date: Aug 2014
Posts: 38
Default

Because the experiment is performed as paired end
simobioinfo is offline   Reply With Quote
Old 09-22-2015, 01:44 AM   #14
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

Presumably the file is interleaved then (show the first 8 lines). Tophat can't use files like that. Either find a program to deinterleave them (e.g., seqtk) or use a different aligner. BBMap is the only aligner that I know of that might be able to handle that.

Next time, ask your sequencing provider to give you normal fastq files, not interleaved ones.
dpryan is offline   Reply With Quote
Old 09-22-2015, 10:20 AM   #15
westerman
Rick Westerman
 
Location: Purdue University, Indiana, USA

Join Date: Jun 2008
Posts: 1,104
Default

Well, some people like interleaved files. Not myself but I do believe that Brian Bushnell is a proponent of them. Speaking of which I think that his reformat.sh program will split apart an interleaved file like so:

reformat.sh in=interleave.fastq out=R1.fastq out2=R2.fastq

As I said I do not use interleaved files very often so the above may not work but it is worth a try.
westerman is offline   Reply With Quote
Old 09-22-2015, 10:37 AM   #16
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

Yeah, some people dislike chocolate too...there's no accounting for taste :P
dpryan is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 09:04 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO