Seqanswers Leaderboard Ad

**sagarc88** · 03-12-2013, 09:06 AM

Couple of points here:

1) Tophat puts mapped reads and unmapped reads in different files. Therefore the accepted_hits.bam file only contains the mapped reads. This is the reason why it says 100% mapped (which is not true).

2) You could have reads that are multimapped and tophat reports the multimapped reads as well. THerefore, the number of mapped read may be higher than they really are. When a read is multimapped, tophat will mark 1 read to be primary and other reads to be secondary. Samtools flagstat does not take multimapping into consideration. Therefore, to get the read number of primary mapped reads, you can just use a command like below:

Code:

samtools -F 256 accepted_hits.bam|wc -l

Once you have this number, you can just divide by the total number of reads (x100) and you will be percent mapped.

Look here for more bitwise flags: http://picard.sourceforge.net/explain-flags.html

3) Read 1 and read 2 are just showing all the reads that are 1st in pair and 2nd in pair, respectively. Therefore the numbers will differ since 1 read in the pair could've mapped when the other one did not. The "properly paired" number is when both reads in pair are mapped, which excludes all the reads that mapped but it's mate did not.

**wanfahmi** · 05-27-2013, 05:10 AM

Hi sagarc88

Which one Tophat or Tophat2 does have the ability multimapper as you mention? I tried using the samtools -F 256, it gave me 0. Does it mean my bam file doesn't have multimapped reads?

Originally posted by sagarc88 View Post

Couple of points here:

1) Tophat puts mapped reads and unmapped reads in different files. Therefore the accepted_hits.bam file only contains the mapped reads. This is the reason why it says 100% mapped (which is not true).

2) You could have reads that are multimapped and tophat reports the multimapped reads as well. THerefore, the number of mapped read may be higher than they really are. When a read is multimapped, tophat will mark 1 read to be primary and other reads to be secondary. Samtools flagstat does not take multimapping into consideration. Therefore, to get the read number of primary mapped reads, you can just use a command like below:

Code:

samtools -F 256 accepted_hits.bam|wc -l

Once you have this number, you can just divide by the total number of reads (x100) and you will be percent mapped.

Look here for more bitwise flags: http://picard.sourceforge.net/explain-flags.html

3) Read 1 and read 2 are just showing all the reads that are 1st in pair and 2nd in pair, respectively. Therefore the numbers will differ since 1 read in the pair could've mapped when the other one did not. The "properly paired" number is when both reads in pair are mapped, which excludes all the reads that mapped but it's mate did not.

**sdriscoll** · 05-29-2013, 05:38 PM

correction to above:

Code:

samtools view -F 256 accepted_hits.bam | wc -l

My two cents -

If you're using Tophat2 then, by default, multi-mapped reads are not reported. That means accepted_hits.bam contains only primary alignments with one alignment per read. Most of them are paired.

You can run samtools flagstat on the unmapped BAM file that's in the Tophat output folder. Take the total reads in that file (first row) and add it to the total reads in the flagstats you already ran (first row). That's your total reads - now you can compute percent aligned.

samtools flagstat shows you a few things. The first line shows you how many total reads are in the file. the "read1" and "read2" rows show you how many first and second mates are in there. If you add "read1" and "read2" you should find the total number from the first row. tophat's output will always show that the number mapped (3rd row) is equal to the number in the first row. the fourth row shows you how many of the aligned reads came from pairs - sort of redundant but I guess if you had a mixture of paired and unpaired data in a single BAM this value would be useful. the "properly paired" row gives you the number of reads that aligned as pairs properly which is defined by the aligner. Most likely this means they are oriented properly and their insert size is correct as far as the aligner can tell. the "with itself and mate mapped" includes the proper pairings as well as those that aren't proper including those with mates mapped to different references (the second to last row) and those that have outlier insert sizes. if you add the "singletons" row to the "with itself and mate mapped" row you'll have the total mapped value from the third row.

So the only question here is why this BAM file claims to have nearly 2x the number of pairs as you say you have in the data. If your fastq files are gzipped, try this:

Code:

gunzip -c <reads_file> | wc -l

if not gzipped...

Code:

wc -l <reads_file>

where <reads_file> is the file name for either of the two mate files. divide the output of that command by 4 - this is your number of pairs.

**sdriscoll** · 05-30-2013, 04:34 PM

so I was wrong - I just did a Tophat run and it does include secondary alignments in accepted_hits.bam. It's a little confusing because Tophat has an option '--report-secondary-alignments' which, as it reads, seems like it is meant to ENABLE this behavior. Maybe they made a mistake.

So what you want to do is this...

Code:

samtools view -bF 0x100 accepted_hits.bam > primary_hits.bam
samtools merge merged.bam primary_hits.bam unmapped.bam
samtools sort -n merged.bam merged.nsort
samtools fixmate merged.nsort.bam final.nsort.bam
samtools flagstat final.nsort.bam

final.nsort.bam will contain all of the original reads, aligned an unaligned, without any secondary alignments. When you run flagstats on that you will get the actual % mapped in the third line of the output. Also the 'read1' and 'read2' lines should reflect the number of pairs you said you should have (~26M).

before you use that 'final.nsort.bam' for other analysis (like cufflinks or something) you'll need to coordinate sort it

Code:

samtools sort final.nsort.bam final.sorted

**EGrassi** · 05-31-2013, 01:41 AM

From the manual:
-g/--max-multihits <int> Instructs TopHat to allow up to this many alignments to the reference for a given read, and choose the alignments based on their alignment scores if there are more than this number. The default is 20 for read mapping. Unless you use --report-secondary-alignments, TopHat will report the alignments with the best alignment score. If there are more alignments with the same score than this number, TopHat will randomly report only this many alignments. In case of using --report-secondary-alignments, TopHat will try to report alignments up to this option value, and TopHat may randomly output some of the alignments with the same score to meet this number.
--report-secondary-alignments By default TopHat reports best or primary alignments based on alignment scores (AS). Use this option if you want to output additional or secondary alignments (up to 20 alignments will be reported this way, this limit can be changed by using the -g/--max-multihits option above).

I understand that with the default options tophat will report up to 20 multimapped alignments for a given read.

**simobioinfo** · 09-16-2015, 06:10 AM

Hi,
I aligned through bwa mem fastq files generated from ion torrent pgm in order to analyze data with breakdancer to detect structural variants.
However samtools flagstat gives these results:
329861 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
308567 + 0 mapped (93.54%:-nan%)
0 + 0 paired in sequencing
0 + 0 read1
0 + 0 read2
0 + 0 properly paired (-nan%:-nan%)
0 + 0 with itself and mate mapped
0 + 0 singletons (-nan%:-nan%)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

Data are paired-end. Why did I obtain these results?

**westerman** · 09-16-2015, 07:21 AM

Originally posted by simobioinfo View Post

Data are paired-end. Why did I obtain these results?

The simple answer is that they are not paired. How did you run 'bwa mem'?

**simobioinfo** · 09-22-2015, 12:40 AM

The data are paired for sure.
I ran bwa mem
bwa mem ../../hg19.fa Campione_2.fastq > Campione_2.sam
Then through
samtools view -bS campione_2.sam > campione_2.bam
and then samtools flagstat.

I think that the aligner could be the problem. What do you think about?

**dpryan** · 09-22-2015, 01:30 AM

The aligner is doing everything correctly. "bwa mem ../../hg19.fa Campione_2.fastq > Campione_2.sam" is how you would align single end reads. You meant to run "bwa mem ../../hg19.fa Campione_1.fastq Campione_2.fastq > Campione.sam". This assumes that the data is paired-end to begin with.

**simobioinfo** · 09-22-2015, 01:32 AM

Then I have to separate my fastq files into two files... and how?

**dpryan** · 09-22-2015, 01:38 AM

If they're paired end reads then they're normally already in two files. Why do you believe that you have a paired-end dataset?

**simobioinfo** · 09-22-2015, 01:39 AM

Because the experiment is performed as paired end

**dpryan** · 09-22-2015, 01:44 AM

Presumably the file is interleaved then (show the first 8 lines). Tophat can't use files like that. Either find a program to deinterleave them (e.g., seqtk) or use a different aligner. BBMap is the only aligner that I know of that might be able to handle that.

Next time, ask your sequencing provider to give you normal fastq files, not interleaved ones.

**westerman** · 09-22-2015, 10:20 AM

Well, some people like interleaved files. Not myself but I do believe that Brian Bushnell is a proponent of them. Speaking of which I think that his reformat.sh program will split apart an interleaved file like so:

reformat.sh in=interleave.fastq out=R1.fastq out2=R2.fastq

As I said I do not use interleaved files very often so the above may not work but it is worth a try.

Topics	Statistics	Last Post
A Closer Look at the Enigmatic Genomes of Oikopleura dioica by seqadmin Started by seqadmin, Yesterday, 06:35 AM	0 responses 14 views 0 likes	Last Post by seqadmin Yesterday, 06:35 AM
Advanced Epigenome Editing Platform Explores Gene Regulation Mechanisms by seqadmin Started by seqadmin, 05-09-2024, 02:46 PM	0 responses 19 views 0 likes	Last Post by seqadmin 05-09-2024, 02:46 PM
Telomere Maintenance by PARP1: A New Perspective in Cancer Research by seqadmin Started by seqadmin, 05-07-2024, 06:57 AM	0 responses 18 views 0 likes	Last Post by seqadmin 05-07-2024, 06:57 AM
Enhanced Neoantigen Detection: Introducing NeoHunter by seqadmin Started by seqadmin, 05-06-2024, 07:17 AM	0 responses 19 views 0 likes	Last Post by seqadmin 05-06-2024, 07:17 AM

Seqanswers Leaderboard Ad

Announcement

Tophat-accepted_hits.bam file shows more read with samtools flagstat

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News