Seqanswers Leaderboard Ad

**Trudy** · 02-07-2013, 08:51 AM

But the number 36,497,412 is relative to R1 and R2 fastq file together?

**arrchi** · 02-07-2013, 10:03 AM

yes, the 36,497,412 is the total number of my raw reads.

**jb2** · 02-07-2013, 10:16 AM

TopHat allows reads to map to more than one place in the genome (multihits). This can occur for reads that overlap with elements that repeat in the transcriptome. This means that those reads get counted twice since each multiread alignment is a separate line in the bam file.

Tophat also does not keep reads that don't align in the bam file, so you'd have to go back and count the number of reads in the original fastq file (basically the number of lines in the fastq file divided by 4) to get the original read count including the reads that do not align and look at the difference between that and the bam file. You can use samtools view -F flag to output a bam file without the multireads. In this case the you'd use -F 256 to return a bam file without aligned reads that are not the primary alignment (see http://picard.sourceforge.net/explain-flags.html for more explanation of how to set the -f/-F flag to get the types of reads you want).

Edit: Also, by default, tophat will give back up to 20 alignments for the same read. You can change this option with -g/--max-multihits <int> on the command line as noted in the manual.

**arrchi** · 02-07-2013, 11:00 AM

Thanks a lot, jb2.

**SEQtheTruth** · 02-26-2013, 01:22 PM

I am currently stuck with a similar problem, maybe someone here can help.
Ultimately I am trying to determine the number of reads that were mapped by TopHat.
STEP 1
First I used "samtools flagstat" on accepted_hits.bam:

76865299 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
76865299 + 0 mapped (100.00%:-nan%)
0 + 0 paired in sequencing
0 + 0 read1
0 + 0 read2
0 + 0 properly paired (-nan%:-nan%)
0 + 0 with itself and mate mapped
0 + 0 singletons (-nan%:-nan%)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

As you can see, I have unpaired reads.
If I understand jb2 correctly, samtools spuriously tells me 100% were mapped because it can't see any of the reads that TopHat threw out. While it looks kind of suspicious that someone would write code to tell you (some%) when it can't ever count unmapped reads, I can believe that's what's happening here. The problem is that when I run "wc -l" on the original fastq file and divide by four, I get 70,247,418. This is significantly less than 76,865,299. As far as I can tell, this may be because TopHat aligns some reads multiple times, so samtools is giving me a number that includes duplicates and increasing my percentage of mapped reads beyond 100%.

STEP 2
To get around this hurdle and determine the number of unique reads aligned, I used a script from this thread discussing techniques for counting reads:

samtools view accepted_hits.bam | cut -f1 | sort | uniq | wc -l

Which output: 63,702,278
Great! So it looks like I have 70,247,418 reads; 63,702,278 of which were mapped with TopHat, some more than once; giving me 76,865,299 total mapped reads (including duplicates).

STEP 3 - TopHat log files
Here's what "bowtie.left_kept_reads" tells me:

70221973 reads; of these:

70221973 (100.00%) were unpaired; of these:

11079060 (15.78%) aligned 0 times

44347197 (63.15%) aligned exactly 1 time

14795716 (21.07%) aligned >1 times

84.22% overall alignment rate

Do you guys think the tophat log outputs correspond well enough with the numbers from STEP 2?
Why does samtools flagstat pretend to give a mapping percentage if it can't ever count the unmapped reads?
Anyone know a better way to count reads?

I appreciate anyone who can help, since this was a longer post than I first intended.
Thanks!

**swbarnes2** · 02-26-2013, 03:42 PM

Do you guys think the tophat log outputs correspond well enough with the numbers from STEP 2?
Why does samtools flagstat pretend to give a mapping percentage if it can't ever count the unmapped reads?
Anyone know a better way to count reads?

I appreciate anyone who can help, since this was a longer post than I first intended.
Thanks!

Of course samtools can count unmapped reads. It does so for me everyday. It counts every .bam entry with the 4 flagged.

But it can't count what isn't there. It's Bowtie that is throwing them out. (in contrast, when bwa makes a .sam file, it leaves unmapped reads in) You can set bowtie to write the unmapped reads to another file.

**kopi-o** · 02-27-2013, 02:16 AM

Small correction, it is TopHat that is throwing them out. I am pretty sure that Bowtie (when run by itself) leaves the unmapped reads in the output.

**Siva** · 03-11-2013, 01:32 PM

Actually unmapped hits are written separately under unmapped.bam. You can run samtools -flagstat on that and get the number of unmapped reads (QC pass + QC fail)

**swbarnes2** · 03-11-2013, 03:14 PM

Originally posted by Siva View Post

Actually unmapped hits are written separately under unmapped.bam. You can run samtools -flagstat on that and get the number of unmapped reads (QC pass + QC fail)

Keep in mind that the QC pass/fail numbers only mean something if there is software in your pipeline somewhere that checks for QC and sets/changes the flag in the .bam appropriately. SAMTools flagstat is just reading the flags, it's not setting them. I don't know that bowtie/tophat will attempt to set or change that flag at all.

**Siva** · 03-12-2013, 10:59 AM

Originally posted by swbarnes2 View Post

Keep in mind that the QC pass/fail numbers only mean something if there is software in your pipeline somewhere that checks for QC and sets/changes the flag in the .bam appropriately. SAMTools flagstat is just reading the flags, it's not setting them. I don't know that bowtie/tophat will attempt to set or change that flag at all.

Thanks! I do understand your point about QC.

I have some tophat2 generated BAM files (50 base paired end RNA seq). I ran tophat with all default options. I want to get the number of uniquely mapped reads. I get a feeling that it isn't all that straight forward to get them. I am walking you through my steps and would like to get comments.

On the original accepted_hits.bam file I ran samtools -flagstat and got the following

Code:

145949292 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
145949292 + 0 mapped (100.00%:-nan%)
145949292 + 0 paired in sequencing
73274066 + 0 read1
72675226 + 0 read2
114124162 + 0 properly paired (78.19%:-nan%)
138690450 + 0 with itself and mate mapped
7258842 + 0 singletons (4.97%:-nan%)
13016262 + 0 with mate mapped to a different chr
417680 + 0 with mate mapped to a different chr (mapQ>=5)

In order to get uniquely mapped reads, I think in in tophat (bowtie2)....the tag that signifies (highest probability) of being unique is "NH:i:1". So I then grepped for only lines with "NH:i:1" tag and got a sam file. I then made a bam file (accepted_hits_NH1.bam) and performed samtools flagstat and got the following:

Code:

108602832 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
108602832 + 0 mapped (100.00%:-nan%)
108602832 + 0 paired in sequencing
54503639 + 0 read1
54099193 + 0 read2
95543890 + 0 properly paired (87.98%:-nan%)
104454694 + 0 with itself and mate mapped
4148138 + 0 singletons (3.82%:-nan%)
1058650 + 0 with mate mapped to a different chr
417680 + 0 with mate mapped to a different chr (mapQ>=5)

Is it then safe to assume that the number of uniquely mapped reads are the sum of properly paired and singletons from my NH:i:1 only bam file. Thus
No. of properly paired reads (95543890) + singletons (4148138) = 99692028.

By the way...is it better if I re -run tophat with

--report secondary alignments

ON so that I can see the difference between the AS and XS (secondary alignment) and use that as a measure of uniqueness?

**chadn737** · 03-13-2013, 07:38 AM

Try using:

Code:

cut -f 1 <samfile> | sort | uniq -u | wc -l

Using the --report secondary alignments will force tophat2 to report a lower quality secondary alignment if one exists. The choice to use it depends I guess on how stringent you want to be in defining "uniquely mapped". It will lower a great deal the number of uniquely mapped reads and I would not typically use it.

**Siva** · 03-13-2013, 08:44 AM

Originally posted by chadn737 View Post

Try using:

Code:

cut -f 1 <samfile> | sort | uniq -u | wc -l

Hi Chad...
This will give a set of unique read IDs but not always uniquely mapped (aligned reads). I think it depends (as you point out) on how we ask tophat to align.

In default mode tophat reports best aligned reads with varying AS scores.The higher the AS score, the more uniquely mapped it is. We can ask tophat to report up to 20 extra alignments using --report-secondary-alignments. The optional tags AS (primary alignment score) and XS (Extra alignment score) can be used to report the read as uniquely mapped or not.

cheers
Siva

**chadn737** · 03-13-2013, 10:31 AM

You're right Siva. That code goes off of the read IDs, so it will tell you the number of unique IDs in that file.

With the older versions of Tophat, where all unique reads simply had a score of 255, and multimapped reads were either 0,1,2,3 then there was no way to really distinguish this on the basis of mapping scores in the sam file.

So ultimately it depends on how stringently you want to define "unique" and what your purposes for that is.

Topics	Statistics	Last Post
A Closer Look at the Enigmatic Genomes of Oikopleura dioica by seqadmin Started by seqadmin, Today, 06:35 AM	0 responses 10 views 0 likes	Last Post by seqadmin Today, 06:35 AM
Advanced Epigenome Editing Platform Explores Gene Regulation Mechanisms by seqadmin Started by seqadmin, Yesterday, 02:46 PM	0 responses 16 views 0 likes	Last Post by seqadmin Yesterday, 02:46 PM
Telomere Maintenance by PARP1: A New Perspective in Cancer Research by seqadmin Started by seqadmin, 05-07-2024, 06:57 AM	0 responses 14 views 0 likes	Last Post by seqadmin 05-07-2024, 06:57 AM
Enhanced Neoantigen Detection: Introducing NeoHunter by seqadmin Started by seqadmin, 05-06-2024, 07:17 AM	0 responses 18 views 0 likes	Last Post by seqadmin 05-06-2024, 07:17 AM

Seqanswers Leaderboard Ad

Announcement

tophat bam shows more reads than the library

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News