PDA

View Full Version : TopHat and uniquely aligned reads


bgibb
07-29-2010, 10:30 PM
Hello, I was wondering if anyone has tried running TopHat with the option max-multihits=1. From the user manual, I would expect only uniquely-mapped reads to appear in the accepted_hits.sam file, however I can usually find several reads appearing more than once. My understanding is that when max-multihits is set to 1, TopHat is supposed to discard mulitply-mapped reads and not consider them further, e.g. when mapping to junctions. Is my understanding correct? If so, any thoughts on why multi-maps are reported in the alignment?

I am using TopHat-1.0.14 with Bowtie-0.12.5.

Thanks.

Bill

apratap
07-30-2010, 06:42 AM
Hi bgibb

I would expect the same after setting max-multihits=1 but I have not tried it as we were interested in multi reads. As far as finding the uniqe hits, all the reads in the sam file with mapping quality = 255 are uniques. I dont know the reason why but I got this info from Cole, the developer of this package.

-A

brentp
07-30-2010, 07:46 AM
@bgibb i posted a message about this but earlier, you cant use --max-multihits, you have to use -g 1 (where -g is the short flag for --max-multihits) because there's a bug in the option handling in tophat.

bgibb
08-01-2010, 09:23 PM
Thanks for the helpful replies.

Regarding mapping quality 255, according to the SAM spec, 255 denotes “mapping quality is not available.” I think TopHat needs at least one 2nd-best hit in order to calculate mapping quality, so the 255 would make sense for uniquely mapped reads... assuming that TopHat follows the MAQ definition of mapping quality [Li et al, “Mapping short DNA sequencing reads and calling variants using mapping quality scores,” p. 1856].

mbom777
01-24-2011, 10:09 AM
@bgibb I recently discovered reads with multiple hits in my unique-mapped tophat output (10 out of 4,480,145). I determined that the reads are mapping to the same location twice, i.e. there is an alternate mapping for those reads due to indels.

I also have a question about the mapping quality values. Even for non-unique-mapped I am seeing mostly 255.

dnusol
01-26-2011, 03:50 AM
Hi, sorry for deviating a bit from the main question, but why would the default value for this flag would be set to 40? wouldn't you want to discard multiple hits and use the op's option -g 1? I am starting to use Tophat and I am trying to set the best flags for my analysis.

Thanks,

Dave

DavidMatthewsBristol
03-04-2011, 11:31 AM
Hi,
I've been using Galaxy to analyse rna seq data from mRNA isolated from Hela cells. Like others I have the problem of reads that are multihits. I have put up a workflow for analysis on Galaxy that involves the following steps:
1. Run the data using tophat and allow up to 40 maps per read (default)
2. Use a samtools feature to get rid of all mappings that are not mate paired and in a proper pair.
3. Count up how many times an individual read is in the sam file and remove all read pairs that are not mapped to a unique site, putting them in a separate "multihits" file.
4. Keep all the uniquely mapped proper mate paired hits in a unique hits sam file.

This approach generates more unique hits than asking tophat to throw out reads that do not uniquely map (this may have changed with the latest tophat release - I haven't checked yet). I think (and the tophat guys may correct me on this) this is because tophat may be removing reads where one end is not uniquely mapped but the other is (and therefore only makes sense with one of the mates).
However, whatever tophat does (now or in the future) this approach does have the advantage of telling you where and how big the multihit problem is. My datset has, for example, 18 million unique proper paired reads, 1.3 million that map to two places, a few hundered thousand that map to 3 places and so on down the line.
One problem with multihits is that we may be overestimating some genes by including multihits or conversely underestimating some genes by excluding them. This "Bristol" workflow allows us to at least know if a gene has a problem of being prone to multihits.

I think this approach is useful but I may have missed something or be behind the curve!! Who knows but I thought it might be a useful workflow to start a discussion about what to do with multihit reads.

Cheers
David

vyellapa
07-02-2012, 10:07 AM
Hi,
I've been using Galaxy to analyse rna seq data from mRNA isolated from Hela cells. Like others I have the problem of reads that are multihits. I have put up a workflow for analysis on Galaxy that involves the following steps:
1. Run the data using tophat and allow up to 40 maps per read (default)
2. Use a samtools feature to get rid of all mappings that are not mate paired and in a proper pair.
3. Count up how many times an individual read is in the sam file and remove all read pairs that are not mapped to a unique site, putting them in a separate "multihits" file.
4. Keep all the uniquely mapped proper mate paired hits in a unique hits sam file.

This approach generates more unique hits than asking tophat to throw out reads that do not uniquely map (this may have changed with the latest tophat release - I haven't checked yet). I think (and the tophat guys may correct me on this) this is because tophat may be removing reads where one end is not uniquely mapped but the other is (and therefore only makes sense with one of the mates).
However, whatever tophat does (now or in the future) this approach does have the advantage of telling you where and how big the multihit problem is. My datset has, for example, 18 million unique proper paired reads, 1.3 million that map to two places, a few hundered thousand that map to 3 places and so on down the line.
One problem with multihits is that we may be overestimating some genes by including multihits or conversely underestimating some genes by excluding them. This "Bristol" workflow allows us to at least know if a gene has a problem of being prone to multihits.

I think this approach is useful but I may have missed something or be behind the curve!! Who knows but I thought it might be a useful workflow to start a discussion about what to do with multihit reads.

Cheers
David

I'm curious how you did step 3
-Count up how many times an individual read is in the sam file and remove all read pairs that are not mapped to a unique site, putting them in a separate "multihits" file.

Im trying to create a file with only having unique reads(no multiple hits) but am also trying to keep the pair with highest mapping score.

Thank you,
Teja

sqcrft
11-29-2012, 04:35 AM
I don't think so.

The option will report only one mapping result out of all. It doesn't mean you get uniquely-mapped reads, you only get only one mapping result.

"If there are more alignments with the same score than this number, TopHat will randomly report only this many alignments."
http://tophat.cbcb.umd.edu/manual.html




Hello, I was wondering if anyone has tried running TopHat with the option max-multihits=1. From the user manual, I would expect only uniquely-mapped reads to appear in the accepted_hits.sam file, however I can usually find several reads appearing more than once. My understanding is that when max-multihits is set to 1, TopHat is supposed to discard mulitply-mapped reads and not consider them further, e.g. when mapping to junctions. Is my understanding correct? If so, any thoughts on why multi-maps are reported in the alignment?

I am using TopHat-1.0.14 with Bowtie-0.12.5.

Thanks.

Bill

carmeyeii
05-16-2013, 12:19 PM
Yes, I think sqcrft is right. It does not discard multiple-mapping reads, but only reports one alignment for those out all all their alignments.

I think there used to be an option -k that would allow one to discard reads that topped x alignments -- whatever happened to that? I only see -g in the tophat 2 manual, no reporting options like before...

CPCantalapiedra
10-01-2013, 11:04 PM
OMG this topic is really lasting throw time. I really encourage the next time traveller to take a look at this http://www.biostars.org/p/82559/.

In the other hand, the -g 1 parameter is well defined in the TopHat manual nowadays. You should not expect to get rid of the multiple mapping reads. Just you will have (in theory) one mapping for read. However, this is not happening (to me, at least) in TopHat 2.0.9, what is really confusing.

In order to get rid of multiple mappings, just point out that the -g option has a different behaviour when used with -M, which is used to do a premapping to genome, in order to filter multiple mappings. Here the -g is reallly a threshold and reads with more than one mapping (-g 1) will be filtered and will not be used in the following steps (transcriptome mapping, genome mapping, segment mapping and so on). However, even being so, one could find some multiple mappings in the accepted_hits.bam (with just one alignment reported in theory if using the -g 1 option). I guess these multiple mappings (a few, in my case) are the multiple mappings generated when mapping the reads or segments to the new junctions tested.

I really would like to have control over multiple and single mapping reads, but it is a tough task with the current TopHat versions.

Now that I am here, I will ask if any one knows exactly which tags in SAM format uses TopHat to annotate multiple mapping reads? Thank you

sphil
10-02-2013, 12:17 AM
Hey,

there is an ongoing discussion almost everywhere. I normally use samtools to extract the unique reads from the results using:

samtools view -bq 1 file.bam > unique.bam

http://www.biostars.org/p/56246/

CPCantalapiedra
10-02-2013, 08:27 PM
It seems too that TopHat uses the NH:i tag instead of the bowtie2 XS:i. I didn't know that. So maybe your answer could be more general than using any of the tags?

CPCantalapiedra
10-02-2013, 09:45 PM
Hey,

there is an ongoing discussion almost everywhere. I normally use samtools to extract the unique reads from the results using:

samtools view -bq 1 file.bam > unique.bam

http://www.biostars.org/p/56246/

I tried to count the MAPQ + NH:i pairs of values in my accepted_hits.bam:

35182251 50 1

So all my BAM records are unique and NH:i:1... of course not maching the align_summary file:

Left reads:
Input: 26017654
Mapped: 17425180 (67.0% of input)
of these: 885971 ( 5.1%) have multiple alignments (19321824 have >1)
Right reads:
Input: 26343552
Mapped: 17592344 (66.8% of input)
of these: 993327 ( 5.6%) have multiple alignments (19103282 have >1)
66.9% overall read alignment rate.

Aligned pairs: 13950250
of these: 188496 ( 1.4%) have multiple alignments
and: 505713 ( 3.6%) are discordant alignments
51.7% concordant pair alignment rate.