SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Tophat uniquely mapped reads mrfox Bioinformatics 2 05-23-2013 06:58 AM
How to get uniquely mapped reads from Tophat subeet Bioinformatics 10 11-28-2012 06:56 AM
TopHat/Bowtie - number of reads aligned mgibson Bioinformatics 7 10-22-2011 09:04 PM
not uniquely mapped reads unidodo RNA Sequencing 2 04-22-2011 03:07 PM
Dindel problem: overlapping windows and non-uniquely re-aligned reads Yilong Li Bioinformatics 5 03-07-2011 03:10 PM

Reply
 
Thread Tools
Old 07-29-2010, 11:30 PM   #1
bgibb
Junior Member
 
Location: San Francisco, CA

Join Date: Jul 2010
Posts: 7
Default TopHat and uniquely aligned reads

Hello, I was wondering if anyone has tried running TopHat with the option max-multihits=1. From the user manual, I would expect only uniquely-mapped reads to appear in the accepted_hits.sam file, however I can usually find several reads appearing more than once. My understanding is that when max-multihits is set to 1, TopHat is supposed to discard mulitply-mapped reads and not consider them further, e.g. when mapping to junctions. Is my understanding correct? If so, any thoughts on why multi-maps are reported in the alignment?

I am using TopHat-1.0.14 with Bowtie-0.12.5.

Thanks.

Bill
bgibb is offline   Reply With Quote
Old 07-30-2010, 07:42 AM   #2
apratap
Member
 
Location: Bay Area

Join Date: Jan 2009
Posts: 58
Default

Hi bgibb

I would expect the same after setting max-multihits=1 but I have not tried it as we were interested in multi reads. As far as finding the uniqe hits, all the reads in the sam file with mapping quality = 255 are uniques. I dont know the reason why but I got this info from Cole, the developer of this package.

-A
apratap is offline   Reply With Quote
Old 07-30-2010, 08:46 AM   #3
brentp
Member
 
Location: salt lake city, UT

Join Date: Apr 2010
Posts: 72
Default

@bgibb i posted a message about this but earlier, you cant use --max-multihits, you have to use -g 1 (where -g is the short flag for --max-multihits) because there's a bug in the option handling in tophat.
brentp is offline   Reply With Quote
Old 08-01-2010, 10:23 PM   #4
bgibb
Junior Member
 
Location: San Francisco, CA

Join Date: Jul 2010
Posts: 7
Default

Thanks for the helpful replies.

Regarding mapping quality 255, according to the SAM spec, 255 denotes “mapping quality is not available.” I think TopHat needs at least one 2nd-best hit in order to calculate mapping quality, so the 255 would make sense for uniquely mapped reads... assuming that TopHat follows the MAQ definition of mapping quality [Li et al, “Mapping short DNA sequencing reads and calling variants using mapping quality scores,” p. 1856].
bgibb is offline   Reply With Quote
Old 01-24-2011, 11:09 AM   #5
mbom777
Junior Member
 
Location: Tucson, AZ

Join Date: Oct 2010
Posts: 4
Default

@bgibb I recently discovered reads with multiple hits in my unique-mapped tophat output (10 out of 4,480,145). I determined that the reads are mapping to the same location twice, i.e. there is an alternate mapping for those reads due to indels.

I also have a question about the mapping quality values. Even for non-unique-mapped I am seeing mostly 255.
mbom777 is offline   Reply With Quote
Old 01-26-2011, 04:50 AM   #6
dnusol
Senior Member
 
Location: Spain

Join Date: Jul 2009
Posts: 133
Default

Hi, sorry for deviating a bit from the main question, but why would the default value for this flag would be set to 40? wouldn't you want to discard multiple hits and use the op's option -g 1? I am starting to use Tophat and I am trying to set the best flags for my analysis.

Thanks,

Dave
dnusol is offline   Reply With Quote
Old 03-04-2011, 12:31 PM   #7
DavidMatthewsBristol
Junior Member
 
Location: Bristol, UK

Join Date: Aug 2010
Posts: 7
Default Hope this helps....

Hi,
I've been using Galaxy to analyse rna seq data from mRNA isolated from Hela cells. Like others I have the problem of reads that are multihits. I have put up a workflow for analysis on Galaxy that involves the following steps:
1. Run the data using tophat and allow up to 40 maps per read (default)
2. Use a samtools feature to get rid of all mappings that are not mate paired and in a proper pair.
3. Count up how many times an individual read is in the sam file and remove all read pairs that are not mapped to a unique site, putting them in a separate "multihits" file.
4. Keep all the uniquely mapped proper mate paired hits in a unique hits sam file.

This approach generates more unique hits than asking tophat to throw out reads that do not uniquely map (this may have changed with the latest tophat release - I haven't checked yet). I think (and the tophat guys may correct me on this) this is because tophat may be removing reads where one end is not uniquely mapped but the other is (and therefore only makes sense with one of the mates).
However, whatever tophat does (now or in the future) this approach does have the advantage of telling you where and how big the multihit problem is. My datset has, for example, 18 million unique proper paired reads, 1.3 million that map to two places, a few hundered thousand that map to 3 places and so on down the line.
One problem with multihits is that we may be overestimating some genes by including multihits or conversely underestimating some genes by excluding them. This "Bristol" workflow allows us to at least know if a gene has a problem of being prone to multihits.

I think this approach is useful but I may have missed something or be behind the curve!! Who knows but I thought it might be a useful workflow to start a discussion about what to do with multihit reads.

Cheers
David
DavidMatthewsBristol is offline   Reply With Quote
Old 07-02-2012, 11:07 AM   #8
vyellapa
Member
 
Location: phoenix

Join Date: Oct 2011
Posts: 59
Default

Quote:
Originally Posted by DavidMatthewsBristol View Post
Hi,
I've been using Galaxy to analyse rna seq data from mRNA isolated from Hela cells. Like others I have the problem of reads that are multihits. I have put up a workflow for analysis on Galaxy that involves the following steps:
1. Run the data using tophat and allow up to 40 maps per read (default)
2. Use a samtools feature to get rid of all mappings that are not mate paired and in a proper pair.
3. Count up how many times an individual read is in the sam file and remove all read pairs that are not mapped to a unique site, putting them in a separate "multihits" file.
4. Keep all the uniquely mapped proper mate paired hits in a unique hits sam file.

This approach generates more unique hits than asking tophat to throw out reads that do not uniquely map (this may have changed with the latest tophat release - I haven't checked yet). I think (and the tophat guys may correct me on this) this is because tophat may be removing reads where one end is not uniquely mapped but the other is (and therefore only makes sense with one of the mates).
However, whatever tophat does (now or in the future) this approach does have the advantage of telling you where and how big the multihit problem is. My datset has, for example, 18 million unique proper paired reads, 1.3 million that map to two places, a few hundered thousand that map to 3 places and so on down the line.
One problem with multihits is that we may be overestimating some genes by including multihits or conversely underestimating some genes by excluding them. This "Bristol" workflow allows us to at least know if a gene has a problem of being prone to multihits.

I think this approach is useful but I may have missed something or be behind the curve!! Who knows but I thought it might be a useful workflow to start a discussion about what to do with multihit reads.

Cheers
David
I'm curious how you did step 3
-Count up how many times an individual read is in the sam file and remove all read pairs that are not mapped to a unique site, putting them in a separate "multihits" file.

Im trying to create a file with only having unique reads(no multiple hits) but am also trying to keep the pair with highest mapping score.

Thank you,
Teja
vyellapa is offline   Reply With Quote
Old 11-29-2012, 05:35 AM   #9
sqcrft
Member
 
Location: boston

Join Date: May 2012
Posts: 29
Default

I don't think so.

The option will report only one mapping result out of all. It doesn't mean you get uniquely-mapped reads, you only get only one mapping result.

"If there are more alignments with the same score than this number, TopHat will randomly report only this many alignments."
http://tophat.cbcb.umd.edu/manual.html




Quote:
Originally Posted by bgibb View Post
Hello, I was wondering if anyone has tried running TopHat with the option max-multihits=1. From the user manual, I would expect only uniquely-mapped reads to appear in the accepted_hits.sam file, however I can usually find several reads appearing more than once. My understanding is that when max-multihits is set to 1, TopHat is supposed to discard mulitply-mapped reads and not consider them further, e.g. when mapping to junctions. Is my understanding correct? If so, any thoughts on why multi-maps are reported in the alignment?

I am using TopHat-1.0.14 with Bowtie-0.12.5.

Thanks.

Bill
sqcrft is offline   Reply With Quote
Old 05-16-2013, 01:19 PM   #10
carmeyeii
Senior Member
 
Location: Mexico

Join Date: Mar 2011
Posts: 137
Default

Yes, I think sqcrft is right. It does not discard multiple-mapping reads, but only reports one alignment for those out all all their alignments.

I think there used to be an option -k that would allow one to discard reads that topped x alignments -- whatever happened to that? I only see -g in the tophat 2 manual, no reporting options like before...
carmeyeii is offline   Reply With Quote
Old 10-02-2013, 12:04 AM   #11
CPCantalapiedra
Member
 
Location: Zaragoza (Spain)

Join Date: Sep 2011
Posts: 38
Default

OMG this topic is really lasting throw time. I really encourage the next time traveller to take a look at this http://www.biostars.org/p/82559/.

In the other hand, the -g 1 parameter is well defined in the TopHat manual nowadays. You should not expect to get rid of the multiple mapping reads. Just you will have (in theory) one mapping for read. However, this is not happening (to me, at least) in TopHat 2.0.9, what is really confusing.

In order to get rid of multiple mappings, just point out that the -g option has a different behaviour when used with -M, which is used to do a premapping to genome, in order to filter multiple mappings. Here the -g is reallly a threshold and reads with more than one mapping (-g 1) will be filtered and will not be used in the following steps (transcriptome mapping, genome mapping, segment mapping and so on). However, even being so, one could find some multiple mappings in the accepted_hits.bam (with just one alignment reported in theory if using the -g 1 option). I guess these multiple mappings (a few, in my case) are the multiple mappings generated when mapping the reads or segments to the new junctions tested.

I really would like to have control over multiple and single mapping reads, but it is a tough task with the current TopHat versions.

Now that I am here, I will ask if any one knows exactly which tags in SAM format uses TopHat to annotate multiple mapping reads? Thank you
CPCantalapiedra is offline   Reply With Quote
Old 10-02-2013, 01:17 AM   #12
sphil
Senior Member
 
Location: Stuttgart, Germany

Join Date: Apr 2010
Posts: 192
Default

Hey,

there is an ongoing discussion almost everywhere. I normally use samtools to extract the unique reads from the results using:

samtools view -bq 1 file.bam > unique.bam

http://www.biostars.org/p/56246/
sphil is offline   Reply With Quote
Old 10-02-2013, 09:27 PM   #13
CPCantalapiedra
Member
 
Location: Zaragoza (Spain)

Join Date: Sep 2011
Posts: 38
Default

It seems too that TopHat uses the NH:i tag instead of the bowtie2 XS:i. I didn't know that. So maybe your answer could be more general than using any of the tags?
CPCantalapiedra is offline   Reply With Quote
Old 10-02-2013, 10:45 PM   #14
CPCantalapiedra
Member
 
Location: Zaragoza (Spain)

Join Date: Sep 2011
Posts: 38
Default

Quote:
Originally Posted by sphil View Post
Hey,

there is an ongoing discussion almost everywhere. I normally use samtools to extract the unique reads from the results using:

samtools view -bq 1 file.bam > unique.bam

http://www.biostars.org/p/56246/
I tried to count the MAPQ + NH:i pairs of values in my accepted_hits.bam:

35182251 50 1

So all my BAM records are unique and NH:i:1... of course not maching the align_summary file:

Left reads:
Input: 26017654
Mapped: 17425180 (67.0% of input)
of these: 885971 ( 5.1%) have multiple alignments (19321824 have >1)
Right reads:
Input: 26343552
Mapped: 17592344 (66.8% of input)
of these: 993327 ( 5.6%) have multiple alignments (19103282 have >1)
66.9% overall read alignment rate.

Aligned pairs: 13950250
of these: 188496 ( 1.4%) have multiple alignments
and: 505713 ( 3.6%) are discordant alignments
51.7% concordant pair alignment rate.
CPCantalapiedra is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 04:26 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2021, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO