Unconfigured Ad

**kmcarr** · 11-20-2012, 05:47 AM

First, use the right tool for the job. Bowtie (1 or 2) is not appropriate for mapping RNASeq reads to a genome reference. Bowtie does not account for RNA splicing, it expects reads to map contiguously to the reference. You should be using TopHat, which is really a wrapper around Bowtie but it permits splitting up of RNASeq reads so that they can be aligned to a genomic reference.

Now to your question look a the TopHat manual and study the "-g/--max-multihits <int>" option; it does what you want.

**hoardin** · 11-20-2012, 06:48 AM

Thank you for your response!

Just out of curiosity, what would be the difference in the mapping using tophat vs just bowtie(1/2)?

**kmcarr** · 11-20-2012, 07:07 AM

Originally posted by hoardin View Post

Just out of curiosity, what would be the difference in the mapping using tophat vs just bowtie(1/2)?

Bowtie expects reads to map contiguously to the reference, allowing only for SNPs or small indels. An RNASeq read from a processed transcript may contain sequence from multiple exons separated by hundreds or thousands of base pairs in genomic coordinate space. Bowtie would fail to align these reads. TopHat uses a number of methods to identify read segments originating from different exons, split the reads and map the segments to their originating exon.

**hoardin** · 11-20-2012, 07:14 AM

Thanks again.

I was wondering percentage wise, what the difference in mapping successfully be? so TopHat would actually report more successes?

**kmcarr** · 11-20-2012, 07:29 AM

Originally posted by hoardin View Post

I was wondering percentage wise, what the difference in mapping successfully be? so TopHat would actually report more successes?

Yes, TopHat will successfully map more RNASeq reads than Bowtie. The magnitude of that difference will vary depending on your experiment. The question is what fraction of your reads include a splice junction in a position which would cause Bowtie to fail (spices near the end of a read may be mappable). This is going to be determined by variables such as the length of your RNASeq reads and the average length of exons in the organism you are studying, more specifically the average length of exons in the transcripts represented in your RNA sample.

**westerman** · 11-20-2012, 07:36 AM

I'll edit this post and say that it is basically a "ditto" of KMCarr's post above ...

Originally posted by hoardin View Post

Thanks again.

I was wondering percentage wise, what the difference in mapping successfully be? so TopHat would actually report more successes?

Percentage will depend on your organism. Multiexonic and alternative splicing percentages and the sizes of exons (it matter if your reads are bigger than the average exon size) vary between the kingdoms and the species.

In general Tophat would report more successes.

Let's see, you are mapping to human. As a wild guess -- and I do not deal with human very much -- I'd say that you are probably missing 50% of your potential mapped reads.

**CraigJ** · 11-20-2012, 12:40 PM

Originally posted by kmcarr View Post

Now to your question look a the TopHat manual and study the "-g/--max-multihits <int>" option; it does what you want.

I do not believe this is what hoardin wanted. This option will still report the reads that align to the genome in multiple places - it will just report the <int> number of alignments. So if hoardin uses -g 1, the reads that align 2+ times to the genome will still be reported, but only the top 1 alignment will be reported.

What hoardin was asking for, I believe, is for bowtie/tophat to report only the alignments that map uniquely to the genome (map to only 1 place on the genome), and exclude all the alignments for reads that map to multiple places on the genome.

While bowtie2 cannot be told to only report uniquely aligned reads, there is a way to filter only uniquely mapped reads from the SAM output.

Bowtie2 uses the SAM optional field XS: to report the alignment score for the second best alignment for this read. Here is the description from the Bowtie2 manual:

XS:i:<N>
Alignment score for second-best alignment. Can be negative. Can be greater than 0 in --local mode (but not in --end-to-end mode). Only present if the SAM record is for an aligned read and more than one alignment was found for the read.

As you can see from the bold text, this field is only present on reads that map multiple times - and therefore, uniquely mapped reads will not have this field present. Therefore you can filter out alignments that contain this field to get only the uniquely mapped reads.

I assume tophat will report the same way, since it uses bowtie2 to do the mapping, though I have not actually checked if this is true.

Credit goes to MGineste for coming up with this method to filter for only uniquely mapped reads. See this seqanswers thread for more information/the original proposal of this filter and some ways to implement the filter: http://seqanswers.com/forums/showthread.php?t=23389

**hoardin** · 11-20-2012, 01:30 PM

Originally posted by CraigJ View Post

What hoardin was asking for, I believe, is for bowtie/tophat to report only the alignments that map uniquely to the genome (map to only 1 place on the genome), and exclude all the alignments for reads that map to multiple places on the genome.

While bowtie2 cannot be told to only report uniquely aligned reads, there is a way to filter only uniquely mapped reads from the SAM output.

Bowtie2 uses the SAM optional field XS: to report the alignment score for the second best alignment for this read. Here is the description from the Bowtie2 manual

Yes this was what I was looking for. I found a more roundabout way to do it. using -k 2, bowtie2 would show two entries for reads that align to multiple locations. Thus, I just removed all entires that weren't unique.

Thanks again everyone for the help.

**vbiaudet** · 11-23-2012, 06:34 AM

Tophat/Bowtie2 mapping

Originally posted by kmcarr View Post

Yes, TopHat will successfully map more RNASeq reads than Bowtie. The magnitude of that difference will vary depending on your experiment. The question is what fraction of your reads include a splice junction in a position which would cause Bowtie to fail (spices near the end of a read may be mappable). This is going to be determined by variables such as the length of your RNASeq reads and the average length of exons in the organism you are studying, more specifically the average length of exons in the transcripts represented in your RNA sample.

I have a question between a mapping with TopHat (against the genome) OR Bowtie2 (against transcripts). If we consider that the goal of a study is to count reads by transcripts, to make differential analyses, and that the model genes are good, do you think is useful to run topHat?

I say that because I consider that the mapping of reads coming from RNAseq was not very good to splicing map against genome but today perhaps topHat2 is better?

Thanks, vb

**Mamta** · 04-06-2013, 12:07 PM

Bowtie2 uniquely mapping reads

Originally posted by hoardin View Post

Yes this was what I was looking for. I found a more roundabout way to do it. using -k 2, bowtie2 would show two entries for reads that align to multiple locations. Thus, I just removed all entires that weren't unique.

Thanks again everyone for the help.

I am dealing with something similar, Would like to know how did you remove those reads?.

**shi** · 04-06-2013, 02:02 PM

If you are worried about the multi-mapping reads, why dont you choose an aligner that can report a unique mapping location for each read at the first place? There are many aligners that report a single location for each read (and has the option of reporting no locations if the read can be best mapped to more than one location).

For example, you may have a look at the Subread package (http://subread.sourceforge.net). It performs particularly well in RNA-seq data because it does not require end-to-end alignments. It uses the largest mappable region in the read to determine its mapping location, although it tries its best to get the end-to-end alignment for the read if possible. It also includes a Subjunc program that performs full alignment for junction reads spanning multiple exons (ie. mapping locations of each read base will be determined). By default, both Subread and Subjunc return at most 1 mapping location for each read.

Cheers,
Wei

**vbiaudet** · 04-10-2013, 05:18 AM

remove perfect multiple hits

Originally posted by Mamta View Post

I am dealing with something similar, Would like to know how did you remove those reads?.

yes, you didn't need to keep the second hit just know its score.
You just need to run mapper with option the best hit and after see the XS:i:score (the score of the second hit), if its' the same than AS:i:score then you have a perfect repeat.

bye, vb

**yingzhang** · 05-24-2013, 10:28 AM

For Tophat/1.0 + version, tophat outputed a MAPQ value of 255 to indicate the mapping is unique or not. So for bam file, you can ran "samtools view -q 255" to extract all uniquely mapped reads.

Unfortunately, Tophat/2.0+ version changed this rule. So I use the following command to extract all reads that are uniquely mapped:
samtools view accepted_hits.bam | grep -v "HI:i:"

I don't think the sam tag XS will work. I also had the impression that XS is the score for the next hit, but when I checked the real ouput, I got something like this "XS:A:-", which is definitely not a score but a character. So I suspect Tophat uses XS differently. Finally, after reading the doc carefully, I found filtering on "NH" or "HI" tag is much more useful. NH:i:1 means the reads just mapped once, in other words, uniquely mapped. If NH:i:2, then the read will appear twice in the bam file, with one using HI:i:0, the other one using HI:i:1 tag. So I decided to filter bam output on NH and HI tag.

Anyway, just my 2 cents.

**sdriscoll** · 05-25-2013, 01:50 PM

Tophat only produces a few MAPQ values since they use a rounded -10log10(uncertainty) calculation where the uncertainty are values like 1/2 for a read that could align 2 times, 2/3 for a read than can align 3 times and so on. If you plug 1/2 into that forumla and round it you get 3. So if you filter the tophat alignments for anything with a MAPQ greater than 3 you've extracted all of the unique alignments. STAR does the same thing. Bowtie2 uses a MAPQ scale that correlates with "uniqueness" of alignment. I'm not sure how it works exactly, though. Also they define uniqueness as an alignment that "has a much higher alignment score than all the other possible alignments".

When you're dealing with looking for reads that align "uniquely" take into consideration that you've also got to put some additional restrictions in place. For example you might say "reads that align uniquely with up to 4 mismatches" or those with no mismatches in the seed but up to 2 in the remaining read bases. With gapped aligners it gets more complicated because then you're talking about a mixture of mismatches and indels so you might move to the edit distance concept which combines the two counts. Not to say that there aren't a lot of unique transcript sequence - obviously there is - but base call errors and divergence in your sample's genome from the reference you're aligning to can cause some confusion for the aligners and kind of obscure the concept of "unique alignment". Also length of reads plays into this. A 25 bp read has a much higher probability of multi-aligning verses a 100 bp read. So in the end you're full condition will include some threshold in edit distance at whatever read length you're using. You might try arbitrarily picking the edit distance cutoff and then trying a few other alternatives to see what sort of impact in the final passing alignment count there is...to help you figure out if your cutoff is producing a reasonably stable result.

Topics	Statistics	Last Post
Sequencing the Two-Toed Sloth Genome Reveals Jumping Genes Tied to Its Extreme Metabolism by SEQadmin2 Started by SEQadmin2, 06-09-2026, 11:58 AM	0 responses 25 views 0 reactions	Last Post by SEQadmin2 06-09-2026, 11:58 AM
A New Method Makes Hantavirus Genome Analysis Faster and More Accessible by SEQadmin2 Started by SEQadmin2, 06-05-2026, 10:09 AM	0 responses 30 views 0 reactions	Last Post by SEQadmin2 06-05-2026, 10:09 AM
A New Single-Cell Method Maps DNA-Protein Interactions by SEQadmin2 Started by SEQadmin2, 06-04-2026, 08:59 AM	0 responses 39 views 0 reactions	Last Post by SEQadmin2 06-04-2026, 08:59 AM
Long-Read RNA Sequencing Uncovers a Hidden Layer of Immune Cell Regulation by SEQadmin2 Started by SEQadmin2, 06-02-2026, 12:03 PM	0 responses 62 views 0 reactions	Last Post by SEQadmin2 06-02-2026, 12:03 PM

Unconfigured Ad

Bowtie2: Uniquely mapped reads

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News