SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Remove reads which are not uniquely mapped hanleng Bioinformatics 9 08-25-2015 05:04 AM
Uniquely mapped reads or multiple mapped reads? taozuo RNA Sequencing 0 06-29-2012 05:50 AM
How to extract uniquely mapped reads? wisense RNA Sequencing 0 05-16-2012 12:04 AM
not uniquely mapped reads unidodo RNA Sequencing 2 04-22-2011 02:07 PM
cufflinks and non-uniquely mapped reads clariet Bioinformatics 1 05-08-2010 11:13 AM

Reply
 
Thread Tools
Old 11-19-2012, 12:51 PM   #1
hoardin
Junior Member
 
Location: Illinois

Join Date: Nov 2012
Posts: 6
Default Bowtie2: Uniquely mapped reads

Hi.

I've been searching everywhere for this, and I can't find a clear answer.

I am mapping RNAseq reads to the human genome using Bowtie2. I am only looking for reads that align ONLY ONCE to the genome. By default Bowtie2 reports all the reads that are aligned. Is there an option so I only have unique mapping reads in my sam file?

Any help would be appreciated. Thanks!
hoardin is offline   Reply With Quote
Old 11-20-2012, 04:47 AM   #2
kmcarr
Senior Member
 
Location: USA, Midwest

Join Date: May 2008
Posts: 1,165
Default

First, use the right tool for the job. Bowtie (1 or 2) is not appropriate for mapping RNASeq reads to a genome reference. Bowtie does not account for RNA splicing, it expects reads to map contiguously to the reference. You should be using TopHat, which is really a wrapper around Bowtie but it permits splitting up of RNASeq reads so that they can be aligned to a genomic reference.

Now to your question look a the TopHat manual and study the "-g/--max-multihits <int>" option; it does what you want.
kmcarr is offline   Reply With Quote
Old 11-20-2012, 05:48 AM   #3
hoardin
Junior Member
 
Location: Illinois

Join Date: Nov 2012
Posts: 6
Default

Thank you for your response!

Just out of curiosity, what would be the difference in the mapping using tophat vs just bowtie(1/2)?
hoardin is offline   Reply With Quote
Old 11-20-2012, 06:07 AM   #4
kmcarr
Senior Member
 
Location: USA, Midwest

Join Date: May 2008
Posts: 1,165
Default

Quote:
Originally Posted by hoardin View Post
Just out of curiosity, what would be the difference in the mapping using tophat vs just bowtie(1/2)?
Bowtie expects reads to map contiguously to the reference, allowing only for SNPs or small indels. An RNASeq read from a processed transcript may contain sequence from multiple exons separated by hundreds or thousands of base pairs in genomic coordinate space. Bowtie would fail to align these reads. TopHat uses a number of methods to identify read segments originating from different exons, split the reads and map the segments to their originating exon.
kmcarr is offline   Reply With Quote
Old 11-20-2012, 06:14 AM   #5
hoardin
Junior Member
 
Location: Illinois

Join Date: Nov 2012
Posts: 6
Default

Thanks again.

I was wondering percentage wise, what the difference in mapping successfully be? so TopHat would actually report more successes?
hoardin is offline   Reply With Quote
Old 11-20-2012, 06:29 AM   #6
kmcarr
Senior Member
 
Location: USA, Midwest

Join Date: May 2008
Posts: 1,165
Default

Quote:
Originally Posted by hoardin View Post
I was wondering percentage wise, what the difference in mapping successfully be? so TopHat would actually report more successes?
Yes, TopHat will successfully map more RNASeq reads than Bowtie. The magnitude of that difference will vary depending on your experiment. The question is what fraction of your reads include a splice junction in a position which would cause Bowtie to fail (spices near the end of a read may be mappable). This is going to be determined by variables such as the length of your RNASeq reads and the average length of exons in the organism you are studying, more specifically the average length of exons in the transcripts represented in your RNA sample.
kmcarr is offline   Reply With Quote
Old 11-20-2012, 06:36 AM   #7
westerman
Rick Westerman
 
Location: Purdue University, Indiana, USA

Join Date: Jun 2008
Posts: 1,104
Default

I'll edit this post and say that it is basically a "ditto" of KMCarr's post above ...
Quote:
Originally Posted by hoardin View Post
Thanks again.

I was wondering percentage wise, what the difference in mapping successfully be? so TopHat would actually report more successes?
Percentage will depend on your organism. Multiexonic and alternative splicing percentages and the sizes of exons (it matter if your reads are bigger than the average exon size) vary between the kingdoms and the species.

In general Tophat would report more successes.

Let's see, you are mapping to human. As a wild guess -- and I do not deal with human very much -- I'd say that you are probably missing 50% of your potential mapped reads.

Last edited by westerman; 11-20-2012 at 06:37 AM. Reason: Acknowledge KMCarr
westerman is offline   Reply With Quote
Old 11-20-2012, 11:40 AM   #8
CraigJ
Junior Member
 
Location: Indiana University, Indiana, USA

Join Date: Jul 2012
Posts: 8
Default

Quote:
Originally Posted by kmcarr View Post
Now to your question look a the TopHat manual and study the "-g/--max-multihits <int>" option; it does what you want.
I do not believe this is what hoardin wanted. This option will still report the reads that align to the genome in multiple places - it will just report the <int> number of alignments. So if hoardin uses -g 1, the reads that align 2+ times to the genome will still be reported, but only the top 1 alignment will be reported.

What hoardin was asking for, I believe, is for bowtie/tophat to report only the alignments that map uniquely to the genome (map to only 1 place on the genome), and exclude all the alignments for reads that map to multiple places on the genome.

While bowtie2 cannot be told to only report uniquely aligned reads, there is a way to filter only uniquely mapped reads from the SAM output.

Bowtie2 uses the SAM optional field XS: to report the alignment score for the second best alignment for this read. Here is the description from the Bowtie2 manual:
Quote:
XS:i:<N>
Alignment score for second-best alignment. Can be negative. Can be greater than 0 in --local mode (but not in --end-to-end mode). Only present if the SAM record is for an aligned read and more than one alignment was found for the read.
As you can see from the bold text, this field is only present on reads that map multiple times - and therefore, uniquely mapped reads will not have this field present. Therefore you can filter out alignments that contain this field to get only the uniquely mapped reads.

I assume tophat will report the same way, since it uses bowtie2 to do the mapping, though I have not actually checked if this is true.

Credit goes to MGineste for coming up with this method to filter for only uniquely mapped reads. See this seqanswers thread for more information/the original proposal of this filter and some ways to implement the filter: http://seqanswers.com/forums/showthread.php?t=23389
CraigJ is offline   Reply With Quote
Old 11-20-2012, 12:30 PM   #9
hoardin
Junior Member
 
Location: Illinois

Join Date: Nov 2012
Posts: 6
Default

Quote:
Originally Posted by CraigJ View Post

What hoardin was asking for, I believe, is for bowtie/tophat to report only the alignments that map uniquely to the genome (map to only 1 place on the genome), and exclude all the alignments for reads that map to multiple places on the genome.

While bowtie2 cannot be told to only report uniquely aligned reads, there is a way to filter only uniquely mapped reads from the SAM output.

Bowtie2 uses the SAM optional field XS: to report the alignment score for the second best alignment for this read. Here is the description from the Bowtie2 manual
Yes this was what I was looking for. I found a more roundabout way to do it. using -k 2, bowtie2 would show two entries for reads that align to multiple locations. Thus, I just removed all entires that weren't unique.

Thanks again everyone for the help.
hoardin is offline   Reply With Quote
Old 11-23-2012, 05:34 AM   #10
vbiaudet
Member
 
Location: Paris

Join Date: Apr 2011
Posts: 13
Default Tophat/Bowtie2 mapping

Quote:
Originally Posted by kmcarr View Post
Yes, TopHat will successfully map more RNASeq reads than Bowtie. The magnitude of that difference will vary depending on your experiment. The question is what fraction of your reads include a splice junction in a position which would cause Bowtie to fail (spices near the end of a read may be mappable). This is going to be determined by variables such as the length of your RNASeq reads and the average length of exons in the organism you are studying, more specifically the average length of exons in the transcripts represented in your RNA sample.
I have a question between a mapping with TopHat (against the genome) OR Bowtie2 (against transcripts). If we consider that the goal of a study is to count reads by transcripts, to make differential analyses, and that the model genes are good, do you think is useful to run topHat?

I say that because I consider that the mapping of reads coming from RNAseq was not very good to splicing map against genome but today perhaps topHat2 is better?

Thanks, vb
vbiaudet is offline   Reply With Quote
Old 04-06-2013, 12:07 PM   #11
Mamta
Junior Member
 
Location: US

Join Date: Sep 2012
Posts: 2
Default Bowtie2 uniquely mapping reads

Quote:
Originally Posted by hoardin View Post
Yes this was what I was looking for. I found a more roundabout way to do it. using -k 2, bowtie2 would show two entries for reads that align to multiple locations. Thus, I just removed all entires that weren't unique.

Thanks again everyone for the help.

I am dealing with something similar, Would like to know how did you remove those reads?.
Mamta is offline   Reply With Quote
Old 04-06-2013, 02:02 PM   #12
shi
Wei Shi
 
Location: Australia

Join Date: Feb 2010
Posts: 235
Default

If you are worried about the multi-mapping reads, why dont you choose an aligner that can report a unique mapping location for each read at the first place? There are many aligners that report a single location for each read (and has the option of reporting no locations if the read can be best mapped to more than one location).

For example, you may have a look at the Subread package (http://subread.sourceforge.net). It performs particularly well in RNA-seq data because it does not require end-to-end alignments. It uses the largest mappable region in the read to determine its mapping location, although it tries its best to get the end-to-end alignment for the read if possible. It also includes a Subjunc program that performs full alignment for junction reads spanning multiple exons (ie. mapping locations of each read base will be determined). By default, both Subread and Subjunc return at most 1 mapping location for each read.

Cheers,
Wei
shi is offline   Reply With Quote
Old 04-10-2013, 05:18 AM   #13
vbiaudet
Member
 
Location: Paris

Join Date: Apr 2011
Posts: 13
Default remove perfect multiple hits

Quote:
Originally Posted by Mamta View Post
I am dealing with something similar, Would like to know how did you remove those reads?.
yes, you didn't need to keep the second hit just know its score.
You just need to run mapper with option the best hit and after see the XS:i:score (the score of the second hit), if its' the same than AS:i:score then you have a perfect repeat.

bye, vb
vbiaudet is offline   Reply With Quote
Old 05-24-2013, 10:28 AM   #14
yingzhang
Junior Member
 
Location: Minneapolis

Join Date: Feb 2012
Posts: 9
Default

For Tophat/1.0 + version, tophat outputed a MAPQ value of 255 to indicate the mapping is unique or not. So for bam file, you can ran "samtools view -q 255" to extract all uniquely mapped reads.

Unfortunately, Tophat/2.0+ version changed this rule. So I use the following command to extract all reads that are uniquely mapped:
samtools view accepted_hits.bam | grep -v "HI:i:"

I don't think the sam tag XS will work. I also had the impression that XS is the score for the next hit, but when I checked the real ouput, I got something like this "XS:A:-", which is definitely not a score but a character. So I suspect Tophat uses XS differently. Finally, after reading the doc carefully, I found filtering on "NH" or "HI" tag is much more useful. NH:i:1 means the reads just mapped once, in other words, uniquely mapped. If NH:i:2, then the read will appear twice in the bam file, with one using HI:i:0, the other one using HI:i:1 tag. So I decided to filter bam output on NH and HI tag.

Anyway, just my 2 cents.

Last edited by yingzhang; 05-24-2013 at 10:36 AM.
yingzhang is offline   Reply With Quote
Old 05-25-2013, 01:50 PM   #15
sdriscoll
I like code
 
Location: San Diego, CA, USA

Join Date: Sep 2009
Posts: 438
Default

Tophat only produces a few MAPQ values since they use a rounded -10log10(uncertainty) calculation where the uncertainty are values like 1/2 for a read that could align 2 times, 2/3 for a read than can align 3 times and so on. If you plug 1/2 into that forumla and round it you get 3. So if you filter the tophat alignments for anything with a MAPQ greater than 3 you've extracted all of the unique alignments. STAR does the same thing. Bowtie2 uses a MAPQ scale that correlates with "uniqueness" of alignment. I'm not sure how it works exactly, though. Also they define uniqueness as an alignment that "has a much higher alignment score than all the other possible alignments".

When you're dealing with looking for reads that align "uniquely" take into consideration that you've also got to put some additional restrictions in place. For example you might say "reads that align uniquely with up to 4 mismatches" or those with no mismatches in the seed but up to 2 in the remaining read bases. With gapped aligners it gets more complicated because then you're talking about a mixture of mismatches and indels so you might move to the edit distance concept which combines the two counts. Not to say that there aren't a lot of unique transcript sequence - obviously there is - but base call errors and divergence in your sample's genome from the reference you're aligning to can cause some confusion for the aligners and kind of obscure the concept of "unique alignment". Also length of reads plays into this. A 25 bp read has a much higher probability of multi-aligning verses a 100 bp read. So in the end you're full condition will include some threshold in edit distance at whatever read length you're using. You might try arbitrarily picking the edit distance cutoff and then trying a few other alternatives to see what sort of impact in the final passing alignment count there is...to help you figure out if your cutoff is producing a reasonably stable result.
sdriscoll is offline   Reply With Quote
Old 05-25-2013, 02:01 PM   #16
Zapages
Member
 
Location: NJ

Join Date: Oct 2012
Posts: 94
Default

Also I would suggest looking into Geneious Map Reader, it allows for 1 reads: http://www.geneious.com/assets/docum...ReadMapper.pdf
Zapages is offline   Reply With Quote
Old 06-03-2013, 02:43 PM   #17
slengyel
Junior Member
 
Location: Philadelphia, Pa

Join Date: Dec 2012
Posts: 7
Default Bowtie2 segv

Greetings,

I've been having trouble find the proper post, but I'll see what everyone thinks here. I've been trying to run Bowtie2 on a set of Illumina reads. The command I used after running bowtie-build (There seem to have been no issues building the index) is :

bowtie2 -a --no-mixed -p 16 --phred 33 -q -x <prefix> -1 <1.fastq> -2 <2.fastq> -S <.sam>

I specify the paths to each fastq file and for the same output.

The error I'm recieving is the following:

bowtie2-align dided with the signal 11 (SEGV) (core dumped)

When compiling, I used make BITS=64 to pass the -m64 option to g++, so I don't think it is a compiler issue. The server is has 256gb of RAM, and 51.9gb of swap memory, so I don't think memory is an issue.

Any and all help is appreciated.

Best,
SL
slengyel is offline   Reply With Quote
Old 06-03-2013, 04:58 PM   #18
sdriscoll
I like code
 
Location: San Diego, CA, USA

Join Date: Sep 2009
Posts: 438
Default

is that a typeo in your command? '--phred 33' instead of '--phred33' ?
__________________
/* Shawn Driscoll, Gene Expression Laboratory, Pfaff
Salk Institute for Biological Studies, La Jolla, CA, USA */
sdriscoll is offline   Reply With Quote
Old 06-03-2013, 05:08 PM   #19
slengyel
Junior Member
 
Location: Philadelphia, Pa

Join Date: Dec 2012
Posts: 7
Default

Sorry, yes it is. I accidentally put a space in when posting to the thread.
slengyel is offline   Reply With Quote
Old 07-28-2015, 01:36 AM   #20
harijay
Junior Member
 
Location: Cambridge, MA

Join Date: Feb 2015
Posts: 4
Default

I am seeing the error :
"bowtie2-align dided with the signal 11 (SEGV) (core dumped)"

a lot on an Ubuntu 14.04 machine with 98GB of RAM in a routine bowtie2 (2.2.5 version) run. I can get this error almost reproducibly .

When I switched OS to SUSE Linux. The same error never occurs.

Any ideas how to troubleshoot this issue. It seems to be due to the application messing up the memory management
harijay is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 10:10 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO