Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • hoardin
    Junior Member
    • Nov 2012
    • 6

    Bowtie2: Uniquely mapped reads

    Hi.

    I've been searching everywhere for this, and I can't find a clear answer.

    I am mapping RNAseq reads to the human genome using Bowtie2. I am only looking for reads that align ONLY ONCE to the genome. By default Bowtie2 reports all the reads that are aligned. Is there an option so I only have unique mapping reads in my sam file?

    Any help would be appreciated. Thanks!
  • kmcarr
    Senior Member
    • May 2008
    • 1181

    #2
    First, use the right tool for the job. Bowtie (1 or 2) is not appropriate for mapping RNASeq reads to a genome reference. Bowtie does not account for RNA splicing, it expects reads to map contiguously to the reference. You should be using TopHat, which is really a wrapper around Bowtie but it permits splitting up of RNASeq reads so that they can be aligned to a genomic reference.

    Now to your question look a the TopHat manual and study the "-g/--max-multihits <int>" option; it does what you want.

    Comment

    • hoardin
      Junior Member
      • Nov 2012
      • 6

      #3
      Thank you for your response!

      Just out of curiosity, what would be the difference in the mapping using tophat vs just bowtie(1/2)?

      Comment

      • kmcarr
        Senior Member
        • May 2008
        • 1181

        #4
        Originally posted by hoardin View Post
        Just out of curiosity, what would be the difference in the mapping using tophat vs just bowtie(1/2)?
        Bowtie expects reads to map contiguously to the reference, allowing only for SNPs or small indels. An RNASeq read from a processed transcript may contain sequence from multiple exons separated by hundreds or thousands of base pairs in genomic coordinate space. Bowtie would fail to align these reads. TopHat uses a number of methods to identify read segments originating from different exons, split the reads and map the segments to their originating exon.

        Comment

        • hoardin
          Junior Member
          • Nov 2012
          • 6

          #5
          Thanks again.

          I was wondering percentage wise, what the difference in mapping successfully be? so TopHat would actually report more successes?

          Comment

          • kmcarr
            Senior Member
            • May 2008
            • 1181

            #6
            Originally posted by hoardin View Post
            I was wondering percentage wise, what the difference in mapping successfully be? so TopHat would actually report more successes?
            Yes, TopHat will successfully map more RNASeq reads than Bowtie. The magnitude of that difference will vary depending on your experiment. The question is what fraction of your reads include a splice junction in a position which would cause Bowtie to fail (spices near the end of a read may be mappable). This is going to be determined by variables such as the length of your RNASeq reads and the average length of exons in the organism you are studying, more specifically the average length of exons in the transcripts represented in your RNA sample.

            Comment

            • westerman
              Rick Westerman
              • Jun 2008
              • 1104

              #7
              I'll edit this post and say that it is basically a "ditto" of KMCarr's post above ...
              Originally posted by hoardin View Post
              Thanks again.

              I was wondering percentage wise, what the difference in mapping successfully be? so TopHat would actually report more successes?
              Percentage will depend on your organism. Multiexonic and alternative splicing percentages and the sizes of exons (it matter if your reads are bigger than the average exon size) vary between the kingdoms and the species.

              In general Tophat would report more successes.

              Let's see, you are mapping to human. As a wild guess -- and I do not deal with human very much -- I'd say that you are probably missing 50% of your potential mapped reads.
              Last edited by westerman; 11-20-2012, 07:37 AM. Reason: Acknowledge KMCarr

              Comment

              • CraigJ
                Junior Member
                • Jul 2012
                • 8

                #8
                Originally posted by kmcarr View Post
                Now to your question look a the TopHat manual and study the "-g/--max-multihits <int>" option; it does what you want.
                I do not believe this is what hoardin wanted. This option will still report the reads that align to the genome in multiple places - it will just report the <int> number of alignments. So if hoardin uses -g 1, the reads that align 2+ times to the genome will still be reported, but only the top 1 alignment will be reported.

                What hoardin was asking for, I believe, is for bowtie/tophat to report only the alignments that map uniquely to the genome (map to only 1 place on the genome), and exclude all the alignments for reads that map to multiple places on the genome.

                While bowtie2 cannot be told to only report uniquely aligned reads, there is a way to filter only uniquely mapped reads from the SAM output.

                Bowtie2 uses the SAM optional field XS: to report the alignment score for the second best alignment for this read. Here is the description from the Bowtie2 manual:
                XS:i:<N>
                Alignment score for second-best alignment. Can be negative. Can be greater than 0 in --local mode (but not in --end-to-end mode). Only present if the SAM record is for an aligned read and more than one alignment was found for the read.
                As you can see from the bold text, this field is only present on reads that map multiple times - and therefore, uniquely mapped reads will not have this field present. Therefore you can filter out alignments that contain this field to get only the uniquely mapped reads.

                I assume tophat will report the same way, since it uses bowtie2 to do the mapping, though I have not actually checked if this is true.

                Credit goes to MGineste for coming up with this method to filter for only uniquely mapped reads. See this seqanswers thread for more information/the original proposal of this filter and some ways to implement the filter: http://seqanswers.com/forums/showthread.php?t=23389

                Comment

                • hoardin
                  Junior Member
                  • Nov 2012
                  • 6

                  #9
                  Originally posted by CraigJ View Post

                  What hoardin was asking for, I believe, is for bowtie/tophat to report only the alignments that map uniquely to the genome (map to only 1 place on the genome), and exclude all the alignments for reads that map to multiple places on the genome.

                  While bowtie2 cannot be told to only report uniquely aligned reads, there is a way to filter only uniquely mapped reads from the SAM output.

                  Bowtie2 uses the SAM optional field XS: to report the alignment score for the second best alignment for this read. Here is the description from the Bowtie2 manual
                  Yes this was what I was looking for. I found a more roundabout way to do it. using -k 2, bowtie2 would show two entries for reads that align to multiple locations. Thus, I just removed all entires that weren't unique.

                  Thanks again everyone for the help.

                  Comment

                  • vbiaudet
                    Member
                    • Apr 2011
                    • 14

                    #10
                    Tophat/Bowtie2 mapping

                    Originally posted by kmcarr View Post
                    Yes, TopHat will successfully map more RNASeq reads than Bowtie. The magnitude of that difference will vary depending on your experiment. The question is what fraction of your reads include a splice junction in a position which would cause Bowtie to fail (spices near the end of a read may be mappable). This is going to be determined by variables such as the length of your RNASeq reads and the average length of exons in the organism you are studying, more specifically the average length of exons in the transcripts represented in your RNA sample.
                    I have a question between a mapping with TopHat (against the genome) OR Bowtie2 (against transcripts). If we consider that the goal of a study is to count reads by transcripts, to make differential analyses, and that the model genes are good, do you think is useful to run topHat?

                    I say that because I consider that the mapping of reads coming from RNAseq was not very good to splicing map against genome but today perhaps topHat2 is better?

                    Thanks, vb

                    Comment

                    • Mamta
                      Junior Member
                      • Sep 2012
                      • 2

                      #11
                      Bowtie2 uniquely mapping reads

                      Originally posted by hoardin View Post
                      Yes this was what I was looking for. I found a more roundabout way to do it. using -k 2, bowtie2 would show two entries for reads that align to multiple locations. Thus, I just removed all entires that weren't unique.

                      Thanks again everyone for the help.

                      I am dealing with something similar, Would like to know how did you remove those reads?.

                      Comment

                      • shi
                        Wei Shi
                        • Feb 2010
                        • 236

                        #12
                        If you are worried about the multi-mapping reads, why dont you choose an aligner that can report a unique mapping location for each read at the first place? There are many aligners that report a single location for each read (and has the option of reporting no locations if the read can be best mapped to more than one location).

                        For example, you may have a look at the Subread package (http://subread.sourceforge.net). It performs particularly well in RNA-seq data because it does not require end-to-end alignments. It uses the largest mappable region in the read to determine its mapping location, although it tries its best to get the end-to-end alignment for the read if possible. It also includes a Subjunc program that performs full alignment for junction reads spanning multiple exons (ie. mapping locations of each read base will be determined). By default, both Subread and Subjunc return at most 1 mapping location for each read.

                        Cheers,
                        Wei

                        Comment

                        • vbiaudet
                          Member
                          • Apr 2011
                          • 14

                          #13
                          remove perfect multiple hits

                          Originally posted by Mamta View Post
                          I am dealing with something similar, Would like to know how did you remove those reads?.
                          yes, you didn't need to keep the second hit just know its score.
                          You just need to run mapper with option the best hit and after see the XS:i:score (the score of the second hit), if its' the same than AS:i:score then you have a perfect repeat.

                          bye, vb

                          Comment

                          • yingzhang
                            Junior Member
                            • Feb 2012
                            • 9

                            #14
                            For Tophat/1.0 + version, tophat outputed a MAPQ value of 255 to indicate the mapping is unique or not. So for bam file, you can ran "samtools view -q 255" to extract all uniquely mapped reads.

                            Unfortunately, Tophat/2.0+ version changed this rule. So I use the following command to extract all reads that are uniquely mapped:
                            samtools view accepted_hits.bam | grep -v "HI:i:"

                            I don't think the sam tag XS will work. I also had the impression that XS is the score for the next hit, but when I checked the real ouput, I got something like this "XS:A:-", which is definitely not a score but a character. So I suspect Tophat uses XS differently. Finally, after reading the doc carefully, I found filtering on "NH" or "HI" tag is much more useful. NH:i:1 means the reads just mapped once, in other words, uniquely mapped. If NH:i:2, then the read will appear twice in the bam file, with one using HI:i:0, the other one using HI:i:1 tag. So I decided to filter bam output on NH and HI tag.

                            Anyway, just my 2 cents.
                            Last edited by yingzhang; 05-24-2013, 10:36 AM.

                            Comment

                            • sdriscoll
                              I like code
                              • Sep 2009
                              • 436

                              #15
                              Tophat only produces a few MAPQ values since they use a rounded -10log10(uncertainty) calculation where the uncertainty are values like 1/2 for a read that could align 2 times, 2/3 for a read than can align 3 times and so on. If you plug 1/2 into that forumla and round it you get 3. So if you filter the tophat alignments for anything with a MAPQ greater than 3 you've extracted all of the unique alignments. STAR does the same thing. Bowtie2 uses a MAPQ scale that correlates with "uniqueness" of alignment. I'm not sure how it works exactly, though. Also they define uniqueness as an alignment that "has a much higher alignment score than all the other possible alignments".

                              When you're dealing with looking for reads that align "uniquely" take into consideration that you've also got to put some additional restrictions in place. For example you might say "reads that align uniquely with up to 4 mismatches" or those with no mismatches in the seed but up to 2 in the remaining read bases. With gapped aligners it gets more complicated because then you're talking about a mixture of mismatches and indels so you might move to the edit distance concept which combines the two counts. Not to say that there aren't a lot of unique transcript sequence - obviously there is - but base call errors and divergence in your sample's genome from the reference you're aligning to can cause some confusion for the aligners and kind of obscure the concept of "unique alignment". Also length of reads plays into this. A 25 bp read has a much higher probability of multi-aligning verses a 100 bp read. So in the end you're full condition will include some threshold in edit distance at whatever read length you're using. You might try arbitrarily picking the edit distance cutoff and then trying a few other alternatives to see what sort of impact in the final passing alignment count there is...to help you figure out if your cutoff is producing a reasonably stable result.
                              /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
                              Salk Institute for Biological Studies, La Jolla, CA, USA */

                              Comment

                              Latest Articles

                              Collapse

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by SEQadmin2, 06-09-2026, 11:58 AM
                              0 responses
                              25 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-05-2026, 10:09 AM
                              0 responses
                              30 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-04-2026, 08:59 AM
                              0 responses
                              39 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-02-2026, 12:03 PM
                              0 responses
                              62 views
                              0 reactions
                              Last Post SEQadmin2  
                              Working...