Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Bowtie2: Uniquely mapped reads

    Hi.

    I've been searching everywhere for this, and I can't find a clear answer.

    I am mapping RNAseq reads to the human genome using Bowtie2. I am only looking for reads that align ONLY ONCE to the genome. By default Bowtie2 reports all the reads that are aligned. Is there an option so I only have unique mapping reads in my sam file?

    Any help would be appreciated. Thanks!

  • #2
    First, use the right tool for the job. Bowtie (1 or 2) is not appropriate for mapping RNASeq reads to a genome reference. Bowtie does not account for RNA splicing, it expects reads to map contiguously to the reference. You should be using TopHat, which is really a wrapper around Bowtie but it permits splitting up of RNASeq reads so that they can be aligned to a genomic reference.

    Now to your question look a the TopHat manual and study the "-g/--max-multihits <int>" option; it does what you want.

    Comment


    • #3
      Thank you for your response!

      Just out of curiosity, what would be the difference in the mapping using tophat vs just bowtie(1/2)?

      Comment


      • #4
        Originally posted by hoardin View Post
        Just out of curiosity, what would be the difference in the mapping using tophat vs just bowtie(1/2)?
        Bowtie expects reads to map contiguously to the reference, allowing only for SNPs or small indels. An RNASeq read from a processed transcript may contain sequence from multiple exons separated by hundreds or thousands of base pairs in genomic coordinate space. Bowtie would fail to align these reads. TopHat uses a number of methods to identify read segments originating from different exons, split the reads and map the segments to their originating exon.

        Comment


        • #5
          Thanks again.

          I was wondering percentage wise, what the difference in mapping successfully be? so TopHat would actually report more successes?

          Comment


          • #6
            Originally posted by hoardin View Post
            I was wondering percentage wise, what the difference in mapping successfully be? so TopHat would actually report more successes?
            Yes, TopHat will successfully map more RNASeq reads than Bowtie. The magnitude of that difference will vary depending on your experiment. The question is what fraction of your reads include a splice junction in a position which would cause Bowtie to fail (spices near the end of a read may be mappable). This is going to be determined by variables such as the length of your RNASeq reads and the average length of exons in the organism you are studying, more specifically the average length of exons in the transcripts represented in your RNA sample.

            Comment


            • #7
              I'll edit this post and say that it is basically a "ditto" of KMCarr's post above ...
              Originally posted by hoardin View Post
              Thanks again.

              I was wondering percentage wise, what the difference in mapping successfully be? so TopHat would actually report more successes?
              Percentage will depend on your organism. Multiexonic and alternative splicing percentages and the sizes of exons (it matter if your reads are bigger than the average exon size) vary between the kingdoms and the species.

              In general Tophat would report more successes.

              Let's see, you are mapping to human. As a wild guess -- and I do not deal with human very much -- I'd say that you are probably missing 50% of your potential mapped reads.
              Last edited by westerman; 11-20-2012, 07:37 AM. Reason: Acknowledge KMCarr

              Comment


              • #8
                Originally posted by kmcarr View Post
                Now to your question look a the TopHat manual and study the "-g/--max-multihits <int>" option; it does what you want.
                I do not believe this is what hoardin wanted. This option will still report the reads that align to the genome in multiple places - it will just report the <int> number of alignments. So if hoardin uses -g 1, the reads that align 2+ times to the genome will still be reported, but only the top 1 alignment will be reported.

                What hoardin was asking for, I believe, is for bowtie/tophat to report only the alignments that map uniquely to the genome (map to only 1 place on the genome), and exclude all the alignments for reads that map to multiple places on the genome.

                While bowtie2 cannot be told to only report uniquely aligned reads, there is a way to filter only uniquely mapped reads from the SAM output.

                Bowtie2 uses the SAM optional field XS: to report the alignment score for the second best alignment for this read. Here is the description from the Bowtie2 manual:
                XS:i:<N>
                Alignment score for second-best alignment. Can be negative. Can be greater than 0 in --local mode (but not in --end-to-end mode). Only present if the SAM record is for an aligned read and more than one alignment was found for the read.
                As you can see from the bold text, this field is only present on reads that map multiple times - and therefore, uniquely mapped reads will not have this field present. Therefore you can filter out alignments that contain this field to get only the uniquely mapped reads.

                I assume tophat will report the same way, since it uses bowtie2 to do the mapping, though I have not actually checked if this is true.

                Credit goes to MGineste for coming up with this method to filter for only uniquely mapped reads. See this seqanswers thread for more information/the original proposal of this filter and some ways to implement the filter: http://seqanswers.com/forums/showthread.php?t=23389

                Comment


                • #9
                  Originally posted by CraigJ View Post

                  What hoardin was asking for, I believe, is for bowtie/tophat to report only the alignments that map uniquely to the genome (map to only 1 place on the genome), and exclude all the alignments for reads that map to multiple places on the genome.

                  While bowtie2 cannot be told to only report uniquely aligned reads, there is a way to filter only uniquely mapped reads from the SAM output.

                  Bowtie2 uses the SAM optional field XS: to report the alignment score for the second best alignment for this read. Here is the description from the Bowtie2 manual
                  Yes this was what I was looking for. I found a more roundabout way to do it. using -k 2, bowtie2 would show two entries for reads that align to multiple locations. Thus, I just removed all entires that weren't unique.

                  Thanks again everyone for the help.

                  Comment


                  • #10
                    Tophat/Bowtie2 mapping

                    Originally posted by kmcarr View Post
                    Yes, TopHat will successfully map more RNASeq reads than Bowtie. The magnitude of that difference will vary depending on your experiment. The question is what fraction of your reads include a splice junction in a position which would cause Bowtie to fail (spices near the end of a read may be mappable). This is going to be determined by variables such as the length of your RNASeq reads and the average length of exons in the organism you are studying, more specifically the average length of exons in the transcripts represented in your RNA sample.
                    I have a question between a mapping with TopHat (against the genome) OR Bowtie2 (against transcripts). If we consider that the goal of a study is to count reads by transcripts, to make differential analyses, and that the model genes are good, do you think is useful to run topHat?

                    I say that because I consider that the mapping of reads coming from RNAseq was not very good to splicing map against genome but today perhaps topHat2 is better?

                    Thanks, vb

                    Comment


                    • #11
                      Bowtie2 uniquely mapping reads

                      Originally posted by hoardin View Post
                      Yes this was what I was looking for. I found a more roundabout way to do it. using -k 2, bowtie2 would show two entries for reads that align to multiple locations. Thus, I just removed all entires that weren't unique.

                      Thanks again everyone for the help.

                      I am dealing with something similar, Would like to know how did you remove those reads?.

                      Comment


                      • #12
                        If you are worried about the multi-mapping reads, why dont you choose an aligner that can report a unique mapping location for each read at the first place? There are many aligners that report a single location for each read (and has the option of reporting no locations if the read can be best mapped to more than one location).

                        For example, you may have a look at the Subread package (http://subread.sourceforge.net). It performs particularly well in RNA-seq data because it does not require end-to-end alignments. It uses the largest mappable region in the read to determine its mapping location, although it tries its best to get the end-to-end alignment for the read if possible. It also includes a Subjunc program that performs full alignment for junction reads spanning multiple exons (ie. mapping locations of each read base will be determined). By default, both Subread and Subjunc return at most 1 mapping location for each read.

                        Cheers,
                        Wei

                        Comment


                        • #13
                          remove perfect multiple hits

                          Originally posted by Mamta View Post
                          I am dealing with something similar, Would like to know how did you remove those reads?.
                          yes, you didn't need to keep the second hit just know its score.
                          You just need to run mapper with option the best hit and after see the XS:i:score (the score of the second hit), if its' the same than AS:i:score then you have a perfect repeat.

                          bye, vb

                          Comment


                          • #14
                            For Tophat/1.0 + version, tophat outputed a MAPQ value of 255 to indicate the mapping is unique or not. So for bam file, you can ran "samtools view -q 255" to extract all uniquely mapped reads.

                            Unfortunately, Tophat/2.0+ version changed this rule. So I use the following command to extract all reads that are uniquely mapped:
                            samtools view accepted_hits.bam | grep -v "HI:i:"

                            I don't think the sam tag XS will work. I also had the impression that XS is the score for the next hit, but when I checked the real ouput, I got something like this "XS:A:-", which is definitely not a score but a character. So I suspect Tophat uses XS differently. Finally, after reading the doc carefully, I found filtering on "NH" or "HI" tag is much more useful. NH:i:1 means the reads just mapped once, in other words, uniquely mapped. If NH:i:2, then the read will appear twice in the bam file, with one using HI:i:0, the other one using HI:i:1 tag. So I decided to filter bam output on NH and HI tag.

                            Anyway, just my 2 cents.
                            Last edited by yingzhang; 05-24-2013, 10:36 AM.

                            Comment


                            • #15
                              Tophat only produces a few MAPQ values since they use a rounded -10log10(uncertainty) calculation where the uncertainty are values like 1/2 for a read that could align 2 times, 2/3 for a read than can align 3 times and so on. If you plug 1/2 into that forumla and round it you get 3. So if you filter the tophat alignments for anything with a MAPQ greater than 3 you've extracted all of the unique alignments. STAR does the same thing. Bowtie2 uses a MAPQ scale that correlates with "uniqueness" of alignment. I'm not sure how it works exactly, though. Also they define uniqueness as an alignment that "has a much higher alignment score than all the other possible alignments".

                              When you're dealing with looking for reads that align "uniquely" take into consideration that you've also got to put some additional restrictions in place. For example you might say "reads that align uniquely with up to 4 mismatches" or those with no mismatches in the seed but up to 2 in the remaining read bases. With gapped aligners it gets more complicated because then you're talking about a mixture of mismatches and indels so you might move to the edit distance concept which combines the two counts. Not to say that there aren't a lot of unique transcript sequence - obviously there is - but base call errors and divergence in your sample's genome from the reference you're aligning to can cause some confusion for the aligners and kind of obscure the concept of "unique alignment". Also length of reads plays into this. A 25 bp read has a much higher probability of multi-aligning verses a 100 bp read. So in the end you're full condition will include some threshold in edit distance at whatever read length you're using. You might try arbitrarily picking the edit distance cutoff and then trying a few other alternatives to see what sort of impact in the final passing alignment count there is...to help you figure out if your cutoff is producing a reasonably stable result.
                              /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
                              Salk Institute for Biological Studies, La Jolla, CA, USA */

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM
                              • seqadmin
                                Techniques and Challenges in Conservation Genomics
                                by seqadmin



                                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                Avian Conservation
                                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                03-08-2024, 10:41 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, Yesterday, 06:37 PM
                              0 responses
                              8 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, Yesterday, 06:07 PM
                              0 responses
                              8 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-22-2024, 10:03 AM
                              0 responses
                              49 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-21-2024, 07:32 AM
                              0 responses
                              66 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X