Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • TopHat and uniquely aligned reads

    Hello, I was wondering if anyone has tried running TopHat with the option max-multihits=1. From the user manual, I would expect only uniquely-mapped reads to appear in the accepted_hits.sam file, however I can usually find several reads appearing more than once. My understanding is that when max-multihits is set to 1, TopHat is supposed to discard mulitply-mapped reads and not consider them further, e.g. when mapping to junctions. Is my understanding correct? If so, any thoughts on why multi-maps are reported in the alignment?

    I am using TopHat-1.0.14 with Bowtie-0.12.5.

    Thanks.

    Bill

  • #2
    Hi bgibb

    I would expect the same after setting max-multihits=1 but I have not tried it as we were interested in multi reads. As far as finding the uniqe hits, all the reads in the sam file with mapping quality = 255 are uniques. I dont know the reason why but I got this info from Cole, the developer of this package.

    -A

    Comment


    • #3
      @bgibb i posted a message about this but earlier, you cant use --max-multihits, you have to use -g 1 (where -g is the short flag for --max-multihits) because there's a bug in the option handling in tophat.

      Comment


      • #4
        Thanks for the helpful replies.

        Regarding mapping quality 255, according to the SAM spec, 255 denotes “mapping quality is not available.” I think TopHat needs at least one 2nd-best hit in order to calculate mapping quality, so the 255 would make sense for uniquely mapped reads... assuming that TopHat follows the MAQ definition of mapping quality [Li et al, “Mapping short DNA sequencing reads and calling variants using mapping quality scores,” p. 1856].

        Comment


        • #5
          @bgibb I recently discovered reads with multiple hits in my unique-mapped tophat output (10 out of 4,480,145). I determined that the reads are mapping to the same location twice, i.e. there is an alternate mapping for those reads due to indels.

          I also have a question about the mapping quality values. Even for non-unique-mapped I am seeing mostly 255.

          Comment


          • #6
            Hi, sorry for deviating a bit from the main question, but why would the default value for this flag would be set to 40? wouldn't you want to discard multiple hits and use the op's option -g 1? I am starting to use Tophat and I am trying to set the best flags for my analysis.

            Thanks,

            Dave

            Comment


            • #7
              Hope this helps....

              Hi,
              I've been using Galaxy to analyse rna seq data from mRNA isolated from Hela cells. Like others I have the problem of reads that are multihits. I have put up a workflow for analysis on Galaxy that involves the following steps:
              1. Run the data using tophat and allow up to 40 maps per read (default)
              2. Use a samtools feature to get rid of all mappings that are not mate paired and in a proper pair.
              3. Count up how many times an individual read is in the sam file and remove all read pairs that are not mapped to a unique site, putting them in a separate "multihits" file.
              4. Keep all the uniquely mapped proper mate paired hits in a unique hits sam file.

              This approach generates more unique hits than asking tophat to throw out reads that do not uniquely map (this may have changed with the latest tophat release - I haven't checked yet). I think (and the tophat guys may correct me on this) this is because tophat may be removing reads where one end is not uniquely mapped but the other is (and therefore only makes sense with one of the mates).
              However, whatever tophat does (now or in the future) this approach does have the advantage of telling you where and how big the multihit problem is. My datset has, for example, 18 million unique proper paired reads, 1.3 million that map to two places, a few hundered thousand that map to 3 places and so on down the line.
              One problem with multihits is that we may be overestimating some genes by including multihits or conversely underestimating some genes by excluding them. This "Bristol" workflow allows us to at least know if a gene has a problem of being prone to multihits.

              I think this approach is useful but I may have missed something or be behind the curve!! Who knows but I thought it might be a useful workflow to start a discussion about what to do with multihit reads.

              Cheers
              David

              Comment


              • #8
                Originally posted by DavidMatthewsBristol View Post
                Hi,
                I've been using Galaxy to analyse rna seq data from mRNA isolated from Hela cells. Like others I have the problem of reads that are multihits. I have put up a workflow for analysis on Galaxy that involves the following steps:
                1. Run the data using tophat and allow up to 40 maps per read (default)
                2. Use a samtools feature to get rid of all mappings that are not mate paired and in a proper pair.
                3. Count up how many times an individual read is in the sam file and remove all read pairs that are not mapped to a unique site, putting them in a separate "multihits" file.
                4. Keep all the uniquely mapped proper mate paired hits in a unique hits sam file.

                This approach generates more unique hits than asking tophat to throw out reads that do not uniquely map (this may have changed with the latest tophat release - I haven't checked yet). I think (and the tophat guys may correct me on this) this is because tophat may be removing reads where one end is not uniquely mapped but the other is (and therefore only makes sense with one of the mates).
                However, whatever tophat does (now or in the future) this approach does have the advantage of telling you where and how big the multihit problem is. My datset has, for example, 18 million unique proper paired reads, 1.3 million that map to two places, a few hundered thousand that map to 3 places and so on down the line.
                One problem with multihits is that we may be overestimating some genes by including multihits or conversely underestimating some genes by excluding them. This "Bristol" workflow allows us to at least know if a gene has a problem of being prone to multihits.

                I think this approach is useful but I may have missed something or be behind the curve!! Who knows but I thought it might be a useful workflow to start a discussion about what to do with multihit reads.

                Cheers
                David
                I'm curious how you did step 3
                -Count up how many times an individual read is in the sam file and remove all read pairs that are not mapped to a unique site, putting them in a separate "multihits" file.

                Im trying to create a file with only having unique reads(no multiple hits) but am also trying to keep the pair with highest mapping score.

                Thank you,
                Teja

                Comment


                • #9
                  I don't think so.

                  The option will report only one mapping result out of all. It doesn't mean you get uniquely-mapped reads, you only get only one mapping result.

                  "If there are more alignments with the same score than this number, TopHat will randomly report only this many alignments."





                  Originally posted by bgibb View Post
                  Hello, I was wondering if anyone has tried running TopHat with the option max-multihits=1. From the user manual, I would expect only uniquely-mapped reads to appear in the accepted_hits.sam file, however I can usually find several reads appearing more than once. My understanding is that when max-multihits is set to 1, TopHat is supposed to discard mulitply-mapped reads and not consider them further, e.g. when mapping to junctions. Is my understanding correct? If so, any thoughts on why multi-maps are reported in the alignment?

                  I am using TopHat-1.0.14 with Bowtie-0.12.5.

                  Thanks.

                  Bill

                  Comment


                  • #10
                    Yes, I think sqcrft is right. It does not discard multiple-mapping reads, but only reports one alignment for those out all all their alignments.

                    I think there used to be an option -k that would allow one to discard reads that topped x alignments -- whatever happened to that? I only see -g in the tophat 2 manual, no reporting options like before...

                    Comment


                    • #11
                      OMG this topic is really lasting throw time. I really encourage the next time traveller to take a look at this http://www.biostars.org/p/82559/.

                      In the other hand, the -g 1 parameter is well defined in the TopHat manual nowadays. You should not expect to get rid of the multiple mapping reads. Just you will have (in theory) one mapping for read. However, this is not happening (to me, at least) in TopHat 2.0.9, what is really confusing.

                      In order to get rid of multiple mappings, just point out that the -g option has a different behaviour when used with -M, which is used to do a premapping to genome, in order to filter multiple mappings. Here the -g is reallly a threshold and reads with more than one mapping (-g 1) will be filtered and will not be used in the following steps (transcriptome mapping, genome mapping, segment mapping and so on). However, even being so, one could find some multiple mappings in the accepted_hits.bam (with just one alignment reported in theory if using the -g 1 option). I guess these multiple mappings (a few, in my case) are the multiple mappings generated when mapping the reads or segments to the new junctions tested.

                      I really would like to have control over multiple and single mapping reads, but it is a tough task with the current TopHat versions.

                      Now that I am here, I will ask if any one knows exactly which tags in SAM format uses TopHat to annotate multiple mapping reads? Thank you

                      Comment


                      • #12
                        Hey,

                        there is an ongoing discussion almost everywhere. I normally use samtools to extract the unique reads from the results using:

                        samtools view -bq 1 file.bam > unique.bam

                        http://www.biostars.org/p/56246/

                        Comment


                        • #13
                          It seems too that TopHat uses the NH:i tag instead of the bowtie2 XS:i. I didn't know that. So maybe your answer could be more general than using any of the tags?

                          Comment


                          • #14
                            Originally posted by sphil View Post
                            Hey,

                            there is an ongoing discussion almost everywhere. I normally use samtools to extract the unique reads from the results using:

                            samtools view -bq 1 file.bam > unique.bam

                            http://www.biostars.org/p/56246/
                            I tried to count the MAPQ + NH:i pairs of values in my accepted_hits.bam:

                            35182251 50 1

                            So all my BAM records are unique and NH:i:1... of course not maching the align_summary file:

                            Left reads:
                            Input: 26017654
                            Mapped: 17425180 (67.0% of input)
                            of these: 885971 ( 5.1%) have multiple alignments (19321824 have >1)
                            Right reads:
                            Input: 26343552
                            Mapped: 17592344 (66.8% of input)
                            of these: 993327 ( 5.6%) have multiple alignments (19103282 have >1)
                            66.9% overall read alignment rate.

                            Aligned pairs: 13950250
                            of these: 188496 ( 1.4%) have multiple alignments
                            and: 505713 ( 3.6%) are discordant alignments
                            51.7% concordant pair alignment rate.

                            Comment

                            Latest Articles

                            Collapse

                            • seqadmin
                              Strategies for Sequencing Challenging Samples
                              by seqadmin


                              Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                              03-22-2024, 06:39 AM
                            • seqadmin
                              Techniques and Challenges in Conservation Genomics
                              by seqadmin



                              The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                              Avian Conservation
                              Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                              03-08-2024, 10:41 AM

                            ad_right_rmr

                            Collapse

                            News

                            Collapse

                            Topics Statistics Last Post
                            Started by seqadmin, Yesterday, 06:37 PM
                            0 responses
                            8 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, Yesterday, 06:07 PM
                            0 responses
                            8 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 03-22-2024, 10:03 AM
                            0 responses
                            49 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 03-21-2024, 07:32 AM
                            0 responses
                            66 views
                            0 likes
                            Last Post seqadmin  
                            Working...
                            X