Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Tophat 1.4.1 in -G mode

    Dear all,

    I am trying to map reads from an Illumina RNA-seq run to the human genome.
    I am using the command:
    ~/bin/tophat-1.4.1.Linux_x86_64/tophat --solexa1.3-quals --library-type fr-unstranded --bowtie-n -g 1 -N 2 -n 2 -G ~/data/GTF/ --transcriptome-index ~/data/GTF/transcriptome ~/data/indexes/genome data.fastq.gz

    Basically I am mapping to the transcriptome first, allowing up to 2 mismatches, but not putting any limit on the number of hits (-x option). This is because the GTF can contain several transcripts from the same gene that have overlapping coordinates. But a read mapping to many transcript of the same gene should have only one mapping location on the genome coordinates, this is why I used the option "-g 1" (I don't want reads mapping to many genes in the transcriptome).

    However I am not sure what tophat is exactly doing... Is it actually doing what I described? I get the impression that it might be aligning reads non uniquely to the transcriptome, without considering the -g option, and later only use the -g restriction when mapping the fragments of unmapped reads to discover new junctions.
    Do you have any experience with this mode?

    Thanks a lot for your feedback
    Julien

  • #2
    The answer is Yes.

    the -x option is what's controlling alignments to the transcriptome and it's default is 60 hits so yes, it is non-uniquely aligning reads to the transcriptome. the -g option only comes into play when tophat is aligning full length reads directly to the genome.

    So by default all of these things happen:

    alignment to the transcriptome @ 60 max alignments per read
    alignment of unaligned full length reads to the genome @ 20 max alignments per read
    segment mapping (to find junctions) @ 40 max alignments per segment
    [assembly of junctions, thinking, etc...]
    more segment mapping @ 40 max alignments per segment
    [assembly of final alignments, reporting or output tracks]

    I guess the problem is if you specify -x 1 then that runs bowtie with -m 1 -k 1 which means reads aligning to more than one location are thrown out. That means any reads aligning into genes with multiple overlapping regions between transcripts would get tossed. It's worth experimenting to see if that really happens because that seems like an unfortunate oversight. They'd need to include a post-alignment check that sorts out alignments that aligned to several transcripts but all in the same genomic location. It should keep those and throw out those that did not all align in the same genomic location.

    I have no idea if it does that or not. You'd think they would have considered that since their focus has always been on making tools capable of dealing with genes with several splice variants.

    Logically there doesn't seem to be any way around this when using the -G option (unless they did something like I just described). Maybe we can get a post from Cole or someone involved in its development to explain. For now it seems that to properly get unique alignments you'd have to go without -G.
    /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
    Salk Institute for Biological Studies, La Jolla, CA, USA */

    Comment


    • #3
      Originally posted by sdriscoll View Post
      I guess the problem is if you specify -x 1 then that runs bowtie with -m 1 -k 1 which means reads aligning to more than one location are thrown out. That means any reads aligning into genes with multiple overlapping regions between transcripts would get tossed. It's worth experimenting to see if that really happens because that seems like an unfortunate oversight.
      Yes indeed I think this is what happens: I have just checked a run output with -x 1 and the mapping rate is only 16%. Most reads are thrown away because they map multiple locations.

      Originally posted by sdriscoll View Post
      They'd need to include a post-alignment check that sorts out alignments that aligned to several transcripts but all in the same genomic location. It should keep those and throw out those that did not all align in the same genomic location.
      I thought that the -g 1 was taking care of that, but I agree with you that it would be useful to get some feedback from the authors to be clear on this.

      Maybe a way to deal with this would be to feed tophat with a GTF including all non-redundant exons for each gene (and the union of the exons coordinates for overlapping exons of different transcripts), but this would probably interfere with splice junction discovery...

      Thanks
      Julien

      Comment


      • #4
        Julien, your understanding is correct, and just as sdriscoll explained, the -x option is directly translated into bowtie's -m and -k options for the transcriptome mapping step. That's why the default value for this option is much higher than the one for -g -- exactly because we want to account for mapping on the same genomic location due to alternative splicing or other possible transcript redundancy in the given annotation file. I agree that this also makes the -x option of very limited use, as there is little sense in changing/lowering its value, and setting it to 1 is actually a mistake - unless one is sure that there there are no alternative splicing transcripts in the annotation data given..
        They'd need to include a post-alignment check that sorts out alignments that aligned to several transcripts but all in the same genomic location.
        This actually happens [sort of] in the map2gtf module as it converts the transcriptome mappings to unique genomic coordinates, it only keeps the "unique" list of mappings for a read in terms of genomic coordinates. But indeed tophat does not check the -x threshold against the number of distinct genome mapping locations for each read mapped to the transcriptome. However the -g value will be applied eventually to these read mappings as well, in the final stage (tophat_reports) because at that stage the -g limit is applied to all reads based on the number of their distinct genomic mapping locations (even if those were obtained by transcriptome mapping originally, this does not matter at that final stage). So Julien is correct to use the -g 1 option to filter out all the non-unique read mappings by their final genomic coordinates, without interfering with the transcriptome mapping process. I guess what is missing here is a true equivalent of the -g option for the transcriptome search, to limit the number of *distinct* transcriptome mappings separately from the final genome mappings, because clearly the -x option is not doing that.

        Comment


        • #5
          Thanks Geo for your clarifications that the -g 1 option is eventually applied to all the reads. I guess that if a user wants more tuning on the transcriptome mapping step, it might be preferable to use Bowtie directly...
          And thanks for your work on tophat!
          Best,
          Julien

          Comment

          Latest Articles

          Collapse

          • seqadmin
            Strategies for Sequencing Challenging Samples
            by seqadmin


            Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
            03-22-2024, 06:39 AM
          • seqadmin
            Techniques and Challenges in Conservation Genomics
            by seqadmin



            The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

            Avian Conservation
            Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
            03-08-2024, 10:41 AM

          ad_right_rmr

          Collapse

          News

          Collapse

          Topics Statistics Last Post
          Started by seqadmin, Yesterday, 06:37 PM
          0 responses
          12 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, Yesterday, 06:07 PM
          0 responses
          10 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 03-22-2024, 10:03 AM
          0 responses
          51 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 03-21-2024, 07:32 AM
          0 responses
          68 views
          0 likes
          Last Post seqadmin  
          Working...
          X