Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • TopHat internals (splitting longer reads)

    Hello folks,

    in the Cufflinks paper it is stated that TopHat splits reads greater than 75pb:

    TopHat version 1.0.7 and later splits a read 75 bp or longer in three or more segments of approximately equal size (25 bp) and maps them independently.
    But I thought that having longer reads is better because it is more likely that they will align to only a single position because they are "more unique".
    Why is it useful that TopHat splits longer reads? I could imagine that its better to find splice sites in some way but I thought that is done in the other steps.

    And, does TopHat store the dependence of the splitted reads in some way so it can infer a splice junction if they do not align contiguously (but in the right order and with a space of an approximately intron length)?

    What happens when they are aligned over the whole genome with no coherence? Then the result is biased. I think this is more likely to happen with reads of 25 bp length.

    I didn't find any information, maybe someone could help. The original TopHat paper does not refer to this feature.

    Thanks in advance,
    Oliver

  • #2
    I think the TopHat approach is OUTDATED by now, much better to use dedicated split read aligners (like SoapSplice) which allow real mapping of each read across splice junctions, rather than a "post-mortem" heuristic approach that tries to find spliced reads.

    We have been using SoapSplice and seeing much better results, especially when exons are short and split-read mapping makes a big difference.
    --------------------------------------
    Elia Stupka
    Co-Director and Head of Unit
    Center for Translational Genomics and Bioinformatics
    San Raffaele Scientific Institute
    Via Olgettina 58
    20132 Milano
    Italy
    ---------------------------------------

    Comment


    • #3
      Hello eslondon,

      thanks for the input. This didn't answer my question, though, but I will try the program. What do you use for assembly (is there a counter part for Cufflinks)?

      I have to admit that the first results I got with TopHat are pretty satisfying. I'll check what SoapSplice is delivering. I used TopHat because it seems it is very common. Any other opinions?
      Last edited by ocs; 08-10-2011, 11:58 PM. Reason: SoapSplit -> SoapSplice

      Comment


      • #4
        Will try to answer your questions properly

        >But I thought that having longer reads is better because it is more likely that they will >align to only a single position because they are "more unique".
        >Why is it useful that TopHat splits longer reads? I could imagine that its better to find >splice sites in some way but I thought that is done in the other steps.

        TopHat splits longer reads because of the issue that I mentioned, longer reads have a higher chance of being unique in the genome, but also a higher chance of aligning on a splice junction. Since it does not deal with splice junctions, it has to resort to splitting reads, mapping to "potential" splice sites.

        Bear in mind that TopHat deals specifically with reads that did not align well with the first step (Bowtie), i.e. reads that are not mapping linearly to the genome, and are thus likely to be spliced.

        By the way I am not a developer of SoapSplice, and I also used TopHat and Cufflinks until recently, but got very frustrated with these issues.

        SoapSplice is just an aligner, after the alignment you can run any tool that you are interested in running to estimate genes, expression values, etc.

        Any aligner/tool which gives you a "triangular" shape of score (in the wiggle plot) on each exon is missing reads close to the splice junctions....

        best

        Elia
        --------------------------------------
        Elia Stupka
        Co-Director and Head of Unit
        Center for Translational Genomics and Bioinformatics
        San Raffaele Scientific Institute
        Via Olgettina 58
        20132 Milano
        Italy
        ---------------------------------------

        Comment


        • #5
          Hello eslondon,

          thanks for your quick reply. I had a look at the SoapSplice paper and it seems to me that the approach is pretty the same as the one of TopHat. So I can't see that the TopHat approach is out-of-date. But in an overview they scored best (call rate) amongst all RNA-Seq aligners. So its worth a look :-)

          Comment


          • #6
            HI ocs,

            not quite, they are very different tools. As you will see in the paper SOAPSplice has (amongst other things) a real splice aligner, i.e. an aligner (like PALmapper) which performs an alignment of a read across junctions. Bowtie cannot perform gapped alignment, and thus it is not a "split aligner", it just simulates a split alignment.

            We have compared TopHat and SoapSplice on the same set of data with several samples done with paired-end RNA-Seq and they are very different, much better uniform coverage of all exons, and lovely mapping of split reads.

            Have fun!

            Elia
            --------------------------------------
            Elia Stupka
            Co-Director and Head of Unit
            Center for Translational Genomics and Bioinformatics
            San Raffaele Scientific Institute
            Via Olgettina 58
            20132 Milano
            Italy
            ---------------------------------------

            Comment

            Latest Articles

            Collapse

            • seqadmin
              Essential Discoveries and Tools in Epitranscriptomics
              by seqadmin




              The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
              04-22-2024, 07:01 AM
            • seqadmin
              Current Approaches to Protein Sequencing
              by seqadmin


              Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
              04-04-2024, 04:25 PM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, Today, 11:49 AM
            0 responses
            12 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, Yesterday, 08:47 AM
            0 responses
            16 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-11-2024, 12:08 PM
            0 responses
            61 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-10-2024, 10:19 PM
            0 responses
            60 views
            0 likes
            Last Post seqadmin  
            Working...
            X