Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Trinity transcript naming

    I don't know if I am missing something obvious but I can't find an explanation anywhere as to how Trinity RNA-seq names the transcripts it derives.

    I've been noticing that when I annotate with BLAST+ most of the transcripts that start with the same number (comp#_) are annotating as the same gene (isoforms?). They will then all have different ending numbers (_seq#) and some of them have different numbers int he middle (_c#_), for example (comp2335_c1_seq1, comp2335_c1_seq2, comp2335_c1_seq3).

    Are these names derived sequentially by inchworm (comp#) chrysalis (c#) then butterfly (seq#)?

    Can anyone verify whether these would be indeed considered isoforms of single genes within a comp# group or is there something else at work here?

    Thanks,
    -Mike

  • #2
    You're almost correct. The names are indeed based on the algorithm, but it's not quite as simple as the inchworm number, then the chrysalis number, then the butterfly number.

    This is the basic algorithm of Trinity, assuming that you've already built a de Bruijn assembly graph (which is identical to a (k+1)-mer count):
    1. Eagerly extract contigs from the de Bruijn graph. These contigs may or may not have any relationship to "real" transcripts, they're just whatever long contiguous paths happen to be found in the graph.
    2. Find reads which justify clustering/joining ("welding" in Trinity terminology) these contigs together. A set of contigs which are believed to belong together is called a "component".
    3. Align reads to components. For each read, decide which component it's most likely to belong to.
    4. For each component, treat the reads which map to that component as a separate assembly problem. This involves constructing a new (smaller) de Bruijn graph from only those reads which belong to a component.

    The output of Inchworm is the set of contigs. The output of Chrysalis is the set of components, plus the reads which are called as belonging to those components. The output of Butterfly is the called transcripts for each component.

    When Butterfly rebuilds a graph for each component and does cleanup, it sometimes finds that the resulting graph is disconnected. This is usually because the inital contigs discovered by Inchworm were not "real" contigs. Each connected component in the rebuilt graph is called a "subcomponent".

    So the "comp" is the component, "c" is the subcomponent, and "seq" is the extracted sequence from the subcomponent.

    Trinity does not reason in terms of genes, loci and alternative splicing events. It solves a graph problem, though of course the heuristics are tuned to the needs of biology. So while it's highly likely that all the isoforms of a given gene belong to the same subcomponent, you shouldn't assume that a subcomponent is a gene.
    Last edited by Pseudonym; 04-15-2012, 05:05 PM.
    sub f{($f)=@_;print"$f(q{$f});";}f(q{sub f{($f)=@_;print"$f(q{$f});";}f});

    Comment


    • #3
      Nice explaination Pseudonym...



      Originally posted by Pseudonym View Post
      You're almost correct. The names are indeed based on the algorithm, but it's not quite as simple as the inchworm number, then the chrysalis number, then the butterfly number.

      This is the basic algorithm of Trinity, assuming that you've already built a de Bruijn assembly graph (which is identical to a (k+1)-mer count):
      1. Eagerly extract contigs from the de Bruijn graph. These contigs may or may not have any relationship to "real" transcripts, they're just whatever long contiguous paths happen to be found in the graph.
      2. Find reads which justify clustering/joining ("welding" in Trinity terminology) these contigs together. A set of contigs which are believed to belong together is called a "component".
      3. Align reads to components. For each read, decide which component it's most likely to belong to.
      4. For each component, treat the reads which map to that component as a separate assembly problem. This involves constructing a new (smaller) de Bruijn graph from only those reads which belong to a component.

      The output of Inchworm is the set of contigs. The output of Chrysalis is the set of components, plus the reads which are called as belonging to those components. The output of Butterfly is the called transcripts for each component.

      When Butterfly rebuilds a graph for each component and does cleanup, it sometimes finds that the resulting graph is disconnected. This is usually because the inital contigs discovered by Inchworm were not "real" contigs. Each connected component in the rebuilt graph is called a "subcomponent".

      So the "comp" is the component, "c" is the subcomponent, and "seq" is the extracted sequence from the subcomponent.

      Trinity does not reason in terms of genes, loci and alternative splicing events. It solves a graph problem, though of course the heuristics are tuned to the needs of biology. So while it's highly likely that all the isoforms of a given gene belong to the same subcomponent, you shouldn't assume that a subcomponent is a gene.

      Comment

      Latest Articles

      Collapse

      • seqadmin
        Advancing Precision Medicine for Rare Diseases in Children
        by seqadmin




        Many organizations study rare diseases, but few have a mission as impactful as Rady Children’s Institute for Genomic Medicine (RCIGM). “We are all about changing outcomes for children,” explained Dr. Stephen Kingsmore, President and CEO of the group. The institute’s initial goal was to provide rapid diagnoses for critically ill children and shorten their diagnostic odyssey, a term used to describe the long and arduous process it takes patients to obtain an accurate...
        12-16-2024, 07:57 AM
      • seqadmin
        Recent Advances in Sequencing Technologies
        by seqadmin



        Innovations in next-generation sequencing technologies and techniques are driving more precise and comprehensive exploration of complex biological systems. Current advancements include improved accessibility for long-read sequencing and significant progress in single-cell and 3D genomics. This article explores some of the most impactful developments in the field over the past year.

        Long-Read Sequencing
        Long-read sequencing has seen remarkable advancements,...
        12-02-2024, 01:49 PM

      ad_right_rmr

      Collapse

      News

      Collapse

      Topics Statistics Last Post
      Started by seqadmin, 12-17-2024, 10:28 AM
      0 responses
      26 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 12-13-2024, 08:24 AM
      0 responses
      42 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 12-12-2024, 07:41 AM
      0 responses
      28 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 12-11-2024, 07:45 AM
      0 responses
      42 views
      0 likes
      Last Post seqadmin  
      Working...
      X