Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • tophat .junc file

    Hi All

    I'm trying to use tophat with the --GFF argument so as to get RPKM data for some yeast experiments. My question is that the .junc file produced by tophat seems not to be consistent with the exon data supplied in the GFF file. For example, when the GFF specifies


    Scchr01 SGD gene 87287 87753 . + . ID=YAL030W

    Scchr01 SGD mRNA 87287 87753 . + . ID=YAL030WmRNA;Parent=YAL030W

    Scchr01 SGD exon 87287 87388 . + 0 ID=YAL030Wexon1;Parent=YAL030WmRNA

    Scchr01 SGD exon 87502 87753 . + 0 ID=YAL030Wexon2;Parent=YAL030WmRNA

    the .junc file specifies

    Scchr01 87387 87501 +

    The position 87387 appears incorrect if it is suppose to be indicating the first base of the intron (as 87501 appears to indicate the last position of the intron) or even the last base of the exon. Am I misinterpreting this or is there a problem here?

    Thanks for your help

  • #2
    Originally posted by Mark View Post
    Hi All

    I'm trying to use tophat with the --GFF argument so as to get RPKM data for some yeast experiments. My question is that the .junc file produced by tophat seems not to be consistent with the exon data supplied in the GFF file. For example, when the GFF specifies


    Scchr01 SGD gene 87287 87753 . + . ID=YAL030W

    Scchr01 SGD mRNA 87287 87753 . + . ID=YAL030WmRNA;Parent=YAL030W

    Scchr01 SGD exon 87287 87388 . + 0 ID=YAL030Wexon1;Parent=YAL030WmRNA

    Scchr01 SGD exon 87502 87753 . + 0 ID=YAL030Wexon2;Parent=YAL030WmRNA

    the .junc file specifies

    Scchr01 87387 87501 +

    The position 87387 appears incorrect if it is suppose to be indicating the first base of the intron (as 87501 appears to indicate the last position of the intron) or even the last base of the exon. Am I misinterpreting this or is there a problem here?

    Thanks for your help
    I have no idea. I 'm trying to install tophat. but there are errors occuring during installation. maybe I will also run into the same problem you have in the near future. i'm also expecting someone to fix it too. ^ ^

    Comment


    • #3
      Don't know if you got this sorted out but from what I have seen in my runs with Tophat it isn't SUPER accurat when it comes to positions. Output tends to vary a little. What I see from your post is that the junction specified in your .junc file is a junction between those two exons (lines 3 and 4). I'm not surprised that Tophat has it a click or two off. I have sequencing from several lanes and when I compare the junction.bed files in UCSC's browser I can easily see that a junction found in one lane is the same as that found in another lane. However if I look at the numbers in the junction.bed files the start and end points of those junctions are not equal. They are sometimes up to 10 positions off from each other.
      /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
      Salk Institute for Biological Studies, La Jolla, CA, USA */

      Comment


      • #4
        A splice junction identified in two different runs may look slightly different in the bed file. The reason for this is not due to alignment accuracy, it's actually a feature of the output format.

        Each bed record in junctions.bed contains two blocks, one on the left side of the intron and one on the right side. The length of these blocks is determined by looking at all the alignments that span the junction, and measuring how far the left and right "overhangs" extend for each read. That is, suppose a read that spans a junction in such a way that the first 20 bp of the read fall on the left exon, and the last 55bp fall on the right exon (for a 75bp) read. If there is only one alignment spanning this intron, then the bed record for it will have the first block be 20bp, and the second block 55bp, and the distance between them in the genomic coordinate space will be the length of the intron.

        If there are multiple alignments across the junction, then each block is as big as the biggest overhang from any read, on each side. Does this make sense?

        Thus since the number of reads spanning a given junction will naturally vary from run to run, as will how they fall across it, the length of the blocks will vary. However, the actual intron coordinates reflected by a given bed record should be consistent from run to run, at least as long as there are any alignments at all spanning that intron.

        It's straightforward to extract the actual intron coordinates from the bed records after a run, and in the upcoming version of TopHat (1.0.11), I provide a script to do so.
        Last edited by Cole Trapnell; 09-23-2009, 08:32 PM.

        Comment


        • #5
          I should have posted a reply to Mark's earlier question as well. The .juncs file format is zero-based (as opposed to the 1-based GTF file), and left coordinate marks the rightmost base of the *left* exon. The right coordinate in each line marks the leftmost base of the *right* exon. Think of it as "each line says concatenate right base to the left base, leaving out everything in between".

          Comment


          • #6
            Thanks Cole. Your responses are very helpful in understanding the outputs. I'm actually a programmer working for a lab and they have charged me with learning how to use Tophat and Bowtie. From what you wrote here it sounds like if I were to compare intron coordinates between two runs in the .bed files I should be able to filter out matching junctions and reveal junctions from one run that did not show up in another.
            /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
            Salk Institute for Biological Studies, La Jolla, CA, USA */

            Comment

            Latest Articles

            Collapse

            • seqadmin
              Current Approaches to Protein Sequencing
              by seqadmin


              Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
              04-04-2024, 04:25 PM
            • seqadmin
              Strategies for Sequencing Challenging Samples
              by seqadmin


              Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
              03-22-2024, 06:39 AM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, 04-11-2024, 12:08 PM
            0 responses
            31 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-10-2024, 10:19 PM
            0 responses
            33 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-10-2024, 09:21 AM
            0 responses
            28 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-04-2024, 09:00 AM
            0 responses
            53 views
            0 likes
            Last Post seqadmin  
            Working...
            X