Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Cuffcompare stats: High sensitivity and Low specificity....... what does it mean?

    Hi,

    I am using cuffcompare from cufflinks suite to check and compare the transcriptome assemblies from STAR/cufflinks and TopHat/cufflinks to the reference annotation. While assemblies from aligners STAR and TopHat seem quite comparable in numbers, the specificities reported for both the assemblies seem alarming.

    Is it ok to have low specificity??? How good are these assemblies?

    The cuffcmp.stats is as follows
    ##########################################################

    #= Summary for dataset: SRR594419_STAR_filtered_transcripts.gtf :
    # Query mRNAs : 103797 in 85255 loci (40428 multi-exon transcripts)
    # (10592 multi-transcript loci, ~1.2 transcripts per locus)
    # Reference mRNAs : 29129 in 26270 loci (23160 multi-exon)
    # Corresponding super-loci: 24738
    #--------------------| Sn | Sp | fSn | fSp
    Base level: 99.9 35.6 - -
    Exon level: 99.3 66.6 100.0 68.5
    Intron level: 99.3 86.0 100.0 87.4
    Intron chain level: 95.3 54.6 100.0 63.6
    Transcript level: 90.0 25.3 89.9 25.2
    Locus level: 96.7 29.5 99.9 30.4

    Matching intron chains: 22068
    Matching loci: 25390

    Missed exons: 37/210468 ( 0.0%)
    Novel exons: 80952/313777 ( 25.8%)
    Missed introns: 1182/183787 ( 0.6%)
    Novel introns: 14048/212202 ( 6.6%)
    Missed loci: 0/26270 ( 0.0%)
    Novel loci: 46321/85255 ( 54.3%)

    #= Summary for dataset: SRR594419_tophat_transcripts.gtf :
    # Query mRNAs : 104746 in 87334 loci (38090 multi-exon transcripts)
    # (10015 multi-transcript loci, ~1.2 transcripts per locus)
    # Reference mRNAs : 29129 in 26270 loci (23160 multi-exon)
    # Corresponding super-loci: 25059
    #--------------------| Sn | Sp | fSn | fSp
    Base level: 99.9 36.1 - -
    Exon level: 99.3 68.3 100.0 69.0
    Intron level: 99.3 88.9 99.7 89.3
    Intron chain level: 95.4 58.0 100.0 66.0
    Transcript level: 89.3 24.8 89.0 24.8
    Locus level: 96.7 28.9 99.8 29.7

    Matching intron chains: 22098
    Matching loci: 25414

    Missed exons: 72/210468 ( 0.0%)
    Novel exons: 78071/306064 ( 25.5%)
    Missed introns: 1197/183787 ( 0.7%)
    Novel introns: 10768/205343 ( 5.2%)
    Missed loci: 19/26270 ( 0.1%)
    Novel loci: 48238/87334 ( 55.2%)

    Total union super-loci across all input datasets: 92143
    (11373 multi-transcript, ~1.5 transcripts per locus)
    ################################################################

  • #2
    I have the same problem, did you find an answer?

    Comment


    • #3
      I am also curious about this. Running cuffcompare on my cuffmerge output results in these numbers:

      Code:
      #     Query mRNAs :  865356 in  787440 loci  (97791 multi-exon transcripts)
      #            (16955 multi-transcript loci, ~1.1 transcripts per locus)
      # Reference mRNAs :   95598 in   36914 loci  (82214 multi-exon)
      # Super-loci w/ reference transcripts:    33985
      #--------------------|   Sn   |  Sp   |  fSn |  fSp
              Base level:      99.6     8.5     -       -
              Exon level:     110.6    35.2   100.0    36.2
            Intron level:      99.2    96.9   100.0    98.8
      Intron chain level:      80.3    67.5   100.0   100.0
        Transcript level:      74.7     8.3    70.2     7.8
             Locus level:      99.1     4.6    99.6     4.6
      
           Matching intron chains:   66045
                    Matching loci:   36587
      
                Missed exons:    1293/351192  (  0.4%)
                 Novel exons:  755700/1102464 ( 68.5%)
              Missed introns:    1755/243253  (  0.7%)
               Novel introns:    1588/249173  (  0.6%)
                 Missed loci:     157/36914   (  0.4%)
                  Novel loci:  747780/787440  ( 95.0%)
      Reference used was Ensembl mouse from igenomes. The options used for cuffcompare were the following:

      Code:
      ~/cufflinks-2.2.0.Linux_x86_64/cuffcompare -s ~/igenomes/Mus_musculus/Ensembl/NCBIM37/Sequence/Bowtie2Index/genome.fa -r ~/igenomes/Mus_musculus/Ensembl/NCBIM37/Annotation/Genes/genes.gtf -p Ensembl ~/cuffmerge/merged.gtf
      In addition, I got the following class codes:

      Code:
      grep -v "gene_name" Ensembl.combined.gtf | awk '{print $18}' | sort | uniq -c
      
       739578 "u";
      grep "gene_name" Ensembl.combined.gtf | awk '{print $22}' | sort | uniq -c
      
       555263 "=";
       380367 "j";
         2684 "o";
        12920 "x";
      739578 novel transfrags seems a bit much to me.

      Comment


      • #4
        Cuffcompare: Low specificity of transcript assembly

        Hi,
        Cuffcompare introductory page at http://cufflinks.cbcb.umd.edu/manual.html states the following.

        " Cuffcompare produces the following output files:
        1) <outprefix>.stats

        Cuffcompare reports various statistics related to the "accuracy" of the transcripts in each sample when compared to the reference annotation data. The typical gene finding measures of "sensitivity" and "specificity" (as defined in Burset, M., Guigó, R. : Evaluation of gene structure prediction programs (1996) Genomics, 34 (3), pp. 353-367. doi: 10.1006/geno.1996.0298) are calculated at various levels (nucleotide, exon, intron, transcript, gene) for each input file and reported in this file."

        As highlighted in the mentioned 1996 reference's figure 1(Attached) it appears that exons metioned in the annotation GTF would be considered as True prositives and any novel transcript/exon would be considered False positives while calculating sensitivity and specificity by cuffcompare. This explains why we have low specificity measures for whole transcriptome assembly which might have a large number of novel transcripts.

        It seems that we can ignore specificity measure for assembly from whole RNA samples. However, to increase specificity FPKM fileters might be effective.
        Attached Files

        Comment

        Latest Articles

        Collapse

        • seqadmin
          Strategies for Sequencing Challenging Samples
          by seqadmin


          Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
          03-22-2024, 06:39 AM
        • seqadmin
          Techniques and Challenges in Conservation Genomics
          by seqadmin



          The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

          Avian Conservation
          Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
          03-08-2024, 10:41 AM

        ad_right_rmr

        Collapse

        News

        Collapse

        Topics Statistics Last Post
        Started by seqadmin, Yesterday, 06:37 PM
        0 responses
        10 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, Yesterday, 06:07 PM
        0 responses
        9 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 03-22-2024, 10:03 AM
        0 responses
        49 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 03-21-2024, 07:32 AM
        0 responses
        67 views
        0 likes
        Last Post seqadmin  
        Working...
        X