Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • cuffcompare not matching reference genome

    I was just using cufflinks and cuffcompare for the first time. When I ran cuffcompare with the -r option and the Ensembl human genome gtf file, I got zero matches between my sequences and Ensembl. I did notice that my data used a prefix of 'chr' for the chromosome names, so I edited the ensembl gtf file to match that, but I still got no matches.

    I also noticed some discrepancies in the cufflinks output from what it is supposed to be, perhaps this is the cause of the lack of reference matches. Here are the discrepancies that I notice:

    genes.expr:
    The header has 8 columns, but the output only has 6. The last column is a real number, and I'm guessing it's the RPKM value, but the last three column headers are bundle_fraction, density, and RPKM, so it could be any of those three.

    transcripts.expr:
    The header has 13 columns, but the output has 14. The last column contains an integer value. The other columns look like they contain the correct data type based on the column name, so I'm ignoring that 14th column.

    For the cuffcompare output, the transcripts.refmap files are empty except for the column header due to the lack of matches. The transcripts.tmap file has 10 columns in the header, but 12 columns in the data. In this case, I can tell that the missing column header values are conf_low and conf_hi which should be in between RPKM and cov.

    The version of the software I'm using is:
    cufflinks-0.7.0.OSX_x86_64

    Thanks for your help.

    Gene

  • #2
    Originally posted by genec View Post
    I was just using cufflinks and cuffcompare for the first time. When I ran cuffcompare with the -r option and the Ensembl human genome gtf file, I got zero matches between my sequences and Ensembl. I did notice that my data used a prefix of 'chr' for the chromosome names, so I edited the ensembl gtf file to match that, but I still got no matches.

    I also noticed some discrepancies in the cufflinks output from what it is supposed to be, perhaps this is the cause of the lack of reference matches. Here are the discrepancies that I notice:

    genes.expr:
    The header has 8 columns, but the output only has 6. The last column is a real number, and I'm guessing it's the RPKM value, but the last three column headers are bundle_fraction, density, and RPKM, so it could be any of those three.

    transcripts.expr:
    The header has 13 columns, but the output has 14. The last column contains an integer value. The other columns look like they contain the correct data type based on the column name, so I'm ignoring that 14th column.

    For the cuffcompare output, the transcripts.refmap files are empty except for the column header due to the lack of matches. The transcripts.tmap file has 10 columns in the header, but 12 columns in the data. In this case, I can tell that the missing column header values are conf_low and conf_hi which should be in between RPKM and cov.

    The version of the software I'm using is:
    cufflinks-0.7.0.OSX_x86_64

    Thanks for your help.

    Gene
    Sorry about the header errors. transcripts.expr should be

    trans_id\tbundle_id\tchr\tleft\tright\ttotal_score\tavg_read_score\tRPKM\tFMI\tfrac\tconf_low\tconf_hi\tcoverage\tlength\n


    and genes.expr should be

    gene_id\tbundle_id\tchr\tleft\tright\tRPKM\n

    If you'd like to fix them yourself locally, you can edit the file assemble.cpp, in the function assemble_hits.cpp, around line 2900 or so. You can simply replace the existing format strings with the ones above.

    As for why you didn't get any matches - this is likely to be a disagreement between your SAM and GTF inputs - can you please double check that the chromosome names match between the two files, and that you don't have stray whitespace, etc. between the tabs separating the fields? If you are certain the names are the same, please email me ([email protected]) a small snippet of the SAM and GTF (say, a gene's worth), and I'll take a look. However, please be patient with me, as I am out of town this week working on something else, so it might be a few days/early next week before I get back to you.

    Comment


    • #3
      Thanks for the quick response.

      The .sam files were produced by tophat, so it should be in the correct format. The only change that I made was that I had to split up my input fastq files going into the tophat alignment because otherwise tophat was running out of memory. After the tophat alignment was done, I then merged and resorted the topat accepted_hits.sam files.

      You ask about the .sam files, but those are not specified as input to cuffcompare, only to cufflinks. Is cuffcompare expecting to find .sam files somewhere? Maybe that is the problem. If not, I'll go ahead and send you some sample data.
      Last edited by genec; 10-20-2009, 02:02 PM. Reason: grammar

      Comment


      • #4
        Originally posted by genec View Post
        Thanks for the quick response.

        You ask about the .sam files, but those are not specified as input to cuffcompare, only to cufflinks. Is cuffcompare expecting to find .sam files somewhere? Maybe that is the problem. If not, I'll go ahead and send you some sample data.
        Ah sorry for being unclear - The Cufflinks GTFs just carry over the chromosome names present in the SAM, which is why I asked about them. I just wanted input data that I could use to fully reproduce the problem, including assembling with Cufflinks myself, in case there's some problem with Cufflinks' handling of the chromosome names/parsing.

        Comment


        • #5
          same problem

          Hi,

          Was this problem resolved? I have a similar issue where I had 2 input files and the current ensembl reference genome (Homo_sapiens.GRCh37.56.gtf) but no references mapped and I only got 1 output file - stdout.tracking. I also noticed the conf_hi and conf_lo values are empty from the cufflinks output however I cannot see anything wrong with the tophat run. I've listed a few of the outputs below;

          #= Summary for dataset: /home/karenp/cufflinks_Melanoma_1205/transcripts.gtf :
          # Total mRNAs : 28662 in 26717 loci (28112 multi-exon)
          # Reference mRNAs : 137021 in 46375 loci (116441 multi-exon)
          # Corresponding super-loci: 0
          #--------------------| Sn | Sp | fSn | fSp
          Base level: 0.0 0.0 - -
          Exon level: 0.0 0.0 0.0 0.0
          Intron level: 0.0 0.0 0.0 0.0
          Intron chain level: 0.0 0.0 0.0 0.0
          Transcript level: 0.0 0.0 0.0 0.0
          Locus level: 0.0 0.0 0.0 0.0
          Missed exons: 436629/436629 (100.0%)
          Wrong exons: 84089/84089 (100.0%)
          Missed introns: 297912/297912 (100.0%)
          Wrong introns: 55433/55433 (100.0%)
          Missed loci: 0/46375 ( 0.0%)
          Wrong loci: 0/26717 ( 0.0%)



          head stdout.tracking
          - - q1:CUFF.76|CUFF.76.0|100|1.166125|0.000000|0.000000|uniq -
          - - q1:CUFF.121|CUFF.121.0|100|1.106966|0.000000|0.000000|uniq -
          - - q1:CUFF.127|CUFF.127.0|100|25.672007|0.000000|0.000000|uniq -
          - - q1:CUFF.193|CUFF.193.0|100|6.199007|0.000000|0.000000 q2:CUFF.157|CUFF.157.0|100|6.610995|0.000000|0.000000
          - - q1:CUFF.202|CUFF.202.0|100|2.866208|0.000000|0.000000 q2:CUFF.163|CUFF.163.0|100|3.347445|0.000000|0.000000


          head transcripts.gtf

          chr1 Cufflinks transcript 12706 12765 1000 . . gene_id "CUFF.1"; transcript_id "CUFF.1.0"; RPKM "0.7404369560"; frac "1.000000"; conf_lo "0.000000"; conf_hi "0.000000"; cov "1.333333";
          chr1 Cufflinks exon 12706 12765 1000 . . gene_id "CUFF.1"; transcript_id "CUFF.1.0"; exon_number "1"; RPKM "0.7404369560"; frac "1.000000"; conf_lo "0.000000"; conf_hi "0.000000"; cov "1.333333";
          chr1 Cufflinks transcript 16858 17740 1000 . . gene_id "CUFF.4"; transcript_id "CUFF.4.0"; RPKM "1.7083154380"; frac "1.000000"; conf_lo "0.000000"; conf_hi "0.000000"; cov "3.076229";

          Thanks,

          Karen

          Comment


          • #6
            This issue has been fixed in our repository, but we're still working on some other features, which is why we haven't issued a release. We hope to have v0.7.1 out within a week or two.

            Comment


            • #7
              I got the same result .. waiting for an update.

              #= Summary for dataset: transcripts.gtf :
              # Total mRNAs : 22914 in 20436 loci (22143 multi-exon)
              # Reference mRNAs : 61726 in 36120 loci (46229 multi-exon)
              # Corresponding super-loci: 0
              #--------------------| Sn | Sp | fSn | fSp
              Base level: 0.0 0.0 - -
              Exon level: 0.0 0.0 0.0 0.0
              Intron level: 0.0 0.0 0.0 0.0
              Intron chain level: 0.0 0.0 0.0 0.0
              Transcript level: 0.0 0.0 0.0 0.0
              Locus level: 0.0 0.0 0.0 0.0
              Missed exons: 271295/271295 (100.0%)
              Wrong exons: 73373/73373 (100.0%)
              Missed introns: 221765/221765 (100.0%)
              Wrong introns: 51268/51268 (100.0%)
              Missed loci: 0/36120 ( 0.0%)
              Wrong loci: 0/20436 ( 0.0%)
              --
              bioinfosm

              Comment


              • #8
                Same issue - the update will be most helpful.

                Comment


                • #9
                  Tweaks

                  Originally posted by Cole Trapnell View Post
                  This issue has been fixed in our repository, but we're still working on some other features, which is why we haven't issued a release. We hope to have v0.7.1 out within a week or two.
                  I've also run into this same result. Cole, is this just a matter of input file formatting? Are there simple tweaks we can do to overcome the problem before your next update is available?

                  Many thanks, Roye

                  Comment


                  • #10
                    Hello again,

                    I've found that removing all the "NT_#####" (supercontig) lines from the reference GTF file and adding the prefix "chr" to all the other (chromosomal) lines gets rid of the all-zeros results. Cole, I'm not sure if this is what you intend to do in your patch, or if this just serves as a stop-gap measure. To others also stuck on this, it's at least worth a try.

                    r

                    Comment


                    • #11
                      same issue with v0.8.1

                      I'm new to cufflinks and am running into this same issue. From this thread it looks like this issue was resolved in a previous version of the software- I'm using version 0.8.1.
                      I'm using a gft file from Ensembl and have converted the chromosome and first base fields to match the cufflinks format.

                      Any thoughts?

                      Thanks!

                      Comment


                      • #12
                        Hi,

                        I am not sure if I am having the same problem. Any help will be greatly appreciated.

                        I ran cuffcompare in the folllowing manner and this was my output.

                        HTML Code:
                        /cuffcompare -r Homo_sapiens.GRCh37.58.gtf transcripts.gtf
                        Can anyone tell me what I am doing wrong? Is the GTF file in the wrong format ( i downloaded it from the cufflinks website). I should also state that during the allignment process, I used the hg18 reference genome.

                        HTML Code:
                        ref_gene_id     ref_id  class_code      cuff_gene_id    cuff_id FMI     FPKM    FPKM_conf_lo    FPKM_conf_hi    cov     len     major_iso_id
                        -               -               u       CUFF.1  CUFF.1.1        100     0.230129        0.000000        0.544984        0.564535        141     CUFF.1.1
                        -               -               u       CUFF.3  CUFF.3.1        100     1.007442        0.013988        2.000896        2.601054        62      CUFF.3.1
                        -               -               u       CUFF.5  CUFF.5.1        100     0.313054        0.000000        0.649479        1.030819        168     CUFF.5.1
                        -               -               u       CUFF.7  CUFF.7.1        100     0.217497        0.000000        0.550683        0.482221        119     CUFF.7.1
                        -               -               u       CUFF.9  CUFF.9.1        100     0.130129        0.000000        0.351696        0.217099        161     CUFF.9.1
                        Last edited by zorph; 06-08-2010, 09:59 AM.

                        Comment

                        Latest Articles

                        Collapse

                        • seqadmin
                          Strategies for Sequencing Challenging Samples
                          by seqadmin


                          Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                          03-22-2024, 06:39 AM
                        • seqadmin
                          Techniques and Challenges in Conservation Genomics
                          by seqadmin



                          The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                          Avian Conservation
                          Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                          03-08-2024, 10:41 AM

                        ad_right_rmr

                        Collapse

                        News

                        Collapse

                        Topics Statistics Last Post
                        Started by seqadmin, Yesterday, 06:37 PM
                        0 responses
                        12 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, Yesterday, 06:07 PM
                        0 responses
                        10 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 03-22-2024, 10:03 AM
                        0 responses
                        51 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 03-21-2024, 07:32 AM
                        0 responses
                        68 views
                        0 likes
                        Last Post seqadmin  
                        Working...
                        X