Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Cuffcompare output

    (Many new threads from me here, but I'm still new to RNA-seq and all the analysis, so I'm hoping that's alright!)

    My current goal is trying to get FPKM values for a test data set I have, and if I have understood the instructions and papers Cufflinks (after Tophat) is the way to do that. I've taken the "accepted_hits.bam" from Tophat and run that on Cufflinks (with no special options), and then taken the "transcripts.gtf" and run it through Cuffcompare with a reference annotation (hg19: genome.gtf, taken from iGenomes).

    This is just a single biological replicate of paired-end reads (i.e. two files, *_1.fq and *_2.fq), and I just want the FPKM values for the genes in that replicate. So, first of all, is the workflow I described above correct for this? Next, looking at the cuffcmp.stats:

    Code:
    # Cuffcompare v2.2.1 | Command line was:
    #cuffcompare -p 2 -r /Users/erikfasterius/rna_seq/bowtie_indexes//hg19.gtf transcripts.gtf
    #
    
    #= Summary for dataset: transcripts.gtf :
    #     Query mRNAs :    1099 in    1070 loci  (708 multi-exon transcripts)
    #            (27 multi-transcript loci, ~1.0 transcripts per locus)
    # Reference mRNAs :   49570 in   26031 loci  (45413 multi-exon)
    # Super-loci w/ reference transcripts:      857
    #--------------------|   Sn   |  Sp   |  fSn |  fSp  
            Base level: 	  1.6	 93.0	  - 	  - 
            Exon level: 	  1.3	 67.4	  1.3	 71.3
          Intron level: 	  1.6	 99.0	  1.6	 99.5
    Intron chain level: 	  0.3	 19.1	  0.3	 20.5
      Transcript level: 	  0.0	  0.0	  0.0	  0.0
           Locus level: 	  0.5	 12.3	  0.5	 12.5
    
         Matching intron chains:     135
                  Matching loci:     132
    
              Missed exons:  243810/248560	( 98.1%)
               Novel exons:      97/4701	(  2.1%)
            Missed introns:  224610/228634	( 98.2%)
             Novel introns:       0/3623	(  0.0%)
               Missed loci:   25103/26031	( 96.4%)
                Novel loci:      87/1070	(  8.1%)
    
     Total union super-loci across all input datasets: 945
    ... it seems that the analysis was somehow unsuccessful? I'm not really sure how to interpret this data. I can also look into cuffcmp.transcripts.gtf.tmap:

    Code:
    head cuffcmp.transcripts.gtf.tmap
    ref_gene_id	ref_id	class_code	cuff_gene_id	cuff_id	FMI	FPKM	FPKM_conf_lo	FPKM_conf_hi	cov	len	major_iso_id	ref_match_len
    GNB1	NM_002074	c	CUFF.1	CUFF.1.1	100	228.270145	132.599651	323.940640	4.059528	1359	CUFF.1.1	3194
    RPL22	NM_000983	c	CUFF.2	CUFF.2.1	100	231.819459	103.317386	360.321532	3.710384	876	CUFF.2.1	2085
    PARK7	NM_001123377	=	CUFF.3	CUFF.3.1	100	496.677254	301.998070	691.356439	8.342134	832	CUFF.3.1	903
    ENO1	NM_001428	c	CUFF.4	CUFF.4.1	100	844.654267	645.987522	1043.321012	14.260660	1250	CUFF.4.1	2187
    TARDBP	NM_007375	c	CUFF.5	CUFF.5.1	100	169.652613	109.552539	229.752686	2.928344	2357	CUFF.5.1	4217
    UQCRHL	NM_001089591	c	CUFF.6	CUFF.6.1	100	1198.040186	632.905455	1763.174917	13.612039	385	CUFF.6.1	538
    -	-	u	CUFF.7	CUFF.7.1	100	325.223511	212.645155	437.801867	5.269625	1402	CUFF.7.1	-
    SDHB	NM_003000	c	CUFF.8	CUFF.8.1	100	313.861645	139.762529	487.960760	5.058395	685	CUFF.8.1	1145
    CDC42	NM_001039802	c	CUFF.9	CUFF.9.1	100	230.544754	121.904615	339.184893	3.416650	1126	CUFF.9.1	2294
    ... and that seems (to me) to give me the FKPM values I'm looking for. Is this so? If so, what does the cuffcmp.stats file mean? If not, how is it related to cuffcmp.stats, and how can I fix it? I found this thread (http://seqanswers.com/forums/showthread.php?t=3972) that seems related, but I don't really understand it. It seems to say I should change some things in the reference genomes, but as far as I can tell the one I'm using is formatted as described in the thread...

    Thanks in advance!
    Last edited by ErikFas; 06-24-2014, 01:21 AM.

  • #2
    cuffcompare is not meant to calculate transcript abundances, it is used to compare your assembled transcripts against the reference.
    cuffcompare produces a new gtf, which you can use to run cuffquant and cuffnorm, this way you will get normalized FPKM values for your samples.

    Comment


    • #3
      Oh, so the workflow to obtain FPKM would be Tophat -> Cufflinks -> Cuffcompare -> Cuffquant -> Cuffnorm?

      I tried running the above for my sample, but Cuffnorm says that it requires at least two SAM files... How does this work, since I only have the one replicate? I'm using the following command:

      Code:
      cuffnorm -p 2 -o cuffnorm_out $BOWTIE2_INDEXES/hg19.gtf cuffquant_out/abundances.cxb
      ... and this is the error:
      Code:
      Error: cuffdiff requires at least 2 SAM files

      Comment


      • #4
        ErikFas: See if this paper helps clarify things. http://www.nature.com/nprot/journal/....2012.016.html

        It is a protocols paper going over the steps of a common RNAseq analysis using the tuxedo suite of programs.

        Comment


        • #5
          Thanks, that's the paper I'm currently using to try to understand the workflow. Problem is, that they always have more than one replicate/condition in that paper, and I don't understand how to do it in my case, with the one replicate for the one condition. Is this not something that you're supposed to be able to do, finding the RNA expression levels in just one replicate?

          Comment


          • #6
            You could but be aware that that analysis is not going to produce statistically meaningful results. There are many threads that discuss this on SeqAnswers. Here is a sample:

            Discussion of next-gen sequencing related bioinformatics: resources, algorithms, open source efforts, etc

            Application of sequencing to RNA analysis (RNA-Seq, whole transcriptome, SAGE, expression analysis, novel organism mining, splice variants)

            Application of sequencing to RNA analysis (RNA-Seq, whole transcriptome, SAGE, expression analysis, novel organism mining, splice variants)

            Comment


            • #7
              No, I understand that. Maybe I'm just confused with what I am expecting. See, the analysis I'm currently doing is to learn how to do future analsyses. I already have partial results for my current project (we're waiting on another RNA-seq experiment to be run), which consists of finding RNA expression levels in three different colorectal cancer cell lines.

              I have some preliminary results that were given to me by a bioinformatician. These consists of an Excel-file containing FPKM values for two of the three cell lines. Each cell line was in triplicate, so three columns with FPKM values for each cell lines, plus one column where he had taken the mean of the three other columns, for the statistically "real" FPKM for each gene.

              I was under the assumption that this is how one would arrive at the FPKM for a given gene: calculate the FPKM for each replicate (on its own) and then combine them. Hence, I thought that this would be possible to do with the Tuxedo package, and I could practice with just one replicate. But this is not so?

              Comment


              • #8
                Sure you can. The FPKM values will be in genes.fpkm_tracking (and isoforms.fpkm_tracking) after you run Cufflinks. If you are working on hg19 (which you seem to be) it would probably be easier to run cufflinks with -G hg19.gtf, but you can also run without it to perform an assembly and then run cuffcompare to compare the assembly against hg19.gtf. In both cases you should get the fpkm_tracking files, but in the latter case you also need to check which assembled transcript (CUFF.123 and so on) corresponds to which annotated transcript/gene. Cuffnorm and CuffDiff normalize across replicates which is by definition not doable without replicates.

                Comment


                • #9
                  For future reference, you'll get more reliable results if you use cuffmerge and then cuffdiff, since the resulting FPKMs can be made from more robustly normalized values (the more recent versions of cuffdiff use the DESeq library normalization method, I believe). This produces much more meaningful results than using FPKM values from cufflinks, due to all of the inherent problems with FPKM.

                  Comment


                  • #10
                    Ok, I think the picture is getting clearer! If I'm only interested in FPKM values, I can just grab the info from Cufflinks in genes.fpkm_tracking (possible with a single replicate), but I can also get the same FPKM values from Cuffdiff (needs at least two replicates).

                    I tried both of the methods, and I get slightly different FPKM values. Some of them are quite close, whereas others vary quite a bit (like +/- 25%, ish). If I understand you correctly, dpryan, the values generated with Cuffdiff (as describe in the protocol paper, i.e. including Cuffmerge before) should be the ones that are "more correct"?

                    In a related problem, while I was checking the gene_exp.diff output from Cuffdiff, I don't get any FPKM columns. Reading the documentation, there's supposed to be two FPKM columns, one for each (in this case two) replicate/condition. I did get some errors from Cuffmerge, though, but I pushed on with the intention of seeing if it worked anyway, which it kind of it. I did get the genes.fpkm_tracking, but not the FPKM columns in gene_exp.diff. Are these related?

                    Cuffmerge error:

                    Code:
                    Warning: couldn't find fasta record for 'chr17_ctg5_hap1'!
                    Warning: couldn't find fasta record for 'chr17_gl000205_random'!
                    Warning: couldn't find fasta record for 'chr19_gl000209_random'!
                    Warning: couldn't find fasta record for 'chr1_gl000191_random'!
                    Warning: couldn't find fasta record for 'chr1_gl000192_random'!
                    Warning: couldn't find fasta record for 'chr4_ctg9_hap1'!
                    Warning: couldn't find fasta record for 'chr4_gl000193_random'!
                    Warning: couldn't find fasta record for 'chr4_gl000194_random'!
                    Warning: couldn't find fasta record for 'chr6_apd_hap1'!
                    Warning: couldn't find fasta record for 'chr6_cox_hap2'!
                    Warning: couldn't find fasta record for 'chr6_dbb_hap3'!
                    Warning: couldn't find fasta record for 'chr6_mann_hap4'!
                    Warning: couldn't find fasta record for 'chr6_mcf_hap5'!
                    Warning: couldn't find fasta record for 'chr6_qbl_hap6'!
                    Warning: couldn't find fasta record for 'chr6_ssto_hap7'!
                    Warning: couldn't find fasta record for 'chr7_gl000195_random'!
                    Warning: couldn't find fasta record for 'chrUn_gl000211'!
                    Warning: couldn't find fasta record for 'chrUn_gl000212'!
                    Warning: couldn't find fasta record for 'chrUn_gl000213'!
                    Warning: couldn't find fasta record for 'chrUn_gl000215'!
                    Warning: couldn't find fasta record for 'chrUn_gl000218'!
                    Warning: couldn't find fasta record for 'chrUn_gl000219'!
                    Warning: couldn't find fasta record for 'chrUn_gl000220'!
                    Warning: couldn't find fasta record for 'chrUn_gl000222'!
                    Warning: couldn't find fasta record for 'chrUn_gl000223'!
                    Warning: couldn't find fasta record for 'chrUn_gl000227'!
                    Warning: couldn't find fasta record for 'chrUn_gl000228'!

                    Comment


                    • #11
                      cuffmerge "errors" are actually warnings. Some of the ID's are referring to alternative haplotypes for MHC region that are not represented in the reference sequence (http://vega.sanger.ac.uk/info/data/M...o_sapiens.html). chrUn_* are areas that can't be reliably assigned to a chromosome though they are part of human genome (http://genome.ucsc.edu/FAQ/FAQdownloads.html#download11).

                      Comment


                      • #12
                        Thanks a lot for the help, everybody! Now on to try to learn DESeq...

                        Comment

                        Latest Articles

                        Collapse

                        • seqadmin
                          Strategies for Sequencing Challenging Samples
                          by seqadmin


                          Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                          03-22-2024, 06:39 AM
                        • seqadmin
                          Techniques and Challenges in Conservation Genomics
                          by seqadmin



                          The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                          Avian Conservation
                          Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                          03-08-2024, 10:41 AM

                        ad_right_rmr

                        Collapse

                        News

                        Collapse

                        Topics Statistics Last Post
                        Started by seqadmin, 03-27-2024, 06:37 PM
                        0 responses
                        12 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 03-27-2024, 06:07 PM
                        0 responses
                        10 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 03-22-2024, 10:03 AM
                        0 responses
                        52 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 03-21-2024, 07:32 AM
                        0 responses
                        68 views
                        0 likes
                        Last Post seqadmin  
                        Working...
                        X