SEQanswers

Go Back   SEQanswers > Applications Forums > RNA Sequencing



Similar Threads
Thread Thread Starter Forum Replies Last Post
cufflinks-1.0.3 produces very high FPKM values when compared to cufflinks-0.9.3. Why? pinki999 Bioinformatics 5 06-09-2012 06:48 AM
Cufflinks cufflinks v1.0.3 - segmentation fault bias correction chrNT annotations adrian Bioinformatics 0 06-08-2011 01:28 PM

Reply
 
Thread Tools
Old 06-08-2012, 08:10 AM   #1
drdna
Member
 
Location: Kentucky

Join Date: May 2012
Posts: 45
Default Is cufflinks fundamentally flawed?

I have been running tophat/cufflinks/cuffdiff on fungal and human RNAseq data. Some of the FPKM values seemed high so I decided to look at the alignment files (accepted_hits.bam) to count numbers of reads hitting selected genes. I am unable to come up with anything near the values produced in the cuffdiff output. For example one cufflinks locus (XLOC) had a reported FPKM of 421 yet there were zero reads mapping in the corresponding genomic region.

A related issue is that some of the reported loci span genomic regions well beyond the borders of transcripts defined in the supplied .gtf file. However, inspection of the alignment file reveals no reads that support the extension of the transcript.

Am I missing something here? Specifically, does cufflinks use information other than that contained in the accepted_hits.bam file to calculate FPKM and define its (XLOC) loci?
drdna is offline   Reply With Quote
Old 06-08-2012, 06:42 PM   #2
drdna
Member
 
Location: Kentucky

Join Date: May 2012
Posts: 45
Default

I'm currently following up on this by generating a control dataset containing known transcript abundances. Stay tuned...
drdna is offline   Reply With Quote
Old 06-09-2012, 11:33 AM   #3
drdna
Member
 
Location: Kentucky

Join Date: May 2012
Posts: 45
Default Cufflinks IS flawed

So, using artificially-generated control datasets, I find that cufflinks is flawed in two ways:

First, it's FPKM values are inflated. Problem is the magnitude of inflation varies from gene-to-gene - there is no consistency in the error.

Second the "locus" interval defined in the cuffdiff output is often just plain wrong. In many instances, the reported "locus" frequently spans multiple transcripts and intergenic regions, even though the dataset contains reads from only one transcript. In other words, neither the .gtf file, nor the input sequence data support expansion of the "locus" to cover multiple genes.
drdna is offline   Reply With Quote
Old 06-09-2012, 08:04 PM   #4
Portah
Member
 
Location: Cincinnati

Join Date: Jan 2012
Posts: 12
Default

Hi, I've met almost the same problem in addition in gtf file for mm9 from UCSC annotation I have:

chr10 unknown exon 80640798 80640979 . + . gene_id "Eef2"; gene_name "Eef2"; p_id "P7224"; transcript_id "NM_007907"; tss_id "TSS5168";
chr10 unknown CDS 80641426 80641637 . + 2 gene_id "Eef2"; gene_name "Eef2"; p_id "P7224"; transcript_id "NM_007907"; tss_id "TSS5168";
chr10 unknown exon 80641426 80641637 . + . gene_id "Eef2"; gene_name "Eef2"; p_id "P7224"; transcript_id "NM_007907"; tss_id "TSS5168";
chr10 unknown exon 80641706 80641758 . + . gene_id "Snord37"; gene_name "Snord37"; transcript_id "NR_028549"; tss_id "TSS16143";
chr10 unknown CDS 80641826 80642004 . + 0 gene_id "Eef2"; gene_name "Eef2"; p_id "P7224"; transcript_id "NM_007907"; tss_id "TSS5168";
chr10 unknown exon 80641826 80642004 . + . gene_id "Eef2"; gene_name "Eef2"; p_id "P7224"; transcript_id "NM_007907"; tss_id "TSS5168";
chr10 unknown CDS 80642091 80642196 . + 1 gene_id "Eef2"; gene_name "Eef2"; p_id "P7224"; transcript_id "NM_007907"; tss_id "TSS5168";
chr10 unknown exon 80642091 80642196 . + . gene_id "Eef2"; gene_name "Eef2"; p_id "P7224"; transcript_id "NM_007907"; tss_id "TSS5168";
chr10 unknown CDS 80642289 80642541 . + 0 gene_id "Eef2"; gene_name "Eef2"; p_id "P7224"; transcript_id "NM_007907"; tss_id "TSS5168";

Snord37 gene inside Eef2 and length of the Snord37 gene is just 52 but in cuffdiff output I've got:
Snord37 Snord37 Snord37 chr10:80639375-80645254 Control IL33 OK 0 8173.79 1.79769e+308 1.79769e+308 0.0786496 0.428305 no

locus size is 5879. Also cufflinks found 8173.79 FPKM in bam file for the Snord37 but there just 2 reads.

I have a couple other examples. I've tested it on 1.2.1, 1.3.0, 2.0.0 versions of cufflinks the result is the same.
Portah is offline   Reply With Quote
Old 06-10-2012, 05:22 AM   #5
drdna
Member
 
Location: Kentucky

Join Date: May 2012
Posts: 45
Default

I'm glad to hear to someone else can verify my suspicions. I have contacted the tophat cufflink support site about this but I do not expect them to reply because they ignored a previous question I submitted about a month ago.
drdna is offline   Reply With Quote
Old 06-11-2012, 12:59 AM   #6
NicoBxl
not just another member
 
Location: Belgium

Join Date: Aug 2010
Posts: 263
Default

Is there other tool like cufflinks ? to compare the results.
NicoBxl is offline   Reply With Quote
Old 06-11-2012, 04:02 AM   #7
colindaven
Senior Member
 
Location: Germany

Join Date: Oct 2008
Posts: 300
Default

I haven't tried cufflinks but have heard others complaining at conferences.

I have been impressed with edgeR and use that in production here.
colindaven is offline   Reply With Quote
Old 06-11-2012, 04:04 AM   #8
NicoBxl
not just another member
 
Location: Belgium

Join Date: Aug 2010
Posts: 263
Default

Quote:
Originally Posted by colindaven View Post
I haven't tried cufflinks but have heard others complaining at conferences.

I have been impressed with edgeR and use that in production here.

yes edgeR and DESeq work pretty well. But is there a tool to perform a reference-based transcriptome assembly (like cufflinks)
NicoBxl is offline   Reply With Quote
Old 06-11-2012, 04:18 AM   #9
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 2,291
Default

Have you looked at MapSplice? http://www.netlab.uky.edu/p/bioinfo/MapSplice

Last edited by GenoMax; 06-11-2012 at 04:21 AM.
GenoMax is online now   Reply With Quote
Old 06-11-2012, 04:26 AM   #10
pbluescript
Senior Member
 
Location: Boston

Join Date: Nov 2009
Posts: 224
Default

Quote:
Originally Posted by NicoBxl View Post
yes edgeR and DESeq work pretty well. But is there a tool to perform a reference-based transcriptome assembly (like cufflinks)
Have you tried Scripture?
http://www.broadinstitute.org/software/scripture/
pbluescript is offline   Reply With Quote
Old 06-11-2012, 05:42 AM   #11
rboettcher
Member
 
Location: Berlin

Join Date: Oct 2010
Posts: 71
Default

Hi all,
I just finished my first cufflinks run on RNAseq data and I also encountered results that make me doubt the validity of cufflinks' and cuffdiff's output.

Therefore I'm also considering to switch my analysis pipeline and rerun the analysis. However, during an Agilent seminar last week it was mentioned that Scripture would be an alternative which is heavy weight and requires serious computational ressources in order to perform the assembly. So my question is: does anybody already have experiences with Scripture and if so could you give recommendations towards the machine specifications needed?

Best regards
rboettcher is offline   Reply With Quote
Old 06-11-2012, 04:44 PM   #12
drdna
Member
 
Location: Kentucky

Join Date: May 2012
Posts: 45
Default

MapSplice is good for gene structure analysis but doesn't do differential expression analysis.
drdna is offline   Reply With Quote
Old 06-12-2012, 12:40 AM   #13
NicoBxl
not just another member
 
Location: Belgium

Join Date: Aug 2010
Posts: 263
Default

Quote:
Originally Posted by pbluescript View Post

Yes but I have several problems to run it. I'll open a new thread now with my problems.

edit > here's the thread for scripture : http://seqanswers.com/forums/showthr...5998#post75998

Last edited by NicoBxl; 06-12-2012 at 12:47 AM.
NicoBxl is offline   Reply With Quote
Old 06-12-2012, 12:44 PM   #14
drdna
Member
 
Location: Kentucky

Join Date: May 2012
Posts: 45
Default

Upon quick inspection, it appears to me that Scripture simply assembles transcripts but does not quantify and compare expression levels. Is that the case?

Last edited by drdna; 06-12-2012 at 12:49 PM.
drdna is offline   Reply With Quote
Old 06-12-2012, 01:39 PM   #15
chadn737
Senior Member
 
Location: US

Join Date: Jan 2009
Posts: 392
Default

Quote:
Originally Posted by NicoBxl View Post
yes edgeR and DESeq work pretty well. But is there a tool to perform a reference-based transcriptome assembly (like cufflinks)
One alternative would be to run cufflinks and then use the transcripts.gtf or combined.gtf from cuffcompare as your input for something like HTSeq-count. That will give you a list of transcripts with raw reads which can then be used in either DESeq or EdgeR.

This approach would avoid any potential problems with Cufflinks quantification/differential expression while giving the advantage of a reference-based transcriptome assembly.
chadn737 is offline   Reply With Quote
Old 06-12-2012, 01:49 PM   #16
Portah
Member
 
Location: Cincinnati

Join Date: Jan 2012
Posts: 12
Default

After a lot of digging about wrong FPKMs and cufflink in the forum and documentation. I tried to check cds_exp.diff and was surprised that FPKMs there and gene list (after infinity filtering +-1.79E+308) are near expected values. Maybe we incorrectly interpret how cufflinks split reads between intersect regions which are a lot in GTF file (CDS, exons, stop-codons...) ?
Portah is offline   Reply With Quote
Old 06-12-2012, 01:51 PM   #17
drdna
Member
 
Location: Kentucky

Join Date: May 2012
Posts: 45
Default

If you follow this thread, you will see that there is a problem with this approach because cufflinks/cuffmerge produces erroneous .gtf files which contains instances where multiple transcripts are merged into one (despite the lack of any evidence to support such mergings).
drdna is offline   Reply With Quote
Old 06-12-2012, 01:54 PM   #18
drdna
Member
 
Location: Kentucky

Join Date: May 2012
Posts: 45
Default

To Portah. Good find, I'm going to go check the cds_exp.diff file for my runs and see if it makes more sense. Regardless, there is still an issue with transcript/reference annotation merging.
drdna is offline   Reply With Quote
Old 06-12-2012, 02:06 PM   #19
chadn737
Senior Member
 
Location: US

Join Date: Jan 2009
Posts: 392
Default

Quote:
Originally Posted by drdna View Post
If you follow this thread, you will see that there is a problem with this approach because cufflinks/cuffmerge produces erroneous .gtf files which contains instances where multiple transcripts are merged into one (despite the lack of any evidence to support such mergings).
My bad, I thought the primary concern was the FPKM calculation, which I have never trusted and why I have always stuck with count based methods of differential expression. But if there is also a problem with merging the transcripts then that seems far more fundamental even.

I'm curious though, how are you guys running cufflinks? I'm assuming you are using the -g/--GTF-guide argument? Or does this problem persist even if you give it the -G/--GTF argument and tell it not to look for novel transcripts and stick to the supplied GTF file?
chadn737 is offline   Reply With Quote
Old 06-12-2012, 02:12 PM   #20
drdna
Member
 
Location: Kentucky

Join Date: May 2012
Posts: 45
Default

chadn737, Yes and yes. I have been running cuffmerge using a reference gtf and the --no-novel-juncs flag.

So if you are using a count-based method of DE analysis, do you align your reads with gene sequences, as opposed to a genome assembly? I'd be interested in hearing a little bit more about your approach.
drdna is offline   Reply With Quote
Reply

Tags
cuffdiff, cufflinks

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 10:25 AM.


Powered by vBulletin® Version 3.8.6
Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.