SEQanswers

Go Back   SEQanswers > Applications Forums > RNA Sequencing



Similar Threads
Thread Thread Starter Forum Replies Last Post
Inconsistency between cuffdiff 1.1.0 and cuffdiff 1.0.2 tleonardi Bioinformatics 3 11-16-2011 09:23 AM
cuffdiff q-value fangquan Bioinformatics 1 09-19-2011 07:08 PM
CuffDiff q-value jdanderson Bioinformatics 0 08-01-2011 11:51 AM
What comes after cuffdiff? shurjo Bioinformatics 1 06-05-2011 06:36 AM
cuffdiff-0.9.1 Balat Bioinformatics 15 11-21-2010 04:30 AM

Reply
 
Thread Tools
Old 01-02-2011, 05:57 PM   #1
blindtiger454
Member
 
Location: Omaha, NE

Join Date: Oct 2010
Posts: 30
Default CuffDiff w/ novel transcriptome

Fairly new to RNA-Seq analysis here. We have assembled novel transcriptome using 454 and Illumina reads via Newbler2.5. Now we wish to use cufflinks package (in particular cuffdiff) to measure expression changes. It seems the cufflinks package is geared towards aligning reads to genomic reference, where cufflinks will output the transcript.gtf file after aligning reads to a genome via tophat. If our dataset is purely mRNA, do I still need to run cufflinks to get the transcript.gtf file, which is required as input to Cuffdiff?
blindtiger454 is offline   Reply With Quote
Old 01-02-2011, 09:57 PM   #2
adarob
Member
 
Location: Berkeley, CA

Join Date: Jul 2010
Posts: 71
Default

Cuffdiff is only meant for RNA-Seq analysis where a reference genome is available. You can probably make it work for you by altering your alignment file to give your transcripts "dummy" genomic coordinates. Alternatively, I recommend you look at RSEM (http://deweylab.biostat.wisc.edu/rsem/) which is meant to estimate abundances purely from transcript sequences.
adarob is offline   Reply With Quote
Old 01-02-2011, 10:00 PM   #3
adarob
Member
 
Location: Berkeley, CA

Join Date: Jul 2010
Posts: 71
Default

I just re-read your question, and it's not clear whether or not a reference genome is available. If it is, you can map your reads to the reference using your assembled transcriptome (GTF) as an input to TopHat. You can then use Cuffdiff with your GTF without first running Cufflinks, if you do not wish to use its assembly function. Alternatively, you can use the Cufflinks assembler and Cuffcompare to combine the Cufflinks assembly with your own.

Let me know if you have any further questions!
adarob is offline   Reply With Quote
Old 01-04-2011, 09:01 PM   #4
blindtiger454
Member
 
Location: Omaha, NE

Join Date: Oct 2010
Posts: 30
Default

We do not have a reference genome. So if I wanted to use Cuffdiff, would I have to manually create a "transcript.gtf" file for my assembled transcripts, just creating an entry in the GTF file for each assembled contig/isotig, and adding in fake or irrelevant genome coordinates?
blindtiger454 is offline   Reply With Quote
Old 01-05-2011, 10:31 AM   #5
adarob
Member
 
Location: Berkeley, CA

Join Date: Jul 2010
Posts: 71
Default

Yes, but it will probably be somewhat involved. You will need to build your GTF so that isoforms correctly overlap in the genome coordinates. You will then need to adjust your SAM alignments so that they are converted from transcriptome coordinates to your new "genomic" coordinates. If you decide to go this route, please let me know how it goes! I can also help if you get stuck along the way.
adarob is offline   Reply With Quote
Old 01-05-2011, 12:27 PM   #6
blindtiger454
Member
 
Location: Omaha, NE

Join Date: Oct 2010
Posts: 30
Default

Do you have any other ideas? Or should I look into RSEM or bioconductor packages?
blindtiger454 is offline   Reply With Quote
Old 01-05-2011, 09:26 PM   #7
adarob
Member
 
Location: Berkeley, CA

Join Date: Jul 2010
Posts: 71
Default

I would look into RSEM if I were you.
adarob is offline   Reply With Quote
Old 01-28-2011, 03:10 PM   #8
blindtiger454
Member
 
Location: Omaha, NE

Join Date: Oct 2010
Posts: 30
Default

Could I just use the "-G" option in cufflinks and use a reference annotation based on my assembled transcriptome? Also, can I directly use the transcript.gtf produced from cufflinks as cuffdiff input? I read that cuffdiff requires tss and p_id's, but it looks like cufflinks does not produce these IDs in the outputted gtf file.
blindtiger454 is offline   Reply With Quote
Old 01-28-2011, 04:29 PM   #9
honey
Senior Member
 
Location: Pittsburgh

Join Date: Feb 2010
Posts: 151
Default p_id issue

I think Cole or Adam may like to add a few sentences in the mannual about getting p-id, I have used GTF file with CDS record but no p_id in the out put in most recent version of Cufflinks. It is quite a problem and not straight forward otherwise It is a great tool.
honey is offline   Reply With Quote
Old 01-28-2011, 09:21 PM   #10
peromhc
Senior Member
 
Location: Durham, NH

Join Date: Sep 2009
Posts: 108
Default

I have been testing an approach where I use Tophat to align reads back to contigs (treating the like many many small chromosomes in essence) and then using Cufflinks-compare-diff to look at differential expression.. This is in a novel transcriptome i.e. no reference. Seems to work OK enough-- although there may be some issue with the statistical detection of differneces that I have yet to explore.
peromhc is offline   Reply With Quote
Old 01-30-2011, 08:30 AM   #11
lpachter
Member
 
Location: Berkeley, cA

Join Date: Feb 2010
Posts: 40
Default

Thats definitely of interest to us- I'd really like feedback on how well this works and any suggestions you may have on optimizing such a strategy.
Lior (tophat.cufflinks@gmail.com)
lpachter is offline   Reply With Quote
Old 02-02-2011, 01:44 PM   #12
blindtiger454
Member
 
Location: Omaha, NE

Join Date: Oct 2010
Posts: 30
Default

I realize there are other programs such as RSEM that easily handle novel transcriptomes, however the CuffDiff package produces more detailed output. We used Newbler to assemble the transcriptome, and output for this program is in the form of isogroups, isotigs, and contigs, where an isogroup is considered a gene, isotig is transcript, and contigs are exons. Ideally this would be true for all the output, but is not the case. Many isotigs within a isogroup are simply assembly goofs, retrotransposon events, or indel genes from sister chromosomes.
An idea above suggested treating each gene as is if it was on its own separate chromosome. It should make no difference to the program if it thinks there are 10 or 1000000 chromosomes. This might make it easier to create a fake transcript.gtf file, where gene_id's would be the isogroup number, and transcript_id's are the isotig names. Then possibly treating the whole isotig as a CDS region?? With published results showing how Newbler2.5 is such a great assembler, there has to be someone who has assembled novel transcriptome via Newbler and mapped reads through TopHat/Cufflinks? Anyone??? lol
blindtiger454 is offline   Reply With Quote
Old 02-02-2011, 03:17 PM   #13
adarob
Member
 
Location: Berkeley, CA

Join Date: Jul 2010
Posts: 71
Default

I apologize for giving you inaccurate information earlier, but it turns out that my idea for making a pseudo-genome will not work unless your transcriptome is aligned so that you know which transcripts overlap and where. Otherwise, you will need multi-read support, which Cufflinks currently lacks (although it is coming).

Short of knowing these overlaps, RSEM is your best option (to my knowledge).
adarob is offline   Reply With Quote
Old 02-02-2011, 03:46 PM   #14
honey
Senior Member
 
Location: Pittsburgh

Join Date: Feb 2010
Posts: 151
Default Allel specific expression

Can we use Cufflink/ Cuffdiff to find the allel specific expression? Any suggestion please.
honey is offline   Reply With Quote
Old 02-02-2011, 11:32 PM   #15
blindtiger454
Member
 
Location: Omaha, NE

Join Date: Oct 2010
Posts: 30
Default

I forgot to mention that our transcriptome is plant. That means polyploid and littered with retrotransposons/transposons. The RSEM paper said 52% of reads in maize were multi-reads. I'm not sure if even RSEM can handle the amount of multi-read alignments a novel plant transcriptome will produce. We are still brainstorming ways to shortcut and streamline this pipeline. Will removing any transcripts & reads that map to retrotransposons affect the statistics of RSEM or CuffLinks? I have been favoring CuffLinks because of better documentation and output. The RSEM google group doesn't have any posts yet. Has anyone used RSEM for novel transcriptome?
blindtiger454 is offline   Reply With Quote
Old 02-03-2011, 08:39 PM   #16
blindtiger454
Member
 
Location: Omaha, NE

Join Date: Oct 2010
Posts: 30
Default

From what I've seen in the RSEM documentation/output examples, I will have to manually apply calculations to normalize triplicate samples, and to determine expression change between samples. I may need help in that area.
I still may try to create a fake transcripts.gtf file for cufflinks. A Newbler assembly using just 454 reads usually creates isogroups(genes) where ~70% are single transcript. These can easily be entered into the transcript.gtf file using fake genomic coordinates. And I suspect that around 30% of multi-transcript isogroups are retrotransposons. The rest of the isogroups I can throwout if they are irrelevant genes (such as housekeeping), or I can align them to best match rice(close relative where genome has been sequenced) transcripts to determine where exons/overlaps potentially are. Does this sound rational?

Last edited by blindtiger454; 02-03-2011 at 08:42 PM.
blindtiger454 is offline   Reply With Quote
Old 02-04-2011, 11:48 AM   #17
blindtiger454
Member
 
Location: Omaha, NE

Join Date: Oct 2010
Posts: 30
Default

I am also somewhat confused how transcript overlaps will affect the program. I've seen instances where a transcript will use half an exon, and another transcript will use the full exon. Then there might be instances where a reverse strand transcript overlaps transcript on other strand, and they share a coding region (I'm sure it's extremely rare, maybe in cases of paralogs). I'm not sure how it affects the statistics, possibly regarding sequence/exon length & number, and instances where a read will map to more than one transcript. I read somewhere that many programs will just throw out reads that map to more than one gene/transcript or if it can't resolve where to map. RSEM tries to resolve this.
For now, I am interested in gene expression differences, not transcript. Once candidate genes that show expression change are singled out, a fine tuned pipeline can be devised to catch changes in isoform expression among these genes. I cringe saying this though. I can think of many circumstances where, say in switchgrass, one isoform will turn off and another will increase expression during drought, where the only difference between the two is one small exon. I'm not sure this would be detected at the gene and/or isoform level in cuffdiff without overlapping transcript coordinates, correct??
blindtiger454 is offline   Reply With Quote
Old 02-08-2011, 12:50 PM   #18
lahoman
Member
 
Location: Houston

Join Date: Jan 2011
Posts: 12
Default

Hi, Adarob,

After I use tophat to map Human RNA-Seq to the genome, then cufflinks for the transcript analysis, I checked the file of transcripts.expr. There are 263,506 transcripts. That's a lot. Do I need to filter the results based on FPKM? What criteria should I use? 1.0 or 0.5? I have no idea about it.

Thanks,

Lahoman
lahoman is offline   Reply With Quote
Old 03-02-2011, 04:25 AM   #19
whfwind
Junior Member
 
Location: shanghai

Join Date: Jan 2011
Posts: 8
Default

I believe you can not filter the result base on the FPKM just by 1.0, or 0.5, in the cufflink paer, they said the FPKM >=15 is considered as moderately abundant transcripts. So I guess if you want to extract abundant transcripts you can use 15 or more, but there is another exception, some transcripts is low expression but is meaningful for Human
whfwind is offline   Reply With Quote
Old 05-30-2011, 04:33 AM   #20
papori
Senior Member
 
Location: berd

Join Date: Dec 2010
Posts: 181
Default

Quote:
Originally Posted by adarob View Post
Cuffdiff is only meant for RNA-Seq analysis where a reference genome is available. You can probably make it work for you by altering your alignment file to give your transcripts "dummy" genomic coordinates. Alternatively, I recommend you look at RSEM (http://deweylab.biostat.wisc.edu/rsem/) which is meant to estimate abundances purely from transcript sequences.
Hey adarob,
i look into RSEM, but i didn't find the option that checking differential expression between samples without reference.
can you give an example?

thanks
papori is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 08:41 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO