SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
issues found in using cufflinks/cuffcompare/cuffdiff sterding Bioinformatics 5 06-01-2011 08:04 PM
transcriptome analysis using tophat and cufflinks,cuffcompare, harshinamdar Bioinformatics 4 04-27-2011 04:52 AM
Differential gene expression: Can Cufflinks/Cuffcompare handle biological replicates? marcora Bioinformatics 38 12-14-2010 03:57 PM
cufflinks / cuffcompare does not produce p_id dariober Bioinformatics 4 07-22-2010 01:32 PM
cufflinks cuffcompare output Mark Bioinformatics 1 07-19-2010 07:23 AM

Reply
 
Thread Tools
Old 10-16-2011, 09:35 AM   #1
apadr007
Member
 
Location: washington DC

Join Date: Oct 2011
Posts: 21
Default Cufflinks and Cuffcompare extended CDS?

Hello,

I have just finished running cufflinks on my dataset and I would like to know how to calculate how much the coding regions of the cufflinks transcript models extended past the known annotated reference model. This will shed light on the 5' and 3' UTR's as well as refine the previous annotation. I performed a cuffcompare analysis on my dataset, but as of now I do not think the class codes from cuffcompare cover this.

Does anyone has any ideas, comments, or suggestions on how to quantify this?

Thanks,
apadr007 is offline   Reply With Quote
Old 10-16-2011, 07:02 PM   #2
jiyan
Junior Member
 
Location: shanghai

Join Date: Aug 2011
Posts: 6
Smile

Hello, apadr007
In classcodes of cuffcompare's tracking files, personally it seems that both code "o - Generic exonic overlap with a reference transcript" and code "x - Exonic overlap with reference on the opposite strand" can help you pick up transcrips of interest.
My question is why don't current classcodes of cuffcompares meet your needs. Thank you and hope for you reply .
jiyan is offline   Reply With Quote
Old 10-17-2011, 05:41 AM   #3
apadr007
Member
 
Location: washington DC

Join Date: Oct 2011
Posts: 21
Default

I had thought the generic overlap with a reference transcript only referred to single exon overlaps. Thanks, it seems I can use these class codes for my analysis.
apadr007 is offline   Reply With Quote
Old 11-16-2011, 09:47 PM   #4
lixiangru
Junior Member
 
Location: China

Join Date: Jan 2011
Posts: 7
Default

Quote:
Originally Posted by jiyan View Post
Hello, apadr007
In classcodes of cuffcompare's tracking files, personally it seems that both code "o - Generic exonic overlap with a reference transcript" and code "x - Exonic overlap with reference on the opposite strand" can help you pick up transcrips of interest.
My question is why don't current classcodes of cuffcompares meet your needs. Thank you and hope for you reply .

I have check my result. It seems that the items with code "o" contained in some reference transcripts. So it doesn't work to extend CDS when predicting UTRs.
lixiangru is offline   Reply With Quote
Old 10-15-2012, 10:39 PM   #5
upendra_35
Senior Member
 
Location: USA

Join Date: Apr 2010
Posts: 102
Default

I want to revive this thread again as i am basically doing what apadr007 has plan to do sometime before. I did observed that class code "o" contained in some reference transcripts as well (same observation as lixiangru). But why those transcripts were classified as "o" rather than "c".

I also want to see how the reference transcriptome annotation compares to cufflinks transcript models. For that is it right if i use class code "o" only?

Also apadr007 have you finished with your analysis. If so could you share with me of how did you calculate the coding regions of the cufflinks transcript models extended past the known annotated reference model? Any help would be appreciated. Thanks

Last edited by upendra_35; 10-15-2012 at 10:47 PM.
upendra_35 is offline   Reply With Quote
Old 10-16-2012, 05:07 AM   #6
apadr007
Member
 
Location: washington DC

Join Date: Oct 2011
Posts: 21
Default

I want to revive this thread again as i am basically doing what apadr007 has plan to do sometime before. I did observed that class code "o" contained in some reference transcripts as well (same observation as lixiangru). But why those transcripts were classified as "o" rather than "c".

I also want to see how the reference transcriptome annotation compares to cufflinks transcript models. For that is it right if i use class code "o" only?

Also apadr007 have you finished with your analysis. If so could you share with me of how did you calculate the coding regions of the cufflinks transcript models extended past the known annotated reference model? Any help would be appreciated. Thanks

Hi upendra_35,

These transcripts have these general classifications based on the criteria that was set by the developers - therefore, take it with a grain of salt. Although their logic behind the generation of these classcodes is sound, it is not all encompassing. I have experimentally validated several "polymerase run on fragments" and several other classifications considered to be false by cuffcompare and I have picked them up with RT-PCR. I would say to use these more as a guide for your analysis. They are useful to determine whether you detect a variant of a previously known gene or if what you pick it totally novel with respect to the reference genome.

What you can do to detect coding regions that extend past the known reference annotation is to first extract from your cuffcompare data what reference genes (by accession number you detect). Place these accession numbers in a file - call it input.txt for example. Also, get the .gff3 file from your reference annotation and do this in a unix shell.

Code:
while read A B; do grep $A reference-genome.gff3; done < input.txt > output.gff3
This will generate a .gff3 with only the accession numbers you detected with cuffcompare (You would do this if you only detected a portion of your reference genes and not all of them. For example, if you are doing a tissue specific transcriptome analysis or if you do not believe in the coverage that is mapping to the reference annotation. Thus, allowing you to select only what you place into the input.txt file). From here remove all of the extra things in the .gff3 like coding DNA sequences like this

Code:
 cat output.gff3 | grep "gene" > output.gff3.gene
then, reformat this .gff3 file into a .bed file. You can take a look at the bed format here (http://genome.ucsc.edu/FAQ/FAQformat.html#format1). You can do this in excel if you are not familiar with awk or perl, its just a matter of changing the columns around and adding the orientation of your transcripts with either a + or a -.

From here you can use BEDtools (http://code.google.com/p/bedtools/#BEDTools_Summary) And analyze the amount of coverage you are getting from your bam files that are within a certain distance upstream (i.e. 5') or downstream (i.e. 3') from your reference gene. Alternatively you can do this at the transcript level as well, but people normally use coverage from their reads from my experience.

And as far as comparing your models to the reference genome, I would use all of the ones that have an associated accession number, except for "p" which can be up to 2kb in front of a reference gene.


cheers,
AP
apadr007 is offline   Reply With Quote
Old 10-16-2012, 06:35 AM   #7
upendra_35
Senior Member
 
Location: USA

Join Date: Apr 2010
Posts: 102
Default

Thanks a lot AP. By accession number you mean class codes? I have the below classcodes from my analysis and according what i understand i take all the classcodes except "p" for validating the reference genome annotation (I have a gff file from reference genome). Right?

9982 =
4230 c
1 class_code
2522 e
122 i
11275 j
4591 o
304 p
39 s
1882 u
317 x

The main objective of doing this analysis is not to make a new annotation altogether but use the existing annotation and improve it further using RNAseq. So we though of only using class "u" (novel transcripts) and class "o" (novel exons/correct annotation transcripts compared to reference?) and after validating some of these, we plan to combine these transcripts to the existing annotation. What do you think of overall strategy. Also for validating "u" and "o" i was planning to use QPCR and RT-PCR respectively. Does that make sense?
upendra_35 is offline   Reply With Quote
Old 10-16-2012, 06:50 AM   #8
apadr007
Member
 
Location: washington DC

Join Date: Oct 2011
Posts: 21
Default

When I say accession number I mean the name of the reference genes in your organism. This is found in the file you downloaded from whatever database you used.

No, you just get the accession number based on what your transcripts detected. So if you have a transcript with a "o" or "j" than they will have an associated accession number from the reference annotation - you use that accession number. These accession numbers are reported in the cuffcompare.tracking output file. Obviously the "u" will have no association with a accession number, therefore, for the analysis that you want to do they would not be used.

If you design primers to test whether your transcripts are real, based on read alignment, will not tell you very much in my opinion since they has been confirmed already and you can just reference a paper that has done this before (check out http://www.biomedcentral.com/1471-2164/12/587/). You can try to test them to determine a FPKM cutoff for you data to consider a minimum of what you call "real", however.

If you want to test your models, you can design primers that span across splice junctions, run a PCR and then send them for sequencing (traditional sequencing). This will tell you if your novel junctions are real and will access the overall accuracy of your isoforms.
apadr007 is offline   Reply With Quote
Old 10-16-2012, 09:30 AM   #9
upendra_35
Senior Member
 
Location: USA

Join Date: Apr 2010
Posts: 102
Default

Thanks so much AP. Very useful comments and suggestions. I have been struggling for last few days of how do i deal with cuffcompare class code stuff but after your comments/suggestions i was relieved. The problem is that there is not much information regarding this anywhere (not even on their website).

So to sum up:

Classcode "u" transcripts doesn't need to be validated because some where it is been validated unless someone wants to know the FPKM cut-off (By the way i have been using FPKM cut-off of 1 to filter out the novel transcripts will that be good or too relaxed?)

Classcode "o" transcripts does need to be validated by designing primes across the splice junctions and sequencing.

Thanks again AP.

Last edited by upendra_35; 10-16-2012 at 09:33 AM.
upendra_35 is offline   Reply With Quote
Old 04-09-2013, 10:04 PM   #10
sivasubramani
Member
 
Location: India

Join Date: Apr 2011
Posts: 14
Default

Hey all,

I am looking for the transcripts which are extended or clipped in either 5' or 3' end. Could anyone help me with this..?? What classcode would be helpful in understanding this transcripts..??
sivasubramani is offline   Reply With Quote
Old 04-09-2013, 11:39 PM   #11
upendra_35
Senior Member
 
Location: USA

Join Date: Apr 2010
Posts: 102
Default

My best guess would be ....Class code 'O' transcripts you would be looking at.
upendra_35 is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 06:51 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO