Seqanswers Leaderboard Ad

**gpertea** · 06-19-2010, 06:43 AM

I am afraid I don't understand what you are looking at there. Are 'u' and '.' the only class codes you got in the .tracking file? Because that would rather suggest that the chromosome names do not match at all between the reference annotation and the input sample files (the chromosome IDs in the first column should follow the same naming convention, e.g. it shouldn't be 'chr1' in the GTF with the transfrags and '1' in the reference annotation file).
However in that case I don't get this statement:

All of the other categories are consistent across the board.

What other categories? In the attached table picture I only see 'u' and '.' codes listed, and a summary 'Total transfrags' row, which makes me think there are no other categories / class codes. Sorry for being a little confused here about your questions.

Also, I suppose in order to compare the class code distribution in different samples, you were in fact looking at the .tmap files, not the .tracking file as you said, is that right? Because you mentioned you ran cuffcompare with 4 GTF files as input -- was that one cuffcompare run with all 4 files at once, or it was 4 different runs?

**zorph** · 06-21-2010, 05:21 AM

Sorry about the confusion earlier.

I see more class codes than just the "u' and the "." To limit confusion, I had previously just posted a table with the relevant numbers--that clearly backfired.

I have attached my entire table (with the number of transfrags in each category) to this posting. Hopefully this makes things a little clearer.

As far as my comment of "all of the other categories are consistent across the board," as you can see from my attached table, the samples have very similar "numbers/%" transfrags in all other categories besides the "." and the "u" category.

I ran the cuffcompare with 4 GTf files as input at once, in addition to using a reference.

HTML Code:

my run line was the following
cuffcompare -r reference.gtf  ./sample1.gtf ./sample2.gtf (etc.)

I calculated all of these numbers using the tracking file because, if I understood the manual correctly, the tracking file "matches" transcripts up between samples and lists each transcript structure that is present in one or more input GTF files" will be located in this file--thus all transcripts should be present in this file. Or am i interpreting this incorrectly?

I hope this posting makes my earlier message much easier to understand. Thanks

Also, in case this helps to diagnose the problem-- here is the pipeline that I used to treat my samples:
Ga2->tophat->cufflinks (w/ accepted_hits.sam)->cuffcompare (as stated earlier with the gtf file from cufflinks)

Attached Files

table.jpg (88.1 KB, 95 views)

**gpertea** · 06-21-2010, 12:42 PM

I calculated all of these numbers using the tracking file because, if I understood the manual correctly, the tracking file "matches" transcripts up between samples and lists each transcript structure that is present in one or more input GTF files"

It's correct though I think things get a little fuzzy when single-exon transfrags are considered, because in that case there is no "structure" to look at and transfrags may get merged in a single line in that file if they just overlap each other very well (though not perfectly).

However I think trying to get such per sample stats based on the tracking file is not a good idea due to the ambiguity of the '.' class code, which simply has no meaning when applied to an individual transfrags in a sample. Instead, you should use the .tmap files, which are generated for each of the input files and provide independent transfrag classification for each sample. As you probably saw in the manual, the '.' code is used in the .tracking file whenever transfrags found to be "structurally equivalent" across samples (and thus likely to come from the same transcript) do not have the same classification code when considered individually (i.e. as shown in the .tmap file). That is, say we have a transfrag t1 in sample 1 that has code 'u' when compared to the reference transcripts, and it has an "equivalent structure" (but see above the caveat for single-exon transfrags) with a transfrag t2 found in sample 2. Now say t2 may be classified as 'p' because it extends a bit closer to a known transcript. So, this combo will end up shown as '.' in the tracking file, and it doesn't make sense to classify the transfrags in both samples as having the '.' code in a table like yours. By the looks of it I suppose it could be that sample 1 and 2 had a lot of these "equivalent" transfrags with mixed individual codes (one of them being 'u') that got reported in the tracking file as '.'.

In all fairness this still looks like a strange distribution so I suppose it is also possible that there could be some inconsistency somewhere in the initialization of the classifier codes for the .tracking file, such that some transfrags with the 'u' class code (which is the default code) may end up being reported in some cases as a '.' instead. I'll take a look to check my code to see if/how that could happen. But again, if you really wanted to get a meaningful distribution of transfrag categories in each individual sample I would advise to use the .tmap files instead of the .tracking file, because the '.' category doesn't tell anything about the actual classification of transfrags in a single sample (in your case, it looks like this category actually "stole" almost all the 'u' transfrags in sample 1 and 2).

**zorph** · 06-22-2010, 08:42 AM

Thank you so much! That explanation will definitely help me in my analysis

Also, recreating my table using the individual .tmap files allowed me to see that the number of transfrags in each class were consistent across all samples.

elisa*_* · 06-23-2010, 08:30 AM

Hi gpertea, I have a question about the format of .tracking file. In the cufflinks manual, it says there are 6 fields for each sample transcript as follows. qJ:<gene_id>|<transcript_id>|<FMI>|<FPKM>|<conf_lo>|<conf_hi>

However when I run cufflinks0.8.2, I get 8 fields for each transcript, for example: q1:CUFF.54652|CUFF.54652.1
|100|10.306160|3.018604|17.593716|1.929078|141

Could you tell me what are the two extra fields? Thanks a lot!

**gpertea** · 06-23-2010, 09:14 AM

Indeed, the manual hasn't yet been updated to reflect the fact that two extra fields were added there, so the format is now like this:

Code:

qJ:<gene_id>|<transcript_id>|<FMI>|<FPKM>|<conf_lo>|<conf_hi>|<cov>|<len>

..where the added fields are:

<cov>: the estimated average depth of read coverage across the transfrag
<len>: the length of the transfrag

elisa*_* · 06-23-2010, 09:40 AM

gpertea, thank you for your prompt reply!

**GeneSeeker** · 06-29-2010, 10:07 AM

visualization of Cuffcompare class codes

Hi,
Some of the class code descriptions are a little difficult to interpret.

I have made a visualization of the transfrags that fall into each classes, based on my interpretation and attached it to this post.

Am I interpreting the descriptions correctly? Thanks in advance.

Attached Files

Cuffcompare_output.jpg (90.6 KB, 229 views)

**thinkRNA** · 06-29-2010, 03:08 PM

Originally posted by gpertea View Post

Indeed, the manual hasn't yet been updated to reflect the fact that two extra fields were added there, so the format is now like this:

Code:

qJ:<gene_id>|<transcript_id>|<FMI>|<FPKM>|<conf_lo>|<conf_hi>|<cov>|<len>

..where the added fields are:

<cov>: the estimated average depth of read coverage across the transfrag
<len>: the length of the transfrag

HI Gpertea, how is coverage calculated in this file? Can you please tell me the formula used? "Estimated average depth of coverage across the transcript", how do you determine which is the transcript or this is local to the transfrag? I am very confused. Also is transfrag same as "fragments" in FPKM? How can I determine how many million reads were mapped for each experiment, how are multireads handled in this case?

Finally, for single end reads, what makes a fragment? I understand the definition of fragments in paired end reads.

One last thing, sometimes cufflinks will show disconnected parts of a transcript even though there are reads in the entire gene, why is this so? could it because coverage is too low in other parts of the transcript.

Thanks so much, and I really hope that you will reply to my queries because my data is not making sense to me.

**brdido** · 10-04-2011, 11:03 AM

(=) Is "perfect match" or "Complete match of intron chain"

Hy guys,

We're trying hard to understand the definition of "intron chain" in the class code description for "=".

As you guys stated in your tables "Complete match of intron chain" means "perfect match of a transcript", is that it?

thanks in advance (should i study more english? :P )

**wenhuang** · 10-04-2011, 11:07 AM

I think, that "intron chain" essentially means ignorance of the 5' end of the first exon and 3' end of the last exon. In other words, you get all the introns recovered, which does not necessarily mean that you get all the ends recovered.

Originally posted by brdido View Post

Hy guys,

We're trying hard to understand the definition of "intron chain" in the class code description for "=".

As you guys stated in your tables "Complete match of intron chain" means "perfect match of a transcript", is that it?

thanks in advance (should i study more english? :P )

**gpertea** · 10-04-2011, 11:14 AM

wenhuang is correct. The intron coordinates must all match, which means that all the internal exons also match, and only the start coordinate of the first exon and the end coordinate of the last exon are allowed to differ from those of the reference transcript.

**brdido** · 10-04-2011, 11:18 AM

ok! i think i got it!
Thanks!

**jtrivino** · 01-20-2012, 07:44 AM

Hi all

In an analysis of transcriptomics with cufflinks and cuffcompare I want to filter and eliminate noise, so, eliminate the transfrag "suspicious", for this, I think that the "class code" could be a good parameter for this selection. What are the best "class codes" for the filtering of transfrag?

Thanks!!!

**upendra_35** · 09-19-2012, 06:14 PM

cuffcompare

Hi gpertea,

I have couple of questions regarding cufflinks/cuffcompare:

1. I found strange results when i compare the cufflinks with annotation and cuffcompare and cufflinks with no annotation and cuffcompare. Here are the results:

#with annotation:
[upendra_35@vm142-17 Denovo_stuff]$ cut -f3 cufflinks_out/cuffcompare_out.transcripts.gtf.tmap |sort|uniq -c
41003 =
7 c
1 class_code

# Without annotation:
[upendra_35@vm142-17 Denovo_stuff]$ cut -f3 cufflinks_out_no_annot/cuffcompare_out_no_annot.transcripts.gtf.tmap |sort|uniq -c
11935 =
6397 c
1 class_code
5014 e
562 i
16519 j
7226 o
1844 p
51 s
8169 u
624 x

Why is it that we only one class with cufflinks with annotation. I have already checked the annotation file and transcripts.gtf file and the chromosome names match. I believe the cufflinks without annotation might be true. Right?

2. The result above is based on 3 lanes of illumina data. Do you think we can increase the percentages of interesting classes (o and u) if you include more data?

Thanks in advance......

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 55 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 52 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 45 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 55 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News