SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
QR Codes and Plate Maps rfrichholt Bioinformatics 1 12-21-2011 04:46 PM
GATK and IUB codes donorio.demeo Bioinformatics 1 11-03-2011 07:34 AM
gsMapper error codes spacedninja 454 Pyrosequencing 2 08-02-2011 03:47 AM
Cuffcompare Class Codes jdanderson Bioinformatics 1 08-01-2011 12:39 PM
454 gsMapper error codes spacedninja General 2 06-29-2011 10:21 AM

Reply
 
Thread Tools
Old 01-04-2010, 01:51 AM   #1
leshkowitz
Junior Member
 
Location: Israel

Join Date: Jun 2008
Posts: 3
Default cuffcompare class codes

I ran cuffcompare with the -r option and got class codes which are not written in the manual.
Code:
cut -f2 cuffcompare.tracking | sort | uniq -c
3852	=
171	-
11919	.
159666	c
18619	e
2	i
11263	j
12492	o
53	p
I don't know what the "-" and the "." classes stand for. On the other hand I don't have the "u" class.

Thanks
leshkowitz is offline   Reply With Quote
Old 01-05-2010, 08:30 AM   #2
gpertea
Member
 
Location: Maryland, US

Join Date: Jan 2010
Posts: 21
Default

Yes, I am afraid that in the last versions of cuffcompare the 'u' class code is no longer used in the tracking file as it's basically replaced by '-' and '.'
  • '-' should appear only when there are no reference transcripts to be associated with the transcript(s) on that line (so it's really the 'u' case)
  • '.' is used when the relationship to the reference transcript is not the same (not consistent) among the multiple cufflinks transcripts on that line.
Though it seems to me that sometimes '.' is also used as a last resort when no other code has been assigned (i.e. just like '-'). Sorry for the inconsistency, I'll have this fixed in a later version.

For now I think it's safe to assume that if there is no reference transcript on that line and the code is '-' or '.', the actual class code is in fact 'u' there, for all intents and purposes. But if there is a reference and the code is '.' it means that the transcripts on that line have different relationships to the reference transcript (and one can check the .tmap output file to find how exactly each of the transcripts on that tracking line relates to the reference)
gpertea is offline   Reply With Quote
Old 01-07-2010, 09:00 AM   #3
jebe
Junior Member
 
Location: New Mexico

Join Date: Nov 2009
Posts: 9
Default cuffcompare class codes

What does the class code p stand for? A reference to it is also not in the manual.
jebe is offline   Reply With Quote
Old 01-08-2010, 09:04 AM   #4
gpertea
Member
 
Location: Maryland, US

Join Date: Jan 2010
Posts: 21
Default

That is the code for "polymerase run" - it's supposed to signal that the relative positioning of the transcripts to the reference transcript suggests a potential polymerase read-through downstream of the 3' UTRs of the reference. Currently this is crudely implemented to have the 'p' code assigned whenever the cufflinks transcripts start in the range of 2Kb downstream of a reference.
gpertea is offline   Reply With Quote
Old 01-22-2010, 07:05 AM   #5
aherbert@bu.edu
Junior Member
 
Location: Boston

Join Date: Jan 2010
Posts: 2
Default

and what does 'o' stand for ? Thanks
aherbert@bu.edu is offline   Reply With Quote
Old 01-22-2010, 10:45 AM   #6
gpertea
Member
 
Location: Maryland, US

Join Date: Jan 2010
Posts: 21
Default

The 'o' code stands for "other overlap" - that is an exonic overlap with the reference transcript that doesn't fall in any other, "more interesting" overlap categories - e.g. no splice sites match ('j' class), no containment ('c' code) etc. These 'o' codes could be assigned, for example, to assembled single-exon fragments that happen to overlap one of the terminal exons of a reference transcript (but not enough to make it "contained").
gpertea is offline   Reply With Quote
Old 03-09-2010, 07:22 PM   #7
Haneko
Member
 
Location: Singapore

Join Date: Jan 2010
Posts: 36
Default

May I know where you got this information? I can't seem to find it.

Thanks!
Haneko is offline   Reply With Quote
Old 03-09-2010, 08:13 PM   #8
Cole Trapnell
Senior Member
 
Location: Boston, MA

Join Date: Nov 2008
Posts: 212
Default

Quote:
Originally Posted by Haneko View Post
May I know where you got this information? I can't seem to find it.

Thanks!
gpertea is Geo Pertea. He wrote cuffcompare, and he and I came up with these classification codes and the rules by which they are assigned to fragments while we were designing the program together.
Cole Trapnell is offline   Reply With Quote
Old 03-09-2010, 09:38 PM   #9
Haneko
Member
 
Location: Singapore

Join Date: Jan 2010
Posts: 36
Default

Oh ok. Thanks!
Haneko is offline   Reply With Quote
Old 03-10-2010, 05:42 AM   #10
gpertea
Member
 
Location: Maryland, US

Join Date: Jan 2010
Posts: 21
Default

Since this thread was started the Cufflinks manual has been updated so all the codes should be listed there now, and the Cuffcompare code was also updated to fix the little inconsistencies discussed above.
gpertea is offline   Reply With Quote
Old 05-26-2010, 08:16 AM   #11
fennan
Member
 
Location: Germany

Join Date: Apr 2010
Posts: 19
Default

I found the following issue when deailing with cufflinks/cuffcompare:

I have an experiment with 7 lanes. I am studying the effect of combining or not these lanes in the analyisis. I am getting some strange results in the class codes provided by cuffcompare:

1)
What does the "e" class code mean? The manual says: "A single exon transcript overlapping a reference exon and at least 10 bp of a reference intron, indicating a possible pre-mRNA fragment." However I checked one of these "e" transcripts and it seems to be overlapping the intron only 4 bp...

2)
* When I use a single lane in the study, these are the class codes (the results are quite similar for every lane):

Class_code Number_reads Counts %
c 80589 35546855.389316 24.105
e 18586 34152995.119766 23.16
i 39500 4951663.761766 3.358
j 27 76931.419867 0.052
o 21448 60651025.9952 41.128
p 6114 1563040.590947 1.06
u 15133 6021136.2572 4.083
= 396 4504079.989821 3.054

* However when I use the 7 lanes:

Class_code Number_reads Counts %
c 64197 88839653.484614 6.38
e 99716 962234656.969969 69.103
i 411761 73484397.951014 5.277
j 9 25164.917023 0.002
o 21828 134378950.510602 9.65
p 21268 8650851.800035 0.621
u 115917 39115195.697265 2.809
= 1690 85729215.60263 6.157

Is this normal? I have tried using two lanes and the percentage of "e" would be around 44%...

Thanks in advance for any help
fennan is offline   Reply With Quote
Old 05-26-2010, 01:33 PM   #12
gpertea
Member
 
Location: Maryland, US

Join Date: Jan 2010
Posts: 21
Default

@fennan:

Re: 1)
You are right, I just looked at the code and it appears that the 10bp check had been inadvertently commented out -- sorry about that, I am going to revert it. There was also an attempt to extend that code definition to include any observed unspliced introns, I have to check with Cole for the status of this modification and I'll update the manual if that's the case.

Re: 2)
I am not sure which of the output files of cuffcompare you used to generate those tables showing class code frequencies and number of "reads" (?). If you can explain how you got those numbers I will look into why they don't add up.
gpertea is offline   Reply With Quote
Old 05-27-2010, 03:54 AM   #13
fennan
Member
 
Location: Germany

Join Date: Apr 2010
Posts: 19
Default

@gpertea

Thank you very much for your quick reply. Please, let me know when you get to fix this little bug.

Quote:
I am not sure which of the output files of cuffcompare you used to generate those tables showing class code frequencies and number of "reads" (?). If you can explain how you got those numbers I will look into why they don't add up.
I used the .tmap file. "Number of reads" is indeed a bad name. It is the number of "transfrags" that I find in the tmap file with a given class code. The counts are computed (in an estimated way) by multiplying the coverage and the length of the transfrag. I was trying to reproduce the information provided in the Table 2 of the Supplementary Material of the cufflinks paper. Am I doing something wrong?

Thanks again for your help
fennan is offline   Reply With Quote
Old 05-28-2010, 01:06 PM   #14
gpertea
Member
 
Location: Maryland, US

Join Date: Jan 2010
Posts: 21
Default

fennan,
I cannot provide a full cufflinks release now but you can get the updated cuffcompare source code with this quick fix, here:
ftp://ftp.cbcb.umd.edu/pub/software/...re.0853.tar.gz
After unpacking simply change to that directory and run 'make' to compile the updated cuffcompare binary for your platform, it should work if you have a recent gcc/g++ installed.

About the second problem with the counts, I have no idea where the problem is as I am still not sure how you actually get the counts -- are you using something like this to count the transfrags for various class codes in a .tmap file ?
Code:
cut -f3 60hr.tmap | sort | uniq -c
Because in this case the counts for a given sample should be consistent between cuffcompare runs if the same reference annotation was used for all runs -- no matter how many other input files (samples) are given to cuffcompare.

But perhaps I misunderstood you -- I assumed that you assembled each lane separately with cufflinks and when you said "I used 7 lanes" I assumed you meant that you ran cuffcompare with all the 7 GTF files as input. And then you summed up the class code counts in all 7 resulting .tmap files.
gpertea is offline   Reply With Quote
Old 06-18-2010, 10:59 AM   #15
zorph
Member
 
Location: FL

Join Date: May 2010
Posts: 40
Default

I recently ran cuffcompare with all 4 GTF files as input and this is some of my outcome (from my tracking file). Each sample has the raw number and % of each category. I am confused about my output:

--my first two samples have no intergenic reads where my other two samples have ample amount (sample 1:2, sample 2: 0 sample 3: 680K, sample 5: 688K)

--when looking at the transfrags that fall into the random category ".", the first two samples have much more in that category than they do in the last two samples.

--when you add the two columns together, they seem to equal to the same amount of reads.

All the samples are the same type of tissue, but different stages of a disease.

Is there something wrong here?

All of the other categories are consistent across the board. Any explanation of why, in two samples ~680k reads were labeled category "u" in two samples and 0 in the other two samples would be greatly appreciated. Thank you.
Attached Images
File Type: jpg table.jpg (19.6 KB, 65 views)

Last edited by zorph; 06-18-2010 at 11:03 AM.
zorph is offline   Reply With Quote
Old 06-19-2010, 07:43 AM   #16
gpertea
Member
 
Location: Maryland, US

Join Date: Jan 2010
Posts: 21
Default

I am afraid I don't understand what you are looking at there. Are 'u' and '.' the only class codes you got in the .tracking file? Because that would rather suggest that the chromosome names do not match at all between the reference annotation and the input sample files (the chromosome IDs in the first column should follow the same naming convention, e.g. it shouldn't be 'chr1' in the GTF with the transfrags and '1' in the reference annotation file).
However in that case I don't get this statement:
Quote:
All of the other categories are consistent across the board.
What other categories? In the attached table picture I only see 'u' and '.' codes listed, and a summary 'Total transfrags' row, which makes me think there are no other categories / class codes. Sorry for being a little confused here about your questions.

Also, I suppose in order to compare the class code distribution in different samples, you were in fact looking at the .tmap files, not the .tracking file as you said, is that right? Because you mentioned you ran cuffcompare with 4 GTF files as input -- was that one cuffcompare run with all 4 files at once, or it was 4 different runs?

Last edited by gpertea; 06-19-2010 at 08:03 AM.
gpertea is offline   Reply With Quote
Old 06-21-2010, 06:21 AM   #17
zorph
Member
 
Location: FL

Join Date: May 2010
Posts: 40
Default

Sorry about the confusion earlier.

I see more class codes than just the "u' and the "." To limit confusion, I had previously just posted a table with the relevant numbers--that clearly backfired.

I have attached my entire table (with the number of transfrags in each category) to this posting. Hopefully this makes things a little clearer.

As far as my comment of "all of the other categories are consistent across the board," as you can see from my attached table, the samples have very similar "numbers/%" transfrags in all other categories besides the "." and the "u" category.

I ran the cuffcompare with 4 GTf files as input at once, in addition to using a reference.

HTML Code:
my run line was the following
cuffcompare -r reference.gtf  ./sample1.gtf ./sample2.gtf (etc.)
I calculated all of these numbers using the tracking file because, if I understood the manual correctly, the tracking file "matches" transcripts up between samples and lists each transcript structure that is present in one or more input GTF files" will be located in this file--thus all transcripts should be present in this file. Or am i interpreting this incorrectly?

I hope this posting makes my earlier message much easier to understand. Thanks

Also, in case this helps to diagnose the problem-- here is the pipeline that I used to treat my samples:
Ga2->tophat->cufflinks (w/ accepted_hits.sam)->cuffcompare (as stated earlier with the gtf file from cufflinks)
Attached Images
File Type: jpg table.jpg (88.1 KB, 71 views)
zorph is offline   Reply With Quote
Old 06-21-2010, 01:42 PM   #18
gpertea
Member
 
Location: Maryland, US

Join Date: Jan 2010
Posts: 21
Default

Quote:
I calculated all of these numbers using the tracking file because, if I understood the manual correctly, the tracking file "matches" transcripts up between samples and lists each transcript structure that is present in one or more input GTF files"
It's correct though I think things get a little fuzzy when single-exon transfrags are considered, because in that case there is no "structure" to look at and transfrags may get merged in a single line in that file if they just overlap each other very well (though not perfectly).

However I think trying to get such per sample stats based on the tracking file is not a good idea due to the ambiguity of the '.' class code, which simply has no meaning when applied to an individual transfrags in a sample. Instead, you should use the .tmap files, which are generated for each of the input files and provide independent transfrag classification for each sample. As you probably saw in the manual, the '.' code is used in the .tracking file whenever transfrags found to be "structurally equivalent" across samples (and thus likely to come from the same transcript) do not have the same classification code when considered individually (i.e. as shown in the .tmap file). That is, say we have a transfrag t1 in sample 1 that has code 'u' when compared to the reference transcripts, and it has an "equivalent structure" (but see above the caveat for single-exon transfrags) with a transfrag t2 found in sample 2. Now say t2 may be classified as 'p' because it extends a bit closer to a known transcript. So, this combo will end up shown as '.' in the tracking file, and it doesn't make sense to classify the transfrags in both samples as having the '.' code in a table like yours. By the looks of it I suppose it could be that sample 1 and 2 had a lot of these "equivalent" transfrags with mixed individual codes (one of them being 'u') that got reported in the tracking file as '.'.

In all fairness this still looks like a strange distribution so I suppose it is also possible that there could be some inconsistency somewhere in the initialization of the classifier codes for the .tracking file, such that some transfrags with the 'u' class code (which is the default code) may end up being reported in some cases as a '.' instead. I'll take a look to check my code to see if/how that could happen. But again, if you really wanted to get a meaningful distribution of transfrag categories in each individual sample I would advise to use the .tmap files instead of the .tracking file, because the '.' category doesn't tell anything about the actual classification of transfrags in a single sample (in your case, it looks like this category actually "stole" almost all the 'u' transfrags in sample 1 and 2).
gpertea is offline   Reply With Quote
Old 06-22-2010, 09:42 AM   #19
zorph
Member
 
Location: FL

Join Date: May 2010
Posts: 40
Default

Thank you so much! That explanation will definitely help me in my analysis

Also, recreating my table using the individual .tmap files allowed me to see that the number of transfrags in each class were consistent across all samples.
zorph is offline   Reply With Quote
Old 06-23-2010, 09:30 AM   #20
elisa*_*
Junior Member
 
Location: CT

Join Date: Aug 2008
Posts: 8
Default

Hi gpertea, I have a question about the format of .tracking file. In the cufflinks manual, it says there are 6 fields for each sample transcript as follows. qJ:<gene_id>|<transcript_id>|<FMI>|<FPKM>|<conf_lo>|<conf_hi>

However when I run cufflinks0.8.2, I get 8 fields for each transcript, for example: q1:CUFF.54652|CUFF.54652.1
|100|10.306160|3.018604|17.593716|1.929078|141

Could you tell me what are the two extra fields? Thanks a lot!
elisa*_* is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 11:56 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2021, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO