Seqanswers Leaderboard Ad

**Cole Trapnell** · 06-07-2010, 09:14 AM

Originally posted by McBryan View Post

Locus Annotation

Looking at 0_1_gene_expr.diff, it seems that quite a few loci aren't annotated correctly even though they are sitting right on top of a gene and several are wrongly annotated.

One example is a locus found at chr16:2017016-2029027. This is annotated as NTHL1 however it should be SLC9A3R2 which is at that position.

NTHL1 is however just next door at chr16:2,029,817-2,037,868 so I thought perhaps SLC9A3R2 is just missing from the annotation and it just picked the closest one, however this is not the case as SLC9A3R2 is there and in the correct place.

Would you please email me the files you fed to Cuffcompare? We'll need to look at this one ourselves.

There are also a large number of loci (~2700) that don't have any annotated transcript from the reference at all despite many of them being right on top of a single gene.

Can you explain this in a bit more detail? Perhaps with an excerpt from the tracking file?

Weird significance

Some loci seem to be classed as significantly different despite having a very small number of reads in them.

This is a known bug that has been fixed in our development builds, and will be resolved in 0.8.3. We introduced the bug in 0.8.2, IIRC.

Weird expression values

I've got some really really really high fold change differences between genes.

Take for example

A ln(fold change) of 20.4781 ! Surely that'd be a regular fold change of about 700 million? Am I misunderstanding that?

Obviously coverage isn't directly equatable with FPKM, but still, I'm hard pressed to believe there is such a huge fold change on that gene as cufflinks reports.

Why do I get huge fold changes for loci? Presumably it occurs because one of the samples is being assigned a FPKM of 7.55E-009 when in reality it should be much higher than this, why would this be the case?

What GTF file did you feed to cuffdiff? Is this the combined GTF from cuffcompare, or is it the reference? Can you please email me both the GTF and the SAM alignments from this locus in both samples? We'll take a closer look.

**McBryan** · 06-08-2010, 03:15 AM

Originally posted by Cole Trapnell View Post

Would you please email me the files you fed to Cuffcompare? We'll need to look at this one ourselves.

cuffcompare -r Homo_sapiens.NCBI36.51.fixed.gtf -R -o summarystats.txt s_6_sequence/transcripts.gtf s_7_sequence/transcripts.gtf

with these files

Originally posted by Cole Trapnell View Post

Can you explain this in a bit more detail? Perhaps with an excerpt from the tracking file?

One of the un-annotated loci is:

test_id gene locus status value_1 value_2 log(fold_change) test_stat p_value significant
XLOC_060829 - chrX:135407087-135422161 OK 1.37306 0.639791 -0.763656 -128.114 0 yes

which looks like this:

From the tracking file for that XLOC:

TCONS_00113902 XLOC_060829 - u q1:CUFF.228858|CUFF.228858.1|100|8.199148|5.205243|11.193054|11.432044|- -

From the summarystats.loci file for that XLOC:

XLOC_060829 chrX[-]135409980-135410636 - CUFF.228858.1

It seems to have been assigned a class code of 'u' (intergenic transcript) while it heavily overlaps with a reference transcript. It's also odd it only seems to have a CUFF id from one sample but not the other.

I would have imagined both samples would have assembled a transfrag for this region.

Looking up the s_6_sequence/transcripts.refmap and s_7_sequence/transcripts.refmap files and searching for ENST00000218364 in either file reveals no hits at all. Nor does searching for HTATSF1. Which is a bit odd.

It's in the reference annotation:

$> cat Homo_sapiens.NCBI36.51.fixed.gtf | grep ENST00000218364
chrX protein_coding exon 135407337 135407695 . + . gene_id "ENSG00000102241"; transcript_id "ENST00000218364"; exon_number "1"; gene_name "HTATSF1"; transcript_name "HTATSF1-001";
chrX protein_coding CDS 135407510 135407695 . + 0 gene_id "ENSG00000102241"; transcript_id "ENST00000218364"; exon_number "1"; gene_name "HTATSF1"; transcript_name "HTATSF1-001"; protein_id "ENSP00000218364";
chrX protein_coding start_codon 135407510 135407512 . + 0 gene_id "ENSG00000102241"; transcript_id "ENST00000218364"; exon_number "1"; gene_name "HTATSF1"; transcript_name "HTATSF1-001";
chrX protein_coding exon 135409423 135409603 . + . gene_id "ENSG00000102241"; transcript_id "ENST00000218364"; exon_number "2"; gene_name "HTATSF1"; transcript_name "HTATSF1-001";
... etc

So, I've checked another one:

test_id gene locus status value_1 value_2 log(fold_change) test_stat p_value significant
XLOC_000027 - chr18:9536822-9604544 OK 0.862998 0.0533053 -2.78438 -707.483 0 yes

Again that seems to be a similar story in the tracking file, class 'u' and only cuff id's from one sample.

TCONS_00000074 XLOC_000027 - u q1:CUFF.97282|CUFF.97282.1|100|0.852185|0.847895|0.856475|1.184949|- -

summarystats.loci:

XLOC_000027 chr18[+]9574714-9578237 - CUFF.97282.1 -

This one does however have listings in the transcripts.refmap files:

s_6_sequence:

PPP4R1 ENST00000400555 = CUFF.97280|CUFF.97280.1

s_7_sequence:

PPP4R1 ENST00000400555 = CUFF.98756|CUFF.98756.2

However, those aren't the CUFF id's we had for the XLOC we are looking at. Sooo, I went back to the gene_exp.diff file and searched for PPP4R1 there and, lo and behold found it.

test_id gene locus status value_1 value_2 log(fold_change) test_stat p_value significant
XLOC_000165 PPP4R1 chr18:9536822-9604544 OK 19.4349 21.174 0.0857056 1.16386 0.24448 no

There are some in the ~2700 that are genuinely un-annotatable as the ensembl gtf either doesn't have a gene there (although occasionally ucsc does) or where there are a few overlapping genes in that area which might make annotation difficult. The ones I've listed were obviously cherry picked as ones where there was only a single gene in the locus, with good coverage and where the reads align directly with the exons in the area.

So, it just occurred to me to sort the gene_exp.diff by chromosome and location and I found this:

XLOC_042941 ,SLC9A3R2 chr16:2017016-2029027 OK 53.9192 65.1133 0.188642 2.1697 0.0300296 no
XLOC_043323 NTHL1 chr16:2017016-2029027 OK 0.000487772 1.57884 8.08235 580.701 0 yes

That's the badly annotated gene from earlier. Next to a correctly (mostly if we ignore that spurious comma) annotated locus as well.

There is also a:

XLOC_043324 NTHL1 chr16:2029838-2037837 NOTEST 8.56769 7.57285 -0.123429 -0.161635 0.871593 no

Which is in the right place.

So, I looked for CCDC36 - the gene with the huge expression from earlier.

It has these:

XLOC_048941 CCDC36 chr3:49108372-49117225 OK 2.8332 3.19538 0.120299 18.6285 0 yes
XLOC_048942 CCDC36 chr3:49108372-49117225 OK 7.51669 6.85147 -0.0926624 -8.4574 0 yes
XLOC_048943 CCDC36 chr3:49108372-49117225 OK 2.33E-008 5.91318 19.3523 2546.34 0 yes
XLOC_049430 QARS chr3:49108372-49117225 OK 34.9382 39.5405 0.123745 1.91556 0.0554212 no
XLOC_048944 CCDC36 chr3:49120517-49129937 NOTEST 1.15265 0.749238 -0.430759 -18.8474 0 no
XLOC_048945 CCDC36 chr3:49120517-49129937 NOTEST 0.900771 1.22835 0.310178 18.803 0 no
XLOC_049431 USP19 chr3:49120517-49129937 NOTEST 16.4928 18.7536 0.128463 0.252713 0.80049 no

Aha! Interesting. It was my understanding that I should have a single row per gene with the sum of the different transcripts. Lots of locus which are exactly the same but assigned to different XLOC. Since one of them seems to have been assigned an almost zero value in one sample but not in the other this causes a ridiculous fold change. But if we were to naively add up the FPKM's from each row then it's a much more sensible 1.22 fold change difference for the region chr3:49108372-49117225 and 1.12 fold change for the region chr3:49120517-49129937.

So, my current theory would be that the reason I'm getting weird expression values is *also* because of the annotation and that when one of the XLOCs ends up with a very low value in one sample but anything that is not a very low value in the other sample we get a huge fold change.

Originally posted by Cole Trapnell View Post

This is a known bug that has been fixed in our development builds, and will be resolved in 0.8.3. We introduced the bug in 0.8.2, IIRC.

Excellent, one less thing to worry about.

Originally posted by Cole Trapnell View Post

What GTF file did you feed to cuffdiff? Is this the combined GTF from cuffcompare, or is it the reference? Can you please email me both the GTF and the SAM alignments from this locus in both samples? We'll take a closer look.

My cuffdiff command was:

cuffdiff -p 14 summarystats.combined.gtf s_6_sequence/accepted_hits.sam s_7_sequence/accepted_hits.sam

which accepts the combined.gtf file from cuffcompare along with the accepted_hits.sam files from s_6_sequence and s_7_sequence.

I have created accepted_hits.sam files for the region chr3:49108296-49130013 (which represents the two regions above CCDC36 was in plus some padding, i.e. chr3:49108372-49129937 + 76bp either side).

AWK used to do that:

cat accepted_hits.sam | awk '$3=="chr3" && $4 >= 49108296 && $4 <= 49130013' > accepted_hits_ccdc36.sam

Files hosted here

**thinkRNA** · 06-08-2010, 09:31 AM

I am wondering why you didn't run tophat with the corresponding GFF file. Have you tried this to see if reads then get assigned to the correctly annotated transcript.

**McBryan** · 06-08-2010, 10:13 AM

Originally posted by thinkRNA View Post

I am wondering why you didn't run tophat with the corresponding GFF file. Have you tried this to see if reads then get assigned to the correctly annotated transcript.

Funny you mention that. That was actually my initial approach where I first noticed the discrepancies I mentioned above.

I had first tried using the script from http://seqanswers.com/forums/showthr...highlight=GFF3 and running:

perl gtf_to_gff.pl Homo_sapiens.NCBI36.51.fixed.gtf > Homo_sapiens.NCBI36.51.gff

followed by:

tophat --solexa1.3-quals -p 14 -G Homo_sapiens.NCBI36.51.gff -o s_6_sequence hg18 s_6_sequence.txt
tophat --solexa1.3-quals -p 14 -G Homo_sapiens.NCBI36.51.gff -o s_7_sequence hg18 s_7_sequence.txt

I later reran it with the details I posted earlier to try and remove as many external factors as I could before posting - so in summary it doesn't stop it happening as that was what I was doing when I first noticed it.

My understanding of the process, however, was that the GFF files merely provided additional reference junctions which bowtie could map the reads against rather than relying only on the ones that tophat detects itself. Providing the reads were ending up in the right place (which I believe they are upon inspection of the coverage.wig files) I would have imagined that this would not have significantly affected cufflink's ability to make transfrags and annotate them.

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 30 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 32 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 28 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 53 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Cufflinks and gene expression

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News