SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
ChIP-Seq: PeakAnalyzer: Genome-wide annotation of chromatin binding and modification Newsbot! Literature Watch 4 01-20-2017 11:50 PM
running cufflinks with a genome annotation as a strict reference or as a guide. maryb Bioinformatics 2 07-19-2012 05:59 AM
Cufflinks, Cuffdiff and annotation chrisbala RNA Sequencing 8 04-05-2011 03:10 PM
tophat/cufflinks for novel genome annotation darked89 Bioinformatics 1 11-18-2010 06:53 AM
Reference genome for MAQ - split reference genome by chromosome or not? inesdesantiago Bioinformatics 4 02-18-2009 08:44 AM

Reply
 
Thread Tools
Old 05-13-2011, 05:23 AM   #1
markr
Junior Member
 
Location: pittsburgh, pa

Join Date: May 2011
Posts: 2
Default Modification of reference genome annotation by cufflinks/cuffdiff?

Hi,

I am totally new to NGS technologies. We are just now starting an RNA-seq project, and I am trying to figure out how to find differentially expressed genes with my Illumina (100bp) reads in Drosophila.

We aligned our reads to the dm3 genome in galaxy, and I imported a gtf file from ucsc. One thing that was really weird is that some genes I knew to be expressed in the tissue we isolated was listed with a value of 0 FPKM in cuffdiff. However, when I looked at the tophat aligned reads in the ucsc browser, there were plenty of reads.

Looking more closely at the data, I noticed that the chromosomal position reported by cuffdiff for the gene I was looking at had changed. It's previous range was 264127-264127 on the chromosome, but cuffdiff had set it to 264730-364670. So the gene is reported by cuffdiff to be a 100kb gene instead of a 2kb gene. The other weird thing is that many genes in that region of the genome had been changed w.r.t. the reference genome gtf file to have identical boundaries.

Is this some kind of artifact that causes genes to become associated with a different gene's annotation?

Thanks for any help you can provide!

-Mark
markr is offline   Reply With Quote
Old 05-15-2011, 07:06 PM   #2
markr
Junior Member
 
Location: pittsburgh, pa

Join Date: May 2011
Posts: 2
Default

Update: I ran cuffdiff on my computer (in verbose mode), and have discovered that it seems to be changing the gene position data before measuring the number of reads. Does anyone have any ideas to help me?

Here is the fragment of the gtf file I used.
chrX dm3_flyBaseGene exon 240182 241740 0.000000 + . gene_id "CG3777"; transcript_id "CG3777-RC";
chrX dm3_flyBaseGene CDS 242113 242271 0.000000 + 2 gene_id "CG3777"; transcript_id "CG3777-RC";
chrX dm3_flyBaseGene exon 242113 242271 0.000000 + . gene_id "CG3777"; transcript_id "CG3777-RC";
chrX dm3_flyBaseGene CDS 242680 243357 0.000000 + 2 gene_id "CG3777"; transcript_id "CG3777-RC";
chrX dm3_flyBaseGene exon 242680 243357 0.000000 + . gene_id "CG3777"; transcript_id "CG3777-RC";
chrX dm3_flyBaseGene CDS 243477 243703 0.000000 + 2 gene_id "CG3777"; transcript_id "CG3777-RC";
chrX dm3_flyBaseGene stop_codon 243704 243706 0.000000 + . gene_id "CG3777"; transcript_id "CG3777-RC";
chrX dm3_flyBaseGene exon 243477 244237 0.000000 + . gene_id "CG3777"; transcript_id "CG3777-RC";
chrX dm3_flyBaseGene start_codon 250712 250714 0.000000 + . gene_id "CG3757"; transcript_id "CG3757-RA";
chrX dm3_flyBaseGene CDS 250712 250950 0.000000 + 0 gene_id "CG3757"; transcript_id "CG3757-RA";
chrX dm3_flyBaseGene exon 250542 250950 0.000000 + . gene_id "CG3757"; transcript_id "CG3757-RA";
chrX dm3_flyBaseGene CDS 253650 255034 0.000000 + 1 gene_id "CG3757"; transcript_id "CG3757-RA";
chrX dm3_flyBaseGene stop_codon 255035 255037 0.000000 + . gene_id "CG3757"; transcript_id "CG3757-RA";
chrX dm3_flyBaseGene exon 253650 255278 0.000000 + . gene_id "CG3757"; transcript_id "CG3757-RA";
chrX dm3_flyBaseGene start_codon 264127 264129 0.000000 + . gene_id "CG3796"; transcript_id "CG3796-RA";
chrX dm3_flyBaseGene CDS 264127 264729 0.000000 + 0 gene_id "CG3796"; transcript_id "CG3796-RA";
chrX dm3_flyBaseGene stop_codon 264730 264732 0.000000 + . gene_id "CG3796"; transcript_id "CG3796-RA";
chrX dm3_flyBaseGene exon 264064 264980 0.000000 + . gene_id "CG3796"; transcript_id "CG3796-RA";


Here is the -verbose output :
Inspecting bundle chrX:162553-173667 with 484 reads
Inspecting bundle chrX:173735-364670 with 29393 reads
Inspecting bundle chrX:365084-370625 with 418 reads
....

Convergence reached in 24 iterations
Tossing likely garbage isoforms
Filtering isoform 162553-173667
Filtering isoform 162553-173667
Filtering isoform 162553-173667
Filtering isoform 162553-173667
Filtering isoform 162553-173667
Filtering isoform 162553-173667
Revising MLE
Convergence reached in 6 iterations
Importance sampling posterior distribution
Testing for differential expression and regulation in locus [chrX:162553-173667]

Reduced 659 frags to 55 (8.345979 percent)
Calculating intial MLE
Convergence reached in 770 iterations
Tossing likely garbage isoforms
Filtering isoform 210608-228832
Filtering isoform 210608-228832
Filtering isoform 210608-364670
Revising MLE
Convergence reached in 770 iterations
Importance sampling posterior distribution

Reduced 1244 frags to 54 (4.340836 percent)
Calculating intial MLE
Convergence reached in 567 iterations
Tossing likely garbage isoforms
Filtering isoform 210608-228832
Filtering isoform 210608-228832
Filtering isoform 210608-364670
Revising MLE
Convergence reached in 567 iterations
Importance sampling posterior distribution
Testing for differential expression and regulation in locus [chrX:173735-364670]

Reduced 33 frags to 11 (33.333333 percent)
Calculating intial MLE
Tossing likely garbage isoforms
Importance sampling posterior distribution
....
And here is my output from cuffdiff, sorted by gene and position:
CG3114 CG3114 - chrX 162553 173667 q1 q2 OK 4.26393 6.93421 0.486276
CG12470 CG12470 - chrX 173735 364670 q1 q2 OK 0 0 0
CG13374 CG13374 - chrX 173735 364670 q1 q2 OK 0 0 0
CG13375 CG13375 - chrX 173735 364670 q1 q2 OK 0 0 0
CG17885 CG17885 - chrX 173735 364670 q1 q2 OK 0 0 0
CG3258 CG3258 - chrX 173735 364670 q1 q2 OK 0.078959 3.53832 3.80248
CG32816 CG32816 - chrX 173735 364670 q1 q2 OK 0.0761256 0.950462 2.52456
CG3757 CG3757 - chrX 173735 364670 q1 q2 OK 0 0.0544975 1.79769e+308
CG3777 CG3777 - chrX 173735 364670 q1 q2 OK 135.839 262.513 0.658829
CG3796 CG3796 - chrX 173735 364670 q1 q2 OK 0.287887 6.5634 3.1267
CG3827 CG3827 - chrX 173735 364670 q1 q2 OK 0.945195 19.5633 3.03002
CG3839 CG3839 - chrX 173735 364670 q1 q2 OK 0 0 0
CG3972 CG3972 - chrX 173735 364670 q1 q2 OK 0.0259724 0 -1.79769e+308
CG3923 CG3923 - chrX 365084 370625 q1 q2 OK 3.33054 9.50618 1.04881


Any help would be much appreciated!

Mark
markr is offline   Reply With Quote
Old 07-19-2011, 11:47 AM   #3
nsl
Member
 
Location: CA

Join Date: Jan 2011
Posts: 28
Default

Hi Mark,

I find myself in your shoes, did you figure out why you have FPKM of "0" for genes you know to be expressed? I also, find that if I use a reference annotation of dmel to run cufflinks I get "0" for FPKM, but if I run it without a reference file I get values. Any insight if you have, would be much appreciated.

nsl
nsl is offline   Reply With Quote
Old 07-20-2011, 01:20 AM   #4
dariober
Senior Member
 
Location: Cambridge, UK

Join Date: May 2010
Posts: 311
Default

This might be a trivial advice but worth checking... Are the reference sequence file and GTF file using the same chromosome names? (E.g. ENSEMBL calls chromosomes as 1, 2, 3 etc while UCSC calls them chr1, chr2 ...)

Dario
dariober is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 04:42 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO