I want to try to estimate the expression level for the transcripts in each gene.
Rather than using the isoforms generated by TopHat-Cufflinks pipeline, I want to use the known annotations.
When I run cuffdiff, I provided the mapping results in SAM format and the ensembl annotation as GTF file.
When I checked the cuffdiff results, there are some weird things in the gene boundaries for the test.
For example, gene ENSMUSG00000029019 structure is stored in the gtf file like below.
Therefore, the gene is starting from 147360009 to 147361306 (1-based position).
But in the cuffdiff result, genes.fpkm_tracking, the locus for the gene is much larger than the original one, from 147326657 to 147416061.
Does it mean that cuffdiff is trying to set the new gene locus (or boundaries) based on the supplied short read data and the provided gene annotation (e.g. emsembl gtf file) is just used as the guidance?
In that case, is there any way to estimate the expression using the exact gene structures provided by user rather than cufflinks definition?
Thanks for any comments in advance.
Rather than using the isoforms generated by TopHat-Cufflinks pipeline, I want to use the known annotations.
When I run cuffdiff, I provided the mapping results in SAM format and the ensembl annotation as GTF file.
When I checked the cuffdiff results, there are some weird things in the gene boundaries for the test.
For example, gene ENSMUSG00000029019 structure is stored in the gtf file like below.
Code:
chr4 mm9_ensGene start_codon 147360085 147360087 0.000000 + . gene_id "ENSMUSG00000029019"; transcript_id "ENSMUST00000103231"; chr4 mm9_ensGene CDS 147360085 147360210 0.000000 + 0 gene_id "ENSMUSG00000029019"; transcript_id "ENSMUST00000103231"; chr4 mm9_ensGene exon 147360009 147360210 0.000000 + . gene_id "ENSMUSG00000029019"; transcript_id "ENSMUST00000103231"; chr4 mm9_ensGene CDS 147360405 147360627 0.000000 + 0 gene_id "ENSMUSG00000029019"; transcript_id "ENSMUST00000103231"; chr4 mm9_ensGene exon 147360405 147360627 0.000000 + . gene_id "ENSMUSG00000029019"; transcript_id "ENSMUST00000103231"; chr4 mm9_ensGene CDS 147361071 147361087 0.000000 + 2 gene_id "ENSMUSG00000029019"; transcript_id "ENSMUST00000103231"; chr4 mm9_ensGene exon 147361071 147361306 0.000000 + . gene_id "ENSMUSG00000029019"; transcript_id "ENSMUST00000103231";
But in the cuffdiff result, genes.fpkm_tracking, the locus for the gene is much larger than the original one, from 147326657 to 147416061.
Code:
tracking_id class_code nearest_ref_id gene_short_name tss_id locus q0_FPKM q0_conf_lo q0_conf_hi q1_FPKM q1_conf_lo q1_conf_hi ENSMUSG00000029019 - - - - chr4:147326657-147416061 72.4927 68.7873 76.1981 47.0939 43.9784 50.2093
In that case, is there any way to estimate the expression using the exact gene structures provided by user rather than cufflinks definition?
Thanks for any comments in advance.
Comment