Dear forum readers,
I'm following the tophat2/cufflinks/cuffmerge/cuffdiff/cummeRbund workflow for my human RNA-seq data analysis. I'm having an issue with annotations that are missing in cummeRbund. Specifically, I'd like to have Ensemble gene IDs instead of cufflinks' XLOC_* IDs.
I aligned my data with tophat2 against GRCh38:
Then I run cufflinks with a Ensembl 81 as GTF-guide reference:
I merge the cufflinks assemblies:
I run cuffdiff on the merged data:
And finally cummeRbund, specifying the reference GTF again:
As you can see I have no annotations, except a gene_short_name. I'd like to have the ensembl_gene_id, ideally in the gene_id column.
I suspect the problem arises at the cuffmerge step, where all annotations are dropped. I can see the gene_id in both transcripts.gtf files:
but they are replaced with dummy XLOC ids in the merged file, despite the -g Homo_sapiens.GRCh38.81.gtf I supplied to cuffmerge:
Also the transcript seems to be gone. And somehow cuffdiff and cummeRbund are not able to recover the annotations later on...
How do you people normally get annotations through this workflow? Am I doing something wrong? I've read lots of threads about this and nobody seems to have a solution...
I'm following the tophat2/cufflinks/cuffmerge/cuffdiff/cummeRbund workflow for my human RNA-seq data analysis. I'm having an issue with annotations that are missing in cummeRbund. Specifically, I'd like to have Ensemble gene IDs instead of cufflinks' XLOC_* IDs.
I aligned my data with tophat2 against GRCh38:
Code:
tophat2 -p 16 -o data1_1_tophat_out Homo_sapiens.GRCh38.dna.primary_assembly data/data1_1_ATGTCA_L002_R1.fastq.gz data/data1_1_ATGTCA_L002_R2.fastq.gz tophat2 -p 16 -o data1_2_tophat_out Homo_sapiens.GRCh38.dna.primary_assembly data/data1_2_ATGTCA_L002_R1.fastq.gz data/data1_2_ATGTCA_L002_R2.fastq.gz
Code:
cufflinks -g Homo_sapiens.GRCh38.81.gtf -p 16 -o data1_1_cl_out data1_1_tophat_out/accepted_hits.bam cufflinks -g Homo_sapiens.GRCh38.81.gtf -p 16 -o data1_2_cl_out data1_2_tophat_out/accepted_hits.bam
Code:
cat data1_assemblies.txt data1_1_cl_out/transcripts.gtf data1_1_cl_out/transcripts.gtf cuffmerge -o data1_cuffmerge -g Homo_sapiens.GRCh38.81.gtf -s Homo_sapiens.GRCh38.dna.primary_assembly.fa -p 16 data1_assemblies.txt
Code:
cuffdiff -o data1_cuffdiff -b Homo_sapiens.GRCh38.dna.primary_assembly.fa -p 16 -L rep1,rep2 -u data1_cuffmerge/merged.gtf data1_1_tophat_out/accepted_hits.bam data1_2_tophat_out/accepted_hits.bam
Code:
R > library(cummeRbund) > data1 <- readCufflinks("data1_cuffdiff/", gtfFile="Homo_sapiens.GRCh38.81.gtf", genome = "GRCh38") > gene.features<-annotation(genes(data1)) > head(gene.features) gene_id class_code nearest_ref_id gene_short_name locus length coverage seqnames start end width strand source type score phase gene_version gene_name gene_source gene_biotype 1 XLOC_000001 <NA> <NA> DDX11L1 1:11868-31109 NA NA <NA> NA NA NA <NA> <NA> <NA> NA NA NA <NA> <NA> <NA> 2 XLOC_000002 <NA> <NA> MIR1302-2,RP11-34P13.3 1:11868-31109 NA NA <NA> NA NA NA <NA> <NA> <NA> NA NA NA <NA> <NA> <NA> 3 XLOC_000003 <NA> <NA> OR4G4P 1:52472-53312 NA NA <NA> NA NA NA <NA> <NA> <NA> NA NA NA <NA> <NA> <NA> 4 XLOC_000004 <NA> <NA> OR4G11P 1:62947-63887 NA NA <NA> NA NA NA <NA> <NA> <NA> NA NA NA <NA> <NA> <NA> 5 XLOC_000005 <NA> <NA> OR4F5 1:69090-70008 NA NA <NA> NA NA NA <NA> <NA> <NA> NA NA NA <NA> <NA> <NA> 6 XLOC_000006 <NA> <NA> CICP27 1:89294-134836 NA NA <NA> NA NA NA <NA> <NA> <NA> NA NA NA <NA> <NA> <NA> havana_gene havana_gene_version isoform_id transcript_version transcript_name transcript_source transcript_biotype havana_transcript havana_transcript_version tag transcript_support_level exon_number 1 <NA> NA <NA> NA <NA> <NA> <NA> <NA> NA <NA> <NA> NA 2 <NA> NA <NA> NA <NA> <NA> <NA> <NA> NA <NA> <NA> NA 3 <NA> NA <NA> NA <NA> <NA> <NA> <NA> NA <NA> <NA> NA 4 <NA> NA <NA> NA <NA> <NA> <NA> <NA> NA <NA> <NA> NA 5 <NA> NA <NA> NA <NA> <NA> <NA> <NA> NA <NA> <NA> NA 6 <NA> NA <NA> NA <NA> <NA> <NA> <NA> NA <NA> <NA> NA exon_id exon_version ccds_id protein_id protein_version 1 <NA> NA <NA> <NA> NA 2 <NA> NA <NA> <NA> NA 3 <NA> NA <NA> <NA> NA 4 <NA> NA <NA> <NA> NA 5 <NA> NA <NA> <NA> NA 6 <NA> NA <NA> <NA> NA
I suspect the problem arises at the cuffmerge step, where all annotations are dropped. I can see the gene_id in both transcripts.gtf files:
Code:
xavier@server:$ grep ENST00000370456 data1_1_cl_out/transcripts.gtf 1 Cufflinks transcript 89364058 89386461 1000 + . gene_id "ENSG00000183347"; transcript_id "ENST00000370456"; FPKM "0.0154691357"; frac "1.000000"; conf_lo "0.000000"; conf_hi "0.046407"; cov "0.052668"; full_read_support "no"; 1 Cufflinks exon 89364058 89364127 1000 + . gene_id "ENSG00000183347"; transcript_id "ENST00000370456"; exon_number "1"; FPKM "0.0154691357"; frac "1.000000"; conf_lo "0.000000"; conf_hi "0.046407"; cov "0.052668"; 1 Cufflinks exon 89368529 89368741 1000 + . gene_id "ENSG00000183347"; transcript_id "ENST00000370456"; exon_number "2"; FPKM "0.0154691357"; frac "1.000000"; conf_lo "0.000000"; conf_hi "0.046407"; cov "0.052668"; [...] xavier@server:$ grep ENST00000370456 data1_2_cl_out/transcripts.gtf 1 Cufflinks transcript 89364058 89386461 1 + . gene_id "ENSG00000183347"; transcript_id "ENST00000370456"; FPKM "0.0000000000"; frac "0.000000"; conf_lo "0.000000"; conf_hi "0.000000"; cov "0.000000"; full_read_support "no"; 1 Cufflinks exon 89364058 89364127 1 + . gene_id "ENSG00000183347"; transcript_id "ENST00000370456"; exon_number "1"; FPKM "0.0000000000"; frac "0.000000"; conf_lo "0.000000"; conf_hi "0.000000"; cov "0.000000"; 1 Cufflinks exon 89368529 89368741 1 + . gene_id "ENSG00000183347"; transcript_id "ENST00000370456"; exon_number "2"; FPKM "0.0000000000"; frac "0.000000"; conf_lo "0.000000"; conf_hi "0.000000"; cov "0.000000"; [...]
Code:
xavier@service0:~/projects4/PEOPLE/xavier/RNAseq_HCC827$ grep ENST00000370456 Cet1_cuffmerge/merged.gtf 1 Cufflinks exon 89364058 89364127 . + . gene_id "XLOC_001053"; transcript_id "TCONS_00005283"; exon_number "1"; gene_name "GBP6"; oId "ENST00000370456"; nearest_ref "ENST00000370456"; class_code "="; tss_id "TSS2586"; p_id "P1592"; 1 Cufflinks exon 89368529 89368741 . + . gene_id "XLOC_001053"; transcript_id "TCONS_00005283"; exon_number "2"; gene_name "GBP6"; oId "ENST00000370456"; nearest_ref "ENST00000370456"; class_code "="; tss_id "TSS2586"; p_id "P1592"; [...]
How do you people normally get annotations through this workflow? Am I doing something wrong? I've read lots of threads about this and nobody seems to have a solution...