Cuffmerge loses gene_id annotation

xavier.robin

Junior Member

Join Date: Aug 2015
Posts: 1

Cuffmerge loses gene_id annotation

08-21-2015, 01:47 AM

Dear forum readers,

I'm following the tophat2/cufflinks/cuffmerge/cuffdiff/cummeRbund workflow for my human RNA-seq data analysis. I'm having an issue with annotations that are missing in cummeRbund. Specifically, I'd like to have Ensemble gene IDs instead of cufflinks' XLOC_* IDs.

I aligned my data with tophat2 against GRCh38:

Code:

tophat2 -p 16 -o data1_1_tophat_out Homo_sapiens.GRCh38.dna.primary_assembly data/data1_1_ATGTCA_L002_R1.fastq.gz data/data1_1_ATGTCA_L002_R2.fastq.gz
tophat2 -p 16 -o data1_2_tophat_out Homo_sapiens.GRCh38.dna.primary_assembly data/data1_2_ATGTCA_L002_R1.fastq.gz data/data1_2_ATGTCA_L002_R2.fastq.gz

Then I run cufflinks with a Ensembl 81 as GTF-guide reference:

Code:

cufflinks -g  Homo_sapiens.GRCh38.81.gtf -p 16 -o data1_1_cl_out data1_1_tophat_out/accepted_hits.bam
cufflinks -g  Homo_sapiens.GRCh38.81.gtf -p 16 -o data1_2_cl_out data1_2_tophat_out/accepted_hits.bam

I merge the cufflinks assemblies:

Code:

cat data1_assemblies.txt
data1_1_cl_out/transcripts.gtf
data1_1_cl_out/transcripts.gtf

cuffmerge -o data1_cuffmerge -g Homo_sapiens.GRCh38.81.gtf -s Homo_sapiens.GRCh38.dna.primary_assembly.fa -p 16 data1_assemblies.txt

I run cuffdiff on the merged data:

Code:

cuffdiff -o data1_cuffdiff -b Homo_sapiens.GRCh38.dna.primary_assembly.fa -p 16 -L rep1,rep2 -u data1_cuffmerge/merged.gtf data1_1_tophat_out/accepted_hits.bam data1_2_tophat_out/accepted_hits.bam

And finally cummeRbund, specifying the reference GTF again:

Code:

R
> library(cummeRbund)
> data1 <- readCufflinks("data1_cuffdiff/", gtfFile="Homo_sapiens.GRCh38.81.gtf", genome = "GRCh38")
> gene.features<-annotation(genes(data1))
> head(gene.features)
      gene_id class_code nearest_ref_id        gene_short_name          locus length coverage seqnames start end width strand source type score phase gene_version gene_name gene_source gene_biotype
1 XLOC_000001       <NA>           <NA>                DDX11L1  1:11868-31109     NA       NA     <NA>    NA  NA    NA   <NA>   <NA> <NA>    NA    NA           NA      <NA>        <NA>         <NA>
2 XLOC_000002       <NA>           <NA> MIR1302-2,RP11-34P13.3  1:11868-31109     NA       NA     <NA>    NA  NA    NA   <NA>   <NA> <NA>    NA    NA           NA      <NA>        <NA>         <NA>
3 XLOC_000003       <NA>           <NA>                 OR4G4P  1:52472-53312     NA       NA     <NA>    NA  NA    NA   <NA>   <NA> <NA>    NA    NA           NA      <NA>        <NA>         <NA>
4 XLOC_000004       <NA>           <NA>                OR4G11P  1:62947-63887     NA       NA     <NA>    NA  NA    NA   <NA>   <NA> <NA>    NA    NA           NA      <NA>        <NA>         <NA>
5 XLOC_000005       <NA>           <NA>                  OR4F5  1:69090-70008     NA       NA     <NA>    NA  NA    NA   <NA>   <NA> <NA>    NA    NA           NA      <NA>        <NA>         <NA>
6 XLOC_000006       <NA>           <NA>                 CICP27 1:89294-134836     NA       NA     <NA>    NA  NA    NA   <NA>   <NA> <NA>    NA    NA           NA      <NA>        <NA>         <NA>
  havana_gene havana_gene_version isoform_id transcript_version transcript_name transcript_source transcript_biotype havana_transcript havana_transcript_version  tag transcript_support_level exon_number
1        <NA>                  NA       <NA>                 NA            <NA>              <NA>               <NA>              <NA>                        NA <NA>                     <NA>          NA
2        <NA>                  NA       <NA>                 NA            <NA>              <NA>               <NA>              <NA>                        NA <NA>                     <NA>          NA
3        <NA>                  NA       <NA>                 NA            <NA>              <NA>               <NA>              <NA>                        NA <NA>                     <NA>          NA
4        <NA>                  NA       <NA>                 NA            <NA>              <NA>               <NA>              <NA>                        NA <NA>                     <NA>          NA
5        <NA>                  NA       <NA>                 NA            <NA>              <NA>               <NA>              <NA>                        NA <NA>                     <NA>          NA
6        <NA>                  NA       <NA>                 NA            <NA>              <NA>               <NA>              <NA>                        NA <NA>                     <NA>          NA
  exon_id exon_version ccds_id protein_id protein_version
1    <NA>           NA    <NA>       <NA>              NA
2    <NA>           NA    <NA>       <NA>              NA
3    <NA>           NA    <NA>       <NA>              NA
4    <NA>           NA    <NA>       <NA>              NA
5    <NA>           NA    <NA>       <NA>              NA
6    <NA>           NA    <NA>       <NA>              NA

As you can see I have no annotations, except a gene_short_name. I'd like to have the ensembl_gene_id, ideally in the gene_id column.

I suspect the problem arises at the cuffmerge step, where all annotations are dropped. I can see the gene_id in both transcripts.gtf files:

Code:

xavier@server:$ grep ENST00000370456 data1_1_cl_out/transcripts.gtf
1	Cufflinks	transcript	89364058	89386461	1000	+	.	gene_id "ENSG00000183347"; transcript_id "ENST00000370456"; FPKM "0.0154691357"; frac "1.000000"; conf_lo "0.000000"; conf_hi "0.046407"; cov "0.052668"; full_read_support "no";
1	Cufflinks	exon	89364058	89364127	1000	+	.	gene_id "ENSG00000183347"; transcript_id "ENST00000370456"; exon_number "1"; FPKM "0.0154691357"; frac "1.000000"; conf_lo "0.000000"; conf_hi "0.046407"; cov "0.052668";
1	Cufflinks	exon	89368529	89368741	1000	+	.	gene_id "ENSG00000183347"; transcript_id "ENST00000370456"; exon_number "2"; FPKM "0.0154691357"; frac "1.000000"; conf_lo "0.000000"; conf_hi "0.046407"; cov "0.052668";
[...]
xavier@server:$ grep ENST00000370456 data1_2_cl_out/transcripts.gtf
1	Cufflinks	transcript	89364058	89386461	1	+	.	gene_id "ENSG00000183347"; transcript_id "ENST00000370456"; FPKM "0.0000000000"; frac "0.000000"; conf_lo "0.000000"; conf_hi "0.000000"; cov "0.000000"; full_read_support "no";
1	Cufflinks	exon	89364058	89364127	1	+	.	gene_id "ENSG00000183347"; transcript_id "ENST00000370456"; exon_number "1"; FPKM "0.0000000000"; frac "0.000000"; conf_lo "0.000000"; conf_hi "0.000000"; cov "0.000000";
1	Cufflinks	exon	89368529	89368741	1	+	.	gene_id "ENSG00000183347"; transcript_id "ENST00000370456"; exon_number "2"; FPKM "0.0000000000"; frac "0.000000"; conf_lo "0.000000"; conf_hi "0.000000"; cov "0.000000";
[...]

but they are replaced with dummy XLOC ids in the merged file, despite the -g Homo_sapiens.GRCh38.81.gtf I supplied to cuffmerge:

Code:

xavier@service0:~/projects4/PEOPLE/xavier/RNAseq_HCC827$ grep ENST00000370456 Cet1_cuffmerge/merged.gtf 
1	Cufflinks	exon	89364058	89364127	.	+	.	gene_id "XLOC_001053"; transcript_id "TCONS_00005283"; exon_number "1"; gene_name "GBP6"; oId "ENST00000370456"; nearest_ref "ENST00000370456"; class_code "="; tss_id "TSS2586"; p_id "P1592";
1	Cufflinks	exon	89368529	89368741	.	+	.	gene_id "XLOC_001053"; transcript_id "TCONS_00005283"; exon_number "2"; gene_name "GBP6"; oId "ENST00000370456"; nearest_ref "ENST00000370456"; class_code "="; tss_id "TSS2586"; p_id "P1592";
[...]

Also the transcript seems to be gone. And somehow cuffdiff and cummeRbund are not able to recover the annotations later on...

How do you people normally get annotations through this workflow? Am I doing something wrong? I've read lots of threads about this and nobody seems to have a solution...

Tags: annotations, cuffmerge, cummerbund, ensembl_gene_id

Previous template Next

Recent Advances in Sequencing Analysis Tools

by seqadmin

The sequencing world is rapidly changing due to declining costs, enhanced accuracies, and the advent of newer, cutting-edge instruments. Equally important to these developments are improvements in sequencing analysis, a process that converts vast amounts of raw data into a comprehensible and meaningful form. This complex task requires expertise and the right analysis tools. In this article, we highlight the progress and innovation in sequencing analysis by reviewing several of the...
- Channel: Articles
Yesterday, 07:48 AM
Essential Discoveries and Tools in Epitranscriptomics

by seqadmin

The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
- Channel: Articles
04-22-2024, 07:01 AM

Topics	Statistics	Last Post
Telomere Maintenance by PARP1: A New Perspective in Cancer Research by seqadmin Started by seqadmin, Today, 06:57 AM	0 responses 9 views 0 likes	Last Post by seqadmin Today, 06:57 AM
Enhanced Neoantigen Detection: Introducing NeoHunter by seqadmin Started by seqadmin, Yesterday, 07:17 AM	0 responses 14 views 0 likes	Last Post by seqadmin Yesterday, 07:17 AM
A Close Examination at Probiotic-Related Bacteremia by seqadmin Started by seqadmin, 05-02-2024, 08:06 AM	0 responses 19 views 0 likes	Last Post by seqadmin 05-02-2024, 08:06 AM
Expanded Genetic Insights into Blood Pressure Regulation by seqadmin Started by seqadmin, 04-30-2024, 12:17 PM	0 responses 23 views 0 likes	Last Post by seqadmin 04-30-2024, 12:17 PM

Seqanswers Leaderboard Ad

Announcement

Cuffmerge loses gene_id annotation

Latest Articles

ad_right_rmr

News