Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • how to change XLOC ID to Ensembl ID from Cuffdiff

    When I did RNA-Seq analysis, the GTF file I used was from Ensembl. The output of cuffdiff replaced the Ensembl IDs with XLOC's although it also output gene names (e.g. MX2). Ensembl IDs were no longer there.
    Is there anyway to convert XLOC back to Ensemble IDs, or simply keep the ensembl IDs from my GTF file? how do you guys go about this?
    Interesting enough, if I don't run new gene discovery (i.e. without doing cuffmerge step), I got to keep Ensembl IDs.
    Last edited by super0925; 03-16-2015, 06:41 AM.

  • #2
    Hi super0925,

    I ran into the same problem, and realized the merged.gtf file produced by cuffmerge did not use the Ensembl ID as the transcript ID (shown below). Also, in the downstream analysis with cuffdiff, the "oId" ensembl ID is not carried over into the SQLite database.

    1 Cufflinks exon 11869 12227 . + . gene_id "XLOC_000001"; transcript_id "TCONS_00000002"; exon_number "1"; gene_name "DDX11L1"; oId "ENST00000456328"; nearest_ref "ENST00000456328"; class_code "="; tss_id "TSS1";
    1 Cufflinks exon 12613 12721 . + . gene_id "XLOC_000001"; transcript_id "TCONS_00000002"; exon_number "2"; gene_name "DDX11L1"; oId "ENST00000456328"; nearest_ref "ENST00000456328"; class_code "="; tss_id "TSS1";
    1 Cufflinks exon 13221 14409 . + . gene_id "XLOC_000001"; transcript_id "TCONS_00000002"; exon_number "3"; gene_name "DDX11L1"; oId "ENST00000456328"; nearest_ref "ENST00000456328"; class_code "="; tss_id "TSS1";
    I ended up writing a python script to substitute the 'transcript_id' with the 'oId' in order to maintain the ensembl IDs (below). I used the new merged.gtf file for cuffdiff and that solved my problem.

    #!/usr/bin/python

    gtf_handle = "/PATH/TO/merged.gtf"
    fh = open(gtf_handle, "r")

    import re

    trans_ids = {}

    with open('merged2.gtf', 'w') as f:

    for line in fh:
    line = line.strip('\n') ##strip the line to remove white spaces
    ##print line
    cuffID = re.findall(r'gene_id \"([\w\.]+)"', line) ##use RE to get lists of cuffid, ensemblId etc
    cuffTx = re.findall(r'transcript_id \"([\w\.]+)"', line)
    ensemblTx = re.findall(r'oId \"([\w\.]+)"', line)
    geneName = re.findall(r'gene_name \"([\w\.]+)"', line)
    ##print cuffTx[0]
    line = str(line).replace(cuffTx[0], ensemblTx[0]) ##unlist the transcript identifiers and replace cufflinksID with ensemblIDs
    print line
    f.write("%s\n" % str(line)) ##write file out to a .gtf file
    1 Cufflinks exon 11869 12227 . + . gene_id "XLOC_000001"; transcript_id "ENST00000456328"; exon_number "1"; gene_name "DDX11L1"; oId "ENST00000456328"; nearest_ref "ENST00000456328"; class_code "="; tss_id "TSS1";
    1 Cufflinks exon 12613 12721 . + . gene_id "XLOC_000001"; transcript_id "ENST00000456328"; exon_number "2"; gene_name "DDX11L1"; oId "ENST00000456328"; nearest_ref "ENST00000456328"; class_code "="; tss_id "TSS1";
    1 Cufflinks exon 13221 14409 . + . gene_id "XLOC_000001"; transcript_id "ENST00000456328"; exon_number "3"; gene_name "DDX11L1"; oId "ENST00000456328"; nearest_ref "ENST00000456328"; class_code "="; tss_id "TSS1";
    Thanks
    Last edited by Seq-Rue; 06-25-2015, 10:54 AM.

    Comment

    Latest Articles

    Collapse

    • seqadmin
      Essential Discoveries and Tools in Epitranscriptomics
      by seqadmin


      The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist on Modified Bases...
      Yesterday, 07:01 AM
    • seqadmin
      Current Approaches to Protein Sequencing
      by seqadmin


      Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
      04-04-2024, 04:25 PM

    ad_right_rmr

    Collapse

    News

    Collapse

    Topics Statistics Last Post
    Started by seqadmin, 04-11-2024, 12:08 PM
    0 responses
    55 views
    0 likes
    Last Post seqadmin  
    Started by seqadmin, 04-10-2024, 10:19 PM
    0 responses
    51 views
    0 likes
    Last Post seqadmin  
    Started by seqadmin, 04-10-2024, 09:21 AM
    0 responses
    45 views
    0 likes
    Last Post seqadmin  
    Started by seqadmin, 04-04-2024, 09:00 AM
    0 responses
    55 views
    0 likes
    Last Post seqadmin  
    Working...
    X