Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • how to change XLOC ID to Ensembl ID from Cuffdiff

    When I did RNA-Seq analysis, the GTF file I used was from Ensembl. The output of cuffdiff replaced the Ensembl IDs with XLOC's although it also output gene names (e.g. MX2). Ensembl IDs were no longer there.
    Is there anyway to convert XLOC back to Ensemble IDs, or simply keep the ensembl IDs from my GTF file? how do you guys go about this?
    Interesting enough, if I don't run new gene discovery (i.e. without doing cuffmerge step), I got to keep Ensembl IDs.
    Last edited by super0925; 03-16-2015, 06:41 AM.

  • #2
    Hi super0925,

    I ran into the same problem, and realized the merged.gtf file produced by cuffmerge did not use the Ensembl ID as the transcript ID (shown below). Also, in the downstream analysis with cuffdiff, the "oId" ensembl ID is not carried over into the SQLite database.

    1 Cufflinks exon 11869 12227 . + . gene_id "XLOC_000001"; transcript_id "TCONS_00000002"; exon_number "1"; gene_name "DDX11L1"; oId "ENST00000456328"; nearest_ref "ENST00000456328"; class_code "="; tss_id "TSS1";
    1 Cufflinks exon 12613 12721 . + . gene_id "XLOC_000001"; transcript_id "TCONS_00000002"; exon_number "2"; gene_name "DDX11L1"; oId "ENST00000456328"; nearest_ref "ENST00000456328"; class_code "="; tss_id "TSS1";
    1 Cufflinks exon 13221 14409 . + . gene_id "XLOC_000001"; transcript_id "TCONS_00000002"; exon_number "3"; gene_name "DDX11L1"; oId "ENST00000456328"; nearest_ref "ENST00000456328"; class_code "="; tss_id "TSS1";
    I ended up writing a python script to substitute the 'transcript_id' with the 'oId' in order to maintain the ensembl IDs (below). I used the new merged.gtf file for cuffdiff and that solved my problem.

    #!/usr/bin/python

    gtf_handle = "/PATH/TO/merged.gtf"
    fh = open(gtf_handle, "r")

    import re

    trans_ids = {}

    with open('merged2.gtf', 'w') as f:

    for line in fh:
    line = line.strip('\n') ##strip the line to remove white spaces
    ##print line
    cuffID = re.findall(r'gene_id \"([\w\.]+)"', line) ##use RE to get lists of cuffid, ensemblId etc
    cuffTx = re.findall(r'transcript_id \"([\w\.]+)"', line)
    ensemblTx = re.findall(r'oId \"([\w\.]+)"', line)
    geneName = re.findall(r'gene_name \"([\w\.]+)"', line)
    ##print cuffTx[0]
    line = str(line).replace(cuffTx[0], ensemblTx[0]) ##unlist the transcript identifiers and replace cufflinksID with ensemblIDs
    print line
    f.write("%s\n" % str(line)) ##write file out to a .gtf file
    1 Cufflinks exon 11869 12227 . + . gene_id "XLOC_000001"; transcript_id "ENST00000456328"; exon_number "1"; gene_name "DDX11L1"; oId "ENST00000456328"; nearest_ref "ENST00000456328"; class_code "="; tss_id "TSS1";
    1 Cufflinks exon 12613 12721 . + . gene_id "XLOC_000001"; transcript_id "ENST00000456328"; exon_number "2"; gene_name "DDX11L1"; oId "ENST00000456328"; nearest_ref "ENST00000456328"; class_code "="; tss_id "TSS1";
    1 Cufflinks exon 13221 14409 . + . gene_id "XLOC_000001"; transcript_id "ENST00000456328"; exon_number "3"; gene_name "DDX11L1"; oId "ENST00000456328"; nearest_ref "ENST00000456328"; class_code "="; tss_id "TSS1";
    Thanks
    Last edited by Seq-Rue; 06-25-2015, 10:54 AM.

    Comment

    Latest Articles

    Collapse

    • seqadmin
      Current Approaches to Protein Sequencing
      by seqadmin


      Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
      04-04-2024, 04:25 PM
    • seqadmin
      Strategies for Sequencing Challenging Samples
      by seqadmin


      Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
      03-22-2024, 06:39 AM

    ad_right_rmr

    Collapse

    News

    Collapse

    Topics Statistics Last Post
    Started by seqadmin, 04-11-2024, 12:08 PM
    0 responses
    30 views
    0 likes
    Last Post seqadmin  
    Started by seqadmin, 04-10-2024, 10:19 PM
    0 responses
    32 views
    0 likes
    Last Post seqadmin  
    Started by seqadmin, 04-10-2024, 09:21 AM
    0 responses
    28 views
    0 likes
    Last Post seqadmin  
    Started by seqadmin, 04-04-2024, 09:00 AM
    0 responses
    53 views
    0 likes
    Last Post seqadmin  
    Working...
    X