Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • XLOC gene id

    Does anyone know what the "XLOC" gene IDs are, and how to convert them to actual gene names or some other useable identifier?

    The first few columns of my Cuffdiff data looks like this:

    test_id gene_id gene locus
    XLOC_000001 XLOC_000001 - chr1:162458-171994
    XLOC_000002 XLOC_000002 - chr1:860763-880142

  • #2
    Cufflink IDs

    They are CuffLinks IDs. If you run CuffLinks with a GTF or GFF file you will get gene names instead of XLocs. If you have a genome without an annotation file then You could extract those sequences and blast them for an initial identification. Though Ideally you would run your genome through Maker or other gene model prediction software before running CuffLinks.

    Comment


    • #3
      If you run CuffLinks with a GTF or GFF file you will get gene names instead of XLocs.


      can you please explain this in detail steps ? Thank you in advance.

      Comment


      • #4
        I ran Cufflinks with the -G flag (i.e. providing an annotation file (gtf file from UCSC) and suggesting to not perform novel transcript discovery) and I still got this XLOC id format. I am having trouble converting them.

        Comment


        • #5
          I saw this thread and thought would like to bring this alive again.

          I am having similar issues. The GTF file I used was from Ensembl where gene IDs are Ensembl IDs. The cuffdiff output file replaced the Ensembl IDs with XLOC_'s although it also output gene names (e.g. BCL2). Ensembl IDs were no longer there.

          Is there anyway to convert XLOC back to Ensemble IDs, or simply keep the ensembl IDs from my GTF file? how do you guys go about this? I try to think what was the authors' intention to replace useful IDs with XLOC's?

          Interesting enough, if I don't run new gene discovery (i.e. without doing cuffmerge step), I got to keep Ensembl IDs.

          thoughts?

          Comment


          • #6
            I faced the similar problem but then used -g with GTF file and got the IDs in my file during cuffdiff...

            Comment


            • #7
              I used this solution (http://seqanswers.com/forums/showthread.php?t=18357):
              Thomas Doktor said:
              cuff <- readCufflinks()

              #Retrive significant gene IDs (XLOC) with a pre-specified alpha
              diffGeneIDs <- getSig(cuff,level="genes",alpha=0.05)

              #Use returned identifiers to create a CuffGeneSet object with all relevant info for given genes
              diffGenes<-getGenes(cuff,diffGeneIDs)

              #gene_short_name values (and corresponding XLOC_* values) can be retrieved from the CuffGeneSet by using:
              names<-featureNames(diffGenes)
              row.names(names)=names$tracking_id
              diffGenesNames<-as.matrix(names)
              diffGenesNames<-diffGenesNames[,-1]

              # get the data for the significant genes
              diffGenesData<-diffData(diffGenes)
              row.names(diffGenesData)=diffGenesData$gene_id
              diffGenesData<-diffGenesData[,-1]

              # merge the two matrices by row names
              diffGenesOutput<-merge(diffGenesNames,diffGenesData,by="row.names")
              diffGenesOutput will then by a list of genes with the XLOC name as well as the gene name (like BATF3).
              Last edited by blakeoft; 12-08-2014, 06:43 AM.

              Comment


              • #8
                Hi All,

                This works great, so many thanks. One quick question, I am having a hard time inserting a column between "value_2" and "log2_fold_change". I can make the new column but it goes to the end of the data frame. The new columned (Ratio) it is placed after the 'significant' column. For example:

                myGenesOutput$Ratio <- myGenesOutput$TRT_fpkm/myGenesOutput$CTR_fpkm

                Any thoughts? Thanks
                Cheers
                G

                Comment


                • #9
                  Gonza,

                  Just rearrange the columns. For example, if your data frame called df has three columns, and you want the third column to come before the second column, do
                  Code:
                  df <- df[, c(1, 3, 2)]
                  If you're still having trouble, tell me what
                  Code:
                  names(myGenesOutput)
                  gives you along with the desired order of the names, and I'll be able to help you more explicitly.

                  Comment


                  • #10
                    Thanks so much that rearrange worked fantastic!!!!!!!
                    G

                    Comment


                    • #11
                      Hello again,

                      I have another R question, please some advice.
                      I am plotting the FPKM expression (log data) of a certain gene using the scrip below and I cannot figure out how to make the y-axis to show up as "10 to the 1", "10 to the 1.5", "10 to the 2", etc.
                      Instead,the graph shows FKPM+1 values as 1, 10 and 100.

                      Any ideas?

                      Script:
                      myGeneBHLH100_isoform_logModeT<-expressionPlot(isoforms(myGeneBHLH100),logMode=T)
                      myGeneBHLH100_isoform_logModeT + theme_bw()

                      Comment


                      • #12
                        Gonza,

                        This appears to be the way to do it with ggplot2. I've tried it with the sample cummeRbund data, and the results are a little goofy. The y axis ticks are at 10^(2.6), 10^(2.8), etc. Maybe it would look better if your data had values that were spread over more powers of 10, or perhaps this is what you're looking for. Try

                        Code:
                        library(scales)
                        myGeneBHLH100_isoform_logModeT<-expressionPlot(isoforms(myGeneBHLH100), logMode=T)
                        myGeneBHLH100_isoform_logModeT +
                           theme_bw() +
                           scale_y_log10(breaks = trans_breaks("log10", function(x) 10^x),
                                             labels = trans_format("log10", math_format(10^.x)))
                        Here's a source R Cookbook. See the section titled "Axis transformations: log, sqrt, etc." This page has an example with axis ticks that are integer powers of 10.

                        Edit: Oh. It looks like you're ok with rational powers of 10.

                        Comment


                        • #13
                          Hey blakeoft, that worked beautifully, thanks much once again!. If you do not mind one last question please.....

                          When i type the command below I get 2 different plots (one for each isoform). Is there a way to plot those isoforms in the sample plot? Somehow they do it the cummeRbund protocol (Fig. 5a - Nature Protocols 7, 562–578 (2012) doi:10.1038/nprot.2012.016)

                          Full script :

                          myGeneId<-"XLOC_010858"
                          myGeneBHLH100<-getGene(cuff_data,myGeneId)
                          myGeneBHLH100

                          XLOC_010858 <-expressionPlot(myGeneBHLH100,logMode=T)
                          XLOC_010858 + theme_bw()

                          myGeneBHLH100_isoform_logModeT<-expressionPlot(isoforms(myGeneBHLH100), logMode=T)
                          myGeneBHLH100_isoform_logModeT + theme_bw() + scale_y_log10(breaks = trans_breaks("log10", function(x) 10^x),
                          labels = trans_format("log10", math_format(10^.x)))

                          Comment


                          • #14
                            Gonza,

                            It looks like expressionPlot() has been updated at some point so that the isoforms are now plotted side by side. Have you looked at the manual? It has the plots side by side in its example. It also has the FPKM values as integers in log mode, instead of the "10^x" format. I could be wrong because the paper and the manual are both dated 2012.

                            I tried to use ggplot2 to plot this for you. Anyways, this is the best that I could do.

                            Code:
                            iso_plot <- ggplot(isoforms(myGeneBHLH100)@fpkm,
                                               aes(x = sample_name, y = fpkm, group = isoform_id, color = isoform_id))
                            iso_plot +
                               geom_line() + theme_bw() +
                               scale_y_log10(breaks = trans_breaks("log10", function(x) 10^x),
                                             labels = trans_format("log10", math_format(10^.x))) +
                               geom_errorbar(aes(ymin = conf_lo, ymax = conf_hi)) # + geom_point(color = "black", shape = 19)
                            Some aesthetics aren't the same as the normal plots that cummeRbund makes, for example the colors of the lines are different. You can mess around with those colors, the line thickness, etc., but this looks pretty close to what they have in the paper. If you want black data points like in the manual, uncomment the geom_point part on the last line.

                            Edit: I think that some people frown on multiple line plots like this because they can get crowded. One way to mitigate this is to do what is called dodging. Here's how you'd do it for this plot:

                            Code:
                            iso <- isoforms(myGeneBHLH100)
                            pd <- position_dodge(0.3)
                            iso_plot <- ggplot(isoforms(myGeneBHLH100)@fpkm,
                                               aes(x = sample_name, y = fpkm, group = isoform_id, color = isoform_id))
                            iso_plot +
                               geom_line(position = pd) + theme_bw() + 
                               scale_y_log10(breaks = trans_breaks("log10", function(x) 10^x),
                                             labels = trans_format("log10", math_format(10^.x))) +
                               geom_errorbar(aes(ymin = conf_lo, ymax = conf_hi), position = pd) # + geom_point(color = "black", shape = 19, position = pd)
                            Last edited by blakeoft; 10-06-2014, 09:10 AM. Reason: made the black data points come after error bars

                            Comment


                            • #15
                              Hi blakeoft, that worked well. I am so grateful to your help!.
                              But you are totally right, after playing around with it, the graphs seems pretty crowded, does not look as good as i thought.

                              Again, many many many thanks for your help and time (and i may have another questions as i go along....)

                              Best
                              G

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Advancing Precision Medicine for Rare Diseases in Children
                                by seqadmin




                                Many organizations study rare diseases, but few have a mission as impactful as Rady Children’s Institute for Genomic Medicine (RCIGM). “We are all about changing outcomes for children,” explained Dr. Stephen Kingsmore, President and CEO of the group. The institute’s initial goal was to provide rapid diagnoses for critically ill children and shorten their diagnostic odyssey, a term used to describe the long and arduous process it takes patients to obtain an accurate...
                                12-16-2024, 07:57 AM
                              • seqadmin
                                Recent Advances in Sequencing Technologies
                                by seqadmin



                                Innovations in next-generation sequencing technologies and techniques are driving more precise and comprehensive exploration of complex biological systems. Current advancements include improved accessibility for long-read sequencing and significant progress in single-cell and 3D genomics. This article explores some of the most impactful developments in the field over the past year.

                                Long-Read Sequencing
                                Long-read sequencing has seen remarkable advancements,...
                                12-02-2024, 01:49 PM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 12-17-2024, 10:28 AM
                              0 responses
                              33 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 12-13-2024, 08:24 AM
                              0 responses
                              48 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 12-12-2024, 07:41 AM
                              0 responses
                              34 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 12-11-2024, 07:45 AM
                              0 responses
                              46 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X