Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • A post-doc and the case of the disappearing gene names (in Cufflinks/cuffdiff)

    I've successfully gotten scatter plots, etc. in cummeRbund, but then when I try to actually dig into the data (trying to see if individual transcripts match the patterns we've seen previously with qPCR, convincing myself and my PI that this RNAseq is working) I don't see any gene ID in the cuff_diff files.

    I don't have a .gtf file of my genome, only a .gff file and it sounds beyond my abilities to change it to a .gtf, and cuffmerge only takes .gtf files for annotation. You can run cuffmerge without a "-r <ref.gtf>" but if I do this, will I never have annotations in my later files?

    Can I use cuffcompare with my .gff annotations to get gene names in the cuff_diff output?

    Also, if I don't have specific gene annotations in my cuffdiff file, then what are showing up as my points in the cummeRbund scatter plots?

    Thanks for your help,
    Anna

  • #2
    Not very satisfying possible workaround

    Well, I've continued playing around and there are still many things that confuse me, but if you also have this problem (annotations in a .gff file and you want to have an annotated list for cuffdiff) it seems like you can get your gene annotations to show up in a cuffdiff output (and can make shiny graphs in cummeRbund) by using cuffcompare instead of cuffmerge to make a list of transcripts since cuffcompare will take .gff files.

    Comment


    • #3
      The tuxedo suite is confusing and I feel R is more so.

      The points on your cummeRbund plots are connected to genes or transcripts in gene_exp.diff or isoform_exp.diff, respectively. I used cuffmerge/cuffdiff. There are corresponding genes_fpkm.tracking and isoforms_fpkm.tracking files which connect cufflink's internal test_ids and gene_ids to either reference transcript identifiers (like refseq) or hugo gene symbols. Thus they are there, but it is just not that obvious.

      cummeRbund has a function to add annotation to object's database, but I haven't used it yet. I will say the cummeRbund documentation at bioconductor is more thorough than at the MIT web site.

      I found it helpful to first review the cufflinks manual carefully. I am also reading any post from the cufflinks developer, Cole Trapnell, for further insights.

      Comment


      • #4
        I was having a similar problem and I couldn't figure out how to get my gene names listed in the cuffdiff output. With lots of messing about, I found that cuffmerge was the program that does this using the gtf file. Once I had the gtf file, everything was great. I think it is a much cleaner method to use cuffmerge (which does cuffcompare for you anyway).

        I highly recommend trying to convert your gff file to gtf for future analyses. Maybe you have seen it already, but it seems that cufflinks already has a utility (gffread) to convert between gff and gtf formats. I have not used it myself, but the instructions seem straightforward (http://cufflinks.cbcb.umd.edu/gff.html).

        Also, out of interest, which genome are you using? Maybe there is already a gtf annotation available already?

        Cheers

        Sam

        Comment


        • #5
          I had this problem too and solved it by changing my gff to a gtf (following Illuminoid's suggestion).

          This is the command: gffread -E myspecies.gff3 -T -o- > myspecies.gtf

          And I ran cuffmerge with the option '-g myspecies.gtf'

          Thanks!!

          Comment


          • #6
            Hi, guys

            Just wanted to clarify if there is any argument as

            “You can run cuffmerge without a "-r <ref.gtf>" but if I do this, will I never have annotations in my later files?”

            I looked the manual, it does has a option -g to spefify annotation file. Am I wrong? I am interested in this discussion because when i follow the instruction to run my cuffmerge with the following command:
            cuffmerge -g genes.gtf -s genome.fa -p 4 assemblies_mt.txt

            at the end, in my different expressed gene file(gene_exp.diff), i got the gene_name (or is it gene_name ? or transcript_id) for each of the transcripts rather than gene_id(which i really wanted). Is there any way to do that in cufflinks? Thanks a lot for your help!

            Comment


            • #7
              Baoqing,
              I don't remember, but I might have meant "-g" and not "-r". in any case, I found that it really matters how your .gtf file is coded. It doesn't just matter that it is .gtf, it also really matters how the final column is coded, and so maybe play around with what you call each term in the file to make sure that cufflinks is pulling out the label that you want?

              Why not just re-code the file so that "gene_name" says what you want to show up? I think that was what finally worked for me, although to be honest, I switched to using DEseq, which ended up being much clearer for me, and works since splice variance isn't something that I'm worried about for my bacterium.

              Comment


              • #8
                Thank you, amcloon

                My programming skill was not good enough to change the setting of the cufflinks, it seems the gene names was the default for the output, or some argument that i have not known yet. Anyway, i was planning to write a script just to match my gene name to the original .gtf file to pull out the gene_id, however, i am not sure if this is the smart way to do it. Or do you know anything related to it can share? I do not really want to do something redundant if the information is already there.
                Best,

                Comment


                • #9
                  As I said above, what worked for me was changing the way the .gtf file was coded. But still, cuffdiff wasn't the best option for me in the end, even when the gene names showed up. Good luck.

                  Comment


                  • #10
                    Here is my stupid method:
                    ##########in R##########
                    > library(cummeRbund)
                    > cuff <- readCufflinks('diff_out')

                    > gene.features<-annotation(genes(cuff))
                    > write.table(gene.features,'gene_anno.txt',sep='\t',row.names=F,col.names=T,quote=F)
                    > gene.matrix<-fpkmMatrix(genes(cuff))
                    > write.table(gene.matrix, 'gene_matrix.txt', sep='\t',row.names = F, col.names = T, quote = F)
                    > gene.count.matrix<-countMatrix(genes(cuff))
                    > write.table(gene.count.matrix, 'gene_count_matrix.txt', sep='\t',row.names = F, col.names = T, quote = F)

                    > isoform.features<-annotation(isoforms(cuff))
                    > write.table(isoform.features,'isoform_anno.txt',sep='\t',row.names=F,col.names=T,quote=F)
                    > isoform.matrix<-fpkmMatrix(isoforms(cuff))
                    > write.table(isoform.matrix, 'isoform_matrix.txt', sep='\t',row.names = F, col.names = T, quote = F)
                    > isoform.count.matrix<-countMatrix(isoforms(cuff))
                    > write.table(isoform.count.matrix, 'isoform_count_matrix.txt', sep='\t',row.names = F, col.names = T, quote = F)

                    > q()
                    ############quit R###########
                    $paste isoform_anno.txt isoform_count_matrix.txt isoform_matrix.txt >isoform_count_fpkm_matrix
                    $paste gene_anno.txt gene_count_matrix.txt gene_matrix.txt >gene_count_fpkm_matrix

                    Comment


                    • #11
                      Thanks a lot! That worked brilliantly! I should have explored the cummeRbund Package more. I resolved this by using :

                      samtools view sample_1_name_sorted.bam | htseq-count -i gene_id - ~/Desktop/rnaseq/trimmed/genes.gtf > sample_1.txt

                      But When I compare the count results I got from the cummeRbund with the result i obtained from htseq-count, they are not entirely the same. Usually have a few counts off compared with one or the other? In other cases, several hundred counts differences occurred. Should that be a problem?

                      There are also some other discrepancies between the results from htseq-count and cummeRbund, for example
                      1 i notice that the gene names that displayed in the file generated from cummeRbund is "gene_short_name", the name i used from htseq-count was extracted from genes.gtf file, under the column of "gene_id".
                      However, i did not find any "gene_short_name" column in the genes.gtf file, there is a column "gene_name" instead, i am assuming you used this column instead?

                      2 some names are present, some are absent from the files. this might actually be explained by the two different columns of names we were using. I did some pattern match, this seemed indeed the case, some names were missing from the cummeRbund results can always be matched back to the names in the gene_id in the gene.gtf file in the same row! Could you confirm this with me?

                      Best,

                      Baoqing
                      Last edited by Baoqing; 07-12-2013, 11:33 AM.

                      Comment

                      Latest Articles

                      Collapse

                      • seqadmin
                        Strategies for Sequencing Challenging Samples
                        by seqadmin


                        Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                        03-22-2024, 06:39 AM
                      • seqadmin
                        Techniques and Challenges in Conservation Genomics
                        by seqadmin



                        The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                        Avian Conservation
                        Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                        03-08-2024, 10:41 AM

                      ad_right_rmr

                      Collapse

                      News

                      Collapse

                      Topics Statistics Last Post
                      Started by seqadmin, Yesterday, 06:37 PM
                      0 responses
                      7 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, Yesterday, 06:07 PM
                      0 responses
                      7 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 03-22-2024, 10:03 AM
                      0 responses
                      49 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 03-21-2024, 07:32 AM
                      0 responses
                      66 views
                      0 likes
                      Last Post seqadmin  
                      Working...
                      X