Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Fruit Fly CummeRbund Plots Are Not The Same

    Hello all,

    After several days of moving through the TopHat Cufflinks workflow with the example fruit fly data (http://www.ncbi.nlm.nih.gov/pubmed/22383036), I have finally made it to CummeRbund.

    When I execute the following commands in R, I get plots that are not the same as the ones found in the publication. Is that a problem?
    library(cummeRbund)
    cuff_data <- readCufflinks('diff_out')
    csDensity(genes(cuff_data))
    csScatter(genes(cuff_data), 'C1', 'C2')
    csVolcano(genes(cuff_data), 'C1', 'C2')

    Attached is the volcano plot. I just took a screen shot and saved it as ppt since I didn't know how to export or save the R graph.

    Thanks and God bless,
    Jason
    Attached Files

  • #2
    csDensity is really different

    I also wanted to share my csDensity plot because it is really different.
    Attached Files

    Comment


    • #3
      This does look like a problem. Assuming that you are using the right original data (GSE32038), you should get the same volcano plot. Most concerning to me is that you're missing any significant genes.

      I imagine that you are using updated versions of the software relative to the paper, but in principle this shouldn't change the results so dramatically. Several default options have changed recently, though. Without looking over everything, it's tough to say for sure.

      An interesting and potentially educational for you approach would be to run the same data through Galaxy's tophat/cufflinks/cummerbund pipeline to see if you get the same results (or if perhaps you or the tutorial had done something incorrectly).

      Comment


      • #4
        I'd like to check the assumption of using the correct data with you.

        I went to http://0-www.ncbi.nlm.nih.gov.elis.t...i?acc=GSE32038, and copied the links for the RAW.tar and simulated_fastq_files.tar.gz.

        My commands were:
        (in GSE32038 directory)
        wget ftp://ftp.ncbi.nlm.nih.gov/geo/serie...E32038_RAW.tar -O GSE32038_RAW.tar
        tar -xvf GSE32038_RAW.tar
        gunzip GSM79448*
        (in the my_rnaseq_exp directory)
        wget ftp://ftp.ncbi.nlm.nih.gov/geo/serie...q_files.tar.gz -O GSE32038_simulated_fastq_files.tar.gz #Could not find any other fastq files in the GEO Assession
        tar -xvf GSE32038_simulated_fastq_files.tar.gz
        gunzip GSM79448*

        The TopHat Cufflinks journal article says to download the "raw sequencing reads, alignment reads, assembled transfrags and differential analysis", but I was not exactly sure which of files were the ones I needed. Did I download the correct ones?

        Thank you!
        Jason

        Comment


        • #5
          Based on the name of the file (GSE32038_simulated_fastq_files.tar.gz) you seem to have downloaded the right data file.

          Figure 8 in the paper has different numeric ranges for axes (specially the Y-axis) that may be why your volcano plot is looking different. We can't eliminate the possibility that you have different results in the underlying data. Same is true for the other graph.

          Also keep in mind that the example in the nature protocols paper used previous versions of the Tophat/Cufflinks suite. You are probably using the newer versions (v.2.x).
          Last edited by GenoMax; 06-17-2013, 11:38 AM.

          Comment


          • #6
            Okay, and for the fruit fly genome I went to ftp://ussd-ftp.illumina.com/.
            wget ftp://igenome:[email protected] -O Drosophila_melanogaster_Ensembl_BDGP5.25.tar.gz
            tar zxvf Drosophila_melanogaster_Ensembl_BDGP5.25.tar.gz

            Is that also correct?

            I also noticed the axis were different in Figure 6. If FPKM (fragments per kilobase of exon per million fragments mapped) is a decimal, log(FPKM) is negative. This is what we see in my figure but not theirs. If FPKM is a decimal than that means there are fewer fragments per kb of exon than million fragments mapped.

            In a previous thread (http://seqanswers.com/forums/showthread.php?t=31074), I posted several warnings that I got from ran Cuffmerge. These warnings were for missing fasta records such as heterochromatin and mitochondrial genome. Perhaps, without these index files, fragments mapped incorrectly all the way back at the TopHat step. This may account for negative values for log10(FPKM) in the csDensity graph.

            If that is the case, for me to get the same results as in the publication, I will need a fasta index with everything including heterochromatin and mitochondrial genome. If the website (ftp://ussd-ftp.illumina.com/) is not the correct source, does anyone know what is the website with the full genome?

            Comment


            • #7
              I actually have similar problems to yours.
              I also analyzed the simulated data, in combination with the iGenomes data. My results also differ from the results shown in the paper. However, my results also differ from yours.
              The reason you're not showing any significant genes in your volcano plots, is because you did not switch on the significant genes. For some reason you have to do this explicitly:

              e.g. csVolcano(genes(cuff_data), 'WT', 'KO', alpha=0.05, showSignificant=T)

              My error bars/standard deviation is also really high for each gene. Much bigger than one standard deviation, I don't know why this is.

              Did you ever figure out if you did something wrong, or what the reason for the changes are?

              Cheers
              Last edited by rubbertjes; 07-08-2013, 08:42 AM.

              Comment


              • #8
                Hello rubbertjes,

                I was never able to test my hypothesis since I could never find (or was given) a full fruit fly fasta index with the heterochromatin and mitochondrial genomes.

                Thanks for tip about showSignificant=T. I will give that a try this week to see what happens.

                God bless

                Comment


                • #9
                  Hello jmwhitha,

                  Did you finally guessed what were the reasons for the different plot outcomes? I am getting exactly the same "wrong" plots as you. If I run cummerbund using the reference result files from the tutorial, some of the plots are corrected (such as the volcano and the scatter plots), but others are not (eg the density plot and the barplots, which only show the error bars). I get some warnings/errors from cummerbund for these last commands (these, even when using the result data provided as reference, thus might be something due to R packages version I guess):

                  > csDensity(genes(cuff_data))
                  Warning messages:
                  1: Removed 4075 rows containing non-finite values (stat_density).
                  2: Removed 4082 rows containing non-finite values (stat_density).

                  > mygene <- getGene(cuff_data,'regucalcin')
                  > expressionBarplot(mygene)
                  Error : Mapping a variable to y and also using stat="bin".
                  With stat="bin", it will attempt to set the y value to the count of cases in each group.
                  This can result in unexpected behavior and will not be allowed in a future version of ggplot2.
                  If you want y to represent counts of cases, use stat="bin" and don't map a variable to y.
                  If you want y to represent values in the data, use stat="identity".
                  See ?geom_bar for examples. (Defunct; last used in version 0.9.2)


                  I am using Tophat2, cufflinks2 and R3.1.1 for these analysis, while I think in the tutorial they used older versions for these tools...however, I guess resutls should not differ so much, right?

                  Santi

                  Comment


                  • #10
                    Hello Santi,

                    I found a variety of issues with this pipeline so I went with DEGseq. Besides not having as many issues, the processing time was much faster. So, actually I'd recommend you try DEGseq or another pipeline rather than trying to get TopHat Cufflinks... to work.

                    God bless,
                    Jason

                    Comment


                    • #11
                      Hi Jason,

                      Thanks for the feedback! Cheers,

                      Santi

                      Comment

                      Latest Articles

                      Collapse

                      • seqadmin
                        Strategies for Sequencing Challenging Samples
                        by seqadmin


                        Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                        03-22-2024, 06:39 AM
                      • seqadmin
                        Techniques and Challenges in Conservation Genomics
                        by seqadmin



                        The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                        Avian Conservation
                        Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                        03-08-2024, 10:41 AM

                      ad_right_rmr

                      Collapse

                      News

                      Collapse

                      Topics Statistics Last Post
                      Started by seqadmin, Yesterday, 06:37 PM
                      0 responses
                      11 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, Yesterday, 06:07 PM
                      0 responses
                      10 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 03-22-2024, 10:03 AM
                      0 responses
                      51 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 03-21-2024, 07:32 AM
                      0 responses
                      67 views
                      0 likes
                      Last Post seqadmin  
                      Working...
                      X