Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • RNA-seq count based vs Cuff Diff

    Hi All,

    I have a question in regards to Differential Expression analysis on RNA-seq data.

    I have a number of different samples and after alignment with Tophat2, I used either cuffdiff, or summerizeoverlap to generate the count matrix and then performed the differential analysis using DEseq2, EdgeR and NBPSeq. The comparison for each I have names List A-J.

    So using these 4 methods, I get a number of different DE genes. See table below:

    deseq2 nbpseq edger Cuffdiff
    ListA 388 491 497 585
    ListB 2 5 3 16
    ListC 386 487 494 576
    ListD 4381 4412 4482 5405
    ListE 3885 4388 4159 4753
    ListF 2 5 3 16
    ListG 1376 1254 1375 1455
    ListH 148 188 190 213
    ListI 3885 4400 4159 4754
    ListJ 3626 2300 3566 5607

    Overall the trend is that Cuffdiff identifies more genes than the count based methods.

    I also looked at the overlap between the genes identified from the methods. Attached in the file.

    Now my question is: Which gene set for each list do I use as the final list??

    Even though Cufflinks reports more genes, a lot of these could be false positives. Also I am inclined to take the overlap between the 4 methods, however I have not come across a publication where this is done.

    Are there any bioinformatics ways in which I could perform additional validation of the gene lists that I have observed? I am aware of over-representation analysis using GO terms or KEGG pathways but these also contain a large number of false positives so IMHO they don't add any more certainty/confidence to my final list being the correct one.

    Thanks in advance for taking the time to answer my question and provide me with some guidance.
    Attached Files

  • #2
    Are there any bioinformatics ways in which I could perform additional validation of the gene lists that I have observed?
    Once you've got your list of possible DE genes, the next step is to validate them using additional experimental methods (e.g. real-time PCR).

    It's far too easy to get bogged down in the details of statistics. These methods will never give you 100% certainty that a particular transcript (or isoform) is differentially expressed, particularly because there are too many unknown factors that need to be estimated in the calculation. Bioinformatics will only add more haystacks to search through, as you've discovered here already by trying multiple methods.
    Last edited by gringer; 03-20-2014, 01:13 AM.

    Comment


    • #3
      hi,

      "Are there any bioinformatics ways in which I could perform additional validation of the gene lists that I have observed?"

      I would load up the coverage (normalized somehow for library size) for all samples in IGV ( http://www.broadinstitute.org/igv/home ) or the Genome Browser, and look at genes which are called by one algorithm but not the others. Or look at genes which are called by all algorithms except one. Or looks by eye at normalized counts for such genes using stripchart() or boxplot() in R. Or you can use ReportingTools ( http://www.bioconductor.org/packages...tingTools.html ) to generate boxplots of the read counts. Basically, in addition to validation, it's a good idea to look at raw data (but do take into account differences in library size).

      Comment


      • #4
        Looking at your figures, the standout result to my mind is that cuffdiff is the odd man out. DEseq2, EdgeR and NBPSeq all seem to agree quite well, while it is cuffdiff that is the one showing very large numbers of uniquely identified restults. Occam's razor alone would make me suspicious of those results then, over anything from the other three at least.

        Any time I see one algorithm or one method giving radically increased numbers of significance, my first suspicion is that method itself, or my particular implementation of it.

        One question I would ask though is how are you defining differential expression for these gene lists? Is this based solely on FDR, or did you apply a magnitude threshold as well in deriving those gene lists? If so, does FDR correlate well for your results for all four methods? How well does estimated fold change correlate?
        Michael Black, Ph.D.
        ScitoVation LLC. RTP, N.C.

        Comment


        • #5
          Originally posted by mbblack View Post
          Looking at your figures, the standout result to my mind is that cuffdiff is the odd man out. DEseq2, EdgeR and NBPSeq all seem to agree quite well, while it is cuffdiff that is the one showing very large numbers of uniquely identified restults. Occam's razor alone would make me suspicious of those results then, over anything from the other three at least.

          Any time I see one algorithm or one method giving radically increased numbers of significance, my first suspicion is that method itself, or my particular implementation of it.
          The other three all use the same count matrix so you can't really say it is 3:1. Much of the differenece is likely in how they handle ambiguous reads and assign reads to transcripts.

          I would compare counts for cuffdiff-only genes and look for signs of reads originating from other truly differentially expressed genes, or try to validate some of the cuffdiff genes with qPCR.

          Comment


          • #6
            Cuffdiff would be expected to find more DE genes because of isoform-level analysis -- it's not clear if isoform-specific changes that aren't replicated at a gene level were excluded in the initial post.

            Comment


            • #7
              Awhile ago I did an analysis that compared the list of genes called as "Differentially expressed" across a set of mixture samples by a number of different algorithms. These mixture samples had some "truth" built in, and the genes called by all algorithms were closer to the truth value than those which were called by only one.

              I'd argue for sending out all of them for validation, but start with the concordant set, if you have too many to do.

              After looking at the size of each of your lists, I'd think that you could do a comprehensive examination of all genes on list B, for example, and use the validation rates from that list to inform which way to go for the 1000+ gene lists.
              Last edited by jparsons; 03-20-2014, 02:37 PM.

              Comment


              • #8
                Thanks for the responses.

                The lists of genes were generated based on FDR corrected p-values of < 0.05 for all methods. I haven't applied a log2 fold change cut-off criteria.

                Originally posted by gringer View Post
                Cuffdiff would be expected to find more DE genes because of isoform-level analysis -- it's not clear if isoform-specific changes that aren't replicated at a gene level were excluded in the initial post.
                For the Cuffdiff results I extracted the significantly DE genes from the gene_exp.diff file. So the summed FPKM for all transcipts associated with a gene will be used and should therefore, in my mind, represent gene specific changes. is this what you were referring to?

                Based on the literature, I would expect DEseq2, EdgeR and NBPseq to produce similar gene lists as they all count based methods that model the data using the negative binomial distribution. They only differ in terms of how they calculate normalization factors to perform library size normalization and how the dispersion parameter is estimated. In regards to Cuffdiff, it has been reported that it returns a high number of false positives, so I am not surprised it found more genes than the count based methods. This is a good review paper on the comparison of the different methods.


                Looking at the p-value (FDR adjusted) and log2fold change values for the four methods, the log2 fold change values are very similar. There are some differences in the p-values however, for all lists cuffdiff reports a higher adjusted p.value. An extract (sorry about the formatting)

                DESeq2_Log2FC edgeR_Log2FC NBPseq_Log2FC Cuffdiff_Log2FC DESeq2_P.adj.value edgeR_P.adj.value NBPseq_P.adj.value q_value
                GeneA 3.740054 3.707668 3.714246 3.69571 2.73E-16 2.80E-31 4.20E-37 0.000219799
                GeneB 3.400678 3.282097 3.392317 3.37794 1.24E-04 7.10E-08 1.85E-07 0.00266667
                GeneC 2.92362 2.879733 2.981853 2.97935 1.05E-07 5.67E-13 4.48E-12 0.000219799
                GeneD 2.914234 2.853205 2.920566 2.90528 5.23E-05 4.95E-08 4.06E-08 0.000219799
                Validating the lists using qPCR methods is what ultimately needs to be done to filter out the false positives in the gene lists. However, perhaps I should have been clearer on this in my original post, I am wondering whether there is a way to assess whether some methods are more accurate at modeling the data than others.

                Attached are the dispersion plots for DESeq2 and EdgeR for the gene lists. (the reason why not all gene lists are on there some have been generated via subtracting results from others for biological reasons). These plots look similar to each other and they look like the examples, so this gives some confidence that nothing strange is going on.

                Perhaps I am searching for something impossible, model validation of determining DE genes, and that the overlap of the 4 lists should be the closest to the "truth".
                Attached Files

                Comment


                • #9
                  For the Cuffdiff results I extracted the significantly DE genes from the gene_exp.diff file. So the summed FPKM for all transcipts associated with a gene will be used and should therefore, in my mind, represent gene specific changes. is this what you were referring to?
                  Yes, good. That looks like the right thing to do.

                  Based on the literature, I would expect DEseq2, EdgeR and NBPseq to produce similar gene lists as they all count based methods that model the data using the negative binomial distribution. They only differ in terms of how they calculate normalization factors to perform library size normalization and how the dispersion parameter is estimated.
                  Well, if you're looking for a different count method, why not the others mentioned in that paper, for example baySeq (after all, it had the best correlation with the TaqMan assay).

                  However, perhaps I should have been clearer on this in my original post, I am wondering whether there is a way to assess whether some methods are more accurate at modeling the data than others.
                  As I alluded to in my previous response, you're going to have trouble getting a concrete answer for this. You may notice that all these programs produce probability values, and that relates to confidence that a result is correct, not ultimate confirmation that the result is correct. Each different program carries its own set of assumptions about the underlying data, and those assumptions probably have a reasonably good basis (or are equally bad), given the high degree of correlation you're observing.

                  You should choose the program that best fits your own hypothesis of how your data behaves (or is the most well-validated after other biological testing), and then go find some other ocean-dwelling dataset to put in the fry pan.

                  Comment


                  • #10
                    Originally posted by Anomilie View Post
                    Thanks for the responses.

                    Based on the literature, I would expect DEseq2, EdgeR and NBPseq to produce similar gene lists as they all count based methods that model the data using the negative binomial distribution. They only differ in terms of how they calculate normalization factors to perform library size normalization and how the dispersion parameter is estimated. In regards to Cuffdiff, it has been reported that it returns a high number of false positives, so I am not surprised it found more genes than the count based methods. This is a good review paper on the comparison of the different methods.
                    http://genomebiology.com/2013/14/9/R95
                    Keep in mind though, the it is in fact the specifics of the data normalization in each case that will be the single largest contributor to differences in statistical results. If the normalizations were truly equivalent, then the particular statistical test used would often be largely immaterial. And even starting from simple transcript or gene read counts, even apparently subtle differences in normalization algorithms can translate into radical differences in statistical test results.
                    Michael Black, Ph.D.
                    ScitoVation LLC. RTP, N.C.

                    Comment


                    • #11
                      Well, if you're looking for a different count method, why not the others mentioned in that paper, for example baySeq (after all, it had the best correlation with the TaqMan assay).
                      I had a go at implementing baySeq on my data set, however as the algorithm includes a randomization step, the differentially expressed genes this algorithm returned was not consistent. For example if I ran the code today, and then ran the same code on the next day I would get 2 slightly different lists. Hence I decided not to proceed with the results from this method.

                      ...even apparently subtle differences in normalization algorithms can translate into radical differences in statistical test results
                      I agree that the normalization algorithms will have a large impact on the data, however considering a consensus has not been reached in regards to which normalization method is superior in various situations, I have just used to default method associated with each algorithm.


                      In regards to NBPseq, I have discovered that it does not handle low counts data very well. For example when using this method to determine the fold change of a gene with 0 counts for all samples in one group, the reported value is infinity.

                      s1 s2 s3 t1 t2 t3 logfc p.adj.value
                      GeneA 58 51 43 0 0 0 Inf 5.55E-40
                      It appears that DESeq2 does not report the genes where this is occurring, and a total of 2 counts for all samples seems the minimum for which it returns the genes to be significant.
                      EdgeR does report this gene as significantly differentially expressed.

                      s1 s2 s3 t1 t2 t3 logfc p.adj.value
                      GeneA 58 51 43 0 0 0 8.653092142 1.13E-40
                      Has anyone come across this as well?

                      If I take the intersection of these 3 methods, genes with the above scenario will be excluded, however there should be a distinction between doing comparison between an average of 0 and an average of 4, as oppose to an average of 0 and an average of 50.667.
                      The former scenario is most likely to be a false positive since 4 reads could be misaligned, while the latter is most likely to be a true positive.

                      What are your opinions on this?
                      Last edited by Anomilie; 03-27-2014, 03:01 PM.

                      Comment


                      • #12
                        "It appears that DESeq2 does not report the genes where this is occurring, and a total of 2 counts for all samples seems the minimum for which it returns the genes to be significant. "
                        see "independent filtering" in the DESeq2 documentation.

                        Comment


                        • #13
                          Originally posted by Anomilie View Post
                          Hi All,

                          I have a question in regards to Differential Expression analysis on RNA-seq data.

                          I have a number of different samples and after alignment with Tophat2, I used either cuffdiff, or summerizeoverlap to generate the count matrix and then performed the differential analysis using DEseq2, EdgeR and NBPSeq. The comparison for each I have names List A-J.

                          So using these 4 methods, I get a number of different DE genes. See table below:

                          deseq2 nbpseq edger Cuffdiff
                          ListA 388 491 497 585
                          ListB 2 5 3 16
                          ListC 386 487 494 576
                          ListD 4381 4412 4482 5405
                          ListE 3885 4388 4159 4753
                          ListF 2 5 3 16
                          ListG 1376 1254 1375 1455
                          ListH 148 188 190 213
                          ListI 3885 4400 4159 4754
                          ListJ 3626 2300 3566 5607

                          Overall the trend is that Cuffdiff identifies more genes than the count based methods.

                          I also looked at the overlap between the genes identified from the methods. Attached in the file.

                          Now my question is: Which gene set for each list do I use as the final list??

                          Even though Cufflinks reports more genes, a lot of these could be false positives. Also I am inclined to take the overlap between the 4 methods, however I have not come across a publication where this is done.

                          Are there any bioinformatics ways in which I could perform additional validation of the gene lists that I have observed? I am aware of over-representation analysis using GO terms or KEGG pathways but these also contain a large number of false positives so IMHO they don't add any more certainty/confidence to my final list being the correct one.

                          Thanks in advance for taking the time to answer my question and provide me with some guidance.

                          Hi how do you uniform the gene name between different methods?
                          Now I am doing RNA-Seq on the bovine, my reference genome use Ensembl database.
                          As you know, the Texudo pipeline: Tophat2-Cufflinks-Cuffmerge-Cuffdiff.
                          After the Cuffmerge and get the merged.gtf which could give you gene name output only but not the Ensembl ID. However, the gene name of Cuffdiff, some are like "Q865A2_BOVIN", which made me confused, cause it is not official gene name.
                          Is there any methods to uniform gene names (whatsoever Ensembl ID, gene name or UCSC ID ,.etc.) across different methods?
                          I think I could do Tophat2-Cuffdiff without Cufflinks+Cuffmerge to do this pipeline. However, I am not sure I am right and is there any other way to uniform them?
                          Thank you!

                          Comment


                          • #14
                            Originally posted by Anomilie View Post
                            In regards to NBPseq, I have discovered that it does not handle low counts data very well. For example when using this method to determine the fold change of a gene with 0 counts for all samples in one group, the reported value is infinity.
                            The issue of very low count data has actually been discussed quite a bit in literature. I would say the growing consensus is clear in that you really should not even be including your very low count data in your analysis. The statisticians I've worked with certainly feel that way.

                            The reason lies with the dynamics of the count distribution of RNA-seq data. Very low count data (say, at least counts<10 but the actual cutoff is aribtrary and publications I have seen have used as low as <5, as many as <25) has a much, much higher variance than high(er) count data and thus the estimated counts are much less reliable. Including counts of zero never makes sense to me, as all you know of that transcript is you did not detect it. You know nothing at all about its actual relative abundance in your sampled tissues. Including data for which you actually failed to obtain any data at all may actually hurt your ability to detect significant differences amongst the data you have reliable counts for as you inflate the number of simultaneous tests you now need to correct for.
                            Michael Black, Ph.D.
                            ScitoVation LLC. RTP, N.C.

                            Comment


                            • #15
                              Hello All,

                              I have extracted counts from cuff_data and I am havign an issue while generating a DGEList using edgeR. It keeps saying --> Error: could not find function "DGEList" wich is odd because i am specifying it. Any idea? Thanks

                              source("http://bioconductor.org/biocLite.R")
                              biocLite("edgeR")
                              library('edgeR')

                              # extract counts from cuff_data
                              rep.gene.count.matrix<-repCountMatrix(genes(cuff_data))
                              head(rep.gene.count.matrix)

                              # make matrix integer
                              m <- ceiling(rep.gene.count.matrix)
                              m

                              # build DGEList
                              group <- c(1,1,1,2,2,2)
                              y <- DGEList(counts = m, group=c(1,1,1,2,2,2))
                              dim (y)

                              Error: could not find function "DGEList"

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM
                              • seqadmin
                                Techniques and Challenges in Conservation Genomics
                                by seqadmin



                                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                Avian Conservation
                                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                03-08-2024, 10:41 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 03-27-2024, 06:37 PM
                              0 responses
                              13 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-27-2024, 06:07 PM
                              0 responses
                              11 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-22-2024, 10:03 AM
                              0 responses
                              53 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-21-2024, 07:32 AM
                              0 responses
                              69 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X