Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • RNA-seq analysis of DEGs by different software?

    Recently I got some RNA-seq data for differential expression analysis. Because there is no biological rep, I tried different software, edgeR, DESeq and DEGseq, and they came with different results.

    For edger: I got about 8000 DEGs based on the scripts in mannual and filtered by LogFC>1 and <-1.

    For DEGseq: I got about 7000 DEGs based on the the scripts in mannual and and filtered by LogFC>1 and <-1, the method used is MARS.

    For DESeq: I only got about 250 DEGs based on padj<0.05.

    There is so big difference using different softwares, for edger and DEGseq, they came with too much genes I expected, and for DESeq, it came with too little genes I expected. The old question came to me again: which method is good for analyze RNA-seq data for DEGs without biological replicates? Is anyone can summary the standard to filter genes by different softwares? Or whether Fisher's exact test (R build-in function) works better for data without biological replication?

    I just began to learn how to do the analysis, and it is quite new to me. I will appreciate very much for any suggestions.

  • #2
    Originally posted by beefeng123 View Post
    Recently I got some RNA-seq data for differential expression analysis. Because there is no biological rep, I tried different software, edgeR, DESeq and DEGseq, and they came with different results.

    For edger: I got about 8000 DEGs based on the scripts in mannual and filtered by LogFC>1 and <-1.

    For DEGseq: I got about 7000 DEGs based on the the scripts in mannual and and filtered by LogFC>1 and <-1, the method used is MARS.

    For DESeq: I only got about 250 DEGs based on padj<0.05.

    There is so big difference using different softwares, for edger and DEGseq, they came with too much genes I expected, and for DESeq, it came with too little genes I expected. The old question came to me again: which method is good for analyze RNA-seq data for DEGs without biological replicates? Is anyone can summary the standard to filter genes by different softwares? Or whether Fisher's exact test (R build-in function) works better for data without biological replication?

    I just began to learn how to do the analysis, and it is quite new to me. I will appreciate very much for any suggestions.
    1) There is no good method for DEGs without Biological reps. DEG analysis needs Bio reps.

    2) How do you know that there are too many or too few genes? One of the advantages of genome scale experiments is that if they are done right, you should be avoiding this sort of a priori reasoning, compared to other methods like qPCR.

    3) Why are you not filtering the EdgeR and DEGseq results by FDR corrected p-values? I bet most of those 8000 genes have a pvalue > 0.05 and are not significant, but then without Biological reps, there is no good way to know.

    My suggestion is to go to the PI and say, there needs to be Bio reps. Then you can really do some analysis. I've been there. I have told my Boss before, hey we need more replicates. I was nervous that he might not like the answer, but it had to be done.
    Last edited by chadn737; 06-26-2012, 02:37 PM.

    Comment


    • #3
      Originally posted by chadn737 View Post
      1) There is no good method for DEGs without Biological reps. DEG analysis needs Bio reps.

      2) How do you know that there are too many or too few genes? One of the advantages of genome scale experiments is that if they are done right, you should be avoiding this sort of a priori reasoning, compared to other methods like qPCR.

      3) Why are you not filtering the EdgeR and DEGseq results by FDR corrected p-values? I bet most of those 8000 genes have a pvalue > 0.05 and are not significant, but then without Biological reps, there is no good way to know.

      My suggestion is to go to the PI and say, there needs to be Bio reps. Then you can really do some analysis. I've been there. I have told my Boss before, hey we need more replicates. I was nervous that he might not like the answer, but it had to be done.

      Thank you very much for your reply, chadn737!

      1. It is good to have biological replicates to do the analysis. Just I have a little bit more samples, if have multiple replications, it will be very expensive. I can try to ask the PI to do one more replicate. But I heard that there is big difference from different biological replicate. So it will be difficult to analyze them if that is the true.

      2. I agree with your points-it is hard to say too many or too few DEGs. When I did the analysis, I went back to the raw data, eg. analyzed by DESeq, I found many genes with fold change>50 are filtered, though they have good number of read data, eg, 10 in one sample, 500 in another sample. So I guess they should be the right one. Maybe I am wrong. I am puzzle how to make a decision what is DEGs based on the analysis?

      3. For the DEGs I got from DEGseq, I used q-value(Benjamini)<0.001 and logFC>1 or <-1 to filter. For edgeR, from the output file, there is no q-value or FDR column, I filtered in EXCEL. Does anyone know how to get the q-value or FDR from edgeR?

      Thank you very much.

      Comment


      • #4
        Originally posted by beefeng123 View Post
        Thank you very much for your reply, chadn737!

        1. It is good to have biological replicates to do the analysis. Just I have a little bit more samples, if have multiple replications, it will be very expensive. I can try to ask the PI to do one more replicate. But I heard that there is big difference from different biological replicate. So it will be difficult to analyze them if that is the true.

        2. I agree with your points-it is hard to say too many or too few DEGs. When I did the analysis, I went back to the raw data, eg. analyzed by DESeq, I found many genes with fold change>50 are filtered, though they have good number of read data, eg, 10 in one sample, 500 in another sample. So I guess they should be the right one. Maybe I am wrong. I am puzzle how to make a decision what is DEGs based on the analysis?

        3. For the DEGs I got from DEGseq, I used q-value(Benjamini)<0.001 and logFC>1 or <-1 to filter. For edgeR, from the output file, there is no q-value or FDR column, I filtered in EXCEL. Does anyone know how to get the q-value or FDR from edgeR?

        Thank you very much.
        1) It will depend on the nature of your samples. Biological variation is typically greater than technical variation. But Biological variation between inbred lines raised under controlled conditions will be much less than say between samples of two genetically diverse patients with the same treatment. But it does not complicate the analysis. All three of these methods (DESeq, EdgeR, DEGseq) are designed to work with biological replicates. They are more powerful with biological reps because then you can at least estimate the variance and that will make the results meaningful. It may be that variation is such that very few genes are differentially expressed. But what you got to remember is that you are trying to find biologically meaningful results that will hold up if anyone reproduces your work. Its a very bad thing to publish results that cannot be reproduced. Even if all you can do is convince the PI to add one biological replicate, that is far far better than having none. It will cost more, yes. But its a one time cost and far cheaper than paying people (like you) to waste their time chasing after a result that may be wrong. I would also point that out to the PI. You will have a very hard time anymore publishing without replicates and its better to plan for them upfront then find that out later when you try to publish.

        2) The fewer the reads in a sample, the greater variance. In the DESeq paper and vignettes they refer to this as the shot noise. I am not sure how DESeq works without Bio reps, that is a question for Simon Anders. But I'm thinking that minus any replication, DESeq may not assign any significance to DEGs when one of the samples has a very low read count. This is where having just that one additional biological rep will help. Because that will help DESeq to estimate the variance and some of those large changes may actually have a significant p-value when you add in replicates.

        3) Its been a while since I have used EdgeR. Look for the toptags() function or something along those lines.

        Another thing to consider, when you filter DESeq results by the padj, it sets a cutoff of 1.2/-1.2 or 1.15/-1.15 fold change (cant remember off the top of my head which). This probably explains a lot when it comes to why DEGseq and EdgeR give you so many DEGs. The way you are filtering, you are including anything with a 1/-1 fold change. That means you will be including genes that have only a 1.01 or so change. Such a small fold change is questionable even with biological reps and I wouldn't be surprised if many of the genes you have in those two lists fall into that category. You may want to raise that cutoff and see what happens. I bet the results are more meaningful.
        Last edited by chadn737; 06-26-2012, 04:02 PM.

        Comment


        • #5
          Thank you so much, chadn737! What's you said makes me feel confidence to ask my PI for another replicate.

          I still have one question: I checked some papers published earlier, they did not have biological replicates for the RNA-seq experiment because it is more expensive at that time. I wonder how they did the DEGs analysis?

          Comment


          • #6
            It is good to have biological replicates to do the analysis. Just I have a little bit more samples, if have multiple replications, it will be very expensive. I can try to ask the PI to do one more replicate. But I heard that there is big difference from different biological replicate. So it will be difficult to analyze them if that is the true.
            As you are new to the field, let me spell it out again: 1. Replication is not "goot to have" but essential. An experiment without them is flawed, unlikely to give correct results and hence not publishable. 2. The minimum number of replicates is 2. This is not "very expensive". 3. Replicates do not increase costs at all. You simple do everything in triplicates but sequence each sample to only a third of the depth by putting several samples in one lane with multiplexing. You have three times the sample prep cost but the sequencing cost stays the same.

            Originally posted by beefeng123 View Post
            I still have one question: I checked some papers published earlier, they did not have biological replicates for the RNA-seq experiment because it is more expensive at that time. I wonder how they did the DEGs analysis?
            Very simple: They did it incorrectly. All these papers are simply plain wrong, and it is an embarrassment that they ever got published in respectable journals, and so confuse newcomers like you.

            Simon

            Comment


            • #7
              Originally posted by Simon Anders View Post
              As you are new to the field, let me spell it out again: 1. Replication is not "goot to have" but essential. An experiment without them is flawed, unlikely to give correct results and hence not publishable. 2. The minimum number of replicates is 2. This is not "very expensive". 3. Replicates do not increase costs at all. You simple do everything in triplicates but sequence each sample to only a third of the depth by putting several samples in one lane with multiplexing. You have three times the sample prep cost but the sequencing cost stays the same.
              Hi Simon,
              I totally agree with your points and replicates are absolutely necessary to make any conclusions. Just an addendum though - adding more samples (replicates or not) DOES increase the sequencing cost (sometimes dramatically) because most centres have a pricing model where the sequencing cost is fixed (and cheap) while the library prep costs are per-sample (and expensive).

              It's very difficult to add more samples and "split" the run for cost and coverage because individual barcoded libraries are required for pooled samples.

              Comment


              • #8
                Hie Jean

                this is an interesting point, and surprises me a bit. My wet-lab colleagues usually barcode their samples themselves and then give the already pooled sample to the core facility for the final library prep steps. I understand that they do not use Illumina's multiplexing kit, but rather some other (but also commonly used) protocol and that this makes things cheaper because one has to buy only one sample prep kit per pooled sample from Illumina. Our core facility does the same if you ask them to do the multiplexing for you.

                Maybe, our core facility is ahead of the pack in such matters, but I hope that market forces will soon weed out unreasonable pricing schemes as the ones you describe.

                Simon

                Comment


                • #9
                  Jean

                  I don't know what your core charges per sample, but ours charges $250 per prep. A simple DE experiment with 3 reps per condition would cost $1500 in library preparation. 2 reps/cond would cost $1000

                  That sounds expensive, except.....its a one time cost and cheaper than the monthly salary of a grad student or post doc.

                  So....spend a little money upfront and get the replicates or skimp and then spend the next several months paying far more for a grad student or post doc to chase after results that cannot be published.

                  That's how I see it. Its always cheaper to do the experiment right the first time because salaries are the biggest expense and bad experiments only waste time.

                  Comment


                  • #10
                    Originally posted by Simon Anders View Post
                    As you are new to the field, let me spell it out again: 1. Replication is not "goot to have" but essential. An experiment without them is flawed, unlikely to give correct results and hence not publishable. 2. The minimum number of replicates is 2. This is not "very expensive". 3. Replicates do not increase costs at all. You simple do everything in triplicates but sequence each sample to only a third of the depth by putting several samples in one lane with multiplexing. You have three times the sample prep cost but the sequencing cost stays the same.



                    Very simple: They did it incorrectly. All these papers are simply plain wrong, and it is an embarrassment that they ever got published in respectable journals, and so confuse newcomers like you.

                    Simon
                    Hi, Simon,
                    Thank you so much for your reply. It is very important to have replication, and I am planing to do that soon.
                    For the price, actually most part of the cost is the library preparation. We sent RNA to sequencing facility to make library and it costed about $560 each one. Running for one lane, it is about $1200. So I have 12 samples, and they are sequenced by multiple in one lane, I run two lanes for more depth. It costed me about $9000 for one experiment. If I do one more biological experiment, it will be another $9000. That is the reason why I did not have biological replicate at beginning.
                    Still have one question, what is the difference of calculating the padj or q-value between DESeq, edgeR and DEGseq? Why they have so big difference when analyzing DEGs using these different softwares, except for no biological replicates? Thank you very much.

                    Comment


                    • #11
                      Simon is correct that replicates are necessary. However his comment that data without replicates is useless is an exaggeration, and the idea that published papers without replicates are an embarrassment is hysterical hyperbole.

                      First of all, there is plenty of value in RNA-Seq that has nothing to do with differential expression. For example, it can be used for assembly.

                      Second, many studies _with_ replicates have shown that for a large range of expression values biological variability is small compared to "shot noise", so that ignoring the variability that can be estimated with replicates is not necessarily "wrong". Even leaving this argument aside, there are plenty of cases where change in expression is so dramatic (e.g. an exon is not used at all in one condition, is present and very abundant in another), that replicates are not needed.

                      Its also untrue that getting replicates is not expensive. It depends on the experiment and how much RNA is available. "Doing everything in triplicate" can be very non-trivial. Having said that, obviously replicates are preferable and needed in many applications. But RNA-Seq can still be useful without them.

                      To answer the original question, if you'd to try another software that may give fewer false positives try Cuffdiff. Its also possible with CummeRbund to do a careful quality assessment afterwards to see what happened on individual genes.
                      Sincerely,
                      Lior

                      Comment


                      • #12
                        Second, many studies _with_ replicates have shown that for a large range of expression values biological variability is small compared to "shot noise"
                        Really? Aren't you thinking of technical replicates now? I know I have several data sets I am working on now (many of them human clinical samples) with very substantial differences between biological replicates but where technical replicates are highly similar. And the amount of variation is very different in different experiments. I don't see how you can draw valid conclusions without doing biological replicates.

                        Even leaving this argument aside, there are plenty of cases where change in expression is so dramatic (e.g. an exon is not used at all in one condition, is present and very abundant in another), that replicates are not needed.
                        I wouldn't publish such a finding without replicates in a case-control study. How do you know that this behavior is consistently different between the conditions independent of the individual?

                        Assembly is different, of course.

                        Comment


                        • #13
                          It's interesting to see this topic come up again and the strong responses it always seems to generate. I think you would find it hard to find anyone to argue that case that it would not be preferable to have multiple replicates in any experiment, but it's also hyperbole to say that an RNA-Seq experiment with no replicates is intrinsically flawed and should hence be unpublishable. I have a dataset at the moment where we currently only have one set of data per condition (replicates are in process but we've looked at this data), and we have results which are so startlingly clear there's no conceivable chance that they're just happening through any kind of noise. However, I would say there are caveats to add to my statement:
                          1. If the only conclusion from your data is that some small set of genes is changing expression, with no corroborating evidence then a single replicate isn't enough. You could either do more replicates, or validate with QPCR or some other orthologous technique. If what you see is a coordinated change in a set of genes linked by something other than expression (a functional cluster or a set defined by some other type of experiment), then a single replicate can give you very convincing results. I guess the point I'm making is that if the conclusions you're drawing come from the combined expression of several genes then these are effectively replicates since the chances of them changing through noise should be independent.
                          2. Even on a single replicate you can still do a statistical analysis taking the level of observation into account. In RNA-Seq the level of noise and the level of observation are very strongly connected. If your analysis takes this into account you can perform a reasonable statistical analysis even on single replicates. These types of analysis don't tell you the same things as analyses spanning biological replicates and they lack the power of those experiments, but they shouldn't be dismissed out of hand. On the other hand, analysis just using fold change or something similar are a waste of time, especially as there are better ways to do this.
                          3. At least as important as replication is decent quality control and data exploration. We have all sorts of instances of biases in RNA-Seq data which result from technical artefacts either in the library preparation or sequencing steps. Having a good experimental setup is crucial in getting believable results, especially as some biases can produce very convincing statistially significant differences. It seems relatively rare to actually visualise the basic quantiation of your RNA-Seq data, but this can help you to quickly spot whether something weird is happening in your data.

                          Comment


                          • #14
                            Biological replicas and deSeq

                            Hi guys,

                            wrt to biological replicas I can only reiterate the messages above, biological replicas are necessary. The lists of genes you generate without biological replicas will contain a large number of false positives.

                            DeSeq includes a nice trick to be able to deal with a lack of replicas, which is to guess variance in some samples based on variance in all the samples, however, this is a very useful trick when you have good replicas for some biological conditions and in one biological condition you "lost" the replica, and thus you "emulate" biological variance in that particular condition. It is not something that will help you avoid the overall need for biological replicas.

                            On a different note, yes, unfortunately using different softwares in RNA-Seq DEG analysis will still yield often quite different results. This is partly a sign of how the technology and software still needs to mature. However, the stronger the experiment the more the different softwares will tend to converge...

                            Elia
                            --------------------------------------
                            Elia Stupka
                            Co-Director and Head of Unit
                            Center for Translational Genomics and Bioinformatics
                            San Raffaele Scientific Institute
                            Via Olgettina 58
                            20132 Milano
                            Italy
                            ---------------------------------------

                            Comment


                            • #15
                              Originally posted by eslondon View Post
                              Hi guys,

                              wrt to biological replicas I can only reiterate the messages above, biological replicas are necessary. The lists of genes you generate without biological replicas will contain a large number of false positives.
                              Actually, our experience (at least with the statistics we use), is that lists generated without biological replicates are more likely to contain no hits at all (which is as it should be). If your test is predicting large numbers of changing genes from a single sample then there's probably something wrong with the test.

                              The point we should be stressing here is the distinction between biological and technical noise. A single sample, given the right analysis, should be able to provide you with a list of genes which were measured as changing in *that* sample. What it can't tell you is whether the change that you measured was representative of what that gene was doing in the whole population of biological samples under the same condition as yours. It's perfectly possible (and correct) for you to get data which says that a gene was changing in one sample, but that it wasn't changing in the biological condition from which it came overall.

                              RNA-Seq provides enough data that you can be pretty confident in the technical measures you make (or at least be able to make a good estimation of the quality of measure that you've made). Replicates are there to help you to assess biological variability - something you can't possibly see in a single sample.

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Advancing Precision Medicine for Rare Diseases in Children
                                by seqadmin




                                Many organizations study rare diseases, but few have a mission as impactful as Rady Children’s Institute for Genomic Medicine (RCIGM). “We are all about changing outcomes for children,” explained Dr. Stephen Kingsmore, President and CEO of the group. The institute’s initial goal was to provide rapid diagnoses for critically ill children and shorten their diagnostic odyssey, a term used to describe the long and arduous process it takes patients to obtain an accurate...
                                12-16-2024, 07:57 AM
                              • seqadmin
                                Recent Advances in Sequencing Technologies
                                by seqadmin



                                Innovations in next-generation sequencing technologies and techniques are driving more precise and comprehensive exploration of complex biological systems. Current advancements include improved accessibility for long-read sequencing and significant progress in single-cell and 3D genomics. This article explores some of the most impactful developments in the field over the past year.

                                Long-Read Sequencing
                                Long-read sequencing has seen remarkable advancements,...
                                12-02-2024, 01:49 PM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 12-17-2024, 10:28 AM
                              0 responses
                              33 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 12-13-2024, 08:24 AM
                              0 responses
                              49 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 12-12-2024, 07:41 AM
                              0 responses
                              34 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 12-11-2024, 07:45 AM
                              0 responses
                              46 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X