Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Comparing 2 Sets of RNA-seq Data

    I have 2 sets of RNA-seq data.
    One is for cancer patients and is from one of our collabourators. The RNA-seq data is processed and all I have is normalized counts for each Ensembl ID for cancer samples.
    The other one is also processed RNA-seq data that I downloaded from TCGA website. This provides normalized counts for each isoform (UCSC Gene) for normal samples.
    I need to identify differentially expressed genes between cancer and normal.
    I drew cluster dendrogram for all samples (cancer and normal) with the original data, then cancer samples and normal samples consist 2 large clusters. So I used ComBat to adjust the batch effect and plotted cluster dendrogram again. This time, cancer and normal samples are not clearly separated to 2 large clusters, but the subclusters are either all normal or all cancer.

    I wonder how people combine their own data and online data?

  • #2
    This comparison is not valid, as you describe it.
    You are comparing 2 sets of data processed differently, each belonging to different conditions.
    You cannot distinguish differences between the 2 datasets due to the differences between the conditions, and differences due to the processing of the samples.

    At the very least, you should rerun the bioinformatics analysis. If the same programs (and the same versions) with the same parameters are not used for both datasets, the comparisons will not be valid. I've run different versions of Cufflinks on the same dataset and got different FPKM values.

    This will still not solve the issue that the experimental preparation will differ between the 2 datasets, therefore making it impossible to distinguish what differences in the counts are due to the experimental preparation and what differences are due to differences between the conditions.

    You cannot correct for the batch effect if you only have 2 datasets, one for cancer patients and one for normal patients. ComBat would only work if both datasets had cancer and normal patients. You would then be able to distinguish differences between the datasets due to differences between the conditions and differences due to processing of the samples. In my opinion, it is not possible to run ComBat on these 2 datasets, one with only normal samples and the other with only cancer samples.

    At least the normal and cancer patients cluster separately, and not according to the origin of the datasets.

    To sum it up, this is a poorly thought-out comparison. Unfortunately, it is commonly requested by biologists with little experience analyzing next-generation sequencing data.

    If you must do this comparison, only do it reluctantly and make it very clear that the comparison is not valid. There is no way of determining what differences are due to differences between the conditions (normal vs cancer), and what differences are due to differences between the processing of the samples, especially the wet lab portion. The bioinformatics processing of the samples can be redone so as to be identical for both datasets.

    You might be able to draw some hypotheses for further experiments based on this comparison, but I still think it is a bad idea for the reasons I have outlined above.
    Last edited by blancha; 04-26-2014, 05:27 PM.

    Comment


    • #3
      Thank you very much for your reply blancha.

      I understand the many drawbacks of the comparison I described.
      Now I am working on DNA methylation data. Again, we have only methylation data for cancer samples. For normal samples, we downloaded data from TCGA. I processed both cancer and normal data all together starting from .idat files(similar to .CEL files). In theory, since methylation beta value is the methylated proportion, and both methylated and unmethylated are down on the same chip same well, I thought the comparison would make some sense. But the p-values for most of the probes are very small.
      If we use ComBat to make adjustment to the 2 data sets, would it make eough sense? Here is what I think: The difference between normal and cancer geneome is not systematic difference, but variance caused by batch is systematic. And ComBat only make adjustment for those systematic differences between the 2 data sets. Of course this is not perfect, but I wonder if this could make enough sense.
      Last edited by HeidiLee; 04-29-2014, 01:32 PM.

      Comment


      • #4
        You cannot use ComBat for this comparison.
        You could only use ComBat if both datasets had both normal and cancer samples.
        Forget the mathematics behind ComBat. Think about it conceptually. How can ComBat know which differences result from the different processing of the samples, and which differences results from the samples belonging to different conditions, normal and cancer?

        This approach is a very common mistake, since researchers are so excited about all the online data available. For a comparison to be valid, both datasets must contain samples belonging to both conditions. This is even more important for microarray data, since microarrays are much more susceptible to the batch effect than next-generation sequencing.

        Comment


        • #5
          blancha: "How can ComBat know which differences result from the different processing of the samples, and which differences results from the samples belonging to different conditions, normal and cancer?"

          The differences result from different disease conditions is not a systematic difference, while the differences result from different processing is a systematic difference. ComBat is only correcting systematic differences.

          Comment


          • #6
            That's what you want ComBat to do for you.
            My question is how could ComBat do that for you when the different datasets only have samples belonging to one condition?
            How can ComBat distinguish between the "systematic differences" and the "different disease conditions"?

            Only if both datasets had samples belonging to both conditions, could you use ComBat to achieve your objectives.
            For example, I've successfully used ComBat when 2 technicians had prepared differently cancer and normal microarrays.
            Had one technician prepared all the normal microarrays and one technician prepared all the cancer microarrays, there would be no way of removing the batch effect.
            Since both technicians had prepared both type of microarrays, I was able to use ComBat to determine what differences were due to the technician and what differences were due to the microarrays being normal or cancer samples.

            If I'm wrong, and it wouldn't be the first time , I'd love for someone to demonstrate that ComBat can actually be used to achieve your objectives.

            Comment


            • #7
              [QUOTE=blancha;139074]That's what you want ComBat to do for you.
              My question is how could ComBat do that for you when the different datasets only have samples belonging to one condition?
              How can ComBat distinguish between the "systematic differences" and the "different disease conditions"?

              Different disease condition wouldn't have systematic differences. The significant difference caused by disease is on some specific gene or genomic location. If you read the article about ComBat, you will see that ComBat does not, or does not intend to, adjust this kind of differences.
              Last edited by HeidiLee; 04-29-2014, 02:00 PM.

              Comment


              • #8
                Batch effects aren't systematic, they won't affect all species and won't uniformly modulate affected species. That's the crux of the issue. The best you might be able to do is use the literature or previous experiments to choose a set of representative species that you expect to be unaffected and then normalize according to those. This is similar in principal to how ERCC spike-ins works.

                Comment


                • #9
                  Originally posted by dpryan View Post
                  Batch effects aren't systematic, they won't affect all species and won't uniformly modulate affected species. That's the crux of the issue. The best you might be able to do is use the literature or previous experiments to choose a set of representative species that you expect to be unaffected and then normalize according to those. This is similar in principal to how ERCC spike-ins works.
                  Do you mean "probe" by "species"?

                  If batch effect is completely random, what does ComBat do?
                  Last edited by HeidiLee; 04-29-2014, 02:28 PM.

                  Comment


                  • #10
                    In your case (with methylation arrays), yes, but that's not universally the case. In RNAseq, a species would typically be a gene. On arrays it'd be a probe. In targeted resequencing it'd be a region.

                    Comment


                    • #11
                      Originally posted by dpryan View Post
                      In your case (with methylation arrays), yes, but that's not universally the case. In RNAseq, a species would typically be a gene. On arrays it'd be a probe. In targeted resequencing it'd be a region.
                      Thank you very much dpryan.

                      Comment

                      Latest Articles

                      Collapse

                      • seqadmin
                        Advancing Precision Medicine for Rare Diseases in Children
                        by seqadmin




                        Many organizations study rare diseases, but few have a mission as impactful as Rady Children’s Institute for Genomic Medicine (RCIGM). “We are all about changing outcomes for children,” explained Dr. Stephen Kingsmore, President and CEO of the group. The institute’s initial goal was to provide rapid diagnoses for critically ill children and shorten their diagnostic odyssey, a term used to describe the long and arduous process it takes patients to obtain an accurate...
                        12-16-2024, 07:57 AM
                      • seqadmin
                        Recent Advances in Sequencing Technologies
                        by seqadmin



                        Innovations in next-generation sequencing technologies and techniques are driving more precise and comprehensive exploration of complex biological systems. Current advancements include improved accessibility for long-read sequencing and significant progress in single-cell and 3D genomics. This article explores some of the most impactful developments in the field over the past year.

                        Long-Read Sequencing
                        Long-read sequencing has seen remarkable advancements,...
                        12-02-2024, 01:49 PM

                      ad_right_rmr

                      Collapse

                      News

                      Collapse

                      Topics Statistics Last Post
                      Started by seqadmin, 12-17-2024, 10:28 AM
                      0 responses
                      25 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 12-13-2024, 08:24 AM
                      0 responses
                      42 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 12-12-2024, 07:41 AM
                      0 responses
                      28 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 12-11-2024, 07:45 AM
                      0 responses
                      42 views
                      0 likes
                      Last Post seqadmin  
                      Working...
                      X