Seqanswers Leaderboard Ad

**blancha** · 04-26-2014, 05:23 PM

This comparison is not valid, as you describe it.
You are comparing 2 sets of data processed differently, each belonging to different conditions.
You cannot distinguish differences between the 2 datasets due to the differences between the conditions, and differences due to the processing of the samples.

At the very least, you should rerun the bioinformatics analysis. If the same programs (and the same versions) with the same parameters are not used for both datasets, the comparisons will not be valid. I've run different versions of Cufflinks on the same dataset and got different FPKM values.

This will still not solve the issue that the experimental preparation will differ between the 2 datasets, therefore making it impossible to distinguish what differences in the counts are due to the experimental preparation and what differences are due to differences between the conditions.

You cannot correct for the batch effect if you only have 2 datasets, one for cancer patients and one for normal patients. ComBat would only work if both datasets had cancer and normal patients. You would then be able to distinguish differences between the datasets due to differences between the conditions and differences due to processing of the samples. In my opinion, it is not possible to run ComBat on these 2 datasets, one with only normal samples and the other with only cancer samples.

At least the normal and cancer patients cluster separately, and not according to the origin of the datasets.

To sum it up, this is a poorly thought-out comparison. Unfortunately, it is commonly requested by biologists with little experience analyzing next-generation sequencing data.

If you must do this comparison, only do it reluctantly and make it very clear that the comparison is not valid. There is no way of determining what differences are due to differences between the conditions (normal vs cancer), and what differences are due to differences between the processing of the samples, especially the wet lab portion. The bioinformatics processing of the samples can be redone so as to be identical for both datasets.

You might be able to draw some hypotheses for further experiments based on this comparison, but I still think it is a bad idea for the reasons I have outlined above.

**HeidiLee** · 04-29-2014, 01:26 PM

Thank you very much for your reply blancha.

I understand the many drawbacks of the comparison I described.
Now I am working on DNA methylation data. Again, we have only methylation data for cancer samples. For normal samples, we downloaded data from TCGA. I processed both cancer and normal data all together starting from .idat files(similar to .CEL files). In theory, since methylation beta value is the methylated proportion, and both methylated and unmethylated are down on the same chip same well, I thought the comparison would make some sense. But the p-values for most of the probes are very small.
If we use ComBat to make adjustment to the 2 data sets, would it make eough sense? Here is what I think: The difference between normal and cancer geneome is not systematic difference, but variance caused by batch is systematic. And ComBat only make adjustment for those systematic differences between the 2 data sets. Of course this is not perfect, but I wonder if this could make enough sense.

**blancha** · 04-29-2014, 01:39 PM

You cannot use ComBat for this comparison.
You could only use ComBat if both datasets had both normal and cancer samples.
Forget the mathematics behind ComBat. Think about it conceptually. How can ComBat know which differences result from the different processing of the samples, and which differences results from the samples belonging to different conditions, normal and cancer?

This approach is a very common mistake, since researchers are so excited about all the online data available. For a comparison to be valid, both datasets must contain samples belonging to both conditions. This is even more important for microarray data, since microarrays are much more susceptible to the batch effect than next-generation sequencing.

**HeidiLee** · 04-29-2014, 01:43 PM

blancha: "How can ComBat know which differences result from the different processing of the samples, and which differences results from the samples belonging to different conditions, normal and cancer?"

The differences result from different disease conditions is not a systematic difference, while the differences result from different processing is a systematic difference. ComBat is only correcting systematic differences.

**blancha** · 04-29-2014, 01:52 PM

That's what you want ComBat to do for you.
My question is how could ComBat do that for you when the different datasets only have samples belonging to one condition?
How can ComBat distinguish between the "systematic differences" and the "different disease conditions"?

Only if both datasets had samples belonging to both conditions, could you use ComBat to achieve your objectives.
For example, I've successfully used ComBat when 2 technicians had prepared differently cancer and normal microarrays.
Had one technician prepared all the normal microarrays and one technician prepared all the cancer microarrays, there would be no way of removing the batch effect.
Since both technicians had prepared both type of microarrays, I was able to use ComBat to determine what differences were due to the technician and what differences were due to the microarrays being normal or cancer samples.

If I'm wrong, and it wouldn't be the first time

, I'd love for someone to demonstrate that ComBat can actually be used to achieve your objectives.

**HeidiLee** · 04-29-2014, 01:57 PM

[QUOTE=blancha;139074]That's what you want ComBat to do for you.
My question is how could ComBat do that for you when the different datasets only have samples belonging to one condition?
How can ComBat distinguish between the "systematic differences" and the "different disease conditions"?

Different disease condition wouldn't have systematic differences. The significant difference caused by disease is on some specific gene or genomic location. If you read the article about ComBat, you will see that ComBat does not, or does not intend to, adjust this kind of differences.

**dpryan** · 04-29-2014, 02:15 PM

Batch effects aren't systematic, they won't affect all species and won't uniformly modulate affected species. That's the crux of the issue. The best you might be able to do is use the literature or previous experiments to choose a set of representative species that you expect to be unaffected and then normalize according to those. This is similar in principal to how ERCC spike-ins works.

**HeidiLee** · 04-29-2014, 02:20 PM

Originally posted by dpryan View Post

Batch effects aren't systematic, they won't affect all species and won't uniformly modulate affected species. That's the crux of the issue. The best you might be able to do is use the literature or previous experiments to choose a set of representative species that you expect to be unaffected and then normalize according to those. This is similar in principal to how ERCC spike-ins works.

Do you mean "probe" by "species"?

If batch effect is completely random, what does ComBat do?

**dpryan** · 04-29-2014, 02:30 PM

In your case (with methylation arrays), yes, but that's not universally the case. In RNAseq, a species would typically be a gene. On arrays it'd be a probe. In targeted resequencing it'd be a region.

**HeidiLee** · 04-29-2014, 04:58 PM

Originally posted by dpryan View Post

In your case (with methylation arrays), yes, but that's not universally the case. In RNAseq, a species would typically be a gene. On arrays it'd be a probe. In targeted resequencing it'd be a region.

Thank you very much dpryan.

Topics	Statistics	Last Post
New Software Simplifies 3D Gene Expression Mapping by seqadmin Started by seqadmin, Today, 10:17 AM	0 responses 7 views 0 reactions	Last Post by seqadmin Today, 10:17 AM
AI Tool Creates High-Resolution 3D Maps of the Mouse Brain by seqadmin Started by seqadmin, 03-20-2025, 05:03 AM	0 responses 49 views 0 reactions	Last Post by seqadmin 03-20-2025, 05:03 AM
Studying Microbial Gene Transfer with RNA Barcoding by seqadmin Started by seqadmin, 03-19-2025, 07:27 AM	0 responses 59 views 0 reactions	Last Post by seqadmin 03-19-2025, 07:27 AM
Mapping the snoRNAome in Zebrafish to Advance Disease Research by seqadmin Started by seqadmin, 03-18-2025, 12:50 PM	0 responses 50 views 0 reactions	Last Post by seqadmin 03-18-2025, 12:50 PM

Seqanswers Leaderboard Ad

Comparing 2 Sets of RNA-seq Data

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News