Seqanswers Leaderboard Ad

**Simon Anders** · 02-28-2013, 12:09 AM

Sigh. Well, at least you ask before doing the experiment and ruining your project. No, the unequal sample sizes are not your problem.

But how would you ever know whether an observed difference is statistically significant, i.e., large compared to what you observe between samples treated the same way, if you don't know how strong the differences between samples in the same treatment group are?

Maybe I'm in a bad mood because it's early in the morning, but as you are the n-th person to ask this question here: I still don't get it. Why would anyone even think about pooling samples without multiplexing? I met people who claimed that they knew that the differences between equally treated samples are so small that they don't need to check, but curiously, these are only those people who have never done such an experiment.

**shocker8786** · 02-28-2013, 10:35 AM

Thank you for your reply. I'm new to NGS analysis, so I may have this wrong, but my understanding was that when comparing differentially methylated sites between groups your statistics are based on comparing the number of methylated/unmethylated reads for each group.

For example, you have a region where 50 reads are aligned in both pools. You would then determine statistical significance by comparing the methylated and unmethylated read counts of the two pools at that region.

I was under this assumption based on the paper below.

Just a moment...

http://www.pnas.org/content/110/6/2354.short

The sentence below was taken from the supplemental methods, where they explain how statistical significance was determined between the two cell lines.

"For each methylation region, statistical signiﬁcance of differential methylation was calculated using a Fisher’s exact test on a 2 × 2 contingency table of methylated and nonmethylated counts in the two cell lines. "

The way I interpret that is the reads are what give you statistical significance. If I'm mistaken would you be able to explain what I am missing? Thank you very much for your help, I really appreciate it!

**Simon Anders** · 02-28-2013, 10:51 AM

Short answer: Using Fisher's exact test for this purpose is wrong. I don't have much time at the moment to look at it in detail, but the paper's analysis is most likely seriously flawed.

Imagine you have 2 treated and 2 control samples:

Control 1: 10 of 50 reads methylated
Control 2: 30 of 50 reads methylated
Treatment 1: 20 of 50 reads methylated
Treatment 2: 40 of 50 reads methylated

So, the methylation goes up by 10 reads, but between two samples within the same treatment group, the difference is 20 reads. Would you believe that this increase in methylation by 10 reads is due to the treatment? I'd rather say it is due to the same random variation that you see within group. Next time you do the same experiment, you might get the opposite result if things vary so much.

Now, imagine you pooled the samples, so you see only the averages:

Control: 20 of 50 reads methylated
Treatment: 30 of 50 reads methylated

Now you don't know any more that there was a change of 20 between replicates, and might think that an increase by 10 is a lot. FIsher's exact test cannot know this either, which is why it is wrong to use this test.

The advantage of pooling is, of course, precisely that you do not see that your results are unlikely to be reproducible, and hence are not discouraged from writing a paper anyway. The fact that referees still fail to spot this elementary mistake seems to help.

**microgirl123** · 02-28-2013, 10:51 AM

I think what Simon is trying to say doesn't relate to NGS sequencing specifically. It relates to any set of samples you are trying to perform statistics on and get meaningful results. Basically, you cannot statistically compare two things unless you have replicates (n must be greater than 1 in your statistics formulas!). If you pool all your samples together into two groups, then you can't perform statistics because you only have one of each of two things (n=1).

You should index each of your 4 samples for Treatment A and each of your 5 samples for Treatment B before pooling. Then you can perform your NGS analysis on the pooled sample and see how the differences between samples in Treatment A compare to the differences between samples in Treatment B.

**shocker8786** · 02-28-2013, 11:28 AM

Thank you very much for taking the time to explain, I understand what you are saying now. I cannot remember why the decision to pool was originally made, but your argument against it makes perfect sense. I'm definitely going to talk with my group about reconsidering our experimental design.

Thanks again!

**Simon Anders** · 02-28-2013, 11:29 AM

Originally posted by microgirl123 View Post

I think what Simon is trying to say doesn't relate to NGS sequencing specifically.

Of course. But NGS is one of the few fields where people don't know this and nevertheless routinely get papers in high-ranking journals, which than causes new-comers to think that this is how it should be done.

**Rick_R** · 09-20-2013, 11:46 AM

I know this is many months after the original post, but I would like to pose a similar question.

I work with cell lines, and can therefore produce many biological replicates. However, the cost of sequencing them all separately would be too high. One could sequence, say, 6 samples:
1. Control A
2. Control B
3. Control C
4. Treatment A
5. Treatment B
6. Treatment C

Might it be better to sequence this instead:
1. Control A + Control B
2. Control C + Control D
3. Control E + Control F
4. Treatment A + Treatment B
5. Treatment C + Treatment D
6. Treatment E + Treatment F

Is this a reasonable way to reduce the "noise" from biological variability/random variation while maintaining the number of samples sequenced?

**Simon Anders** · 09-20-2013, 11:54 AM

Yes, it is.

It's still worth double-checking whether multiplexing really is that expensive: Even if you want to use only one lane for two samples, you can still gain information by marking the fragments from each sample with a barcode. You don't pay more for the sequencing, but you do pay extra for the steps up to the barcode ligation because they cannot be performed in a pooled fashion.

**aliceb** · 01-08-2014, 01:39 PM

Hi all,

To dredge up an old question again, I was wondering if I could get an opinion on a pooling / not pooling design.

First, I understand that I want biological replicates! But is it better to work with replicates of pools or replicates of individuals? I'm leaning towards individuals because we can better call alleles, I think. But my main goal is to identify differentially expressed genes.

An example. We have 3 treatments to compare:

Option A: 5 individuals per treatment, giving me 15 libraries.
Option B: 5 pools (of 10 individuals?), again giving me 15 libraries, but summarizing 150 individuals.

Any thoughts on this option would be appreciated.

Thanks!

**Simon Anders** · 01-08-2014, 03:12 PM

Of course, B is the better option if you have so many samples anyway. (What are we talking about? Flies?) Unless you want to look at allele-specific expression, as you already noted. The trade-off here depends on how much signal you gain with B vs A and how much potentially interesting biology you lose by not being able to look at alleles.

The option I argued against is

Option C: Pool all the samples from each treatment, giving you 3 libraries in total.

It seems to be non-obvious to distressingly many practitioners why that one is not acceptable.

If it does not cost anything extra, you should consider

Option D: Label the cDNA from each individual with a barcode, the pool them all in one big library, spread over 15 sequencing lanes.

This offers you most information, but requires you to do all the sample-prep steps up to the barcoding 150 times in parallel, which is practicable only if these are only few steps before the pooling and/or you have suitable robotics or lots of patience.

**aliceb** · 01-09-2014, 12:49 AM

Thanks for the reply! We're working with wasps that can be grown up, but high numbers will be a bit of a struggle. And as they're variable, sexual populations there will certainly be information that is lost by pooling.

Option D sounds fantastic. But as I actually have 12 experimental lines to sequence (well, 3 blocks of 4 parallel lines), with at least 5 biological replicates each, I think it's outside of my budget and pipetting capacity

Also, when it comes to pooling, do you have an opinion on how many individuals to use? It seems like pools of only 5 individuals might have problems with one weirdo dominating the response. But how high would one have to go to avoid that? This where my number limitations come in. I would like 10 per pool, but might be limited to fewer.

**revAMI** · 01-22-2014, 01:16 PM

Library prep can be more expensive than the sequencing, so option D would have a significant added cost.

I have money of 18 preps, and one run. I have three treatment groups, and hundreds of samples. Is it better to pick six from each group at random, or do six pools (of how many?) for each group?

Pooling would reduce chance bias from biological variability, and give a stronger signal for the most changed genes. It would also be more emotionally satisfying to use more of my samples. On the other hand, it would make allele-specific expression and alternative splicing much harder to do.

This is in humans, so I'm not concerned about creating a denovo trnscriptome.

Which would look better to apply for a follow grant to do more samples?

Topics	Statistics	Last Post
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, Yesterday, 11:49 AM	0 responses 15 views 0 likes	Last Post by seqadmin Yesterday, 11:49 AM
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, 04-24-2024, 08:47 AM	0 responses 16 views 0 likes	Last Post by seqadmin 04-24-2024, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 61 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM

Seqanswers Leaderboard Ad

Announcement

Pooling Samples for Sequencing

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News