Seqanswers Leaderboard Ad

**sdriscoll** · 05-09-2012, 11:02 AM

Originally posted by pmiguel View Post

Not sure why this bothers you. I think they are trying to address the question: How few reads am I likely to be able to get away with while retaining most of the information I need. So you can look at the curves they generate and dial in a number that might work for the genes you are interested in.

I don't think their results were intended to surprise.

--
Phillip

I'm not completely sure why it bothers me but it does non the less. I may eventually be able to put it into words. Maybe because within those plots you're drawn into comparing the different subsets of the complete data set to one another. So in a 100M read set you might be looking at what 50% of those reads quantifies as to 60% of the reads. I'd wager that is not as informative as looking at actual technical replicates sequenced to different depths. In the subset approach 83% of the reads in the 60% subset were quantified previously in the 50% subset so very little new information is added into the quantification and the chance that the quantification is different isn't nearly as great as a 20% quantification compared to a 10% quantification where only 50% of the data was in the previous subset. If you think of it in the other direction, starting with 100% of your data, how much of an impact would you expect to see to the quantification by removing 10% of the data. What about 70%?

I guess what I'm getting at is I think those plots show a property of sub-setting a fixed size sample at different percentages. It's not a property of RNA-Seq data and I'm afraid that the analysis shows common results independent of the actual number of reads collected. For example if you've got 40 million reads the analysis might show that once you're past 20 or 25 million reads you've got a pretty good picture of what's happening at 40 million reads. If you had 900 million reads it might show you that around 400 or 500 million reads you've got a good picture of what's happening at 900 million reads. I suspect that in all cases when you have N total reads then this analysis will show you that you have a pretty good picture of the quantification of N reads at N/2 or 2N/3 reads. It's just a property of percentages. Take 10% of your data and that's 100% different than no data. At 20%, 50% of the data came from the last subset. At 30%, 66% of the data was quantified in the 20% subset. By 60%, as I mentioned before, you quantified 83% of the that sample in the 50% subset. At each additional subset the the chance of the quantification looking different from the previous subset gets smaller and smaller independent of the numbers of reads involved in each subset.

Of course depth is important for other things - like guaranteeing good coverage of specific length transcripts down to some RPKM level and for splicing analysis. But to say that the gene expression values at 100 million reads are more correct than the values at 50 million seems to me irrelevant and it's certainly not demonstrated by comparing 50% of your reads to 100% of your reads within a single sample. The gene expressions of a single sample sequenced to 500 million reads are still just from a single sample. The distribution of expressions from 10 biological replicates sequenced at 50 million reads each would be much more reliable...which is obvious, right? It's just like any other kind of experiment.

**GenoMax** · 05-09-2012, 11:41 AM

So before we go to page 2 in this interesting discussion .. which makes it harder for someone coming along afterwards (or searching) to find the conclusive information

Shall we start collecting read levels people feel comfortable recommending (in "Mil" reads for a genome of "X" size).

Possible scenarios (feel free to add more):

1. for a run of the mill (not looking for anything fancy, microarray replacement type of RNA-seq) ---
2. a run looking for alternate splicing ---
3. looking for an elusive gene ---
4. nothing is known about this transcriptome --

**severin** · 05-09-2012, 05:11 PM

start with the question

Originally posted by LP_SEP23 View Post

Hi,

I just got my first RNA-seq dataset alignment results today, and I am wondering what would be considered a "good run".
This is a pilot experiment, whose objective is to determine how many samples we can multiplex in one lane of Illumina HiSeq to get the information we want and the best trade off between cost, specificity and sensitivity.

I have some basic questions I ask everyone that asks me this.

1) What is the biological question you are trying to answer? (depending on your answer to this will determine if you want gene level expression differences or differential splice variation)
2) What is your organism and how different are your biological replicates from each other at a genomic level (clones, siblings, population).
-Organism matters because different species have different evolutions and polyploidy like events(more important for plants than animals).
-large variation (overdispersion) between biological replicates is common and more so depending on how you controlled the environment.
3) Are you hoping to find differential expression of a particular gene or is it ok to find differential expression of gene families (multiple mapping can affect gene expression estimates)
4) How good is your annotation? Do you anticipate new genic information not contained in the current gff file?

This just scratches the surface. I had another thread discussing what people are using in the counting (paired concordant, too-long, scramble, inverse, half-mapping, translocations, unpaired reads that map individual but not together and then normalization for paired vs unpaired data)

The fact that different programs cuffdiff/bayseq/DEseq/DEXseq give different lists of genes makes for yet another interesting discussion also in another thread.

**sdriscoll** · 05-09-2012, 05:21 PM

Originally posted by severin View Post

The fact that different programs cuffdiff/bayseq/DEseq/DEXseq give different lists of genes makes for yet another interesting discussion also in another thread.

indeed.

**DerSeb** · 05-10-2012, 12:35 AM

This would be my estimates, looking at human, mouse or rat:

1. for a run of the mill (not looking for anything fancy, microarray replacement type of RNA-seq) 60M
2. a run looking for alternate splicing 150M
3. looking for an elusive gene 100M
4. nothing is known about this transcriptome ?

About the PCR duplicates: yes, I agree that you will sequence more and more duplicates. But with bigger depth I'm not interested in finding new reads. Let's say I capture one exon with 5-10 reads, these reads are present in my library and it will still give me much better extimates of expression if I sequence several replicates of them. So to do a comparison between samples I still consider the information of how many fragments are present in my library as important.
Therefor I do not discard duplicate reads but I use them for quantification. For DNA sequencing I understand why one would discard duplicate reads, but not for quantitative analyses.

**vebaev** · 06-26-2012, 08:51 AM

Not to open a new tread for my question:

I was wondering what is the optimal number of samples per lane on GAIIx if we want to do an RNA-seq of unsequenced plant?

**chadn737** · 06-26-2012, 09:13 AM

Originally posted by vebaev View Post

Not to open a new tread for my question:

I was wondering what is the optimal number of samples per lane on GAIIx if we want to do an RNA-seq of unsequenced plant?

That is the wrong question. Unless you are doing a massive number of samples, you can and should multiplex all your samples and spread it across however many lanes of you need to achieve the desired depth.

The more important question is how many samples do you need and how many reads do you need per sample. That will be determined by the goal of the experiment, not the sequencer.

**vebaev** · 06-26-2012, 09:38 AM

Thanks.
The goal is to compare 2 samples of RNA-seq of unsequenced plant, so we do not know the size of the transcriptome. What should be the optimal M reads for this de novo RNA-seq task?
I asked since we will have 6 lanes but other colleagues will want to put also their samples (similar as ours samples) , and wanted to decide how much samples in total we can multiplex and load on these 6 lanes to be optimal for this task?

**GenoMax** · 06-26-2012, 11:42 AM

If there are only 2 samples is there any advantage in multiplexing them (unless you are worried about lane effect of some kind)?

You could just do 3 lanes of each sample. Since you do not know the size of the transcriptome you would want to get as many reads as you can.

Originally posted by vebaev View Post

Thanks.
The goal is to compare 2 samples of RNA-seq of unsequenced plant, so we do not know the size of the transcriptome. What should be the optimal M reads for this de novo RNA-seq task?
I asked since we will have 6 lanes but other colleagues will want to put also their samples (similar as ours samples) , and wanted to decide how much samples in total we can multiplex and load on these 6 lanes to be optimal for this task?

**chadn737** · 06-26-2012, 12:00 PM

Originally posted by GenoMax View Post

If there are only 2 samples is there any advantage in multiplexing them (unless you are worried about lane effect of some kind)?

You could just do 3 lanes of each sample. Since you do not know the size of the transcriptome you would want to get as many reads as you can.

I see no reason not to multiplex. Lane effects may or may not be an issue, but why take the chance when it is so easy to multiplex.

**GenoMax** · 06-26-2012, 12:05 PM

Originally posted by chadn737 View Post

I see no reason not to multiplex. Lane effects may or may not be an issue, but why take the chance when it is so easy to multiplex.

I am not sure I understand the benefit of multiplexing in this case. What chance are you referring to?

**chadn737** · 06-26-2012, 12:10 PM

Originally posted by GenoMax View Post

I am not sure I understand the benefit of multiplexing in this case. What chance are you referring to?

If there are lane to lane effects....which there usually are to some extent, then multiplexing all your samples means that those lane effects will affect all samples, not just one. It may be minor, but its a matter of good experimental design to reduce the possible technical variables as much as possible. Statistically this is the optimal experimental design.

Page not available - PMC

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2881125/

Let me reverse the question. What do you gain by NOT multiplexing?

**vebaev** · 06-26-2012, 12:16 PM

And if we came back on the topic

what is the optimal number of millions reads for de novo RNA-seq without reference genome? as if we have 12 samples per 6 lanes = ~15M for sample is this OK for this task?

**chadn737** · 06-26-2012, 12:21 PM

Originally posted by vebaev View Post

And if we came back on the topic

what is the optimal number of millions reads for de novo RNA-seq without reference genome? as if we have 12 samples per 6 lanes = ~15M for sample is this OK for this task?

What do you want to do with the data? Are the 12 samples from the same plant? Do you have a close relative sequenced? Do you have any rough estimation of the genome and transcriptome size? You need to be clear about what exactly it is you want to do, there is no simple answer to these questions.

Even for Arabidopsis I would want more than 15M per sample.

**vebaev** · 06-26-2012, 12:26 PM

no, each 2 are different plant, 1 control and 1 treatment , no info of transcriptome size (probably only one of them will be sequenced with ref genome - tomato)

Topics	Statistics	Last Post
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, Yesterday, 08:47 AM	0 responses 12 views 0 likes	Last Post by seqadmin Yesterday, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 54 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News