Originally posted by pmiguel
View Post
I guess what I'm getting at is I think those plots show a property of sub-setting a fixed size sample at different percentages. It's not a property of RNA-Seq data and I'm afraid that the analysis shows common results independent of the actual number of reads collected. For example if you've got 40 million reads the analysis might show that once you're past 20 or 25 million reads you've got a pretty good picture of what's happening at 40 million reads. If you had 900 million reads it might show you that around 400 or 500 million reads you've got a good picture of what's happening at 900 million reads. I suspect that in all cases when you have N total reads then this analysis will show you that you have a pretty good picture of the quantification of N reads at N/2 or 2N/3 reads. It's just a property of percentages. Take 10% of your data and that's 100% different than no data. At 20%, 50% of the data came from the last subset. At 30%, 66% of the data was quantified in the 20% subset. By 60%, as I mentioned before, you quantified 83% of the that sample in the 50% subset. At each additional subset the the chance of the quantification looking different from the previous subset gets smaller and smaller independent of the numbers of reads involved in each subset.
Of course depth is important for other things - like guaranteeing good coverage of specific length transcripts down to some RPKM level and for splicing analysis. But to say that the gene expression values at 100 million reads are more correct than the values at 50 million seems to me irrelevant and it's certainly not demonstrated by comparing 50% of your reads to 100% of your reads within a single sample. The gene expressions of a single sample sequenced to 500 million reads are still just from a single sample. The distribution of expressions from 10 biological replicates sequenced at 50 million reads each would be much more reliable...which is obvious, right? It's just like any other kind of experiment.
Comment