SEQanswers

Go Back   SEQanswers > Applications Forums > RNA Sequencing



Similar Threads
Thread Thread Starter Forum Replies Last Post
How many reads are acceptable from an RNA seq experiment pettervikman Bioinformatics 23 02-09-2012 04:45 AM
RNA-Seq: ReCount: A multi-experiment resource of analysis-ready RNA-seq gene count da Newsbot! Literature Watch 0 11-18-2011 02:20 AM
small RNA seq experiment gfmgfm Bioinformatics 0 01-12-2011 10:20 AM
is rna-seq experiment strand-specific or not? laupl Introductions 2 10-14-2010 12:56 PM
Tag counts for RNA seq experiment sanush SOLiD 3 12-03-2009 07:37 AM

Reply
 
Thread Tools
Old 04-18-2011, 04:40 AM   #1
gavin.oliver
Senior Member
 
Location: uk

Join Date: Jan 2010
Posts: 110
Question Planning an RNA-Seq Experiment

Hi all,

If planning an experiment to for example compare two types (stages) of human tumour, what sort of number of reads and number of bioiogical replicates is currently considered acceptable?
gavin.oliver is offline   Reply With Quote
Old 04-18-2011, 08:35 AM   #2
Jon_Keats
Senior Member
 
Location: Phoenix, AZ

Join Date: Mar 2010
Posts: 279
Default

What do you want to detect? What tissue type? The number of reads and read length really depend on the immediate goal associated with your hypothesis and tissue type. If you want to have your cake and eat it too (gene expression, transcript expression, mutation detection) you will need more reads than for just gene expression say 150 million versus 25 million. Also, replicates are nice to have for the expression estimates.
Jon_Keats is offline   Reply With Quote
Old 04-18-2011, 11:45 AM   #3
gavin.oliver
Senior Member
 
Location: uk

Join Date: Jan 2010
Posts: 110
Default

Hi, thanks for the reply.

I guess my ideal answer would tell me how the length/number of reads changes with application e.g. for gene expression vs transcript expression and discovery vs SNP detection.

As an example, looking at good vs poor prognosis colorectal tumor samples.

What number of replicates is desirable/computationally realistic?
gavin.oliver is offline   Reply With Quote
Old 04-18-2011, 10:19 PM   #4
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 994
Default

Usually, people do two to three replicates per condition. This is enough to estimate variance but not sufficient to overcome bad signal-to-noise (SNR) ratio. In your case, you expect the SNR to be very bad: The signal (differences in expression due to difference in prognosis) will probably be only very rarely be larger than the noise (differences due to the fact that each sample is form another patient with another genotype).

This is the reason that all these experiments attempting to link cancer prognosis to expression levels are done with tens of replicates (and the with microarrays because that is still cheaper) and why even so, they usually lead to nothing.

Are you sure you have the resources to do such a project? Your post does not sound as if you were aware that this is way more ambitious than your average RNA-Seq project.
Simon Anders is offline   Reply With Quote
Old 04-18-2011, 11:34 PM   #5
gavin.oliver
Senior Member
 
Location: uk

Join Date: Jan 2010
Posts: 110
Default

Hi Simon & thanks again

In short, not certain at all

This is very much fact-finding.

Are there any published results of studies in this vein using (or attempting to use) NGS in place of microarray?

I guess I'm trying to get a concrete feel for how close NGS is to 'plugging into' areas traditionally using microarrays e.g. multivariate diagnostic/prognostic classification, SNP association studies etc.
gavin.oliver is offline   Reply With Quote
Old 04-19-2011, 12:09 PM   #6
gavin.oliver
Senior Member
 
Location: uk

Join Date: Jan 2010
Posts: 110
Default

No more luck here

If noone is aware of work in this area, let me ask in theory:

If one had the will/finances to profile 50 good prognosis and 50 poor prognosis tumors with 150 million reads per patient, what size of an undertaking would it be computationally? Is it huge? Or by generating read counts one patient at a time, does it become nothing more than a collection of 100 data matrices?
gavin.oliver is offline   Reply With Quote
Old 04-19-2011, 12:36 PM   #7
pbluescript
Senior Member
 
Location: Boston

Join Date: Nov 2009
Posts: 224
Default

Quote:
Originally Posted by gavin.oliver View Post
No more luck here

If noone is aware of work in this area, let me ask in theory:

If one had the will/finances to profile 50 good prognosis and 50 poor prognosis tumors with 150 million reads per patient, what size of an undertaking would it be computationally? Is it huge? Or by generating read counts one patient at a time, does it become nothing more than a collection of 100 data matrices?
That would be a massive undertaking. Even at the lowest level of sequencing (36 bp, single end), you would generate 540 gigabases of data (50*2*150,000,000*36). However, 36 bp reads aren't that great for RNA seq, especially in a complex genome. If you go for 100 bp, paired end reads, you'd be talking about 3 terabases of data, which is approaching what was generated for the pilot paper for 1000 Genomes published in Nature last year (4.9 terabases). To generate and analyze that amount of data would probably require millions of dollars.
pbluescript is offline   Reply With Quote
Old 04-19-2011, 12:41 PM   #8
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 994
Default

If we are talking about analysing mRNA to see changes in gene expression, high-throughput sequencing will not provide much advantage over microarrays. With HTS, you might get better precision for your expression estimates (and even this only if you have enough reads), but the measurement precision is not the limiting factor here anyway, the patient-to-patient variation is.

You may hope to get better prognostic signatures by looking at features that are hard to see by microarrays, e.g., changes in splicing rather than expression, or appearance of fusion genes. This might be a long shot, though.

Finally, if you go for genomic sequencing and look for structural variants, you may hope to find something in the size range of variants to large for SNP chips and too small for array-CGH / tiling arrays. Again, whether cancer signatures are likely to be found there is up to anyone's guess.
Simon Anders is offline   Reply With Quote
Old 04-20-2011, 12:44 AM   #9
gavin.oliver
Senior Member
 
Location: uk

Join Date: Jan 2010
Posts: 110
Default

These are great answers guy - thanks a lot

A few small things:

1) How to convert bases to bytes if talking about 3 terabases of data? I'm guessing it depends on the format it's supplied in?

2) As that data is mapped and converted to read counts, how much does it shrink?

3) How would that data likely be supplied by a sequencing provider? On disk?

Thanks for all your help.
gavin.oliver is offline   Reply With Quote
Old 04-20-2011, 05:57 AM   #10
Jon_Keats
Senior Member
 
Location: Phoenix, AZ

Join Date: Mar 2010
Posts: 279
Default

Most suppliers will provide portable drives with the data as a fastq file, which contains the read sequence and quality values for each base (ei. 2 characters per base) these files also contain read identifiers, which can be of varying lengths so file size per base will vary by vendor and platform. But expect 3-4 bytes per base. Depending on the processing method the intermediate files can suck up a huge amount of space and to be conservative just expect the final output BAM file that you will likely want to keep to be roughly the same size. From that file you can use a variety of programs to generate the count data which ends up being very small 1-5 MB per sample. If you did one sample per lane on the Illumina HiSeq using 50x50 reads you end up with two 6-15GB fastq files, after alignment a single BAM file of 6-15GB, and then a couple of count files in the 1-5Mb range.

Do consider many of the comments from Simon, he is definitely one of the more knowledgeable contributors to this forum. In general I agree with him that these types of questions are still best answered using microarrays over RNAseq. Largely because there are many more nuances you need to consider when designing an RNAseq and to do it properly the cost will scare you. If you cut corners because of cost in the end you will likely end up regretting you decision later when you realize you can't do XYZ or the results or inaccurate. That being said, I have more hope that expression profiling does provide a good means to identify prognostic groups. In fact I often say the only thing it is good for is subset identification and I have a number of similar studies running currently. The one issue you always run into, regardless of arrays or sequencing, is the old "garbage in garbage out phenomena". If you can not purify the tumor cells to high purity (ie. 85% or greater minimum) you are likely to end up with feelings similar to Simon's comment "they usually lead to nothing". Even with that, I can tell you in our field were we can robustly purify tumor cells by magnetic sorting to an average of 95% purity the really good risk models only fell out in cohorts of 250-350 patients.
Jon_Keats is offline   Reply With Quote
Old 04-20-2011, 06:06 AM   #11
gavin.oliver
Senior Member
 
Location: uk

Join Date: Jan 2010
Posts: 110
Default

Jon - thanks for the comprehensive response!

Rest assured I take Simon's (and your) responses very seriously and disagree with none of them!

I am just keen to build a strong concept of the why-nots in an area where I have limited practical experience.

So it's fair to say that prognostic classification should remain a microarray-based pursuit.

Do you feel that will change in the near future?

Last edited by gavin.oliver; 04-20-2011 at 06:07 AM. Reason: typo
gavin.oliver is offline   Reply With Quote
Old 04-20-2011, 06:58 AM   #12
Jon_Keats
Senior Member
 
Location: Phoenix, AZ

Join Date: Mar 2010
Posts: 279
Default

I don't think it should remain a microarray-based pursuit, the sequencing based read outs have so many advantages it is unquestionable they will ultimately be the future. The question is what to do today. On my end we are pushing forward with sequencing based measurements for a couple of reasons. First, our institute is heavily invested in NGS technology and have sold off our rooms full of Affymetrix equipment to make room for all the sequencers. So internally our hands are a bit tied, though we still have our Agilent platforms that I actually prefer anyways. Second, we and others have had success in our field using affy arrays so we assume similar models should fall out of a sequencing based study. But I have to say my major imputus was/is that those arrays may not exist in a couple of years so we better start moving our models to sequencing based read outs so we can stay ahead of the curve. We also want to start integrating exome/genome sequencing with expression estimates and the sequencing based approach allows for things like allele specific expression analysis that can not be done on a conventional microarray platform.

The short answer is if you have the money, and feel comfortable with NGS data or have collaborators who are, then go ahead. But like any experiment, maybe more so given the cost/risk, make sure the sample selection and analytical goal are specifically layed out in advance so library production and sequencing are performed correctly. Once that decision is made I would always suggest a 3 sample test batch to see if you can handle the data and to see if the outlined plan is generating the data you need for your analytical goals then ramp up to mass production of the 100 sample batch.
Jon_Keats is offline   Reply With Quote
Old 04-20-2011, 07:08 AM   #13
gavin.oliver
Senior Member
 
Location: uk

Join Date: Jan 2010
Posts: 110
Default

Apologies - I should have said "remain a microarray-based pursuit for now"

Our group have been firmly Affymetrix microarray-based for many years now and have been involved in some large scale prognostic classifier type work.

The thing is that I am trying convince a move (tentative/partial at least) toward NGS and I want to have a strong idea of what we can already do with it and what remains in the future i.e. to what degree and in what applications it can already replace microarray cost-effectively...
gavin.oliver is offline   Reply With Quote
Old 04-20-2011, 07:40 AM   #14
Jon_Keats
Senior Member
 
Location: Phoenix, AZ

Join Date: Mar 2010
Posts: 279
Default

RNAseq is best used currently for small scale test vs control comparisons or time series. But that is largely assuming you want to look at gene expression and transcript expression comparisons were you need significant read depth for the later. In your situation I think limiting the analysis to "gene" expression you could generally replace affy arrays getting rid of their many inaccuracies for around double the cost per sample likely less depending on vendor. Depending on the tissue, the cost could drop even more if you can multiplex, but the limitation will be relative gene expression. Take my situation of working on multiple myeloma which is a plasma cell disease were the cell really is a factory producing immunoglobulin we need to double the read count to get equal counts on the non-immunoglobulin genes compared to breast cancer because ~50% of all the transcripts in the cell are immunoglobulin.
Jon_Keats is offline   Reply With Quote
Old 04-20-2011, 10:15 AM   #15
gavin.oliver
Senior Member
 
Location: uk

Join Date: Jan 2010
Posts: 110
Default

Quote:
Originally Posted by Jon_Keats View Post
but the limitation will be relative gene expression. Take my situation of working on multiple myeloma which is a plasma cell disease were the cell really is a factory producing immunoglobulin we need to double the read count to get equal counts on the non-immunoglobulin genes compared to breast cancer because ~50% of all the transcripts in the cell are immunoglobulin.
Is this where protocols like DSN normalisation can be of use? Or am I off the mark?
gavin.oliver is offline   Reply With Quote
Old 04-21-2011, 07:24 AM   #16
Joann
Senior Member
 
Location: Woodbridge CT

Join Date: Oct 2008
Posts: 231
Default another question

Quote:
Originally Posted by Jon_Keats View Post

If you can not purify the tumor cells to high purity (ie. 85% or greater minimum) you are likely to end up with feelings similar to Simon's comment "they usually lead to nothing". Even with that, I can tell you in our field were we can robustly purify tumor cells by magnetic sorting to an average of 95% purity the really good risk models only fell out in cohorts of 250-350 patients.
How many tumor cells contribute to each DNA or RNA sequencing sample in this type of comparison (risk model)?
Joann is offline   Reply With Quote
Old 04-21-2011, 09:43 AM   #17
husamia
Member
 
Location: cinci

Join Date: Apr 2010
Posts: 66
Default

I have a colleague who did RNA-seq experiment and I have analyzed reads for differential 2-sample test with RPKM values. I have a comment about the design which was illumina 25bps reads plus the adapter. So first I had to remove adapter and quality trim. then align to mouse genome and map to mirbase to get my "expression". The first issue I was concerned with is the alignment of the 25bps which I believe was set to 80% match, when I used 90% match more than half of reads didn't align. with 80% match I got 80% aligned which is what I expected. Thats 5 mismatches which may be too high however, I nocited high heterogeniety at ends. I have read reports of this in literature. Does everyone use this cutoff for alignment. I believe this may not be strong enough since 25 with 5 mismatches may not well represent the target. Another concern is this assumption that coverage (depth) as proxy for expression. I believe that there is amplification bias here that may be overwhilming at some cases. The read depth varried which is another issue, what is good depth range and should I remove duplicate reads or not? if I see difference how many samples should I duplicate to be confident that I am measuring different in expression not amplification bias or is it possible to make this stronger?
husamia is offline   Reply With Quote
Old 05-01-2011, 10:06 PM   #18
christinawu2008
Member
 
Location: Australia

Join Date: Feb 2011
Posts: 13
Default

Quote:
Originally Posted by husamia View Post
I have a colleague who did RNA-seq experiment and I have analyzed reads for differential 2-sample test with RPKM values. I have a comment about the design which was illumina 25bps reads plus the adapter. So first I had to remove adapter and quality trim. then align to mouse genome and map to mirbase to get my "expression". The first issue I was concerned with is the alignment of the 25bps which I believe was set to 80% match, when I used 90% match more than half of reads didn't align. with 80% match I got 80% aligned which is what I expected. Thats 5 mismatches which may be too high however, I nocited high heterogeniety at ends. I have read reports of this in literature. Does everyone use this cutoff for alignment. I believe this may not be strong enough since 25 with 5 mismatches may not well represent the target. Another concern is this assumption that coverage (depth) as proxy for expression. I believe that there is amplification bias here that may be overwhilming at some cases. The read depth varried which is another issue, what is good depth range and should I remove duplicate reads or not? if I see difference how many samples should I duplicate to be confident that I am measuring different in expression not amplification bias or is it possible to make this stronger?
May be some artefacts in your data and you can detect them by comparing replicates and discard them.
christinawu2008 is offline   Reply With Quote
Old 05-01-2011, 10:48 PM   #19
christinawu2008
Member
 
Location: Australia

Join Date: Feb 2011
Posts: 13
Default

Hi Simon,

Can you post out the literatures related to comparing GE between using microarray and RNA-Seq, and prove your views? Even though I agree with you, but I'd like to see some published statistics on it.

Thank you!
christinawu2008 is offline   Reply With Quote
Old 02-11-2014, 10:19 PM   #20
ymc
Senior Member
 
Location: Hong Kong

Join Date: Mar 2010
Posts: 498
Default

Quote:
Originally Posted by Jon_Keats View Post
RNAseq is best used currently for small scale test vs control comparisons or time series. But that is largely assuming you want to look at gene expression and transcript expression comparisons were you need significant read depth for the later. In your situation I think limiting the analysis to "gene" expression you could generally replace affy arrays getting rid of their many inaccuracies for around double the cost per sample likely less depending on vendor. Depending on the tissue, the cost could drop even more if you can multiplex, but the limitation will be relative gene expression. Take my situation of working on multiple myeloma which is a plasma cell disease were the cell really is a factory producing immunoglobulin we need to double the read count to get equal counts on the non-immunoglobulin genes compared to breast cancer because ~50% of all the transcripts in the cell are immunoglobulin.
Hi Jon, can RNA-Seq be used to measure expression of the immunogolbulin genes? I don't see them (ie, IGH*, IGK*, IGL*) in the genes.gtf file. Therefore, I can't see them in the output. Is there any genes.gtf files that contains them? Where can I download?

Thanks a lot in advance.
ymc is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 04:33 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO