SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Planning an RNA-Seq Experiment gavin.oliver RNA Sequencing 20 02-12-2014 12:08 AM
RNA-Seq: ReCount: A multi-experiment resource of analysis-ready RNA-seq gene count da Newsbot! Literature Watch 0 11-18-2011 02:20 AM
small RNA seq experiment gfmgfm Bioinformatics 0 01-12-2011 10:20 AM
is rna-seq experiment strand-specific or not? laupl Introductions 2 10-14-2010 12:56 PM
Tag counts for RNA seq experiment sanush SOLiD 3 12-03-2009 07:37 AM

Reply
 
Thread Tools
Old 11-11-2011, 02:12 AM   #1
pettervikman
Member
 
Location: Sweden

Join Date: Nov 2009
Posts: 23
Question How many reads are acceptable from an RNA seq experiment

Hi

We have data from an RNA seq experiment, 48 samples v2.5 Illumina. We had roughly the recommended number of clusters and an even distribution between the samples so we've ended up with roughly 6-7 million paird reads or 12-14 million single reads per sample.

I've heard people claim that you need at least 20-25 million reads per sample. So I'm wondering if anyone knows or have an article that has looked at a good read number for an RNA seq experiment. The data quality is really nice, if someone ask me how our runs look I always show the fastqc from this run...

/Petter
pettervikman is offline   Reply With Quote
Old 11-11-2011, 03:23 AM   #2
cedance
Senior Member
 
Location: Germany

Join Date: Feb 2011
Posts: 108
Default

I'd guess it depends on the analysis you want to do on the data, or the purpose of your experiment. Generally, for snp-calling, this amount of reads is sufficient I'd suppose. However, if you are looking at gene expression, especially to detect low expressed genes' differential expression, then maybe more reads would help.

I'd love to see the fastqc results to see how good an RNA-Seq data could look like. The ones I am working with, while they are good after preprocessing (adapter clipping + quality trimming), I have never seen a library sequenced good enough by looking at the raw data.
Also, it would be great if you could tell how much of total RNA did you use and also a bit about pre-amplification of the library.. if it was performed, how many cycles etc...

Thank you.
cedance is offline   Reply With Quote
Old 11-11-2011, 03:46 AM   #3
kopi-o
Senior Member
 
Location: Stockholm, Sweden

Join Date: Feb 2008
Posts: 319
Default

This is a hotly debated topic, see e. g. http://blog.fejes.ca/?p=607 where Anthony Fejes discusses a paper claiming that 500 million reads are needed to estimate transcription levels ... There has been a kind of mini-trend lately with several papers claiming that RNA-seq is actually not that good compared to microarrays unless you have very deep coverage.

As cedance said, it really depends on what you are interested in. I have performed some simulations where I downsampled the data and looked at the resulting abundance estimates for isoforms from Cufflinks and other tools, and haven't seen that much difference beyond 10 million paired-end reads so far. Looking at the number of detected transcripts, it always grows with sequencing depth, but again the curve is almost flat after 10-20M reads in the cases I've looked at.
kopi-o is offline   Reply With Quote
Old 11-11-2011, 06:04 AM   #4
adameur
Member
 
Location: Uppsala, Sweden

Join Date: Nov 2009
Posts: 23
Default

To make it even more complex, we have seen that polyA+ RNA gives a much higher fraction of reads mapping to exons compared to total RNA (rRNA depleted) where there are instead lots of intronic reads. Our explanation is that total RNA-seq captures lots of nascent transcripts that have not yet been fully transcribed while PolyA+ RNA-seq captures mainly mature transcripts (see http://dx.doi.org/10.1038/nsmb.2143).

So I think fewer reads are required for polyA+ RNA-seq compared to total RNA-seq if you are interested in mRNA expression.
adameur is offline   Reply With Quote
Old 11-11-2011, 07:58 AM   #5
harryzs
Member
 
Location: Germany

Join Date: Dec 2010
Posts: 29
Default

you should read this:
http://rna-seqblog.com/information/h...ds-are-enough/
harryzs is offline   Reply With Quote
Old 11-18-2011, 12:50 AM   #6
pettervikman
Member
 
Location: Sweden

Join Date: Nov 2009
Posts: 23
Default

Thanks for all the answers. I've decided to resequence a couple of samples to a much higher depth as well as doing some data pooling to see how things look in our system. I'm assuming that the coveraqge needed it will be dependent on read length as well read depth and since we have 101 bp long reads we might be better off. I'm also uncertain regarding the number of transcripts to expect, we're working in a highly specialised celltype, not in a cell line, so I'm expecting less transcripts and far from all that could exist in comparison to the vast numbers found in the immortalised cell lines.

I'm also curious whether it's much dependent on the highly expressed genes that are in the sample since they "steal" a lot of the data being produced. I know that it's possible to select the genes that one is interested in but have any one tried to remove the genes that is uninteresting/highly expressed to increase the coverage of the other genes? This would allow for a higher coverage even of genes that you don't know exist in comparison to the positive selection when you only find what you expected to find.'

I've also (wanted to) attach a figure to show what I call high quality data since cedence asked for it but since it ask for an url to do it and I have those figures just on my computer I can't. Are there any nice (fast and simple) ways of doing this?
pettervikman is offline   Reply With Quote
Old 11-18-2011, 01:00 AM   #7
cedance
Senior Member
 
Location: Germany

Join Date: Feb 2011
Posts: 108
Default

Pettervikman,
About posting images/urls to images, I use imageshack to upload images and paste the url here with the URL button.
cedance is offline   Reply With Quote
Old 11-18-2011, 01:38 AM   #8
pettervikman
Member
 
Location: Sweden

Join Date: Nov 2009
Posts: 23
Default

A new try for the figures


pettervikman is offline   Reply With Quote
Old 11-18-2011, 03:16 AM   #9
cedance
Senior Member
 
Location: Germany

Join Date: Feb 2011
Posts: 108
Default

That looks really great. Could you also post the plots for "Sequence duplication levels" and "per base sequence content"? These are the ones I am not quite satisfied with, with our data.
cedance is offline   Reply With Quote
Old 11-18-2011, 03:58 AM   #10
pettervikman
Member
 
Location: Sweden

Join Date: Nov 2009
Posts: 23
Default





Here are per base content and duplication levels. Since we've used the poly A tail pulldown I'm not suprised of the increase in A/T initially. The duplication levels are much higher then I'd accept for a genomic project but since there's much less diversity from the transcriptome I'm fine with this. Consider that there are hard end points that really cant be changed (5' and 3' ends of transcripts) and between maybe 10-15 k transcripts to start with.

An other question though. After cufflinks using RABT (-g) the transcripts creation looks a lot nicer. That said does anyone know why some transcripts are labelled OK despite the fact that their FPKM_low is 0? I'm also wondering about transcripts labelled as FAIL that have the positive numbers in coverage, fpkm, fpkm_high.

To sum it up, why are there transcripts with positive numbers in coverage, fpkm, fpkm_high and 0 in fpkm_low sometime OK, LOWDATA or FAIL?
pettervikman is offline   Reply With Quote
Old 11-18-2011, 04:06 AM   #11
cedance
Senior Member
 
Location: Germany

Join Date: Feb 2011
Posts: 108
Default

Thanks again. I am sorry I don't/haven't used cufflinks, yet.
1 more question!!: why is poly-A pulldown responsible for initial increase in A/T?
cedance is offline   Reply With Quote
Old 11-18-2011, 04:09 AM   #12
kopi-o
Senior Member
 
Location: Stockholm, Sweden

Join Date: Feb 2008
Posts: 319
Default

Petter, those data look super. Did you get them sequenced in Uppsala?
kopi-o is offline   Reply With Quote
Old 11-18-2011, 04:23 AM   #13
pettervikman
Member
 
Location: Sweden

Join Date: Nov 2009
Posts: 23
Default

Thanks! They got sequenced here on "my" hiseq. We have a hiseq here on CRC in Malmö, and where part of Lund University/LUDC (Lund University Diabets Center).

The pulldown uses a poly T tail and this will bind somewhere in the poly A tail (just to be super clear). Hopefully close to the 3' end of the CDS/3'non coding. But if it binds further down there will be a few As or Ts sequenced before the actual sequencing, hence the slight increase of A/T.
pettervikman is offline   Reply With Quote
Old 11-18-2011, 04:34 AM   #14
pmiguel
Senior Member
 
Location: Purdue University, West Lafayette, Indiana

Join Date: Aug 2008
Posts: 2,317
Default

Quote:
Originally Posted by cedance View Post
Thanks again. I am sorry I don't/haven't used cufflinks, yet.
1 more question!!: why is poly-A pulldown responsible for initial increase in A/T?
It isn't. The non-random base distribution in the first 10 bases is attributed to hexamer-primed 2nd strand synthesis. (The hexamers do not prime perfectly randomly.)

--
Phillip
pmiguel is offline   Reply With Quote
Old 11-18-2011, 04:38 AM   #15
pettervikman
Member
 
Location: Sweden

Join Date: Nov 2009
Posts: 23
Default

Thanks pmiquel. Didn't know that. But I've heard that it's much more common in rna-seq experiments in comparison to dna seq, hence the poly a tail story. But your saying that it's only dependent on the 2nd strand syntesis?
pettervikman is offline   Reply With Quote
Old 11-18-2011, 04:41 AM   #16
pmiguel
Senior Member
 
Location: Purdue University, West Lafayette, Indiana

Join Date: Aug 2008
Posts: 2,317
Default

Quote:
Originally Posted by pettervikman View Post

I'm also curious whether it's much dependent on the highly expressed genes that are in the sample since they "steal" a lot of the data being produced. I know that it's possible to select the genes that one is interested in but have any one tried to remove the genes that is uninteresting/highly expressed to increase the coverage of the other genes? This would allow for a higher coverage even of genes that you don't know exist in comparison to the positive selection when you only find what you expected to find.'
You can deplete the highly expressed genes globally using normalization. My naive presumption would be that this would distort the relative numbers of all the transcripts. But recent studies have shown this not to be the case.

--
Phillip
pmiguel is offline   Reply With Quote
Old 11-18-2011, 04:45 AM   #17
pettervikman
Member
 
Location: Sweden

Join Date: Nov 2009
Posts: 23
Default

pmiquel: I've thought about depleting them through the same technical pipeline as is used for rRNA depletion for example. This woun't affect the actual sequencing though, I would not get more reads for the lower expressed genes. But by a physical depletion where I actually removed these transcripts I'd get more reads and maybe more transcripts sequenced.

But I guess that your talking about the data depletion rather then the transript depletion?
pettervikman is offline   Reply With Quote
Old 11-18-2011, 04:54 AM   #18
cedance
Senior Member
 
Location: Germany

Join Date: Feb 2011
Posts: 108
Default

I guess the normalization pmiguel talks about is experimental. One of my colleagues explained something about library normalization using enzymes. It seems that the enzymes will act on cDNA-mRNA complex degrading them. Its probably explained better here, not sure though. Since the highly expressed genes are abundant, they'll be depleted in larger quantity. (I am not a biologist. So pardon my terminology).

@pmiguel, about the fact that the bias is due to random hexamer priming, this is what I have heard of as well. Thanks for pointing that out.

@petter, I am not sure about the poly-A story. 1) I don't think there is a necessity for excess A/T at the 3' end. 2) Even so, you are then fragmenting and amplifying. So, why would they still be selectively amplified?
cedance is offline   Reply With Quote
Old 11-18-2011, 05:04 AM   #19
pmiguel
Senior Member
 
Location: Purdue University, West Lafayette, Indiana

Join Date: Aug 2008
Posts: 2,317
Default

No, I mean using one of the cDNA normalization via double stranded nuclease(DSN) methods. You create double stranded cDNA, denature the strands, allow re-annealing to occur for an interval long enough for highly expressed transcript strands to find one another, but not those expressed at a lower level. Add DSN. Purify. Re-synthesize 2nd strand, continue as normal.

There was a method paper published showing the validity of this method for RNAseq. Can't seem to find it now, though.

--
Phillip
pmiguel is offline   Reply With Quote
Old 11-18-2011, 05:06 AM   #20
cedance
Senior Member
 
Location: Germany

Join Date: Feb 2011
Posts: 108
Default

The link I pointed to, basically tells the same story. Phillip explained it crisply and well.
cedance is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 02:39 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO