SEQanswers

Go Back   SEQanswers > General



Similar Threads
Thread Thread Starter Forum Replies Last Post
double peak in mRNA truseq kit ssing Sample Prep / Library Generation 16 01-10-2012 09:53 AM
mRNA from Bacterial sample for trueseq Garyron Sample Prep / Library Generation 9 08-16-2011 01:42 AM
Informatics cause of intra-sample variability, Illumina TruSeq mRNA eab Bioinformatics 0 07-13-2011 09:16 AM
Intra-sample variability, Illumina TruSeq mRNA eab General 0 07-12-2011 08:50 AM
questions about Illumina mRNA sample prep ik76 Illumina/Solexa 2 01-19-2010 04:38 AM

Reply
 
Thread Tools
Old 07-12-2011, 08:46 AM   #1
eab
Member
 
Location: Maryland

Join Date: May 2011
Posts: 63
Default Intra-sample variability, Illumina TruSeq mRNA

Does anyone know if there is a technical issue during either mRNA library prep or data handling that could cause two libraries prepared from the same cell population to look radically different? We have made multiple libraries from the same sample and the results appear quite discouraging. We're not sure if we're doing something wrong with analysis, or whether there was a problem with sample prep. Anyone know of any common pitfalls that could explain our problem?

Last edited by eab; 07-12-2011 at 08:53 AM.
eab is offline   Reply With Quote
Old 07-12-2011, 08:52 AM   #2
mnkyboy
Member
 
Location: Seattle, WA

Join Date: Mar 2009
Posts: 87
Default

I did some QC runs using MAQC UHR and Human Brain with three replicates for TruSeq Whole Transcriptome and the libraries had an R-squared value of .98 or higher when I compared their FPKM as generated by Cufflinks. Granted I did not do the mRNA selection step.

I have also made libraries from the same experimental sample using different methods for ribosomal reduction and when comparing the non-ribosomal FPKM I also get a very high correlation of >0.9.

Were they prepared at different times? What kind of RNA and how does the BioAnlyzer look?
mnkyboy is offline   Reply With Quote
Old 07-12-2011, 09:07 AM   #3
eab
Member
 
Location: Maryland

Join Date: May 2011
Posts: 63
Default

Hey mnkyboy, thanks for the superfast reply! Here are details.

Cells: sorted human naive T cells, approximately 15 million in one tube. Cells aliquotted into 5 tubes, including one (1) tube of 1x10e7, two (2) tubes of 2x10e6, and two (2) tubes of 2x10e5.

Extractions: cells pelleted and lysed in RNAzol RT immediately after aliquoting, then stored at -80 until total RNA extraction. RNA extraction done at same time with same tubes of reagents on all 5 tubes.

Library prep: TruSeq RNA sample prep kit A, all libraries prepared together in a single 96-well plate using high-throughput protocol (with a few minor mods).

Library QC: completed, purified libraries run on bioanalyzer and showed appropriate size peak + a large peak that I took to represent the "bubble form" Illumina describes. Libraries quantified by Kapa qPCR with flowcell primers and SYBR Green reporter.

Clustering: cBOT using cluster kit TruSeq PE cluster kit v2 - HighSeq.

I did not run the starting RNA on the BioA before library prep. The cells were handled as immaculately as was possible, so I figured that no matter what the BioA gave me for an RIN, I would not be able to improve on it and I needed to just go forward. I have some RNA saved back that I can run now on the BioA, but I would be shocked if differential degradation were the problem.

Any ideas? We're wondering especially about trivial informatics sorts of things that can lead to false differences.....

Thanks!
Eli
eab is offline   Reply With Quote
Old 07-12-2011, 09:22 AM   #4
chadn737
Senior Member
 
Location: US

Join Date: Jan 2009
Posts: 392
Default

When you say they look radically different, what do you mean? Is this before alignment or after alignment?
chadn737 is offline   Reply With Quote
Old 07-12-2011, 09:28 AM   #5
mnkyboy
Member
 
Location: Seattle, WA

Join Date: Mar 2009
Posts: 87
Default

That is definitely a head scratcher. How long were your reads? We have found for RNA-seq if we go over 75 bases we start hitting adapter and our mapping goes to awry. Did you multiplex? Was there anything that stuck out across the lanes in your QC? We generally multiplex and spread across the flow cell to reduce any lane variation.

The only other thing that I think could be an issue is if something odd happened during the poly-A selection. One way to check this is too see if you map to any known non poly-adenylated non-coding RNA and see if there are differences across the samples.
mnkyboy is offline   Reply With Quote
Old 07-12-2011, 09:33 AM   #6
chadn737
Senior Member
 
Location: US

Join Date: Jan 2009
Posts: 392
Default

Quote:
Originally Posted by mnkyboy
That is definitely a head scratcher. How long were your reads? We have found for RNA-seq if we go over 75 bases we start hitting adapter and our mapping goes to awry.
This is exactly the problem I had with the truseq libraries and I wonder if this is the problem now. We had 100bp reads and I was only getting ~60% to map. When I would blast random reads, the last 25 or so bps often had no match at all and turned out to be adapter sequence. I have heard of other people also having this problem with correct size selection.
chadn737 is offline   Reply With Quote
Old 07-12-2011, 09:37 AM   #7
mnkyboy
Member
 
Location: Seattle, WA

Join Date: Mar 2009
Posts: 87
Default

Quote:
Originally Posted by chadn737 View Post
This is exactly the problem I had with the truseq libraries and I wonder if this is the problem now. We had 100bp reads and I was only getting ~60% to map. When I would blast random reads, the last 25 or so bps often had no match at all and turned out to be adapter sequence. I have heard of other people also having this problem with correct size selection.
Yeah our standard WT or mRNA-seq is now 2x75 bp and then 2x50 if we do FFPE.
mnkyboy is offline   Reply With Quote
Old 07-12-2011, 10:49 AM   #8
sdarko
Member
 
Location: Bethesda, MD

Join Date: Apr 2009
Posts: 51
Default

Quote:
Originally Posted by chadn737 View Post
When you say they look radically different, what do you mean? Is this before alignment or after alignment?
I'm the bioinformatician working on this.

They looked vastly different.

In the first image I uploaded, I had used the wrong gtf (contained multiple entry names for the same transcript, ucsc_all_known_mRNA) file for the cufflinks analysis and that was a cause of much of the disparity. The R^2 value was only 0.60 or so.

After realizing my error, I grabbed the refSeq gtf file from the UCSC genome browser. After using it in cufflinks, we obtained the second image. The R^2 value for that one us much better at 0.90 or so, but probably should be a bit better.

Sam
Attached Images
File Type: png 43_44_old.png (17.4 KB, 14 views)
File Type: png 43_44_new.png (13.5 KB, 18 views)
sdarko is offline   Reply With Quote
Old 07-12-2011, 11:22 AM   #9
eab
Member
 
Location: Maryland

Join Date: May 2011
Posts: 63
Default

As Sam (sdarko) writes, a change in the gtf improved the correlation between duplicate libraries, but we hope the actual correlation is even better. First off, if you look at the right-hand plot from his post, there are a good number of reads stacked up along the axes, meaning that they occurred in only one of the two libraries. Second, of the reads that occurred in both libraries, correlation between libraries is not so close, especially at the middle and lower ranges of abundance.
eab is offline   Reply With Quote
Old 07-12-2011, 11:26 AM   #10
chadn737
Senior Member
 
Location: US

Join Date: Jan 2009
Posts: 392
Default

How deep was your sequencing? I almost always find a large number of genes with 1 or 2 reads mapping, that may be in one sample, but not in the other. Still, even 0.9 seems a bit low for technical replicates. We only do Biological replicates and there we usually an r2 of around .96 - .97.
chadn737 is offline   Reply With Quote
Old 07-13-2011, 04:45 AM   #11
sdarko
Member
 
Location: Bethesda, MD

Join Date: Apr 2009
Posts: 51
Default

Quote:
Originally Posted by chadn737 View Post
How deep was your sequencing? I almost always find a large number of genes with 1 or 2 reads mapping, that may be in one sample, but not in the other. Still, even 0.9 seems a bit low for technical replicates. We only do Biological replicates and there we usually an r2 of around .96 - .97.
I think that one issue may be that in one "identical" library we have ~ 4 million reads (with ~83% aligning to genome) while in the other "identical" library we have ~1 million reads (with ~71% aligning to genome).

So we have greater than 4x the reads aligning for one library versus the other.

Sam
sdarko is offline   Reply With Quote
Old 07-13-2011, 04:51 AM   #12
Heisman
Senior Member
 
Location: St. Louis

Join Date: Dec 2010
Posts: 533
Default

Quote:
Originally Posted by sdarko View Post
I think that one issue may be that in one "identical" library we have ~ 4 million reads (with ~83% aligning to genome) while in the other "identical" library we have ~1 million reads (with ~71% aligning to genome).

So we have greater than 4x the reads aligning for one library versus the other.

Sam
That can be a big. Since you're a bioinformatician who is presumably much better at programming than I am can you take random samples of 1M reads from the total 4M and align them and see how the R^2 looks? How much coverage did you get overall?
Heisman is offline   Reply With Quote
Old 07-13-2011, 05:00 AM   #13
sdarko
Member
 
Location: Bethesda, MD

Join Date: Apr 2009
Posts: 51
Default

Quote:
Originally Posted by Heisman View Post
That can be a big. Since you're a bioinformatician who is presumably much better at programming than I am can you take random samples of 1M reads from the total 4M and align them and see how the R^2 looks? How much coverage did you get overall?
Taking a random subset is on the agenda for today. Will let you know.
sdarko is offline   Reply With Quote
Old 07-13-2011, 09:54 AM   #14
eab
Member
 
Location: Maryland

Join Date: May 2011
Posts: 63
Default

We noticed that many of the species "unique" to 1/2 duplicates appear to be ubiquitously-expressed genes mapping to loci encompassing several possible transcripts. So there is no way they should have been unique to one of the starting RNA samples. Perhaps a single species is being called one thing from one duplicate library, and something else from the other? Either that, or PCR is so chaotic that it completely loses large numbers of moderately-abundant species in a somewhat random fashion? I feel like the field would be aware of that if it were the case, though.

Last edited by eab; 07-13-2011 at 10:00 AM.
eab is offline   Reply With Quote
Old 07-13-2011, 11:17 AM   #15
chadn737
Senior Member
 
Location: US

Join Date: Jan 2009
Posts: 392
Default

Quote:
Originally Posted by sdarko View Post
I think that one issue may be that in one "identical" library we have ~ 4 million reads (with ~83% aligning to genome) while in the other "identical" library we have ~1 million reads (with ~71% aligning to genome).

So we have greater than 4x the reads aligning for one library versus the other.

Sam
Yeah, thats not very deep, so I would expect a lot more singletons. If you set an arbitrary cutoff and filter out the singletons, I wonder if your r2 will increase.
chadn737 is offline   Reply With Quote
Old 07-13-2011, 11:46 AM   #16
eab
Member
 
Location: Maryland

Join Date: May 2011
Posts: 63
Default

Hi chadn737, thanks for your reply - you raise an interesting point about the depth. We need to start with cell numbers in the millions because that's the number of human peripheral blood T cells one needs to generate enough RNA to be in Illumina's recommended range. On the HiSeq it looks like we can expect around 100 million reads per lane. With 10 barcodes per lane, we can expect 10 million reads per sample, which would be a coverage of 10x if we start with 1 million cells, but even less if we start with more material. Do you think this is a big problem with our system? I must admit I struggle to understand the significance of coverage in mRNA seq experiments....
eab is offline   Reply With Quote
Old 07-13-2011, 12:14 PM   #17
pmiguel
Senior Member
 
Location: Purdue University, West Lafayette, Indiana

Join Date: Aug 2008
Posts: 2,237
Default

Hi eab,
The Illumina TruSeq RNA kit is cheap and fast, but I think it is easy to get poor yields for some of the samples in one or two steps and not really notice it. After the final PCR amplification everything looks fine. But is it?

On the other hand 10 million reads on 200 bp amplicons is only asking for 2 pg of DNA. If you are willing to blow your whole library in a single lane (mixed in with 9 other libraries) and every amplicon molecule produces a cluster.

That may seem confusing and not very useful, but thinking about a library as a collection of individual amplicon molecules derived from the RNA you started with, rather than a 12 pM solution you load into a flowcell, seems more concrete to me.

--
Phillip
pmiguel is offline   Reply With Quote
Old 07-13-2011, 02:52 PM   #18
eab
Member
 
Location: Maryland

Join Date: May 2011
Posts: 63
Default

Hi Phillip, I think I follow what you're saying up top - that having a peak on the bioanalyzer at the end of a multistep, high-throughput library prep does not prove that everything went ok at every step. Maybe that bioA peak should be 10x larger than it is, and you lost 90% of your material, and the losses were uneven, resulting in bias. So when two libraries that were supposed to be identical yield divergent sequence data, bias due to uneven losses during library prep is a potential cause. Is that what you're saying?

I don't understand what you're driving at with the rest of your post, though. Are you advocating fewer PCR cycles/no PCR?

Eli
eab is offline   Reply With Quote
Old 07-14-2011, 04:45 AM   #19
pmiguel
Senior Member
 
Location: Purdue University, West Lafayette, Indiana

Join Date: Aug 2008
Posts: 2,237
Default

Hi eab,

What I see as the RNA TruSeq's main problem is that it has no QC that happens prior to the amplification enrichment step. Specifically, the numbers of library molecules could drop below the number of reads generated. If that is the case the data set will be "bottomed out". (Lots of PCR duplicate reads.)

The rest of my post was me taking the contrary position and saying that really it was not that likely your library molecule numbers would end up being that low prior to PCR because 10 million amplicons would be 2 pg of DNA.

To take that another step, what would 2 pg of amplicons look like after 15 cycles of amplification? 2^15 ~= 32,000x. So it would look like 64 ng. I guess if you see less than 100 ng of amplified library for a sample you might begin to worry about your library being bottomed out.

Of course you actually have the sequence data, so you can probably tell from it whether your issue results from a surfeit of PCR duplicates. (With some caveats...)

--
Phillip
pmiguel is offline   Reply With Quote
Old 07-16-2011, 10:38 AM   #20
eab
Member
 
Location: Maryland

Join Date: May 2011
Posts: 63
Default

Hi Phillip, I think I get you now. Thanks for the clarification.

Picking up on your point about the lack of QC built into TruSeq - do you think that a qPCR using flowcell primers (as in the Kapa library quant kit) would be useful if added just after purifying ligation reaction, before amplification enrichment? If I did that, how would I use the info? Would I just have to question the sequence data from any library that was surprisingly scant before amplification? If I wind up with only a small amount of library after amplification, doesn't that provide the same info?

Thanks very much for your advice

Eli
eab is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 08:49 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2017, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO