SEQanswers

SEQanswers (http://seqanswers.com/forums/index.php)
-   General (http://seqanswers.com/forums/forumdisplay.php?f=16)
-   -   Intra-sample variability, Illumina TruSeq mRNA (http://seqanswers.com/forums/showthread.php?t=12668)

eab 07-12-2011 07:46 AM

Intra-sample variability, Illumina TruSeq mRNA
 
Does anyone know if there is a technical issue during either mRNA library prep or data handling that could cause two libraries prepared from the same cell population to look radically different? We have made multiple libraries from the same sample and the results appear quite discouraging. We're not sure if we're doing something wrong with analysis, or whether there was a problem with sample prep. Anyone know of any common pitfalls that could explain our problem?

mnkyboy 07-12-2011 07:52 AM

I did some QC runs using MAQC UHR and Human Brain with three replicates for TruSeq Whole Transcriptome and the libraries had an R-squared value of .98 or higher when I compared their FPKM as generated by Cufflinks. Granted I did not do the mRNA selection step.

I have also made libraries from the same experimental sample using different methods for ribosomal reduction and when comparing the non-ribosomal FPKM I also get a very high correlation of >0.9.

Were they prepared at different times? What kind of RNA and how does the BioAnlyzer look?

eab 07-12-2011 08:07 AM

Hey mnkyboy, thanks for the superfast reply! Here are details.

Cells: sorted human naive T cells, approximately 15 million in one tube. Cells aliquotted into 5 tubes, including one (1) tube of 1x10e7, two (2) tubes of 2x10e6, and two (2) tubes of 2x10e5.

Extractions: cells pelleted and lysed in RNAzol RT immediately after aliquoting, then stored at -80 until total RNA extraction. RNA extraction done at same time with same tubes of reagents on all 5 tubes.

Library prep: TruSeq RNA sample prep kit A, all libraries prepared together in a single 96-well plate using high-throughput protocol (with a few minor mods).

Library QC: completed, purified libraries run on bioanalyzer and showed appropriate size peak + a large peak that I took to represent the "bubble form" Illumina describes. Libraries quantified by Kapa qPCR with flowcell primers and SYBR Green reporter.

Clustering: cBOT using cluster kit TruSeq PE cluster kit v2 - HighSeq.

I did not run the starting RNA on the BioA before library prep. The cells were handled as immaculately as was possible, so I figured that no matter what the BioA gave me for an RIN, I would not be able to improve on it and I needed to just go forward. I have some RNA saved back that I can run now on the BioA, but I would be shocked if differential degradation were the problem.

Any ideas? We're wondering especially about trivial informatics sorts of things that can lead to false differences.....

Thanks!
Eli

chadn737 07-12-2011 08:22 AM

When you say they look radically different, what do you mean? Is this before alignment or after alignment?

mnkyboy 07-12-2011 08:28 AM

That is definitely a head scratcher. How long were your reads? We have found for RNA-seq if we go over 75 bases we start hitting adapter and our mapping goes to awry. Did you multiplex? Was there anything that stuck out across the lanes in your QC? We generally multiplex and spread across the flow cell to reduce any lane variation.

The only other thing that I think could be an issue is if something odd happened during the poly-A selection. One way to check this is too see if you map to any known non poly-adenylated non-coding RNA and see if there are differences across the samples.

chadn737 07-12-2011 08:33 AM

Quote:

Originally Posted by mnkyboy
That is definitely a head scratcher. How long were your reads? We have found for RNA-seq if we go over 75 bases we start hitting adapter and our mapping goes to awry.

This is exactly the problem I had with the truseq libraries and I wonder if this is the problem now. We had 100bp reads and I was only getting ~60% to map. When I would blast random reads, the last 25 or so bps often had no match at all and turned out to be adapter sequence. I have heard of other people also having this problem with correct size selection.

mnkyboy 07-12-2011 08:37 AM

Quote:

Originally Posted by chadn737 (Post 46244)
This is exactly the problem I had with the truseq libraries and I wonder if this is the problem now. We had 100bp reads and I was only getting ~60% to map. When I would blast random reads, the last 25 or so bps often had no match at all and turned out to be adapter sequence. I have heard of other people also having this problem with correct size selection.

Yeah our standard WT or mRNA-seq is now 2x75 bp and then 2x50 if we do FFPE.

sdarko 07-12-2011 09:49 AM

2 Attachment(s)
Quote:

Originally Posted by chadn737 (Post 46242)
When you say they look radically different, what do you mean? Is this before alignment or after alignment?

I'm the bioinformatician working on this.

They looked vastly different.

In the first image I uploaded, I had used the wrong gtf (contained multiple entry names for the same transcript, ucsc_all_known_mRNA) file for the cufflinks analysis and that was a cause of much of the disparity. The R^2 value was only 0.60 or so.

After realizing my error, I grabbed the refSeq gtf file from the UCSC genome browser. After using it in cufflinks, we obtained the second image. The R^2 value for that one us much better at 0.90 or so, but probably should be a bit better.

Sam

eab 07-12-2011 10:22 AM

As Sam (sdarko) writes, a change in the gtf improved the correlation between duplicate libraries, but we hope the actual correlation is even better. First off, if you look at the right-hand plot from his post, there are a good number of reads stacked up along the axes, meaning that they occurred in only one of the two libraries. Second, of the reads that occurred in both libraries, correlation between libraries is not so close, especially at the middle and lower ranges of abundance.

chadn737 07-12-2011 10:26 AM

How deep was your sequencing? I almost always find a large number of genes with 1 or 2 reads mapping, that may be in one sample, but not in the other. Still, even 0.9 seems a bit low for technical replicates. We only do Biological replicates and there we usually an r2 of around .96 - .97.

sdarko 07-13-2011 03:45 AM

Quote:

Originally Posted by chadn737 (Post 46257)
How deep was your sequencing? I almost always find a large number of genes with 1 or 2 reads mapping, that may be in one sample, but not in the other. Still, even 0.9 seems a bit low for technical replicates. We only do Biological replicates and there we usually an r2 of around .96 - .97.

I think that one issue may be that in one "identical" library we have ~ 4 million reads (with ~83% aligning to genome) while in the other "identical" library we have ~1 million reads (with ~71% aligning to genome).

So we have greater than 4x the reads aligning for one library versus the other.

Sam

Heisman 07-13-2011 03:51 AM

Quote:

Originally Posted by sdarko (Post 46334)
I think that one issue may be that in one "identical" library we have ~ 4 million reads (with ~83% aligning to genome) while in the other "identical" library we have ~1 million reads (with ~71% aligning to genome).

So we have greater than 4x the reads aligning for one library versus the other.

Sam

That can be a big. Since you're a bioinformatician who is presumably much better at programming than I am can you take random samples of 1M reads from the total 4M and align them and see how the R^2 looks? How much coverage did you get overall?

sdarko 07-13-2011 04:00 AM

Quote:

Originally Posted by Heisman (Post 46336)
That can be a big. Since you're a bioinformatician who is presumably much better at programming than I am can you take random samples of 1M reads from the total 4M and align them and see how the R^2 looks? How much coverage did you get overall?

Taking a random subset is on the agenda for today. Will let you know.

eab 07-13-2011 08:54 AM

We noticed that many of the species "unique" to 1/2 duplicates appear to be ubiquitously-expressed genes mapping to loci encompassing several possible transcripts. So there is no way they should have been unique to one of the starting RNA samples. Perhaps a single species is being called one thing from one duplicate library, and something else from the other? Either that, or PCR is so chaotic that it completely loses large numbers of moderately-abundant species in a somewhat random fashion? I feel like the field would be aware of that if it were the case, though.

chadn737 07-13-2011 10:17 AM

Quote:

Originally Posted by sdarko (Post 46334)
I think that one issue may be that in one "identical" library we have ~ 4 million reads (with ~83% aligning to genome) while in the other "identical" library we have ~1 million reads (with ~71% aligning to genome).

So we have greater than 4x the reads aligning for one library versus the other.

Sam

Yeah, thats not very deep, so I would expect a lot more singletons. If you set an arbitrary cutoff and filter out the singletons, I wonder if your r2 will increase.

eab 07-13-2011 10:46 AM

Hi chadn737, thanks for your reply - you raise an interesting point about the depth. We need to start with cell numbers in the millions because that's the number of human peripheral blood T cells one needs to generate enough RNA to be in Illumina's recommended range. On the HiSeq it looks like we can expect around 100 million reads per lane. With 10 barcodes per lane, we can expect 10 million reads per sample, which would be a coverage of 10x if we start with 1 million cells, but even less if we start with more material. Do you think this is a big problem with our system? I must admit I struggle to understand the significance of coverage in mRNA seq experiments....

pmiguel 07-13-2011 11:14 AM

Hi eab,
The Illumina TruSeq RNA kit is cheap and fast, but I think it is easy to get poor yields for some of the samples in one or two steps and not really notice it. After the final PCR amplification everything looks fine. But is it?

On the other hand 10 million reads on 200 bp amplicons is only asking for 2 pg of DNA. If you are willing to blow your whole library in a single lane (mixed in with 9 other libraries) and every amplicon molecule produces a cluster.

That may seem confusing and not very useful, but thinking about a library as a collection of individual amplicon molecules derived from the RNA you started with, rather than a 12 pM solution you load into a flowcell, seems more concrete to me.

--
Phillip

eab 07-13-2011 01:52 PM

Hi Phillip, I think I follow what you're saying up top - that having a peak on the bioanalyzer at the end of a multistep, high-throughput library prep does not prove that everything went ok at every step. Maybe that bioA peak should be 10x larger than it is, and you lost 90% of your material, and the losses were uneven, resulting in bias. So when two libraries that were supposed to be identical yield divergent sequence data, bias due to uneven losses during library prep is a potential cause. Is that what you're saying?

I don't understand what you're driving at with the rest of your post, though. Are you advocating fewer PCR cycles/no PCR?

Eli

pmiguel 07-14-2011 03:45 AM

Hi eab,

What I see as the RNA TruSeq's main problem is that it has no QC that happens prior to the amplification enrichment step. Specifically, the numbers of library molecules could drop below the number of reads generated. If that is the case the data set will be "bottomed out". (Lots of PCR duplicate reads.)

The rest of my post was me taking the contrary position and saying that really it was not that likely your library molecule numbers would end up being that low prior to PCR because 10 million amplicons would be 2 pg of DNA.

To take that another step, what would 2 pg of amplicons look like after 15 cycles of amplification? 2^15 ~= 32,000x. So it would look like 64 ng. I guess if you see less than 100 ng of amplified library for a sample you might begin to worry about your library being bottomed out.

Of course you actually have the sequence data, so you can probably tell from it whether your issue results from a surfeit of PCR duplicates. (With some caveats...)

--
Phillip

eab 07-16-2011 09:38 AM

Hi Phillip, I think I get you now. Thanks for the clarification.

Picking up on your point about the lack of QC built into TruSeq - do you think that a qPCR using flowcell primers (as in the Kapa library quant kit) would be useful if added just after purifying ligation reaction, before amplification enrichment? If I did that, how would I use the info? Would I just have to question the sequence data from any library that was surprisingly scant before amplification? If I wind up with only a small amount of library after amplification, doesn't that provide the same info?

Thanks very much for your advice

Eli


All times are GMT -8. The time now is 07:55 AM.

Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.