Seqanswers Leaderboard Ad

**Nitrogen-DNE-sulfer** · 11-28-2009, 03:37 PM

The 3' end error is from the random hexamers priming the cDNA synthesis and should not be seen on the 5' ends (of the template not the growing strand). I suspect an alignment issue is confusing the SOLiD strands if you see it on both ends.

The hexamer error is known by Ambion and they are working towards a fix. Its often confused for sequencing quality but genomic samples on the same run dont show the same concentration of errors and the QVs for many of the RNA errors are very good supporting the notion of misprimed hexamers creating real bases in the templates being properly sequenced. The effect is also see as the 1st 6 bases in the more recent reverse reads for the paired end data.

Poly A capture wont fix it as it uses the same priming approach.

**pmiguel** · 11-30-2009, 04:29 AM

Originally posted by Nitrogen-DNE-sulfer View Post

The 3' end error is from the random hexamers priming the cDNA synthesis and should not be seen on the 5' ends (of the template not the growing strand). I suspect an alignment issue is confusing the SOLiD strands if you see it on both ends.

The hexamer error is known by Ambion and they are working towards a fix. Its often confused for sequencing quality but genomic samples on the same run dont show the same concentration of errors and the QVs for many of the RNA errors are very good supporting the notion of misprimed hexamers creating real bases in the templates being properly sequenced. The effect is also see as the 1st 6 bases in the more recent reverse reads for the paired end data.

Poly A capture wont fix it as it uses the same priming approach.

If this is the issue, then the fix would be to make sure your inserts are > 55 bases long. That way the F3 read would not reach the pentamer overhang of the P2 adaptor.

(That won't work for a paired-end read, obviously. But there you can trim the first 5 bases of your read there, if necessary.)

Still seems weird to me. Why such a high number of amplicons with exactly 50 nt inserts?

--
Phillip

**hingamp** · 11-30-2009, 11:57 PM

Indeed our implementation of the Bio:: DB::Sam API was buggy in generating the mismatch histogram (reverse reads were not rotated). Attached is the new correct histogram which shows the expected high frequency of mismatches at beginning of reads presumably due to random pentamers RT priming (although this strech is hexameric rather than pentameric), and then a more typical gradual increase in mismatch frequencies as we reach the read end. Observations are now coherent with SWTAK protocol. Is the maize figure above also SWTAK?

Pascal
PS: these QV filtered data were obtained from the SOLiD BAM files with a perl script that uses the io:: DB::Sam module to scan reads.

Attached Files

Capture-4.jpg (7.8 KB, 48 views)

**pmiguel** · 12-01-2009, 05:15 AM

Originally posted by hingamp View Post

Indeed our implementation of the Bio:: DB::Sam API was buggy in generating the mismatch histogram (reverse reads were not rotated). Attached is the new correct histogram which shows the expected high frequency of mismatches at beginning of reads presumably due to random pentamers RT priming (although this strech is hexameric rather than pentameric), and then a more typical gradual increase in mismatch frequencies as we reach the read end. Observations are now coherent with SWTAK protocol. Is the maize figure above also SWTAK?

Pascal
PS: these QV filtered data were obtained from the SOLiD BAM files with a perl script that uses the io:: DB::Sam module to scan reads.

Hi Pascal,
Ah! All is clear now. (See below.)

But first, yes, the maize figure was v3 chemistry data deriving from the SWTAK protocol. However, no quality value screening was done. That figure is just the plot generated by secondary analysis by SETS running on the instrument control cluster against the maize ribosomal repeat unit. (We depleted ribosomal RNA, but some always gets through.)

Also, another possibly non-obvious point, "hexamer instead of pentamer": 5 "base space" (random) nucleotides will affect 6 "color space" bases. This is because each color space base derives from 2 adjacent base space nucleotides. Hence random pentamers have a hexameric "footprint" in color space in your circumstance.

Here is what I think is going on with your data:

SWTAK allows construction of amplicons such that P1 (the sequence priming side) adaptor is adjacent to the 5' end of an RNA (Adaptor Mix A) or the 3' end (Adaptor Mix B). We always have always used Adaptor Mix A. I think this is key to why we do not see the high hexameric mismatch signature that you diagram in your histogram.

Your results also explain a mysterious comment made Session Co-Lead, Bob Nutter, at the most recent SOLiD User's Summit (Sept 9th-11th 2009). Bob recommended using only "Adaptor Mix A" not "Adaptor Mix B" because the results with Adaptor Mix A were better.

Adaptor Mix B ligates adaptor P1 to the 3' end of RNA fragments. Because sequence is primed out of adaptor P1 (F3 reads), this results in minus strand sequence[1]. But, critical to your results, the first 6 color space bases of these minus-strand reads would derive from the random pentamer used to prime reverse transcription.

So, all is explained if you used Adaptor Mix B--or even equal amounts of Adaptor Mix B and Adaptor Mix A--for construction of your SWTAK amplicons. And it explains Bob Nutter's cryptic recommendation at the 2009 SOLiD User's Summit.

Am I right? Was Adaptor Mix B used?

--
Phillip

[1] By "minus strand" here, I mean that the sequence read represents the (notional) minus strand of the RNA sequenced. I don't mean sequence deriving from the backwards (3' -> 5') ligation chemistry AB is planning to release with v4 (or sooner) to allow "paired end" sequence.

**hingamp** · 12-02-2009, 11:52 PM

Originally posted by pmiguel View Post

Am I right? Was Adaptor Mix B used?

Well spotted! Indeed Mix B was used, because we were particularly interested in examining transcript ends. Not many have seen the shadow of the random pentamers because most users indeed use Mix A (also very clearly recommended by AB in the SWTAK protocol booklet).

This leads me to another question: even though we used Mix B and therefore sequence transcripts from their 3' ends, we have strictly no sign of poly-A tails! We've even looked in raw color space before any alignment is attempted in case some cleaning was going on somewhere in the pipeline. Where have these poly-A gone? Would RNAase III that is used to fragment transcripts have some poly-A specific 3'->5' exonuclease activity? Mystery.

**pmiguel** · 12-03-2009, 04:57 AM

Originally posted by hingamp View Post

Well spotted! Indeed Mix B was used, because we were particularly interested in examining transcript ends. Not many have seen the shadow of the random pentamers because most users indeed use Mix A (also very clearly recommended by AB in the SWTAK protocol booklet).

This leads me to another question: even though we used Mix B and therefore sequence transcripts from their 3' ends, we have strictly no sign of poly-A tails! We've even looked in raw color space before any alignment is attempted in case some cleaning was going on somewhere in the pipeline. Where have these poly-A gone? Would RNAase III that is used to fragment transcripts have some poly-A specific 3'->5' exonuclease activity? Mystery.

Difficult to say. RNase III is classified as a double-stranded RNA endonuclease. So it should not cleave single-stranded RNA at all under "normal" conditions. Word from AB/Lifetech at the User's Summit was that Ambion was deploying conditions under which RNAseIII would cleave single-stranded RNA in the SWTAK kit.

Is it possible that your RNA is 3' modified in a way that would block ligation to the adaptor? (For example, by 3' phosphorylation.)

Ah, wait. I think I know what might be happening. When you say your reads show no sign of polyA tails--what reads, specifically, are you talking about? Pre or post mapping?

My guess is that if you chose a few highly expressed messages and looked into the raw reads for their 3' terminus, you would find poly-adenylated ones. The tail, at best, is probably discarded by the WT mapping pipeline. (At worst the entire read may be discarded.)

Alternatively you could create a "transcriptome" reference set and add polyA tails to the transcripts in the reference (maybe only 10 bases of A, though--otherwise you may trigger repeat masking algorithms, if any). Then do your mapping.

Phillip

**hingamp** · 12-03-2009, 11:14 PM

Originally posted by pmiguel View Post

When you say your reads show no sign of polyA tails--what reads, specifically, are you talking about? Pre or post mapping?

We never see the polyA's, even at pre-mapping stage in the raw data... We've had a quick look at AB's raw data processing to check if any clipping might be going on, but found no reason why polyA tails should be removed.

**pmiguel** · 12-04-2009, 04:11 AM

Originally posted by hingamp View Post

We never see the polyA's, even at pre-mapping stage in the raw data... We've had a quick look at AB's raw data processing to check if any clipping might be going on, but found no reason why polyA tails should be removed.

The other possibility is that reads with polyA tails are just being discarded (not mapped.) How have you looked for polyA's in the pre-mapping stage?

--
Phillip

**hingamp** · 12-07-2009, 08:09 AM

Originally posted by pmiguel View Post

The other possibility is that reads with polyA tails are just being discarded (not mapped.) How have you looked for polyA's in the pre-mapping stage?

We've just looked again in the raw read data (in color space, before mapping). We expect to see more polyT homopolymers than others (because we're using adapters B).

We've counted the reads that start with each possible homopolymer (polyA, polyG, polyC & polyT) and that gives the histograms in the attached figure. Short streches of polyC and polyG (less than 10bp) are far more frequent than Short streches of polyT and polyA. However only polyT are seen above 20bp, which is good (they could correspond to our disappeared polyA tails). What is totally unexpected is their absolute frequency: only around 300 reads begin with polyA mers over 20 in length! Granted we are working with total RNA - not polyA+ purified RNA - but still this is much lower than the number of genes that in some single cases are covered many times that figure... Put another way, 300 polyA tails doesn't account for the number of bona fide mRNA messengers we are seing in this SOLiD run. It doesn't add up.

Could anyone with SWTAK data (adaptors B only) count the reads begining with polyT's ?

Attached Files

homopoly.jpg (11.6 KB, 45 views)

**pmiguel** · 12-07-2009, 08:49 AM

Originally posted by hingamp View Post

We've just looked again in the raw read data (in color space, before mapping). We expect to see more polyT homopolymers than others (because we're using adapters B).

We've counted the reads that start with each possible homopolymer (polyA, polyG, polyC & polyT) and that gives the histograms in the attached figure. Short streches of polyC and polyG (less than 10bp) are far more frequent than Short streches of polyT and polyA. However only polyT are seen above 20bp, which is good (they could correspond to our disappeared polyA tails). What is totally unexpected is their absolute frequency: only around 300 reads begin with more than polyA mers over 20 in length! Granted we are working with total RNA - not polyA+ purified RNA - but still this is much lower than the number of genes that in some single cases are covered many times that figure... Put another way, 300 polyA tails doesn't account for the number of bona fide mRNA messengers we are seing in this SOLiD run. It doesn't add up.

Could anyone with SWTAK data (adaptors B only) count the reads begining with polyT's ?

It is possible that your polyA RNA are 3' modified (eg, phosphorylated). That would block adaptor ligation to a true 3' message end. Then only in cases where RNAse III happened to cut inside the polyA tail, would you see polyA in your sequence.

--
Phillip

**hingamp** · 12-08-2009, 01:15 AM

Originally posted by pmiguel View Post

It is possible that your polyA RNA are 3' modified (eg, phosphorylated). That would block adaptor ligation to a true 3' message end. Then only in cases where RNAse III happened to cut inside the polyA tail, would you see polyA in your sequence.

3' end modification is indeed possible, but random cutting by RNAse III inside polyA tails should be far more frequent than what we observe: this only happended 300 times over 249721009 reads. If we assume conservative estimates of 1% of total RNA is mRNA, and that 1% of mRNA is polyA tails, then we would expect 25000 reads out the 250M randomly sequenced RNAse III cut molecules to contain 5' end polyTs? Observation is two orders of magnitude lower than conservative estimations.

**pmiguel** · 12-08-2009, 06:06 AM

Originally posted by hingamp View Post

3' end modification is indeed possible, but random cutting by RNAse III inside polyA tails should be far more frequent than what we observe: this only happended 300 times over 249721009 reads. If we assume conservative estimates of 1% of total RNA is mRNA, and that 1% of mRNA is polyA tails, then we would expect 25000 reads out the 250M randomly sequenced RNAse III cut molecules to contain 5' end polyTs? Observation is two orders of magnitude lower than conservative estimations.

I see. A few points:

(1) You can't trust conversion of raw SOLiD reads into base space. As you probably know, a base calling error anywhere in the read will likely place you in a different "color frame", resulting in every base from that point on in the read being converted incorrectly.

A polyA tail is a poly "0" tail in color space. To convert it correctly to polyA requires that there be no color space sequencing errors. A single miscall could result in the rest of the polyA tract being called polyG, polyC, or polyT.

(2) RNAse III maybe biased against polyA, or biased against cutting near the end of an RNA.

(3)Okay, I like this one, but it is somewhat counter-intuitive. So please bear with me:

If, prior to fragmentation, your RNAs are blocked for 3' ligation, then their non-ligatable polyA ends may be effectively depleting your pool of random pentamers of "TTTTT" overhung adaptors. That is, the "TTTTT" terminated P1-adaptors would spend a large fraction of the ligation time available to them annealed to blocked polyA ends that cannot be ligated to. The result would be that the higher the percentage you had of polyA RNAs, the fewer polyA tails you would sequence!

I especially like this scenario because it fits what I understand the biology to be. That is, the easiest way to degrade RNA is via the 2'-OH->3'-phosphate nucleophilic attack. So most ribonucleases produce degradation products that have 5'-OH and 3'-phosphate ends. (Or 5'-OH and 2',3'-cyclic monophosphate ends.) So, it follows that mRNA may be largely 3'-phosphorylated.

A counter-argument is that the adaptors are probably present in a large molar excess to the RNA ends. This may be the case, but "TTTTT"-terminated P1 adaptors would be only 1/1024th of the full complement of "NNNNN"-terminated adaptors. So achieving such a large molar excess may not be feasible. Also, at a large enough molar excess, the 3' adaptors (P1 in your case) would ligate to themselves, producing adaptor-dimers. So adaptors probably are not present in a large molar excess over RNA ends.

(4) This one is unlikely, but I thought I would throw it in anyway. "NNNNN" may be "IIIII" (all inosines.) Inosine base nucleotides would anneal promiscuously. However I think they would tend to be replicated as "G's". So, unless you are seeing nearly every sequence beginning either with 5 C's or some sequence that will not align to the genome, I think we can rule this one out. Especially since the AB guys seem to strongly dislike the sort of "color balance" issue that this would cause for the first 5 bases. The second satay of every primer would be crazy out of balance if my "five inosines replicates as 5 guanosines conjecture" is correct. I think the SOLiD guys would complain to Ambion and Ambion would change their chemistry.
--
Phillip

**hingamp** · 12-09-2009, 09:33 AM

Originally posted by pmiguel View Post

(1) You can't trust conversion of raw SOLiD reads into base space. As you probably know, a base calling error anywhere in the read will likely place you in a different "color frame", resulting in every base from that point on in the read being converted incorrectly.

A polyA tail is a poly "0" tail in color space. To convert it correctly to polyA requires that there be no color space sequencing errors. A single miscall could result in the rest of the polyA tract being called polyG, polyC, or polyT.

When looking for homopolymer tracts at read's beginnings, we used the color fasta unconverted files (so we looked for [1|2|3|4]00000000000... in color space which correspond to either TTTTTTTTTTTTT, TGGGGGGGGGG, TCCCCCCCCCC, or TAAAAAAAAAA in base space, given that T is the first deconvolution base). Indeed there can still be miscalls, but these would be balanced across all four homopolymers, instead of the skew towards long polyT's that we do observe?

Originally posted by pmiguel View Post

(2) RNAse III maybe biased against polyA, or biased against cutting near the end of an RNA.

That option still nags me, but RNAse III isn't known for polyA biases. Then again it isn't known to cut single stranded molecules, so it's been tampered with?

Originally posted by pmiguel View Post

(3)Okay, I like this one, but it is somewhat counter-intuitive. So please bear with me:

If, prior to fragmentation, your RNAs are blocked for 3' ligation, then their non-ligatable polyA ends may be effectively depleting your pool of random pentamers of "TTTTT" overhung adaptors. That is, the "TTTTT" terminated P1-adaptors would spend a large fraction of the ligation time available to them annealed to blocked polyA ends that cannot be ligated to. The result would be that the higher the percentage you had of polyA RNAs, the fewer polyA tails you would sequence!

That's a intersting point, although to explain the observed huge depletion of polyA fragments the adaptors would indeed need to be dangerously close in concentration to the RNA's - which are unlikely conditions for optimal quantitative hybridization?

Originally posted by pmiguel View Post

(4) This one is unlikely, but I thought I would throw it in anyway. "NNNNN" may be "IIIII" (all inosines.) Inosine base nucleotides would anneal promiscuously. However I think they would tend to be replicated as "G's". So, unless you are seeing nearly every sequence beginning either with 5 C's or some sequence that will not align to the genome, I think we can rule this one out.

Most reads align to the reference genome, including at their beginning (apart from the observed higher mismatch frequency). So polyI pentamers seem unlikely.

It would be most interesting if someone else has used SWTAK with adaptor mix B (ie sequencing RNA's from their 3' ends) on total RNA from an eukaryote with polyadenylated transcripts. If they didn't confirm this huge underrepresentation of polyA at read beginnings, we would know it's something with our model or RNA prep...

**pmiguel** · 12-09-2009, 09:45 AM

Originally posted by hingamp View Post

That's a intersting point, although to explain the observed huge depletion of polyA fragments the adaptors would indeed need to be dangerously close in concentration to the RNA's - which are unlikely conditions for optimal quantitative hybridization?

Not really. Remember the "TTTTT" would only be 1/1024th of your total random pentamers whereas a large fraction of your RNA may be poly-adenylated.

Then, if (conservatively), after fragmentation, 1% of your RNAs end in polyA, there would be a 10x excess of them over your random pentamers. (presuming Ambion uses 1:1 molar ratio of adaptor to insert. (There is a potential problem with going 10:1 adaptor:insert. So I don't think they did that.)

--
Phillip

**pmiguel** · 12-09-2009, 11:28 AM

Originally posted by hingamp View Post

That option still nags me, but RNAse III isn't know for polyA biases. Then again it isn't known to cut double stranded molecules, so it's been tampered with?

RNAse III is a double stranded RNA endonuclease.

At the September SOLiD Users Summit, I asked about the SWTAK's using it to fragment (largely) single stranded RNA. A rep (Lifetech or Ambion, I think) said that Ambion had found reaction conditions under which RNAse III would cleave single stranded RNA. They also admitted that it cleaved in a biased fashion.

The sole advantage of using RNAse III to fragment RNA that I can think of is that it is one of the rare endoribonucleases that result in 5'-phosphate and 3'-hydroxyl ends. (RNaseH and RNaseP being the only other two that I'm aware of.) This simplifies the protocol because ligases require 5'-phosphates and 3'-hydroxyl ends.

My understanding is that were a more random method (like MgCl2/heat) used, a T4-PNK step might be necessary after fragmentation to convert the resulting 5'-hydroxyl/3'-phosphate ends into something ligatable. That is an extra step and would likely require a clean-up step afterward. So it would add an hour or two and some yield loss to the protocol.

--
Phillip

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 25 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 28 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 24 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 52 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News