Seqanswers Leaderboard Ad

**pmiguel** · 11-23-2009, 05:31 AM

Originally posted by hingamp View Post

Hi,

When using the Whole Transcriptome protocol on total RNA (ie not poly-A+) with random priming of RT, have others noticed the frequent mismatches with reference sequence close to the ends of the reads (see attached image for example)? These are presumably due to the random mers used to prime the RT? How long are those N-mers?

Pictures of the N-mers show them as 5 bases long. Also they are double stranded over much of their length but do have a 5 base 3' overhang on the 5' adaptor and a 5 base 5' overhang on the 3' adaptor.

The overhangs presumably anneal to the ends of fragmented RNA, providing just enough double strandedness to allow the ligase to function optimally. The top strands of the adaptor are ligated to RNA. Also, one presumes the bottom strand of the 5' adaptor must be degraded/displaced during 1st strand cDNA synthesis. If the above is correct, the only bases that might represent oligomer sequence not derived from the RNA being reverse transcribed would be the 5 bases immediately adjacent to the right-hand adaptor.

So, it seems unlikely that sequence mismatch you are observing derives from N-mers. There are two other more likely sources:

(1)Sequence quality. Low quality bases near the end of the read. As you near the end of the reads the satays trend towards beachballs in shape and your lowest quality bases will tend to be here.

and

(2)Library quality. If any of your inserts are shorter than 50 bases, then the 3' end of your sequence will be derived from adaptor sequence. Since this sequence is often too short to recognize as adaptor unambiguously, the SOLiD software does not bother to attempt to locate adaptor sequence. At least not the Corona-lite stuff. I have not looked at the new Bioscope code.

I attempt to assess library quality by searching for the 5' terminus of the SOLiD RNAseq multiplex adaptor:

Code:

#Internal_Adaptor_Multiplex (prefixes 10bp Barcode)
#C G C C T T G G C C G T A C A G C A G
#C3 3 0 2 0 1 0 3 0 3 1 3 1 1 2 3 1 2

I search for zero through 42 base inserts among my sequence reads using just the "3020103" terminus.

I think AB should have this built into their software because issues with library prep will otherwise be blamed on some failing of the SOLiD. And short insert amplicons are easily corrected by cutting a larger band out of the acrylamide gel during library prep.

--
Phillip

**Michael.James.Clark** · 11-24-2009, 04:44 PM

Check out Samtools rmdup utility.

PCR duplicates are a common problem in most next gen sequencing results.

**pmiguel** · 11-25-2009, 09:22 AM

Originally posted by Michael.James.Clark View Post

Check out Samtools rmdup utility.

PCR duplicates are a common problem in most next gen sequencing results.

Again, if you are attempting to do Digital Gene Expression, removing "duplicate" reads would be a mistake. Using SWTAK, it would likely result in under counting transcripts with secondary structure.

--
Phillip

**hingamp** · 11-25-2009, 09:51 AM

In the figures below are the mismatch distribution along SOLiD reads for
-total DNA whole transcriptome
-classic genomic DNA resequencing
In both cases, only reliable high scoring quality mismatches were counted.

Classic genomic DNA sequencing shows increased errors at first and last base (presumably because there can't be any "two color "corrections?).

Concerning our initial question, whole transcriptome indeed shows that errors are present along a well delimited 6 base streches at both ends. Since only the total DNA whole transcriptome has these random N-mers in both adapters, it is tempting to speculate they might explain those high error hexamer stretches at both ends (maybe the most extreme base of the 6bp strech is "usual" high error as seen in genomic DNA)?

In any case the two histograms suggest that the high error rate in the ends of total DNA whole transcriptome reads must be due to a difference in protocol compared to standard genomic DNA sequencing.

Attached Files

**hingamp** · 11-25-2009, 10:04 AM

Originally posted by pmiguel View Post

Again, if you are attempting to do Digital Gene Expression, removing "duplicate" reads would be a mistake. Using SWTAK, it would likely result in under counting transcripts with secondary structure.

--
Phillip

But wouldn't leaving duplicate reads due to PCR artefacts also lead to errors when doing digital gene expression?

These PCR artefacts are a reality, which is why duplicates are usually counted only once in the other read counting NGS application: ChIP-SEQ. Considering the narrower peaks of ChIP-SEQ, ignoring duplicate reads is even more drastic because sequence coverage tends to be much higher than in RNA-SEQ, but it is still done?...

Leaving them in carries a risk of overestimation, leaving them out carries a risk of underestimation. Best decision depends how prevalent PCR artefacts really are. Since it's difficult to know, in the end it's a question of whether one is more concerned about false positives or false negatives?

**Michael.James.Clark** · 11-25-2009, 11:46 AM

Originally posted by hingamp View Post

In the figures below are the mismatch distribution along SOLiD reads for
-total DNA whole transcriptome
-classic genomic DNA resequencing
In both cases, only reliable high scoring quality mismatches were counted.

Classic genomic DNA sequencing shows increased errors at first and last base (presumably because there can't be any "two color "corrections?).

Concerning our initial question, whole transcriptome indeed shows that errors are present along a well delimited 6 base streches at both ends. Since only the total DNA whole transcriptome has these random N-mers in both adapters, it is tempting to speculate they might explain those high error hexamer stretches at both ends (maybe the most extreme base of the 6bp strech is "usual" high error as seen in genomic DNA)?

In any case the two histograms suggest that the high error rate in the ends of total DNA whole transcriptome reads must be due to a difference in protocol compared to standard genomic DNA sequencing.

I agree that your charts are really suggestive that random hexamers are the culprit here because you see a pretty significant drop off in error past six bases in on either end. This effect is compounded by the fact that the last few base reads on SOLiD are significantly lower quality and error-prone.

What does this look like if you use an oligo-dT approach as opposed to random hexamers?

How about when using the SOLiD whole transcriptome kit and protocol from Ambion?

Originally posted by hingamp View Post

But wouldn't leaving duplicate reads due to PCR artefacts also lead to errors when doing digital gene expression?

These PCR artefacts are a reality, which is why duplicates are usually counted only once in the other read counting NGS application: ChIP-SEQ. Considering the narrower peaks of ChIP-SEQ, ignoring duplicate reads is even more drastic because sequence coverage tends to be much higher than in RNA-SEQ, but it is still done?...

Leaving them in carries a risk of overestimation, leaving them out carries a risk of underestimation. Best decision depends how prevalent PCR artefacts really are. Since it's difficult to know, in the end it's a question of whether one is more concerned about false positives or false negatives?

Yes, I have to agree with you. PCR duplicates are not informative, so leaving in duplicate reads is risky business, even for digital gene expression, because it's difficult to differentiate duplicate reads due to PCR and normal duplicate reads (which will be a rare occurrence, but by no means non-existent).

If you consider that in every application coverage equates to copy number, be it digital gene expression or whole genome sequencing, removing all duplicate reads causes underestimation of true coverage. However, it is assumed that depleting them from the whole data set will result in more accurate results because the occurence of reads spanning PCR duplicates is significantly higher than the occurrence of duplicate reads by chance.

But I agree strongly with your assumption that applications that have much higher coverage, like ChIP-Seq, will see a higher abundance reads from PCR duplicates and thereby be even more affected by them.

Does anyone have any published studies or data to share that proves removing duplicates from the whole data set and analyzing from there results in inaccurate results?

**pmiguel** · 11-25-2009, 12:12 PM

Originally posted by hingamp View Post

But wouldn't leaving duplicate reads due to PCR artefacts also lead to errors when doing digital gene expression?

It might. But if you have 100 reads all starting at the same place, that is real data--artefact or not. Discarding replicates because it is possible they derive from PCR distortion of the original message counts seems invalid to me. Either you trust your assay to give you a correct count or you don't. If not, redesign the assay (say, by eliminating or reducing the PCR steps) so it will give you an accurate count.

Downstream validation of SOLiD WT by qPCR that is trotted out in AB advertisements shows decent agreement over a large dynamic range. No removal of additional reads after the first mapping to a give start site is done in the SOLiD pipeline.

Originally posted by hingamp View Post

These PCR artefacts are a reality, which is why duplicates are usually counted only once in the other read counting NGS application: ChIP-SEQ. Considering the narrower peaks of ChIP-SEQ, ignoring duplicate reads is even more drastic because sequence coverage tends to be much higher than in RNA-SEQ, but it is still done?...

Leaving them in carries a risk of overestimation, leaving them out carries a risk of underestimation. Best decision depends how prevalent PCR artefacts really are. Since it's difficult to know, in the end it's a question of whether one is more concerned about false positives or false negatives?

First, PCR can bias the distribution of a pool of amplicons, but that doesn't mean that it did bias the distribution of amplicons substantially in any given sample.

However, if you have a highly expressed gene of 1000 nt. length, a methodology that discards any read mapping to the same start position as a previous read will be able to count no higher than 950 molecules of that transcript in any given sample. (Presuming your reads are 50 bases long.) That means you are imposing a top end on the dynamic range of your assay equal to the length of the transcript you are assaying. Even inside that dynamic range you will be massively distorting your results because, as I mentioned earlier, RNAseIII will have preferences for where it cleaves an RNA. Substantial biases are likely because it is a double-strand specific RNAse.

For structural studies, where the relative counts are not germane, by all means discard the reads with duplicate start sites. If for no other reason, you can reduce the amount of date you have to store. But I don't think this is useful for quantification of transcript levels in an RNA pool.

--
Phillip

**pmiguel** · 11-25-2009, 12:27 PM

Originally posted by hingamp View Post

Concerning our initial question, whole transcriptome indeed shows that errors are present along a well delimited 6 base streches at both ends. Since only the total DNA whole transcriptome has these random N-mers in both adapters, it is tempting to speculate they might explain those high error hexamer stretches at both ends (maybe the most extreme base of the 6bp strech is "usual" high error as seen in genomic DNA)?

Are you using the SOLiD Whole Transcriptome Analysis Kit to make the amplicons you are sequencing? Your plots are compelling, but they do not make sense for SWTAK methodology (and please correct me if I have this wrong) :

(1) The random pentamers utilized in SWTAK are used to mediate ligation of the adaptors they are part of to the RNA fragment. The 5' pentamer should be displaced entirely during reverse transcription. I don't see any mechanism by which that pentamer would become part of the amplicon.

(2) The 3' random pentamer would become part of the amplicon, but in most cases one presumes the tag (insert) would be >50 nt. in length. The mismatch distribution you show implies that a large percentages of the amplicons contained inserts exactly 50 bases long. Otherwise the pentamer would be too far from the P1 sequencing primer to be included in a 50 base read.

--
Phillip

**hingamp** · 11-26-2009, 10:57 PM

Originally posted by Michael.James.Clark View Post

What does this look like if you use an oligo-dT approach as opposed to random hexamers?

We have not tried the SOLiD with oligo-dT.

Originally posted by Michael.James.Clark View Post

How about when using the SOLiD whole transcriptome kit and protocol from Ambion?

The data I discussed here was obtained using the SOLiD whole transcriptome kit and protocol from Ambion (PN 4409491D) with barcoding.

I've uploaded a SOLiD read alignment image at full resolution that shows typical streches of high QV mismatches at both ends (colored positions).

**hingamp** · 11-26-2009, 11:50 PM

Originally posted by pmiguel View Post

It might. But if you have 100 reads all starting at the same place, that is real data--artefact or not. Discarding replicates because it is possible they derive from PCR distortion of the original message counts seems invalid to me. Either you trust your assay to give you a correct count or you don't. If not, redesign the assay (say, by eliminating or reducing the PCR steps) so it will give you an accurate count.

The suspicion that there are some significant PCR artefacts comes from observing lots of read alignments. In this alignment detail for instance, even though the coverage is very high, most reads appear staggered and evenly distributed, but some pileups are clearly visible with not only identical start and ends - but also identical high QV mismatches and INDELS (the mismatches are not SNPs, the genome has been resequenced). In some cases, a weird read group that looks convincingly related (same start, end, indels and mismatches) can be repeated several times the average transcript coverage.

I agree totally that it would be better to rerun with such protocol parameters that these unwanted artefacts do not pop up. But PCR is at the heart of the protocol, I don't know of a way to do without. Cleaning data of manifest artefacts is not optimal and it is a slippery slope to data massage, but is necessary when experimental conditions are such that artefacts are expected. Producing microarrays with 100% of prefectly shaped spots is nigh impossible, so weird shaped spots or outliers are naturally excluded even though they are indeed data. In fact, quite a bit of SOLiD data is excluded by the image analysis software well before we get a chance of counting them, or some read ends are thrashed because of unreliable QV's. I have a feeling that although it would be really welcomed, only keeping NGS experiments where 100% of the data passes all quality criteria would leave us with nothing at all?

Originally posted by pmiguel View Post

Downstream validation of SOLiD WT by qPCR that is trotted out in AB advertisements shows decent agreement over a large dynamic range. No removal of additional reads after the first mapping to a give start site is done in the SOLiD pipeline.

Indeed we see a drop in dynamic range when we condense identical reads. But if PCR artefacts are indeed real, then we have to accept and work with this reduced dynamic range.

It might well be true that the SOLiD analysis pipeline doesn't remove duplicates, but then again there are a number of things we prefer to do differently to the SOLiD pipeline. For one we don't use the SOLiD mapping procedure because of the significant number of reads that are simply left out (again real data that is thrashed quite silently). The alignments I have shown here are obtained with bfast, and many of the more weird ones (notably with INDELS) would simply never show up with the SOLiD pipeline. This might allow the SOLiD pipeline not to have to deal with these...

Originally posted by pmiguel View Post

However, if you have a highly expressed gene of 1000 nt. length, a methodology that discards any read mapping to the same start position as a previous read will be able to count no higher than 950 molecules of that transcript in any given sample. (Presuming your reads are 50 bases long.) That means you are imposing a top end on the dynamic range of your assay equal to the length of the transcript you are assaying.

Exactly, removing duplicates will result in a maximum count linked to the transcript length. The limit is however quite a lot higher than 950 max per kb if one accounts for reads of different lengths, and only removes reads that share high QV mismatches or INDELS. This limit would then allow ample room for interesting dynamic range?

**hingamp** · 11-26-2009, 11:58 PM

Originally posted by pmiguel View Post

Are you using the SOLiD Whole Transcriptome Analysis Kit to make the amplicons you are sequencing? Your plots are compelling, but they do not make sense for SWTAK methodology (and please correct me if I have this wrong) :

(1) The random pentamers utilized in SWTAK are used to mediate ligation of the adaptors they are part of to the RNA fragment. The 5' pentamer should be displaced entirely during reverse transcription. I don't see any mechanism by which that pentamer would become part of the amplicon.

(2) The 3' random pentamer would become part of the amplicon, but in most cases one presumes the tag (insert) would be >50 nt. in length. The mismatch distribution you show implies that a large percentages of the amplicons contained inserts exactly 50 bases long. Otherwise the pentamer would be too far from the P1 sequencing primer to be included in a 50 base read.

--
Phillip

I just can't make sense of the two 5' & 3' end high mismatch streches. The SWTAK protocol doesn't really account for it (figure attached). I'll try and find out if any deviations from SWTAK were attempted in this experiment. Could other users with SWTAK data plot high QV mismatches along read position?

Attached Files

Capture-3.jpg (10.8 KB, 64 views)

**Chipper** · 11-27-2009, 01:50 AM

Originally posted by hingamp View Post

In the figures below are the mismatch distribution along SOLiD reads for
-total DNA whole transcriptome
-classic genomic DNA resequencing
In both cases, only reliable high scoring quality mismatches were counted.

Classic genomic DNA sequencing shows increased errors at first and last base (presumably because there can't be any "two color "corrections?).

Concerning our initial question, whole transcriptome indeed shows that errors are present along a well delimited 6 base streches at both ends. Since only the total DNA whole transcriptome has these random N-mers in both adapters, it is tempting to speculate they might explain those high error hexamer stretches at both ends (maybe the most extreme base of the 6bp strech is "usual" high error as seen in genomic DNA)?

In any case the two histograms suggest that the high error rate in the ends of total DNA whole transcriptome reads must be due to a difference in protocol compared to standard genomic DNA sequencing.

Given the symmetry of the plot - is it possible that the mismatches are from intron-exon boundaries (assuming that values from the reverse strands are not going in the wrong direction...)?

I would look also at regions with low coverage, if you see identical reads there the library is likely sequenced too deep. And make sure that the stacks with mismatches are not from reads with low mapping quality.

**hingamp** · 11-27-2009, 02:44 AM

Originally posted by Chipper View Post

Given the symmetry of the plot - is it possible that the mismatches are from intron-exon boundaries (assuming that values from the reverse strands are not going in the wrong direction...)?

I would look also at regions with low coverage, if you see identical reads there the library is likely sequenced too deep. And make sure that the stacks with mismatches are not from reads with low mapping quality.

Indeed, the histogram symmetry has got us worried: it seems the strand attribute returned by the Bio:: DB::Sam alignments is always zero (in which case when corrected the histogram will be one sided, with massive mismatches at one of the read's ends)... We're looking into it at the moment.

I indeed looked for evidence of duplicates in regions with low coverage: less dramatic, but tentatively still there (look at both sides of this alignment window)? Concerning exon boudaries, our biological model is essentially splice free.

**pmiguel** · 11-27-2009, 08:49 AM

Originally posted by hingamp View Post

I just can't make sense of the two 5' & 3' end high mismatch streches. The SWTAK protocol doesn't really account for it (figure attached). I'll try and find out if any deviations from SWTAK were attempted in this experiment. Could other users with SWTAK data plot high QV mismatches along read position?

I'm unaware of an AB pipeline that uses the csfasta QV files as part of an analysis. I take it you have your own software for doing this?

By the way, unfortunately most of the images you are attaching are too low resolution to see anything much. So mine may suffer the same fate.

I will attach the error profile for a recent WT maize run we did. Note that no quality filtering was performed on these, however.

--
Phillip

Attached Files

maizeErrorProfile.jpg (19.8 KB, 71 views)

Topics	Statistics	Last Post
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, Yesterday, 11:49 AM	0 responses 13 views 0 likes	Last Post by seqadmin Yesterday, 11:49 AM
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, 04-24-2024, 08:47 AM	0 responses 16 views 0 likes	Last Post by seqadmin 04-24-2024, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 61 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM

Seqanswers Leaderboard Ad

Announcement

Concentration of mismatches close to read ends in "Whole Transcriptome" protocol

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News