Unconfigured Ad

**NextGenSeq** · 04-28-2010, 07:07 AM

I don't think a single data point is enough to say this is a major problem. This could be a poor library construction and have nothing to do with the switch to longer sequence reads.

**BaCh** · 04-28-2010, 07:17 AM

Originally posted by NextGenSeq View Post

I don't think a single data point is enough to say this is a major problem. This could be a poor library construction and have nothing to do with the switch to longer sequence reads.

Oh, I didn't make this clear enough then: these were just examples. Here we go:

all my 36mers (some 15 projects over a time period of 18 months) looked like the example given
all my 75mers look like the example given (more than 30 projects over a period 9 month)

Would the sample size now be big enough?

B.

**henry.wood** · 04-28-2010, 07:53 AM

This is fascinated me enough to join and contribute rather than just snoop.
I have found a similar thing happening with human samples. The coverage appears to vary according to chromosomal band, which is associated with GC content. This has only happened where I have used really good quality DNA, which isn't normally a problem I have. I've only seen it with 76bp reads. The machine doesn't know after 36bp that you are planning another 40, so I might try re-aligning the troublesome samples just using the first 36bp.

**NextGenSeq** · 04-28-2010, 07:54 AM

Is it from a single commercial sequence provider? You might want to see the following article.

Amplification-free Illumina sequencing-library preparation facilitates improved mapping and assembly of GC-biased genomes - PMC

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2664327/

Amplification artifacts introduced during library preparation for the Illumina Genome Analyzer increase the likelihood that an appreciable proportion of these sequences will be duplicates, and cause an uneven distribution of read coverage across the ...

**BaCh** · 04-28-2010, 09:09 AM

Originally posted by henry.wood View Post

The machine doesn't know after 36bp that you are planning another 40, so I might try re-aligning the troublesome samples just using the first 36bp.

I had a try at exactly this approach this morning ... to no avail. Please tell if you see something else.

**NextGenSeq** · 04-28-2010, 11:46 AM

I bet the reason is a change made in the DNA shearing protocol by the commercial provider. Previous versions of the Illumina protocol used nebulization. Most people have switched to using Covaris or HydroShear shearing now. Some people use enzymatic shearing (fragmentase). Supposedly Covarising the DNA at 4C does not give GC bias but I haven't seen anyone prove it.

**Torst** · 04-28-2010, 05:57 PM

Originally posted by BaCh View Post

Many of these holes show the infamous GGCxG problem.

Could you please explain more about the GGCxG problem?

We have done 10's of runs on a GAI/GAII/GAIIx and a recent RE-sequencing of a high GC Mycobacterium exhibited zero coverage across very high GC regions - there are literally no reads (depth 0) at those positions. But this one was done in 2009 with 36bp PE.

**BaCh** · 04-29-2010, 12:18 AM

Originally posted by NextGenSeq View Post

I bet the reason is a change made in the DNA shearing protocol by the commercial provider. Previous versions of the Illumina protocol used nebulization. Most people have switched to using Covaris or HydroShear shearing now. Some people use enzymatic shearing (fragmentase). Supposedly Covarising the DNA at 4C does not give GC bias but I haven't seen anyone prove it.

I have no idea whether they changed it or not. I think I remember them saying that they don't use enzymatic shearing.

On another note: I just tested the 72mer data deposited by Illumina end of 2007 at the SRA for the E.coli MG1655 genome (~50%GC), it looks pretty good to me regarding coverage, just like my earlier 36mer projects.

Originally posted by Torst View Post

Could you please explain more about the GGCxG problem?

We have done 10's of runs on a GAI/GAII/GAIIx and a recent RE-sequencing of a high GC Mycobacterium exhibited zero coverage across very high GC regions - there are literally no reads (depth 0) at those positions. But this one was done in 2009 with 36bp PE.

High-GC is known to be a problem for Solexa. Regarding GGCxG, have a look at http://chevreux.org/GGCxG_problem.html where I describe in more detail what I see. There's also an outdated guide on how to reproduce it with the MG1655 data from Illumina at the SRA. Outdataed because: a) the SRA now has FASTQ files and b) MIRA now knows about GGCxG and clips accordingling.

Back to my original problem low GC bias problem: in case they did not change the fragmentation protocol (I need to ask them), what else could lead to this bias?

**henry.wood** · 04-29-2010, 12:40 AM

I'd always assumed the problem was due to shearing. We've only really had problems with very good quality DNA which looks like a tight band on a gel. Once the genomic DNA has a bit of a smear it behaves fine. We're in the process of playing around with different covaris settings, enzyme fragmentation, as well as doing all the things which you're not supposed to do with DNA - repeated freeze/thaws, leaving it on the bench for a few hours, to see if we can degrade it a little.

**Hamid** · 05-13-2010, 12:24 PM

I sincerely doubt that DNA fragmentation is the cause for low coverage being noticed in the 75mer runs by BaCh. My reasoning is as follows:
1. There are several large sequencing centers such as Sanger, Broad, and JGI who extensively use Covaris AFA for their DNA fragmentation. To date they have sequenced billions of bases but have not reported seeing such an obvious coverage lack due to sequence bias in their sequencing runs.
2. Switching from 36mer reads to 75mer reads does involve some changes. Have all the changes in the protocol been investigated?
3. Has someone done a side 36mer and 75mer reads of Covaris and Nebulization processed samples? To me that would seem like an obvious experiment to carry out to see if fragmentation or another aspect of the sample preparation/sequencing is the culprit.

BaCh: which tube type, sample volume, and setting did you use for the fragmentation of your samples using Covaris AFA?
Here is a link to an interesting paper from July 2008 with regards to sequence bias: http://nar.oxfordjournals.org/cgi/reprint/gkn425v1

Thank you

Hamid

**BaCh** · 05-14-2010, 05:23 AM

Originally posted by hamid View Post

[...]
BaCh: which tube type, sample volume, and setting did you use for the fragmentation of your samples using Covaris AFA?
Here is a link to an interesting paper from July 2008 with regards to sequence bias: http://nar.oxfordjournals.org/cgi/reprint/gkn425v1

As I wrote in my original post: we didn't do any fragmentation or things like that. All our labs do is to extract the DNA. I think it's a Qiagen kit, don't ask me which, but it's an unchanged protocol since ... many years. We then send the DNA to the provider(s) and after some time get the reads back as FASTQ file.

Something *has* changed (but what?) and it's a big problem now.

**Hamid** · 05-14-2010, 12:05 PM

Hi BaCh,

I think it is important to know exactly which tubes, what volume, and what settings they are using to fragment you DNA samples. I also think it is equally important to get their protocols for the 36mer read, and 75mer reads to see what is different.
Without having that information, it will be impossible to identify what is causing the issue you are noticing.

Thank you

Hamid

**Nitrogen-DNE-sulfer** · 05-19-2010, 11:53 AM

GC Bias in ILMN 75mers

We see a similar effect with ILMN 75mers on GAIIx. Also done through a service provider with Human DNA.
DNA sheared with a Covaris and run on SOLiD shows a different effect so I dont think its your shearing either. I think its related to changes in polymerases and nucleotides they have quickly altered to thrash the read lengths out as long as possible without much attention paid to the subtle side effects.

The thread above points to Kozarewa et al. This pertains to amplification bias which can occur. Certainly important to know how they amplified the library but in our experience we usually see a selection for GC rich regions or an enrichment for GC rich regions when over amplifying... not a depletion like you are seeing.

Kozarewa et al are consistent with this observation
"For both the raw and mapped datasets of the 3D7 STD-PF2 sequence data, there is an appreciable shift away from the theoretical shredded data towards higher GC content, indicating severe anti AT bias in the sequences."

We did trim the ILMN 75mers back to 50mers and got better dbSNP concordance after suplementing with 50% more reads to off set the trim (didnt check CNV results) so I'm surprised your 36mer trimming didnt have a positive impact as this. Did you allow for the same number of MM? ie 75 mers with ~7MM and 36 mers with 3MM or was there a different MM threshold for both readlengths?

Attached is a chart on the 5mer error probabilities from the ILMN 75mers delivered to us from a service provider. Its sorted on Tm so the GC rich regions are further to the right of the chart and the AT regions further to the left (2nd attachment). The Dohm et al reference above suggests this could be due to poor deprotection of Gs but it may also just be a polymerase artifact?

Also attached are the ILMN data overlayed with the SOLiD data from the same Genome. This does not represent the same library sequenced on both platforms but it was as close as we could get. After reviewing the Dohm et al Supplement Figure 5, I think we are seeing the same thing and it seems to re-inforce your pattern of GGC etc.

Something to think about is to ask your provider if they have a SOLiD. I saw data from AB that they can sequence the ILMN libraries directly on SOLiD so you can tease apart library artifacts from sequencing chemistry artifacts.

Would also recommend Picardi et al. which compares some of the error rates.

Large-scale detection and analysis of RNA editing in grape mtDNA by RNA deep-sequencing - PubMed

http://www.ncbi.nlm.nih.gov/pubmed/20385587?dopt=Abstract

RNA editing is a widespread post-transcriptional molecular phenomenon that can increase proteomic diversity, by modifying the sequence of completely or partially non-functional primary transcripts, through a variety of mechanistically and evolutionarily unrelated pathways. Editing by base substituti …

Harismendy et al is another comparison paper but is rather dated and compares radically different library preps to each other. Circularized kilobase, Nick Translated libraries vs Fragment libraries.

Attached Files

Topics	Statistics	Last Post
High-Resolution Sequencing Exposes Hidden Toxoplasma Diversity by SEQadmin2 Started by SEQadmin2, 07-02-2026, 11:08 AM	0 responses 25 views 0 reactions	Last Post by SEQadmin2 07-02-2026, 11:08 AM
New AI Model Captures Long-Range Genomic Signals to Improve RNA Splice Site Prediction by SEQadmin2 Started by SEQadmin2, 06-30-2026, 05:37 AM	0 responses 23 views 0 reactions	Last Post by SEQadmin2 06-30-2026, 05:37 AM
Large-Scale Protein Screen Uncovers Hidden Regulators of Alternative Polyadenylation by SEQadmin2 Started by SEQadmin2, 06-26-2026, 11:10 AM	0 responses 23 views 0 reactions	Last Post by SEQadmin2 06-26-2026, 11:10 AM
Whole-Genome Sequencing Traces Faroe Islands Ancestry to a North Atlantic Founder Population by SEQadmin2 Started by SEQadmin2, 06-17-2026, 06:09 AM	0 responses 55 views 0 reactions	Last Post by SEQadmin2 06-17-2026, 06:09 AM

Unconfigured Ad

GC coverage bias: difference between 2008 (36mers) and 2010 (75mers) data

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News