Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • GC coverage bias: difference between 2008 (36mers) and 2010 (75mers) data

    Dear all,

    I'm getting my data via sequencing providers, i.e., I have no own lab.

    Having made the switch from 36mers to 75mers last year, I'm finding myself in unexpected trouble. Namely, that when looking at data sets with the same nominal coverage, the coverage variance is now much, much higher with the 75mers than with the old 36mers.

    As example: one of our work horses is a 45% GC bacterium which poses no big problems regarding GC content or repetitiveness.

    When I did resequencing projects in the past, 30x coverage with 36mers was enough to ensure no holes were left in genome. Also, when there were genome duplications, these could be clearly and easily detected

    Nowadays with 75mers, things got really, really nasty. There is now a very clear coverage bias toward low GC regions. It is so strong that one could think they are duplicated (and they clearly are not). Furthermore, having a 30x coverage is not nearly enough anymore to ensure that the whole genome is covered, there are literally hundreds of holes left open. Many of these holes show the infamous GGCxG problem. To get complete coverage I now need to go to at least 70x or 80x ... but this does not solve the problem of false positive genome duplications.



    I have attached a PDF with two slides which show what things look like.

    Has anyone else made this kind of observation? Any idea what could be the cause ... or what a remedy could be?

    Regards,
    B.
    Attached Files

  • #2
    I don't think a single data point is enough to say this is a major problem. This could be a poor library construction and have nothing to do with the switch to longer sequence reads.

    Comment


    • #3
      Originally posted by NextGenSeq View Post
      I don't think a single data point is enough to say this is a major problem. This could be a poor library construction and have nothing to do with the switch to longer sequence reads.
      Oh, I didn't make this clear enough then: these were just examples. Here we go:
      • all my 36mers (some 15 projects over a time period of 18 months) looked like the example given
      • all my 75mers look like the example given (more than 30 projects over a period 9 month)

      Would the sample size now be big enough?

      B.

      Comment


      • #4
        This is fascinated me enough to join and contribute rather than just snoop.
        I have found a similar thing happening with human samples. The coverage appears to vary according to chromosomal band, which is associated with GC content. This has only happened where I have used really good quality DNA, which isn't normally a problem I have. I've only seen it with 76bp reads. The machine doesn't know after 36bp that you are planning another 40, so I might try re-aligning the troublesome samples just using the first 36bp.

        Comment


        • #5
          Is it from a single commercial sequence provider? You might want to see the following article.

          Comment


          • #6
            Originally posted by henry.wood View Post
            The machine doesn't know after 36bp that you are planning another 40, so I might try re-aligning the troublesome samples just using the first 36bp.
            I had a try at exactly this approach this morning ... to no avail. Please tell if you see something else.

            Comment


            • #7
              I bet the reason is a change made in the DNA shearing protocol by the commercial provider. Previous versions of the Illumina protocol used nebulization. Most people have switched to using Covaris or HydroShear shearing now. Some people use enzymatic shearing (fragmentase). Supposedly Covarising the DNA at 4C does not give GC bias but I haven't seen anyone prove it.

              Comment


              • #8
                Originally posted by BaCh View Post
                Many of these holes show the infamous GGCxG problem.
                Could you please explain more about the GGCxG problem?

                We have done 10's of runs on a GAI/GAII/GAIIx and a recent RE-sequencing of a high GC Mycobacterium exhibited zero coverage across very high GC regions - there are literally no reads (depth 0) at those positions. But this one was done in 2009 with 36bp PE.

                Comment


                • #9
                  Originally posted by NextGenSeq View Post
                  I bet the reason is a change made in the DNA shearing protocol by the commercial provider. Previous versions of the Illumina protocol used nebulization. Most people have switched to using Covaris or HydroShear shearing now. Some people use enzymatic shearing (fragmentase). Supposedly Covarising the DNA at 4C does not give GC bias but I haven't seen anyone prove it.
                  I have no idea whether they changed it or not. I think I remember them saying that they don't use enzymatic shearing.

                  On another note: I just tested the 72mer data deposited by Illumina end of 2007 at the SRA for the E.coli MG1655 genome (~50%GC), it looks pretty good to me regarding coverage, just like my earlier 36mer projects.

                  Originally posted by Torst View Post
                  Could you please explain more about the GGCxG problem?

                  We have done 10's of runs on a GAI/GAII/GAIIx and a recent RE-sequencing of a high GC Mycobacterium exhibited zero coverage across very high GC regions - there are literally no reads (depth 0) at those positions. But this one was done in 2009 with 36bp PE.
                  High-GC is known to be a problem for Solexa. Regarding GGCxG, have a look at http://chevreux.org/GGCxG_problem.html where I describe in more detail what I see. There's also an outdated guide on how to reproduce it with the MG1655 data from Illumina at the SRA. Outdataed because: a) the SRA now has FASTQ files and b) MIRA now knows about GGCxG and clips accordingling.


                  Back to my original problem low GC bias problem: in case they did not change the fragmentation protocol (I need to ask them), what else could lead to this bias?

                  Comment


                  • #10
                    I'd always assumed the problem was due to shearing. We've only really had problems with very good quality DNA which looks like a tight band on a gel. Once the genomic DNA has a bit of a smear it behaves fine. We're in the process of playing around with different covaris settings, enzyme fragmentation, as well as doing all the things which you're not supposed to do with DNA - repeated freeze/thaws, leaving it on the bench for a few hours, to see if we can degrade it a little.

                    Comment


                    • #11
                      I sincerely doubt that DNA fragmentation is the cause for low coverage being noticed in the 75mer runs by BaCh. My reasoning is as follows:
                      1. There are several large sequencing centers such as Sanger, Broad, and JGI who extensively use Covaris AFA for their DNA fragmentation. To date they have sequenced billions of bases but have not reported seeing such an obvious coverage lack due to sequence bias in their sequencing runs.
                      2. Switching from 36mer reads to 75mer reads does involve some changes. Have all the changes in the protocol been investigated?
                      3. Has someone done a side 36mer and 75mer reads of Covaris and Nebulization processed samples? To me that would seem like an obvious experiment to carry out to see if fragmentation or another aspect of the sample preparation/sequencing is the culprit.

                      BaCh: which tube type, sample volume, and setting did you use for the fragmentation of your samples using Covaris AFA?
                      Here is a link to an interesting paper from July 2008 with regards to sequence bias: http://nar.oxfordjournals.org/cgi/reprint/gkn425v1

                      Thank you

                      Hamid
                      Last edited by Hamid; 05-13-2010, 12:51 PM.

                      Comment


                      • #12
                        Originally posted by hamid View Post
                        [...]
                        BaCh: which tube type, sample volume, and setting did you use for the fragmentation of your samples using Covaris AFA?
                        Here is a link to an interesting paper from July 2008 with regards to sequence bias: http://nar.oxfordjournals.org/cgi/reprint/gkn425v1
                        As I wrote in my original post: we didn't do any fragmentation or things like that. All our labs do is to extract the DNA. I think it's a Qiagen kit, don't ask me which, but it's an unchanged protocol since ... many years. We then send the DNA to the provider(s) and after some time get the reads back as FASTQ file.

                        Something *has* changed (but what?) and it's a big problem now.

                        Comment


                        • #13
                          Hi BaCh,

                          I think it is important to know exactly which tubes, what volume, and what settings they are using to fragment you DNA samples. I also think it is equally important to get their protocols for the 36mer read, and 75mer reads to see what is different.
                          Without having that information, it will be impossible to identify what is causing the issue you are noticing.

                          Thank you

                          Hamid

                          Comment


                          • #14
                            GC Bias in ILMN 75mers

                            We see a similar effect with ILMN 75mers on GAIIx. Also done through a service provider with Human DNA.
                            DNA sheared with a Covaris and run on SOLiD shows a different effect so I dont think its your shearing either. I think its related to changes in polymerases and nucleotides they have quickly altered to thrash the read lengths out as long as possible without much attention paid to the subtle side effects.

                            The thread above points to Kozarewa et al. This pertains to amplification bias which can occur. Certainly important to know how they amplified the library but in our experience we usually see a selection for GC rich regions or an enrichment for GC rich regions when over amplifying... not a depletion like you are seeing.

                            Kozarewa et al are consistent with this observation
                            "For both the raw and mapped datasets of the 3D7 STD-PF2 sequence data, there is an appreciable shift away from the theoretical shredded data towards higher GC content, indicating severe anti AT bias in the sequences."

                            We did trim the ILMN 75mers back to 50mers and got better dbSNP concordance after suplementing with 50% more reads to off set the trim (didnt check CNV results) so I'm surprised your 36mer trimming didnt have a positive impact as this. Did you allow for the same number of MM? ie 75 mers with ~7MM and 36 mers with 3MM or was there a different MM threshold for both readlengths?

                            Attached is a chart on the 5mer error probabilities from the ILMN 75mers delivered to us from a service provider. Its sorted on Tm so the GC rich regions are further to the right of the chart and the AT regions further to the left (2nd attachment). The Dohm et al reference above suggests this could be due to poor deprotection of Gs but it may also just be a polymerase artifact?

                            Also attached are the ILMN data overlayed with the SOLiD data from the same Genome. This does not represent the same library sequenced on both platforms but it was as close as we could get. After reviewing the Dohm et al Supplement Figure 5, I think we are seeing the same thing and it seems to re-inforce your pattern of GGC etc.

                            Something to think about is to ask your provider if they have a SOLiD. I saw data from AB that they can sequence the ILMN libraries directly on SOLiD so you can tease apart library artifacts from sequencing chemistry artifacts.

                            Would also recommend Picardi et al. which compares some of the error rates.
                            RNA editing is a widespread post-transcriptional molecular phenomenon that can increase proteomic diversity, by modifying the sequence of completely or partially non-functional primary transcripts, through a variety of mechanistically and evolutionarily unrelated pathways. Editing by base substituti …


                            Harismendy et al is another comparison paper but is rather dated and compares radically different library preps to each other. Circularized kilobase, Nick Translated libraries vs Fragment libraries.
                            Attached Files

                            Comment

                            Latest Articles

                            Collapse

                            • seqadmin
                              Strategies for Sequencing Challenging Samples
                              by seqadmin


                              Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                              03-22-2024, 06:39 AM
                            • seqadmin
                              Techniques and Challenges in Conservation Genomics
                              by seqadmin



                              The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                              Avian Conservation
                              Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                              03-08-2024, 10:41 AM

                            ad_right_rmr

                            Collapse

                            News

                            Collapse

                            Topics Statistics Last Post
                            Started by seqadmin, Yesterday, 06:37 PM
                            0 responses
                            12 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, Yesterday, 06:07 PM
                            0 responses
                            10 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 03-22-2024, 10:03 AM
                            0 responses
                            52 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 03-21-2024, 07:32 AM
                            0 responses
                            68 views
                            0 likes
                            Last Post seqadmin  
                            Working...
                            X