Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • MiSeq & Bisulphite Sequencing

    Hi,

    Is there anyone else out there who are using the MiSeq (or HiSeq) platform for performing paired end whole genome bisulphite (aka bisulfite) sequencing? I'd like something to compare my QC reports to to determine whether it is as expected or not.

    For example (after trimming, etc):
    I have a hump of read lengths around 75-100 which drops off before the main peak of read lengths of the full read length of 260.
    The quality distribution is U-shaped from phred-20 to phred-40
    The nucleotide contributions for each base position are bowed.

    *Using EpiTect Bisulfite Kit & EpiGnome Methyl-Seq Kit*

    Cheers

  • #2
    You might post a couple of the images just so we can get an idea how bad/good things actually look. I'm doing some WGBS on the HiSeq, so I might have some comparison graphs (I send everything to our core facility for sequencing, so I only have graphs from things such as fastqc).

    I suspect that others might also have comparison graphs.

    Comment


    • #3
      QC

      Thanks for the reply Devon.

      I have attached the three images I was referring to in my original post. They are based on 16.5 million paired end reads from a single run, post trim for quality (0.05), length (>30bp), adapters and duplicate removal. All performed using CLC Genomics.

      Is this a similar result to what you're getting on the HiSeq?

      Cheers
      Attached Files

      Comment


      • #4
        I'm not familiar with those kits, but wouldn't you expect a much lower % of cytosines in a bisulfite library? I usually get ~1%.

        Also could your size distribution graphs be plotting paired end fragment sizes (the right hand peak) and single end reads (left hand side) together? Is your alignment a mixture of paired and unpaired reads?

        Comment


        • #5
          I do expect a lower CG % in a whole genome bisulphite library, although much greater than ~1% as there are large genomic regions which are hypermethylated in plants (e.g. transposable elements).

          The size distrubtion graph is reporting both paired end and single end (became single after QC). However, the single end reads total 82 compared to the paired which total almost 16.5 million... this shouldn't be noticeable on the graph.

          I have attached a further image which is from a quick, and by no means comprehensive, alignment of the BS reads to the UT genome. It illustrates that those reads which are aligning are certainly converted which is in agreement with QC using sanger sequencing of PCR products post BS conversion from known methyldesert genes.
          Attached Files

          Comment


          • #6
            The fragment size distribution looks a bit weird, though I guess I've never looked at that (usually just look at the bioanalyzer trace beforehand). The nucelotide contribution and quality scores concern me. Do you have a graph showing the quality score distribution as a function of position in the read rather than averaged over the read? I wonder if the wavy nucleotide contributions are simply due to bad quality bases that need to be trimmed. Particularly with BS-seq, quality/adapter trimming is very important.

            I'm a bit confused by the "nucleotide mapping" graph. It seems to be normalized to something, but it's unclear what. The "C/G in reference T/A in read" bars look good, but it seems odd that the "A/T in reference A/T in read" aren't similarly high, though I guess I don't know exactly how that graph was made.

            Comment


            • #7
              Originally posted by dpryan View Post
              The fragment size distribution looks a bit weird, though I guess I've never looked at that (usually just look at the bioanalyzer trace beforehand). The nucelotide contribution and quality scores concern me. Do you have a graph showing the quality score distribution as a function of position in the read rather than averaged over the read? I wonder if the wavy nucleotide contributions are simply due to bad quality bases that need to be trimmed. Particularly with BS-seq, quality/adapter trimming is very important.

              I'm a bit confused by the "nucleotide mapping" graph. It seems to be normalized to something, but it's unclear what. The "C/G in reference T/A in read" bars look good, but it seems odd that the "A/T in reference A/T in read" aren't similarly high, though I guess I don't know exactly how that graph was made.
              After doing a FastQC with the exported data it appears that something is going wrong during the QC in CLC. The Qscores start dropping down below 15 within the first 20 bp... This is really strange considering in the raw data the the Q score doesn't drop below Q30 until after 150 bp and it was it was quality trimmed to Q20.

              It appears that at adapter trimming is where the problem is occurring, with 91% of all reads getting trimmed and then the strange quality and distribution appearing. Instead of doing an adapter trim I tried mapping the reads to the adapters as reference sequences and this results in only 9 reads mapping to the adapters. I think I'll have to contact CLC and find out what the difference between "trim adapters" and "map to reference" (keep unmapped reads) is.

              Comment


              • #8
                Keep in mind that the fastqc results are on a subset of the file (the first million reads or something like that), so if CLC is using the whole thing then you'll get somewhat different results. If you have a bit of computer savvy, try using trim_galore to adapter and quality trim your reads (it's quite good for bisulfite sequencing data). You could probably then reimport things for alignment.

                The whole bisulfite alignment process in CLC is pretty new, so I wouldn't be surprised if they still have bugs. If you're able, you'll generally be better served with the open source stuff than the commercial packages. The latter are easier to use, but less powerful (and often a couple years behind).

                Comment


                • #9
                  Thanks for the heads up on FastQC only doing a subset, I had no idea. As far as I understand CLC uses the whole thing... although I could be wrong. I'll give trim-galore a go and see what the difference is.

                  I have also found the issue I was having is due to user error - the match score for finding adapters within sequence was still at the default '10' which for an adapter length of 58 was being found quite easily within the reads. I also found the quality trimming was allowing for, in some cases, up to 20 bp at the end of reads with very low (<14) Qscores. After correcting for this the data is look far more beaufitul - see attached.

                  I agree with you on bisulphite alignment with CLC, in fact I don't think it's even possible yet? Currently I'm using bismark on our dedicated server and it still takes several hours.

                  Thanks for all the discussion and advice
                  Attached Files

                  Comment


                  • #10
                    Ah, I had thought that you were doing all of this within CLC, good to know that they're still well behind the times. If bismark is taking forever and you have some comfort compiling code, you can try bison, of which I'm the author. It's generally faster, particularly if you have access to a cluster (I'll be releasing a version later today or tomorrow that scales up to more nodes, thus making it MUCH faster).

                    Comment


                    • #11
                      Originally posted by dpryan View Post
                      Keep in mind that the fastqc results are on a subset of the file (the first million reads or something like that)
                      Sorry, but that's not true. For all of the plots shown here FastQC will analyse the full file and the results should match between CLC and fastqc. The only place where fastqc samples the file is for the duplicated and overrepresented sequences analysis where it tracks a subset of sequences through the whole file and then extrapolates from that so it doesn't end up holding every sequence (potentially) in memory. All of the quality and composition plots always use the full dataset.

                      Comment


                      • #12
                        Originally posted by EpiBrass View Post
                        Thanks for the heads up on FastQC only doing a subset, I had no idea.
                        That's because it's not true. See my other reply later.

                        Comment


                        • #13
                          Ah, that was a misunderstanding on my part then, thanks for clarifying Simon.

                          Comment


                          • #14
                            Hi,

                            Just to let you all know, I managed to sort out all the trimming issues (thanks for the advice!). I've attached a couple of images so you can see the difference it has made since the original ones I posted. If anyone needs/wants some advice on how I finally got there just let me know I can even send you the workflow if you're interested in how to trim Raw WGBS data from MiSeq.

                            Cheers,
                            Justin
                            Attached Files

                            Comment


                            • #15
                              help

                              hey can you send mw what tool you used for trimming. i am analyzing the same kind of data

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM
                              • seqadmin
                                Techniques and Challenges in Conservation Genomics
                                by seqadmin



                                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                Avian Conservation
                                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                03-08-2024, 10:41 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, Yesterday, 06:37 PM
                              0 responses
                              10 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, Yesterday, 06:07 PM
                              0 responses
                              9 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-22-2024, 10:03 AM
                              0 responses
                              51 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-21-2024, 07:32 AM
                              0 responses
                              67 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X