SEQanswers

Go Back   SEQanswers > Applications Forums > Epigenetics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Time & Cost of using 1 MiSeq Machine to do 16s rDNA (V2/V4) Seq on 300 Samples/Month vs92 Illumina/Solexa 28 10-09-2015 11:07 AM
Bisulphite sequencing on Illumina Paired End 100bp reads yog77 Illumina/Solexa 11 07-15-2014 06:03 AM
Bisulphite sequencing - Ion Torrent arnaud.kr Ion Torrent 1 11-11-2011 03:06 AM
Ion Torrent vs MiSeq & GS FLX+ Kanak Vaidya Ion Torrent 8 08-18-2011 12:26 PM
Bisulphite sequencing on Illumina Paired End 100bp reads yog77 Epigenetics 0 06-30-2011 09:11 AM

Reply
 
Thread Tools
Old 11-19-2013, 09:03 PM   #1
EpiBrass
Member
 
Location: Australia

Join Date: Nov 2013
Posts: 16
Default MiSeq & Bisulphite Sequencing

Hi,

Is there anyone else out there who are using the MiSeq (or HiSeq) platform for performing paired end whole genome bisulphite (aka bisulfite) sequencing? I'd like something to compare my QC reports to to determine whether it is as expected or not.

For example (after trimming, etc):
I have a hump of read lengths around 75-100 which drops off before the main peak of read lengths of the full read length of 260.
The quality distribution is U-shaped from phred-20 to phred-40
The nucleotide contributions for each base position are bowed.

*Using EpiTect Bisulfite Kit & EpiGnome Methyl-Seq Kit*

Cheers
EpiBrass is offline   Reply With Quote
Old 11-20-2013, 12:13 AM   #2
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

You might post a couple of the images just so we can get an idea how bad/good things actually look. I'm doing some WGBS on the HiSeq, so I might have some comparison graphs (I send everything to our core facility for sequencing, so I only have graphs from things such as fastqc).

I suspect that others might also have comparison graphs.
dpryan is offline   Reply With Quote
Old 11-20-2013, 01:04 PM   #3
EpiBrass
Member
 
Location: Australia

Join Date: Nov 2013
Posts: 16
Default QC

Thanks for the reply Devon.

I have attached the three images I was referring to in my original post. They are based on 16.5 million paired end reads from a single run, post trim for quality (0.05), length (>30bp), adapters and duplicate removal. All performed using CLC Genomics.

Is this a similar result to what you're getting on the HiSeq?

Cheers
Attached Images
File Type: jpg Lengths distribution.jpg (30.4 KB, 38 views)
File Type: jpg Quality distribution.jpg (27.6 KB, 33 views)
File Type: jpg Nucleotide contributions.jpg (36.4 KB, 37 views)
EpiBrass is offline   Reply With Quote
Old 11-20-2013, 06:39 PM   #4
frozenlyse
Senior Member
 
Location: Australia

Join Date: Sep 2008
Posts: 136
Default

I'm not familiar with those kits, but wouldn't you expect a much lower % of cytosines in a bisulfite library? I usually get ~1%.

Also could your size distribution graphs be plotting paired end fragment sizes (the right hand peak) and single end reads (left hand side) together? Is your alignment a mixture of paired and unpaired reads?
frozenlyse is offline   Reply With Quote
Old 11-20-2013, 07:06 PM   #5
EpiBrass
Member
 
Location: Australia

Join Date: Nov 2013
Posts: 16
Default

I do expect a lower CG % in a whole genome bisulphite library, although much greater than ~1% as there are large genomic regions which are hypermethylated in plants (e.g. transposable elements).

The size distrubtion graph is reporting both paired end and single end (became single after QC). However, the single end reads total 82 compared to the paired which total almost 16.5 million... this shouldn't be noticeable on the graph.

I have attached a further image which is from a quick, and by no means comprehensive, alignment of the BS reads to the UT genome. It illustrates that those reads which are aligning are certainly converted which is in agreement with QC using sanger sequencing of PCR products post BS conversion from known methyldesert genes.
Attached Images
File Type: jpg Nucleotide mapping.jpg (39.1 KB, 19 views)
EpiBrass is offline   Reply With Quote
Old 11-21-2013, 12:52 AM   #6
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

The fragment size distribution looks a bit weird, though I guess I've never looked at that (usually just look at the bioanalyzer trace beforehand). The nucelotide contribution and quality scores concern me. Do you have a graph showing the quality score distribution as a function of position in the read rather than averaged over the read? I wonder if the wavy nucleotide contributions are simply due to bad quality bases that need to be trimmed. Particularly with BS-seq, quality/adapter trimming is very important.

I'm a bit confused by the "nucleotide mapping" graph. It seems to be normalized to something, but it's unclear what. The "C/G in reference T/A in read" bars look good, but it seems odd that the "A/T in reference A/T in read" aren't similarly high, though I guess I don't know exactly how that graph was made.
dpryan is offline   Reply With Quote
Old 11-25-2013, 05:37 PM   #7
EpiBrass
Member
 
Location: Australia

Join Date: Nov 2013
Posts: 16
Default

Quote:
Originally Posted by dpryan View Post
The fragment size distribution looks a bit weird, though I guess I've never looked at that (usually just look at the bioanalyzer trace beforehand). The nucelotide contribution and quality scores concern me. Do you have a graph showing the quality score distribution as a function of position in the read rather than averaged over the read? I wonder if the wavy nucleotide contributions are simply due to bad quality bases that need to be trimmed. Particularly with BS-seq, quality/adapter trimming is very important.

I'm a bit confused by the "nucleotide mapping" graph. It seems to be normalized to something, but it's unclear what. The "C/G in reference T/A in read" bars look good, but it seems odd that the "A/T in reference A/T in read" aren't similarly high, though I guess I don't know exactly how that graph was made.
After doing a FastQC with the exported data it appears that something is going wrong during the QC in CLC. The Qscores start dropping down below 15 within the first 20 bp... This is really strange considering in the raw data the the Q score doesn't drop below Q30 until after 150 bp and it was it was quality trimmed to Q20.

It appears that at adapter trimming is where the problem is occurring, with 91% of all reads getting trimmed and then the strange quality and distribution appearing. Instead of doing an adapter trim I tried mapping the reads to the adapters as reference sequences and this results in only 9 reads mapping to the adapters. I think I'll have to contact CLC and find out what the difference between "trim adapters" and "map to reference" (keep unmapped reads) is.
EpiBrass is offline   Reply With Quote
Old 11-26-2013, 01:13 AM   #8
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

Keep in mind that the fastqc results are on a subset of the file (the first million reads or something like that), so if CLC is using the whole thing then you'll get somewhat different results. If you have a bit of computer savvy, try using trim_galore to adapter and quality trim your reads (it's quite good for bisulfite sequencing data). You could probably then reimport things for alignment.

The whole bisulfite alignment process in CLC is pretty new, so I wouldn't be surprised if they still have bugs. If you're able, you'll generally be better served with the open source stuff than the commercial packages. The latter are easier to use, but less powerful (and often a couple years behind).
dpryan is offline   Reply With Quote
Old 11-26-2013, 03:58 PM   #9
EpiBrass
Member
 
Location: Australia

Join Date: Nov 2013
Posts: 16
Default

Thanks for the heads up on FastQC only doing a subset, I had no idea. As far as I understand CLC uses the whole thing... although I could be wrong. I'll give trim-galore a go and see what the difference is.

I have also found the issue I was having is due to user error - the match score for finding adapters within sequence was still at the default '10' which for an adapter length of 58 was being found quite easily within the reads. I also found the quality trimming was allowing for, in some cases, up to 20 bp at the end of reads with very low (<14) Qscores. After correcting for this the data is look far more beaufitul - see attached.

I agree with you on bisulphite alignment with CLC, in fact I don't think it's even possible yet? Currently I'm using bismark on our dedicated server and it still takes several hours.

Thanks for all the discussion and advice
Attached Images
File Type: jpg Quality distribution.jpg (30.1 KB, 7 views)
File Type: jpg Lengths distribution.jpg (51.3 KB, 5 views)
EpiBrass is offline   Reply With Quote
Old 11-26-2013, 07:43 PM   #10
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

Ah, I had thought that you were doing all of this within CLC, good to know that they're still well behind the times. If bismark is taking forever and you have some comfort compiling code, you can try bison, of which I'm the author. It's generally faster, particularly if you have access to a cluster (I'll be releasing a version later today or tomorrow that scales up to more nodes, thus making it MUCH faster).
dpryan is offline   Reply With Quote
Old 11-27-2013, 02:43 AM   #11
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 871
Default

Quote:
Originally Posted by dpryan View Post
Keep in mind that the fastqc results are on a subset of the file (the first million reads or something like that)
Sorry, but that's not true. For all of the plots shown here FastQC will analyse the full file and the results should match between CLC and fastqc. The only place where fastqc samples the file is for the duplicated and overrepresented sequences analysis where it tracks a subset of sequences through the whole file and then extrapolates from that so it doesn't end up holding every sequence (potentially) in memory. All of the quality and composition plots always use the full dataset.
simonandrews is offline   Reply With Quote
Old 11-27-2013, 02:45 AM   #12
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 871
Default

Quote:
Originally Posted by EpiBrass View Post
Thanks for the heads up on FastQC only doing a subset, I had no idea.
That's because it's not true. See my other reply later.
simonandrews is offline   Reply With Quote
Old 11-27-2013, 03:13 AM   #13
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

Ah, that was a misunderstanding on my part then, thanks for clarifying Simon.
dpryan is offline   Reply With Quote
Old 12-10-2013, 02:45 PM   #14
EpiBrass
Member
 
Location: Australia

Join Date: Nov 2013
Posts: 16
Default

Hi,

Just to let you all know, I managed to sort out all the trimming issues (thanks for the advice!). I've attached a couple of images so you can see the difference it has made since the original ones I posted. If anyone needs/wants some advice on how I finally got there just let me know I can even send you the workflow if you're interested in how to trim Raw WGBS data from MiSeq.

Cheers,
Justin
Attached Images
File Type: jpg Quality distribution1.jpg (30.6 KB, 17 views)
File Type: jpg Nucleotide contributions1.jpg (44.8 KB, 20 views)
EpiBrass is offline   Reply With Quote
Old 06-16-2014, 09:43 AM   #15
BFM
Member
 
Location: USA

Join Date: Jun 2014
Posts: 10
Default help

hey can you send mw what tool you used for trimming. i am analyzing the same kind of data
BFM is offline   Reply With Quote
Old 06-16-2014, 07:18 PM   #16
EpiBrass
Member
 
Location: Australia

Join Date: Nov 2013
Posts: 16
Default

Hi BFM,

I used Trimmomatic to trim for adapters and quality (collect orphaned reads). Imported this into CLC Genomics server to remove chloroplast aligning reads and duplicate reads (again collect orphans). Export as .fastq and run Trimmomatic again to remove reads <20 bp - I also trimmed to max 200 bp as I was finding some bias towards the ends of reads.

I hope that helps.
EpiBrass is offline   Reply With Quote
Reply

Tags
bisulphite, methylation, miseq, whole-genome

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 10:22 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO