SEQanswers

Go Back   SEQanswers > Applications Forums > Epigenetics



Similar Threads
Thread Thread Starter Forum Replies Last Post
readgroup id, sample, library confusion seq_lover Bioinformatics 0 05-17-2012 10:15 AM
CHIP-Seq and MBDCap-Seq Library Prep ChristmasSunflower Sample Prep / Library Generation 0 03-29-2012 01:25 PM
ChIP-Seq Library Perp khannak85 Illumina/Solexa 15 01-11-2012 08:46 AM
ChIp seq Library control frigo Illumina/Solexa 0 10-28-2011 05:23 AM
Library Quantification Confusion! peromhc Sample Prep / Library Generation 9 10-05-2011 07:18 AM

Reply
 
Thread Tools
Old 03-22-2013, 01:49 AM   #1
Tobikenobi
Member
 
Location: Japan

Join Date: Mar 2013
Posts: 17
Default Fastqc on Chip-Seq library: confusion

Hello everybody,

I am working on my first ever Chip-Seq experiment (Transcription factor binding on a HiSeq with 51bp single end reads) and at the moment I am looking at my libraries using Fastqc.

Among several fails that I could track down the reasons for, the following seems odd to me (pictures are attached):

Under Sequence content across all bases I find that my data seems quite AT rich.
Then, under sequence duplication level I find it is bigger than 95%.
As suggested in various posts in this forum, I read up on this following this link:
http://proteo.me.uk/2011/05/interpre...lot-in-fastqc/

Is it likely that in the course of library prep or sequencing a bias was created that I now find as a duplication of lots of AT rich reads?
And if that could be the case, how could I confirm this?

Maybe I should add that my bioinformatics level is very low, so at the moment I rely solely on the functions found in fastqc and anything that has a GUI.

Thanks al lot for your input!

Tobias
Attached Images
File Type: jpg duplication_levels copy.jpg (18.9 KB, 42 views)
File Type: jpg per_base_sequence_content copy.jpg (28.5 KB, 43 views)
Tobikenobi is offline   Reply With Quote
Old 03-22-2013, 08:30 AM   #2
kmcarr
Senior Member
 
Location: USA, Midwest

Join Date: May 2008
Posts: 979
Default

What species are you working with and what is the normal GC content of that species' genome? Your Sequence content plot looks perfectly normal if the genome of your species of interest has a GC content of 40%, lots of species do.

Regarding the Sequence duplication plot that may be entirely expected as well. You are doing a ChIP-Seq experiment. How many total sequences did you generate? How big is the genome of your organism? How big is the total target size of your ChIP enrichment? This plot may simply indicate that the target size you were enriching for is not that large and your ChIP enrichment worked very well. If you sequenced very deeply (e.g. 200 million reads) on such a small target you are inevitably going to get a lot of duplicate reads.

These plots can not be properly interpreted without a more thorough understanding of the biology of your system and what steps were carried out to generate your sequence data.

Last edited by kmcarr; 03-23-2013 at 03:51 AM.
kmcarr is offline   Reply With Quote
Old 03-24-2013, 04:09 PM   #3
Tobikenobi
Member
 
Location: Japan

Join Date: Mar 2013
Posts: 17
Default

[QUOTE=kmcarr;99728]What species are you working with and what is the normal GC content of that species' genome? Your Sequence content plot looks perfectly normal if the genome of your species of interest has a GC content of 40%, lots of species do.

I am working in mouse.

Regarding the Sequence duplication plot that may be entirely expected as well. You are doing a ChIP-Seq experiment. How many total sequences did you generate? How big is the genome of your organism?

The total genome size should be 2,644,093,988 bases. The total number of reads obtained for the data I posted previously is 30,223,517.

How big is the total target size of your ChIP enrichment?
This plot may simply indicate that the target size you were enriching for is not that large and your ChIP enrichment worked very well. If you sequenced very deeply (e.g. 200 million reads) on such a small target you are inevitably going to get a lot of duplicate reads.


If by target size you are referring to the size the chromatin was fragmented to, then the answer is around 150 bp.

These plots can not be properly interpreted without a more thorough understanding of the biology of your system and what steps were carried out to generate your sequence data.

In this ChIP-Seq experiment, I used a bioruptor to shear the chromatin of mouse neural stem cells to a size of 150 bp after crosslinking. From the following immunoprecipitation I aimed at 2 biological replicates with a yield of 5 nanogramms of ds DNA as determined by picogreen assay. This DNA as well as input and IgG control went into a Illumina TruSeq ChIP Sample prep Kit and were then evenly pooled into a 4-plex library for sequencing on a HiSeq2000, single end on one lane of a HiSeq flow cell. The Yield from all libraries was between 1,500 and 2,100 Mbases with 30,000,000 to 42,000,000 reads.
I am currently unsure about how the starting DNA was treated in terms of PCR conditions, as this and library prep was carried out by a commercial service, but I am about to find out.

Thank you very much again for your help!

Tobias
Tobikenobi is offline   Reply With Quote
Old 03-25-2013, 06:31 AM   #4
kmcarr
Senior Member
 
Location: USA, Midwest

Join Date: May 2008
Posts: 979
Default

Quote:
Originally Posted by Tobikenobi View Post
Quote:
Originally Posted by kmcarr View Post
What species are you working with and what is the normal GC content of that species' genome? Your Sequence content plot looks perfectly normal if the genome of your species of interest has a GC content of 40%, lots of species do.
I am working in mouse.
The %GC of the mouse genome is 41-42% so your base composition plot looks exactly like you would expect it to look.

Quote:
Quote:
How big is the total target size of your ChIP enrichment?
This plot may simply indicate that the target size you were enriching for is not that large and your ChIP enrichment worked very well. If you sequenced very deeply (e.g. 200 million reads) on such a small target you are inevitably going to get a lot of duplicate reads.
If by target size you are referring to the size the chromatin was fragmented to, then the answer is around 150 bp.
No, the fragment size is not what I was referring to. When I say target size I mean how many, and what it the total length of the regions targeted by your transcription factor. That is the target of your enrichment in this ChIP experiment. Is it a general transcription factor or one that is highly specific to a relatively small number of promoters? As a mental exercise let's imagine that your transcription factor targets 1,000 genes and the binding site size is ~100bp. This means that your target size is 100,000bp of DNA.

Now your input was a mouse genome, 2.64 Gbp of DNA. You obtained approximately 1.54 Gbp of DNA sequence data or < 1X coverage. In an unenriched sample the probability of duplicate reads would be close to 0. Honestly I am not that familiar with the normal statistics of ChIP enrichment but it seems to me that your enrichment would have to be off the charts fantastic to be able to see the level of duplication your are seeing explained by enrichment efficiency alone. I would start to worry that at one point during the ChIP process you ended up with an extremely limiting amount of DNA and subsequent PCR produced a biased, low diversity sample.

Have you tried mapping the reads to the mouse genome yet to see where they align?
kmcarr is offline   Reply With Quote
Old 03-25-2013, 04:21 PM   #5
Tobikenobi
Member
 
Location: Japan

Join Date: Mar 2013
Posts: 17
Default

Dear kmcarr,

thank you very much for your help.
In fact, I am not looking at an general TF but a rather specific one. People have done FLAG-CHIP-Seq on the factor on human cells and have identified about 5,500 genes to be targeted. So enrichment using this figure would mean about 500,000 bp, I guess.

I need to apologize, I should have probably attached the duplication level for my input control as well. This should not be enriched in any way, right? Even though the graph looks different, the duplication level is >80% here as well.

I have mapped the reads using bowtie and tried to look at them in the UCSC browser. In case of low complexity I should see regions that have high numbers of aligned reads vs regions that have low or no aligned reads?

Thank you again for your help!
Tobias
Attached Images
File Type: jpg Input duplication_levels.jpg (59.3 KB, 19 views)
Tobikenobi is offline   Reply With Quote
Old 03-26-2013, 04:39 AM   #6
kmcarr
Senior Member
 
Location: USA, Midwest

Join Date: May 2008
Posts: 979
Default

Quote:
Originally Posted by Tobikenobi View Post
I need to apologize, I should have probably attached the duplication level for my input control as well. This should not be enriched in any way, right? Even though the graph looks different, the duplication level is >80% here as well.
For the input control did you simply sequence some of the starting material, after fragmentation but before any immunoprecipitation? You are saying that this image is NOT from a no antibody IP control?

If that is the case then there is something significantly wrong with your input DNA. If your are sequencing random mouse genomic DNA and only collecting ~1.5 Gbp of sequence data (< 1X coverage of the genome) there is no way you should be observing read duplication like that. Did you start out with an extremely limiting amount of input DNA, because that can lead to a low diversity library. If you started with an adequate amount of genomic DNA then something went wrong with the library prep which drastically reduced the diversity of your sample.
kmcarr is offline   Reply With Quote
Old 03-26-2013, 04:32 PM   #7
Tobikenobi
Member
 
Location: Japan

Join Date: Mar 2013
Posts: 17
Default

This is in fact the input control, i.e. fragmented chromatin that was put aside before the immunoprecipitation.
The amount of starting material was indeed limiting in this experiment, as a specific type of neural stem cell was targeted. After discussing with the service facility that provides the library construction and sequencing at our institute, it was agreed to aim at 5 nanograms of immunoprecipitated, double stranded DNA as starting material to be sufficient with the TruSeq ChiP-Seq Kit. I assume that for the input a similar amount was used. The yield for the Input control was 1.9 Gbp and 38 million reads.

I guess the bottomline is that I am looking at libraries with very poor complexity. How could that reflect on later peak calling?

In the meantime I have used bowtie to map the reads to the mm9 reference and filtered for duplicates. I received the following numbers:

Input: 30,512,219 mapped reads (80%)
IP: 20,586,367 mapped reads (68%)

Thank you very much for your Input!

Tobias
Tobikenobi is offline   Reply With Quote
Old 03-27-2013, 06:07 AM   #8
kmcarr
Senior Member
 
Location: USA, Midwest

Join Date: May 2008
Posts: 979
Default

Quote:
Originally Posted by Tobikenobi View Post
I guess the bottomline is that I am looking at libraries with very poor complexity. How could that reflect on later peak calling?
Clearly the input control doesn't represent the true background (the whole mouse genome), further you can not know that the bias in amplifying your IP sample was the same as the bias during amplification of the control. Given these results I would be skeptical about the validity of any "peaks" observed in your IP sample.
kmcarr is offline   Reply With Quote
Old 03-28-2013, 09:52 PM   #9
Tobikenobi
Member
 
Location: Japan

Join Date: Mar 2013
Posts: 17
Default

That certainly does not make things easier for me.
In any case your help is much appreciated!
Tobikenobi is offline   Reply With Quote
Old 04-03-2013, 10:58 AM   #10
silkiechicken
Junior Member
 
Location: Oregon

Join Date: May 2011
Posts: 3
Default

I had a sequence duplication of like 90% once with mouse tissues... to fix it we now do library size selection after adapter ligation. Good luck.
silkiechicken is offline   Reply With Quote
Old 04-03-2013, 04:14 PM   #11
Tobikenobi
Member
 
Location: Japan

Join Date: Mar 2013
Posts: 17
Default

Quote:
Originally Posted by silkiechicken View Post
I had a sequence duplication of like 90% once with mouse tissues... to fix it we now do library size selection after adapter ligation. Good luck.
Hi!
Could you please explain that in more detail?
What was the size of your libraries before and after the adapters were ligated and which size did you purify?
How much starting material did you use?

Thank you very much!

Tobias
Tobikenobi is offline   Reply With Quote
Old 04-03-2013, 04:33 PM   #12
silkiechicken
Junior Member
 
Location: Oregon

Join Date: May 2011
Posts: 3
Default

So I was doing a ChIP-seq with embryonic tissues dissected from mouse. Samples were fixed and sonicated to fragment sizes between 200-500bp.

These samples were then IP'ed and we were able to recover about 15ng of total DNA from about 500ug of starting chromatin. When we had our 15ng of ChIPed DNA.

When using the illumina tru-seq kits as described, for the input and chip libraries, we had a low diversity and over 90% repeat reads randomly distributed, ie not adapter dimers and not from the IP. This was after bioanalyzer results verified that our resulting product was indeed centered around about 275bp. Second round, we requested the gel size selection to be after the amplification and adapter ligation. This resulted in a similar bioanalyzer result, and when ran on the sequencer, gave us only about 10% non unique reads.

Does that make more sense? I can be rather confusing.
silkiechicken is offline   Reply With Quote
Old 04-03-2013, 04:42 PM   #13
Tobikenobi
Member
 
Location: Japan

Join Date: Mar 2013
Posts: 17
Default

That makes it very clear!
Thank you very much for your input!
Tobikenobi is offline   Reply With Quote
Old 04-03-2013, 04:56 PM   #14
Tobikenobi
Member
 
Location: Japan

Join Date: Mar 2013
Posts: 17
Default actually still confused

Sorry, I think I still dont get it. I just wen t back to the Illumina Truseq DNA protocol, and , if I understand correctly, the gel excision step here is after adapter ligation. How does this differ from you protocol?
Tobikenobi is offline   Reply With Quote
Old 04-03-2013, 05:20 PM   #15
silkiechicken
Junior Member
 
Location: Oregon

Join Date: May 2011
Posts: 3
Default

We did the gel extraction as the very last step, so after ligation and pcr amplification. Our guess is we lost too much DNA during gel purification thus resulting in amplification of a small subset of our sample.

eta: We didn't gel extract twice, we just moved it to the very last step.
silkiechicken is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 10:19 AM.


Powered by vBulletin® Version 3.8.6
Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.