![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
readgroup id, sample, library confusion | seq_lover | Bioinformatics | 0 | 05-17-2012 10:15 AM |
CHIP-Seq and MBDCap-Seq Library Prep | ChristmasSunflower | Sample Prep / Library Generation | 0 | 03-29-2012 01:25 PM |
ChIP-Seq Library Perp | khannak85 | Illumina/Solexa | 15 | 01-11-2012 08:46 AM |
ChIp seq Library control | frigo | Illumina/Solexa | 0 | 10-28-2011 05:23 AM |
Library Quantification Confusion! | peromhc | Sample Prep / Library Generation | 9 | 10-05-2011 07:18 AM |
![]() |
|
Thread Tools |
![]() |
#1 |
Member
Location: Japan Join Date: Mar 2013
Posts: 17
|
![]()
Hello everybody,
I am working on my first ever Chip-Seq experiment (Transcription factor binding on a HiSeq with 51bp single end reads) and at the moment I am looking at my libraries using Fastqc. Among several fails that I could track down the reasons for, the following seems odd to me (pictures are attached): Under Sequence content across all bases I find that my data seems quite AT rich. Then, under sequence duplication level I find it is bigger than 95%. As suggested in various posts in this forum, I read up on this following this link: http://proteo.me.uk/2011/05/interpre...lot-in-fastqc/ Is it likely that in the course of library prep or sequencing a bias was created that I now find as a duplication of lots of AT rich reads? And if that could be the case, how could I confirm this? Maybe I should add that my bioinformatics level is very low, so at the moment I rely solely on the functions found in fastqc and anything that has a GUI. Thanks al lot for your input! Tobias |
![]() |
![]() |
![]() |
#2 |
Senior Member
Location: USA, Midwest Join Date: May 2008
Posts: 1,143
|
![]()
What species are you working with and what is the normal GC content of that species' genome? Your Sequence content plot looks perfectly normal if the genome of your species of interest has a GC content of 40%, lots of species do.
Regarding the Sequence duplication plot that may be entirely expected as well. You are doing a ChIP-Seq experiment. How many total sequences did you generate? How big is the genome of your organism? How big is the total target size of your ChIP enrichment? This plot may simply indicate that the target size you were enriching for is not that large and your ChIP enrichment worked very well. If you sequenced very deeply (e.g. 200 million reads) on such a small target you are inevitably going to get a lot of duplicate reads. These plots can not be properly interpreted without a more thorough understanding of the biology of your system and what steps were carried out to generate your sequence data. Last edited by kmcarr; 03-23-2013 at 03:51 AM. |
![]() |
![]() |
![]() |
#3 |
Member
Location: Japan Join Date: Mar 2013
Posts: 17
|
![]()
[QUOTE=kmcarr;99728]What species are you working with and what is the normal GC content of that species' genome? Your Sequence content plot looks perfectly normal if the genome of your species of interest has a GC content of 40%, lots of species do.
I am working in mouse. Regarding the Sequence duplication plot that may be entirely expected as well. You are doing a ChIP-Seq experiment. How many total sequences did you generate? How big is the genome of your organism? The total genome size should be 2,644,093,988 bases. The total number of reads obtained for the data I posted previously is 30,223,517. How big is the total target size of your ChIP enrichment? This plot may simply indicate that the target size you were enriching for is not that large and your ChIP enrichment worked very well. If you sequenced very deeply (e.g. 200 million reads) on such a small target you are inevitably going to get a lot of duplicate reads. If by target size you are referring to the size the chromatin was fragmented to, then the answer is around 150 bp. These plots can not be properly interpreted without a more thorough understanding of the biology of your system and what steps were carried out to generate your sequence data. In this ChIP-Seq experiment, I used a bioruptor to shear the chromatin of mouse neural stem cells to a size of 150 bp after crosslinking. From the following immunoprecipitation I aimed at 2 biological replicates with a yield of 5 nanogramms of ds DNA as determined by picogreen assay. This DNA as well as input and IgG control went into a Illumina TruSeq ChIP Sample prep Kit and were then evenly pooled into a 4-plex library for sequencing on a HiSeq2000, single end on one lane of a HiSeq flow cell. The Yield from all libraries was between 1,500 and 2,100 Mbases with 30,000,000 to 42,000,000 reads. I am currently unsure about how the starting DNA was treated in terms of PCR conditions, as this and library prep was carried out by a commercial service, but I am about to find out. Thank you very much again for your help! Tobias |
![]() |
![]() |
![]() |
#4 | |||
Senior Member
Location: USA, Midwest Join Date: May 2008
Posts: 1,143
|
![]() Quote:
Quote:
Now your input was a mouse genome, 2.64 Gbp of DNA. You obtained approximately 1.54 Gbp of DNA sequence data or < 1X coverage. In an unenriched sample the probability of duplicate reads would be close to 0. Honestly I am not that familiar with the normal statistics of ChIP enrichment but it seems to me that your enrichment would have to be off the charts fantastic to be able to see the level of duplication your are seeing explained by enrichment efficiency alone. I would start to worry that at one point during the ChIP process you ended up with an extremely limiting amount of DNA and subsequent PCR produced a biased, low diversity sample. Have you tried mapping the reads to the mouse genome yet to see where they align? |
|||
![]() |
![]() |
![]() |
#5 |
Member
Location: Japan Join Date: Mar 2013
Posts: 17
|
![]()
Dear kmcarr,
thank you very much for your help. In fact, I am not looking at an general TF but a rather specific one. People have done FLAG-CHIP-Seq on the factor on human cells and have identified about 5,500 genes to be targeted. So enrichment using this figure would mean about 500,000 bp, I guess. I need to apologize, I should have probably attached the duplication level for my input control as well. This should not be enriched in any way, right? Even though the graph looks different, the duplication level is >80% here as well. I have mapped the reads using bowtie and tried to look at them in the UCSC browser. In case of low complexity I should see regions that have high numbers of aligned reads vs regions that have low or no aligned reads? Thank you again for your help! Tobias |
![]() |
![]() |
![]() |
#6 | |
Senior Member
Location: USA, Midwest Join Date: May 2008
Posts: 1,143
|
![]() Quote:
If that is the case then there is something significantly wrong with your input DNA. If your are sequencing random mouse genomic DNA and only collecting ~1.5 Gbp of sequence data (< 1X coverage of the genome) there is no way you should be observing read duplication like that. Did you start out with an extremely limiting amount of input DNA, because that can lead to a low diversity library. If you started with an adequate amount of genomic DNA then something went wrong with the library prep which drastically reduced the diversity of your sample. |
|
![]() |
![]() |
![]() |
#7 |
Member
Location: Japan Join Date: Mar 2013
Posts: 17
|
![]()
This is in fact the input control, i.e. fragmented chromatin that was put aside before the immunoprecipitation.
The amount of starting material was indeed limiting in this experiment, as a specific type of neural stem cell was targeted. After discussing with the service facility that provides the library construction and sequencing at our institute, it was agreed to aim at 5 nanograms of immunoprecipitated, double stranded DNA as starting material to be sufficient with the TruSeq ChiP-Seq Kit. I assume that for the input a similar amount was used. The yield for the Input control was 1.9 Gbp and 38 million reads. I guess the bottomline is that I am looking at libraries with very poor complexity. How could that reflect on later peak calling? In the meantime I have used bowtie to map the reads to the mm9 reference and filtered for duplicates. I received the following numbers: Input: 30,512,219 mapped reads (80%) IP: 20,586,367 mapped reads (68%) Thank you very much for your Input! Tobias |
![]() |
![]() |
![]() |
#8 |
Senior Member
Location: USA, Midwest Join Date: May 2008
Posts: 1,143
|
![]()
Clearly the input control doesn't represent the true background (the whole mouse genome), further you can not know that the bias in amplifying your IP sample was the same as the bias during amplification of the control. Given these results I would be skeptical about the validity of any "peaks" observed in your IP sample.
|
![]() |
![]() |
![]() |
#9 |
Member
Location: Japan Join Date: Mar 2013
Posts: 17
|
![]()
That certainly does not make things easier for me.
In any case your help is much appreciated! |
![]() |
![]() |
![]() |
#10 |
Junior Member
Location: Oregon Join Date: May 2011
Posts: 3
|
![]()
I had a sequence duplication of like 90% once with mouse tissues... to fix it we now do library size selection after adapter ligation. Good luck.
|
![]() |
![]() |
![]() |
#11 | |
Member
Location: Japan Join Date: Mar 2013
Posts: 17
|
![]() Quote:
Could you please explain that in more detail? What was the size of your libraries before and after the adapters were ligated and which size did you purify? How much starting material did you use? Thank you very much! Tobias |
|
![]() |
![]() |
![]() |
#12 |
Junior Member
Location: Oregon Join Date: May 2011
Posts: 3
|
![]()
So I was doing a ChIP-seq with embryonic tissues dissected from mouse. Samples were fixed and sonicated to fragment sizes between 200-500bp.
These samples were then IP'ed and we were able to recover about 15ng of total DNA from about 500ug of starting chromatin. When we had our 15ng of ChIPed DNA. When using the illumina tru-seq kits as described, for the input and chip libraries, we had a low diversity and over 90% repeat reads randomly distributed, ie not adapter dimers and not from the IP. This was after bioanalyzer results verified that our resulting product was indeed centered around about 275bp. Second round, we requested the gel size selection to be after the amplification and adapter ligation. This resulted in a similar bioanalyzer result, and when ran on the sequencer, gave us only about 10% non unique reads. Does that make more sense? I can be rather confusing. |
![]() |
![]() |
![]() |
#13 |
Member
Location: Japan Join Date: Mar 2013
Posts: 17
|
![]()
That makes it very clear!
Thank you very much for your input! |
![]() |
![]() |
![]() |
#14 |
Member
Location: Japan Join Date: Mar 2013
Posts: 17
|
![]()
Sorry, I think I still dont get it. I just wen t back to the Illumina Truseq DNA protocol, and , if I understand correctly, the gel excision step here is after adapter ligation. How does this differ from you protocol?
|
![]() |
![]() |
![]() |
#15 |
Junior Member
Location: Oregon Join Date: May 2011
Posts: 3
|
![]()
We did the gel extraction as the very last step, so after ligation and pcr amplification. Our guess is we lost too much DNA during gel purification thus resulting in amplification of a small subset of our sample.
eta: We didn't gel extract twice, we just moved it to the very last step. |
![]() |
![]() |
![]() |
Thread Tools | |
|
|