SEQanswers

Go Back   SEQanswers > Applications Forums > Epigenetics



Similar Threads
Thread Thread Starter Forum Replies Last Post
ChIP-Seq: ChIP-Seq Data Analysis: Identification of Protein-DNA Binding Sites with SI Newsbot! Literature Watch 0 12-02-2011 04:51 AM
ChIP-Seq: ChIP-chip versus ChIP-seq: Lessons for experimental design and data analysi Newsbot! Literature Watch 0 03-02-2011 02:50 AM
W-ChIPeaks: A web application for processing ChIP-chip and ChIP-seq data buckeye3947 Bioinformatics 0 01-24-2011 01:05 PM
PubMed: An integrated software system for analyzing ChIP-chip and ChIP-seq data. Newsbot! Literature Watch 0 11-04-2008 05:03 AM

Reply
 
Thread Tools
Old 01-24-2013, 11:23 AM   #1
DunderChief
Junior Member
 
Location: Baltimore, MD

Join Date: Aug 2012
Posts: 6
Default Is my chip-seq data garbage?

I received some chip-seq data that had a very high amount of sequence duplication (over 90% of the reads). The experiment was looking at H3K4me3. I aligned with bowtie2 and ran rmdup and ended up with only about 1 million unique reads mapped. Most of the peaks that MACS is calling have only 5 reads in them. I'm wondering if the data is complete garbage or if I can get something legitimate out of these peaks?
DunderChief is offline   Reply With Quote
Old 01-24-2013, 07:12 PM   #2
xubeisi
Junior Member
 
Location: Memphis, TN, USA

Join Date: Dec 2010
Posts: 2
Default

it seems so, check MACS model file, if the watson & crick distance is small, this means it's useless. you may also want check with fastq, this high duplication could probably due to adapters.
xubeisi is offline   Reply With Quote
Old 03-28-2013, 09:55 PM   #3
Tobikenobi
Member
 
Location: Japan

Join Date: Mar 2013
Posts: 17
Default

Quote:
Originally Posted by xubeisi View Post
... check MACS model file, if the watson & crick distance is small, this means it's useless...
How small are we talking about?
Tobikenobi is offline   Reply With Quote
Old 03-29-2013, 01:30 AM   #4
xubeisi
Junior Member
 
Location: Memphis, TN, USA

Join Date: Dec 2010
Posts: 2
Default

Quote:
Originally Posted by Tobikenobi View Post
How small are we talking about?
~100 should be fine, to me, samples less than 50 are trash
xubeisi is offline   Reply With Quote
Old 03-29-2013, 03:34 AM   #5
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 871
Default

Have you actually looked at your data (both before and after duplication)?

Simply looking at the pattern of mapped reads will very quickly tell you if you're wasting your time trying spending more effort on your analysis.
simonandrews is offline   Reply With Quote
Old 03-31-2013, 05:27 AM   #6
Tobikenobi
Member
 
Location: Japan

Join Date: Mar 2013
Posts: 17
Default Sorry to hijack this thread...

Quote:
Originally Posted by xubeisi View Post
~100 should be fine, to me, samples less than 50 are trash
Depending on what number I enter as mfold in MACS (>10), I can get anything from d=51 to d=118. Does that tell me anything, and is it desirable to go for the highest d possible?
Thank you very much!
Tobikenobi is offline   Reply With Quote
Old 03-31-2013, 04:10 PM   #7
Tobikenobi
Member
 
Location: Japan

Join Date: Mar 2013
Posts: 17
Default

Quote:
Originally Posted by simonandrews View Post
Have you actually looked at your data (both before and after duplication)?

Simply looking at the pattern of mapped reads will very quickly tell you if you're wasting your time trying spending more effort on your analysis.
Could you please specify what you mean by `before and after duplication`?

Also, what would I expect to see in case of high duplication levels (I am looking at ~75% duplication according to fastqc myself)?
Tobikenobi is offline   Reply With Quote
Old 04-02-2013, 12:23 AM   #8
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 871
Default

Quote:
Originally Posted by Tobikenobi View Post
Could you please specify what you mean by `before and after duplication`?

Also, what would I expect to see in case of high duplication levels (I am looking at ~75% duplication according to fastqc myself)?
High duplication can come from a few different sources. It could be that you've got very good enriched data and have saturated the coverage of the enriched regions, in which case it would be bad to deduplicate. It could be that you have a very small number of regions which have enormous coverage or you could have more general low level duplication of your whole library. A 75% duplication level could come from most sequences appearing exactly twice in your data, or it could be a small number of sites with huge coverage.

If you look at the mapped data before you've done any deduplication you will be able to see whether there is a problem. If you see nicely enriched peaks with even coverage over them then the data might well be OK as it is. If you can see obviously biased coverage with more isolated towers of reads where you have duplication then you would need to deduplicate to stand any chance of getting sensible results out of your data.

Don't think that you should always deduplicate your data. There are defininte down sides to doing so - for high coverage regions you can end up compressing the dynamic range of your data and reducing the amount of information you have to work with. It can help in some cases, but when we're analysing data we would only deduplicate if we could see that there was a problem with the data which deduplication would help to fix.
simonandrews is offline   Reply With Quote
Old 04-03-2013, 09:27 PM   #9
Tobikenobi
Member
 
Location: Japan

Join Date: Mar 2013
Posts: 17
Default

Quote:
Originally Posted by simonandrews View Post
High duplication can come from a few different sources. It could be that you've got very good enriched data and have saturated the coverage of the enriched regions, in which case it would be bad to deduplicate. It could be that you have a very small number of regions which have enormous coverage or you could have more general low level duplication of your whole library. A 75% duplication level could come from most sequences appearing exactly twice in your data, or it could be a small number of sites with huge coverage.

If you look at the mapped data before you've done any deduplication you will be able to see whether there is a problem. If you see nicely enriched peaks with even coverage over them then the data might well be OK as it is. If you can see obviously biased coverage with more isolated towers of reads where you have duplication then you would need to deduplicate to stand any chance of getting sensible results out of your data.

Don't think that you should always deduplicate your data. There are defininte down sides to doing so - for high coverage regions you can end up compressing the dynamic range of your data and reducing the amount of information you have to work with. It can help in some cases, but when we're analysing data we would only deduplicate if we could see that there was a problem with the data which deduplication would help to fix.
Thank you very much for your help!
I actually looked at the data before and after filterig for duplicates and have attached a picture of my four samples before (top four tracks) and after de-duplication (lower four tracks). It seems that your second suggestion of isolated towers seems to be the case, as I saw similar things across all chromosomes.
I then went on to try peak calling on my original files (only clipped the adapters and trimmed a little of the 3` end), for what I randomly selected and omitted lines in the input to get equal numbers of tags. Then MACS gives me the following output in the peaks.xls file:

# This file is generated by MACS
# ARGUMENTS LIST:
# name = E_2_mfold_20
# format = SAM
# ChIP-seq file = /galaxy/main_pool/pool7/files/005/979/dataset_5979847.dat
# control file = /galaxy/main_pool/pool7/files/005/965/dataset_5965128.dat
# effective genome size = 1.87e+09
# tag size = 50
# band width = 300
# model fold = 20
# pvalue cutoff = 1.00e-05
# Ranges for calculating regional lambda are : peak_region,1000,5000,10000
# unique tags in treatment: 2868667
# total tags in treatment: 22927127
# unique tags in control: 8014554
# total tags in control: 22927127

# d = 51

Especially in the treatment, the unique tags are very low compared to the control. This makes FDR unreliable.

Is it adviseable to de-duplicate the data and try peak calling then?
Also, as I have two replicates, would be reasonable to combine the two replicates to obtain more unique reads, and then try the peak calling again?

Again, thank you very much for your input!
Attached Images
File Type: jpg combined_tracks copy.jpg (76.0 KB, 42 views)
Tobikenobi is offline   Reply With Quote
Old 04-03-2013, 11:30 PM   #10
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 871
Default

It might be worth noting that MACS does an internal deduplication of your data whilst peak calling. It works out the likely duplication level in your data and then removes any tags which are duplicated above that level when calling peaks. It may not remove as much data as doing a complete strict deduplication, but it does look at this information.

I had a look at the image you posted but at that resolution it's hard to see what's going on. It's not unusual to see a few huge outliers in the data (which can skew the scale on the y-axis), it's more what happens at a more local level which is important, especially looking at the actual pattern of mapped reads rather than quantitated values.
simonandrews is offline   Reply With Quote
Old 04-03-2013, 11:54 PM   #11
Tobikenobi
Member
 
Location: Japan

Join Date: Mar 2013
Posts: 17
Default

So if I understand correctly, it may not be necessary at all to deduplicate the data before using MACS, as it will attempt this on its own.
Moreover, if I would deduplicate myself, I would omit true duplicates that arise from sequencing depth. So deduplicating would really only make sense if I really wanted the accurate FDR from MACS, which I can only get if I adjust the unique tag number beforehand?
Tobikenobi is offline   Reply With Quote
Reply

Tags
chip-seq, quality assessment

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 12:21 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO