SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Basic question about reads pre-processing before alignment 610617109 Bioinformatics 4 11-12-2015 04:43 AM
homopolymer, steps after alignment pingu 454 Pyrosequencing 0 10-05-2015 07:13 AM
ChIP-seq library pre-processing BatSeqs Bioinformatics 2 10-02-2014 01:06 AM
Standard Processing Steps for RNASeq and Bisulfite-Seq raw data Fernas Bioinformatics 4 07-27-2014 05:22 PM
Need advice for pre process and assembly patouch74 Bioinformatics 5 05-20-2014 12:49 PM

Reply
 
Thread Tools
Old 10-24-2016, 10:25 AM   #1
fh331
Member
 
Location: UK

Join Date: Apr 2016
Posts: 19
Default Advice on PE ChIP Alignment & Pre-processing Steps??

Hi All

Being a newbie to bioinformatics, once again I am here with what some would find a really silly question but if I don't ask how am I going to learn. So here goes:

We have done some ChIP-Seq experiments. Samples were run on HiSeqV4. We got PE reads. the sequencing facility aligns, process and give us data as cram files. I go from cram to bam using cramtools-3.0. And if I check stats of my bam using bamtools, all samples except input have very high level of duplication (over 80%). If I remove duplicates, I am left with very few reads and hence peak-callers (macs2 & homer) don't return enough peaks.

If I decompress cram to fastq (using cramtools) and then align it myself using bwa mem, I get 0% duplicates. Note that I am aligning to the same reference genome (obtained via cramtools getref ) as the sequencing facility. Using samtools view -H path/to/my/cram/file.cram | grep PG, I figured that the pipeline used to align and process the reads at the sequencing facility uses the same aligner (bwa) but processes the data further with other tools e.g. bamcollate2, bam12auxmerge, bamsormadup, AlignmentFilter, bamstreamingmarkduplicates, etc.

From different ChIP-Seq papers, I have never found the data being processed and filtered so much. My feeling is that these filters and processing might not be required for ChIP-Seq data.

My question is: am I alright to decompress the crams and align myself or is my data not that great after all?

Any tips, tricks, comments, remarks will be highly appreciated!!!

Thanks very much

fh
fh331 is offline   Reply With Quote
Old 10-24-2016, 11:21 AM   #2
HESmith
Senior Member
 
Location: Washington DC

Join Date: Oct 2009
Posts: 486
Default

For starters, why don't you identify a few reads that are duplicated in the facility-generated BAM, then compare to the same reads in your self-generated BAM? If you have trouble interpreting the results, post the reads here so we can help.
HESmith is offline   Reply With Quote
Old 10-24-2016, 01:17 PM   #3
fh331
Member
 
Location: UK

Join Date: Apr 2016
Posts: 19
Default

Quote:
Originally Posted by HESmith View Post
For starters, why don't you identify a few reads that are duplicated in the facility-generated BAM, then compare to the same reads in your self-generated BAM? If you have trouble interpreting the results, post the reads here so we can help.
Hi HESmith,
Thanks for the reply. Do you mean I should just extract some duplicated reads from the facility-generated bam and see if they're present in my self-generated bam? The number of reads is very similar between the bam.

My impression is that these extra tools used in the facility somehow flags reads as duplicates but when I decompress and realign it, I get rid of the flags somehow and hence there are no duplicates.
fh331 is offline   Reply With Quote
Old 10-24-2016, 01:35 PM   #4
HESmith
Senior Member
 
Location: Washington DC

Join Date: Oct 2009
Posts: 486
Default

Question: how did you determine that your self-aligned reads did not contain any duplicates?
HESmith is offline   Reply With Quote
Old 10-24-2016, 01:35 PM   #5
fanli
Senior Member
 
Location: California

Join Date: Jul 2014
Posts: 196
Default

I'd find it unlikely that a ChIP library had 0% duplication. They are in general highly duplicated as you are sequencing a very limited set of input template.
fanli is offline   Reply With Quote
Old 10-24-2016, 01:45 PM   #6
fh331
Member
 
Location: UK

Join Date: Apr 2016
Posts: 19
Default

@HESmith

bamtools stats -in /path/to/my/bam/self-aligned.bam

Last edited by fh331; 10-24-2016 at 01:49 PM.
fh331 is offline   Reply With Quote
Old 10-24-2016, 01:54 PM   #7
fh331
Member
 
Location: UK

Join Date: Apr 2016
Posts: 19
Default

Quote:
Originally Posted by fanli View Post
I'd find it unlikely that a ChIP library had 0% duplication. They are in general highly duplicated as you are sequencing a very limited set of input template.
Hi fanli

I agree with you. Since I have used 'bamtools stats' function to get a quick idea. I am assuming bamtools isn't very stringent in marking duplicates. If I use piccard markduplicates, I think there will be some level of duplication. I can put updated info about that tomorrow.

On the other hand, what level of duplication is normal?
fh331 is offline   Reply With Quote
Old 10-24-2016, 04:07 PM   #8
HESmith
Senior Member
 
Location: Washington DC

Join Date: Oct 2009
Posts: 486
Default

Quote:
Originally Posted by fh331 View Post
Do you mean I should just extract some duplicated reads from the facility-generated bam and see if they're present in my self-generated bam?
Duplicate reads are identified by alignment (chromosome/position) information. You want to determine if that information is the same for facility vs. self alignments. Find a few duplicates in the former, then examine the same reads in the latter. Either the alignment information will match (which means that bamtools is not counting the duplicates) or not (indicates a discrepancy b/t the aligners) or the duplicates are missing from the latter (indicates removal of duplicates).
HESmith is offline   Reply With Quote
Old 10-25-2016, 02:37 AM   #9
Chipper
Senior Member
 
Location: Sweden

Join Date: Mar 2008
Posts: 324
Default

Samtools tview

Always look at the reads, not just the stats. The number of unique fragments is what matters, not the duplication rate. 80% duplicates would be useless if you sequenced 2 M reads, but may be ok if you sequenced 200 M.
Chipper is offline   Reply With Quote
Old 10-25-2016, 05:53 AM   #10
fh331
Member
 
Location: UK

Join Date: Apr 2016
Posts: 19
Default

Quote:
Originally Posted by HESmith View Post
Duplicate reads are identified by alignment (chromosome/position) information. You want to determine if that information is the same for facility vs. self alignments. Find a few duplicates in the former, then examine the same reads in the latter. Either the alignment information will match (which means that bamtools is not counting the duplicates) or not (indicates a discrepancy b/t the aligners) or the duplicates are missing from the latter (indicates removal of duplicates).
I ran piccard MarkDuplicates on my self-aligned bams and if i check the stats on bamtools after marking duplicates, it returns the same level of duplication. So i think i was just not doing it the right way. After running bwa, I guess i need to markduplicates before checking stats. Something learnt by newbie!
fh331 is offline   Reply With Quote
Old 10-25-2016, 05:58 AM   #11
HESmith
Senior Member
 
Location: Washington DC

Join Date: Oct 2009
Posts: 486
Default

Glad that you were able to sort out the problem.
HESmith is offline   Reply With Quote
Old 10-25-2016, 05:59 AM   #12
fh331
Member
 
Location: UK

Join Date: Apr 2016
Posts: 19
Default

Quote:
Originally Posted by Chipper View Post
Samtools tview

Always look at the reads, not just the stats. The number of unique fragments is what matters, not the duplication rate. 80% duplicates would be useless if you sequenced 2 M reads, but may be ok if you sequenced 200 M.
Hi Chipper,

Thanks for the reply. How does this tview work? I can't seem to find anything about it besides this: http://samtools.sourceforge.net/tview.shtml

which isn't very informative
fh331 is offline   Reply With Quote
Old 10-25-2016, 06:03 AM   #13
HESmith
Senior Member
 
Location: Washington DC

Join Date: Oct 2009
Posts: 486
Default

'tview' is a terminal-based genome viewer. It would allow a quick spot-check of duplication (by visualizing the endpoints of the aligned reads), but it wouldn't calculate the fraction of your reads that are unique.
HESmith is offline   Reply With Quote
Old 10-25-2016, 06:08 AM   #14
fh331
Member
 
Location: UK

Join Date: Apr 2016
Posts: 19
Default

Quote:
Originally Posted by fh331 View Post
Hi Chipper,

Thanks for the reply. How does this tview work? I can't seem to find anything about it besides this: http://samtools.sourceforge.net/tview.shtml

which isn't very informative
found it in samtools manual!!! thanks
fh331 is offline   Reply With Quote
Old 10-25-2016, 06:21 AM   #15
fh331
Member
 
Location: UK

Join Date: Apr 2016
Posts: 19
Default

@HESmith

Thanks very much. I highly appreciate help from all the experienced users.

For future reference, what can I do better to avoid getting so much duplication levels in chipseq samples? Is it better to start with a lot of DNA, less number of pcr cycles during library prepartion? Any tips would make my life way easier!
fh331 is offline   Reply With Quote
Old 10-25-2016, 09:06 AM   #16
HESmith
Senior Member
 
Location: Washington DC

Join Date: Oct 2009
Posts: 486
Default

1) Optimize your chromatin immunoprecipitation (see this reference for guidance).

2) Optimize your library prep, using same amount of input chromatin equal to the amount recovered by ChIP.

Good luck!
HESmith is offline   Reply With Quote
Old 05-19-2017, 02:24 PM   #17
[email protected]
Junior Member
 
Location: china

Join Date: May 2017
Posts: 3
Default Problem about samtools sort and tview, Who can help me?

Hello everyboday, I am a newbie, my sam file size is about 45,786 kB, however after I use the samtools sort command, the .bai file produced is just only 1 kb, I wander it is right. In addition, after I use the samtools tview command, the result show a line N and nothing. I don't know where the problem is, who can help me?
yjz0916@163.com is offline   Reply With Quote
Reply

Tags
chipseq processing

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 06:32 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2017, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO