Hi All
Being a newbie to bioinformatics, once again I am here with what some would find a really silly question but if I don't ask how am I going to learn. So here goes:
We have done some ChIP-Seq experiments. Samples were run on HiSeqV4. We got PE reads. the sequencing facility aligns, process and give us data as cram files. I go from cram to bam using cramtools-3.0. And if I check stats of my bam using bamtools, all samples except input have very high level of duplication (over 80%). If I remove duplicates, I am left with very few reads and hence peak-callers (macs2 & homer) don't return enough peaks.
If I decompress cram to fastq (using cramtools) and then align it myself using bwa mem, I get 0% duplicates. Note that I am aligning to the same reference genome (obtained via cramtools getref ) as the sequencing facility. Using samtools view -H path/to/my/cram/file.cram | grep PG, I figured that the pipeline used to align and process the reads at the sequencing facility uses the same aligner (bwa) but processes the data further with other tools e.g. bamcollate2, bam12auxmerge, bamsormadup, AlignmentFilter, bamstreamingmarkduplicates, etc.
From different ChIP-Seq papers, I have never found the data being processed and filtered so much. My feeling is that these filters and processing might not be required for ChIP-Seq data.
My question is: am I alright to decompress the crams and align myself or is my data not that great after all?
Any tips, tricks, comments, remarks will be highly appreciated!!!
Thanks very much
fh
Being a newbie to bioinformatics, once again I am here with what some would find a really silly question but if I don't ask how am I going to learn. So here goes:
We have done some ChIP-Seq experiments. Samples were run on HiSeqV4. We got PE reads. the sequencing facility aligns, process and give us data as cram files. I go from cram to bam using cramtools-3.0. And if I check stats of my bam using bamtools, all samples except input have very high level of duplication (over 80%). If I remove duplicates, I am left with very few reads and hence peak-callers (macs2 & homer) don't return enough peaks.
If I decompress cram to fastq (using cramtools) and then align it myself using bwa mem, I get 0% duplicates. Note that I am aligning to the same reference genome (obtained via cramtools getref ) as the sequencing facility. Using samtools view -H path/to/my/cram/file.cram | grep PG, I figured that the pipeline used to align and process the reads at the sequencing facility uses the same aligner (bwa) but processes the data further with other tools e.g. bamcollate2, bam12auxmerge, bamsormadup, AlignmentFilter, bamstreamingmarkduplicates, etc.
From different ChIP-Seq papers, I have never found the data being processed and filtered so much. My feeling is that these filters and processing might not be required for ChIP-Seq data.
My question is: am I alright to decompress the crams and align myself or is my data not that great after all?
Any tips, tricks, comments, remarks will be highly appreciated!!!
Thanks very much
fh
Comment