Hi everyone
I've run into a predicament lately, which I'm hoping to gather advice on. We've done paired end illumina whole genome sequencing on a human sample.
I have 4 lanes of data for a sample, and 2 fastq files per lane (reads1fastq and reads2fastq) of which I split into 8-9 files of 10million reads each. When splitting the files, I made sure to split by 10million*4 lines, and the script I wrote compares the first line of each split reads1 fastq file to it's corresponding split reads2 fastq file. I then align each fastq file with BWA (trimming reads down with the parameter q=30) to create a .sai file and then generate alignments/bam files for each pair of split reads1/reads2 files. I then sort and index the small bam files and then merge them into one large bam file - of which I'm trying to run markduplicates on.
One thing I've noticed is markduplicates is telling me I have a ridiculously high number of unmatched pairs. I ran markduplicates on the smaller bam files too, and the same is true. For example, for one of the smaller bam files:
INFO 2012-07-19 13:20:48 MarkDuplicates Read 37393628 records. 28252673 pairs never matched.
Now I'm relatively new to this whole NGS world of data analysis, but I can't imagine having such a high number of unmatched pairs is a good thing.
Does anyone have any advice, or has encountered a similar problem? I'm wondering if I did something wrong with splitting and trimming/aligning split fastq files?
I should note that this DNA was extracted from FFPE tissue so it will be of lower quality than the DNA you guys are used to working with. But I want to make sure this is not a technical error on my part before blaming DNA quality.
Thanks!
I've run into a predicament lately, which I'm hoping to gather advice on. We've done paired end illumina whole genome sequencing on a human sample.
I have 4 lanes of data for a sample, and 2 fastq files per lane (reads1fastq and reads2fastq) of which I split into 8-9 files of 10million reads each. When splitting the files, I made sure to split by 10million*4 lines, and the script I wrote compares the first line of each split reads1 fastq file to it's corresponding split reads2 fastq file. I then align each fastq file with BWA (trimming reads down with the parameter q=30) to create a .sai file and then generate alignments/bam files for each pair of split reads1/reads2 files. I then sort and index the small bam files and then merge them into one large bam file - of which I'm trying to run markduplicates on.
One thing I've noticed is markduplicates is telling me I have a ridiculously high number of unmatched pairs. I ran markduplicates on the smaller bam files too, and the same is true. For example, for one of the smaller bam files:
INFO 2012-07-19 13:20:48 MarkDuplicates Read 37393628 records. 28252673 pairs never matched.
Now I'm relatively new to this whole NGS world of data analysis, but I can't imagine having such a high number of unmatched pairs is a good thing.
Does anyone have any advice, or has encountered a similar problem? I'm wondering if I did something wrong with splitting and trimming/aligning split fastq files?
I should note that this DNA was extracted from FFPE tissue so it will be of lower quality than the DNA you guys are used to working with. But I want to make sure this is not a technical error on my part before blaming DNA quality.
Thanks!
Comment