Dear seqanswers community,
unfortunately I can align only half of the reads (Illumina paired end sequencing od human exomes, 100 bp reads and the same problem with haloplex amplicon sequencing) using bwa. Test of one commercial software showed, that it is a pipeline problem and twice as much reads can be aligned. Commercial software is also based on bwa. Tuning of bwa parameter did not bring much improvement (brought an inscrease of 5-8% more reads to the existing alignment).
The following bwa parameter were tested:
bwa aln –e1…[-e10..–e100];
bwa aln –E1[ –E10…-E100];
bwa aln -k1..[ –k10…-k100],
bwa aln –R1…[-R10..–R100],
bwa aln –o1…[-o10..–o100],
bwa aln –O1…[-O10..–O100]
The generated statistics (samtools idxstat) of commercial (in green) and of our alignment (in red) is in the attached statistics.ppt file. I also attached a screen of two .bam files of amplicon sequencing (upper sample is a result of commercial software and lower is our pipeline, screenshot.ppt). My next idea is, that may be before performing bwa it is necessary to sort the sequences (but I do not know how?).
Do you have any ideas, what can be the problem? Here is the script of the pipeline:
I have been troubleshooting this issue for some time now and any help would be greatly appreciated.
Thank you very much!
Best Regards,
Anna
unfortunately I can align only half of the reads (Illumina paired end sequencing od human exomes, 100 bp reads and the same problem with haloplex amplicon sequencing) using bwa. Test of one commercial software showed, that it is a pipeline problem and twice as much reads can be aligned. Commercial software is also based on bwa. Tuning of bwa parameter did not bring much improvement (brought an inscrease of 5-8% more reads to the existing alignment).
The following bwa parameter were tested:
bwa aln –e1…[-e10..–e100];
bwa aln –E1[ –E10…-E100];
bwa aln -k1..[ –k10…-k100],
bwa aln –R1…[-R10..–R100],
bwa aln –o1…[-o10..–o100],
bwa aln –O1…[-O10..–O100]
The generated statistics (samtools idxstat) of commercial (in green) and of our alignment (in red) is in the attached statistics.ppt file. I also attached a screen of two .bam files of amplicon sequencing (upper sample is a result of commercial software and lower is our pipeline, screenshot.ppt). My next idea is, that may be before performing bwa it is necessary to sort the sequences (but I do not know how?).
Do you have any ideas, what can be the problem? Here is the script of the pipeline:
Code:
#!/bin/sh # input 1: fasta 1 # input 2: fasta 2 # input 3: sample prefix # input 4: number of cores # program directories NGS=/home/user/NGS java=/usr/local/jre1.7.0_21/bin picard=${NGS}/picard-tools-1.64/ samtools=${NGS}/samtools-0.1.18 BEDTools=${NGS}/bedtools-2.17.0 annovar=${NGS}/annovar/annovar GATK=${NGS}/GATK/GenomeAnalysisTK-2.3-9-ge5ebf34 # number of cores available for bwa nkern=$4 # reference genome ref=${NGS}/refgenome/GATK/ucsc.hg19.fasta # exon cover known gene exon_cover=${NGS}/refgenome/hg19/TruSeq-Exome-Targeted-Regions-BED-file # dbsnp dbsnp=${NGS}/refgenome/GATK/dbsnp_135.hg19.vcf mkdir -p LOG rm -f LOG/*.log seq1=$1 seq2=$2 sample=$3 ${NGS}/bwa-0.7.5a/bwa aln -t $nkern $ref $seq1 > ${sample}_R1.sai ${NGS}/bwa-0.7.5a/bwa aln -t $nkern $ref $seq2 > ${sample}_R2.sai ${NGS}/bwa-0.7.5a/bwa sampe -A -P \ -r "@RG\tID:Sample\tSM:Sample\tPL:illumina\tCN:exome" \ $ref \ ${sample}_R1.sai \ ${sample}_R2.sai \ $seq1 \ $seq2 > \ ${sample}.sam ${NGS}/samtools-0.1.18/samtools view -bS ${sample}.sam > ${sample}.bam ${NGS}/samtools-0.1.18/samtools sort ${sample}.bam ${sample}_sorted_w_dup ${NGS}/samtools-0.1.18/samtools index ${sample}_sorted_w_dup.bam ${NGS}/samtools-0.1.18/samtools idxstats ${sample}_sorted_w_dup.bam > idxstat.txt ${NGS}/samtools-0.1.18/samtools rmdup ${sample}_sorted_w_dup.bam ${sample}_preGATK_sorted.bam ${NGS}/samtools-0.1.18/samtools index ${sample}_preGATK_sorted.bam ${NGS}/samtools-0.1.18/samtools idxstats ${sample}_preGATK_sorted.bam > idxstat.txt
Thank you very much!
Best Regards,
Anna
Comment