Hi Everyone,
I am new to the community - learning to analyze whole exome data using bwa and samtools. First I will tell you what I have been doing, then point out the red flags, and then my question is at the bottom.
This is what I have done so far:
1. I analyzed the fastq files using fastQC and they look good. The reads are from Otogenetics.
2. I downloaded the most recent b37 folder from the GATK resource bundle.
3. I indexed the genome using bwa-0.6.2, and aligned 2 illumina fastq files and then merged them into a SAM file.
Here are the commands I used for steps 2 and 3:
./bwa index -a bwtsw human_g1k_v37.fasta
./bwa aln human_g1k_v37.fasta Ot6000_R1_001.fastq > Ot6000_R1.sai
./bwa aln human_g1k_v37.fasta Ot6000_R2_001.fastq > Ot6000_R2.sai
./bwa sampe -r "@RG\tID:6000\tSM:6000\tLB:ga\tPL:Illumina" human_g1k_v37.fasta Ot6000_R1.sai Ot6000_R2.sai Ot6000_R1_001.fastq Ot6000_R2_001.fastq > Ot6000.sam
4. I also made bam files, sorted the bam files and indexed the bam files.
5. I ran flagstat on the sorted bam files. Here is the output:
There are these 3 red flags in the output:
1. output during merging of 2 sai files into sam files:
[bwa_sai2sam_pe_core] convert to sequence coordinate...
[infer_isize] fail to infer insert size: too few good pairs
[bwa_sai2sam_pe_core] time elapses: 1.00 sec
[bwa_sai2sam_pe_core] changing coordinates of 0 alignments.
[bwa_sai2sam_pe_core] align unmapped mate...
[bwa_sai2sam_pe_core] time elapses: 0.39 sec
[bwa_sai2sam_pe_core] refine gapped alignments... 0.01 sec
[bwa_sai2sam_pe_core] print alignments... 0.18 sec
[bwa_sai2sam_pe_core] 28604456 sequences have been processed.
2. bai files are empty
3. output from flagstat:
57208912 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
0 + 0 mapped (0.00%:nan%)
57208912 + 0 paired in sequencing
28604456 + 0 read1
28604456 + 0 read2
0 + 0 properly paired (0.00%:nan%)
0 + 0 with itself and mate mapped
0 + 0 singletons (0.00%:nan%)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)
Something is going on with the pairing of the reads, based on the output while making the sam file and from the flagstat output.
I have re-downloaded the GATK resource bundle, re-indexed the genome, tried other fastq files, etc, and I'm getting the same result. I can't find the problem. Any ideas?
Thanks!
Allison
I am new to the community - learning to analyze whole exome data using bwa and samtools. First I will tell you what I have been doing, then point out the red flags, and then my question is at the bottom.
This is what I have done so far:
1. I analyzed the fastq files using fastQC and they look good. The reads are from Otogenetics.
2. I downloaded the most recent b37 folder from the GATK resource bundle.
3. I indexed the genome using bwa-0.6.2, and aligned 2 illumina fastq files and then merged them into a SAM file.
Here are the commands I used for steps 2 and 3:
./bwa index -a bwtsw human_g1k_v37.fasta
./bwa aln human_g1k_v37.fasta Ot6000_R1_001.fastq > Ot6000_R1.sai
./bwa aln human_g1k_v37.fasta Ot6000_R2_001.fastq > Ot6000_R2.sai
./bwa sampe -r "@RG\tID:6000\tSM:6000\tLB:ga\tPL:Illumina" human_g1k_v37.fasta Ot6000_R1.sai Ot6000_R2.sai Ot6000_R1_001.fastq Ot6000_R2_001.fastq > Ot6000.sam
4. I also made bam files, sorted the bam files and indexed the bam files.
5. I ran flagstat on the sorted bam files. Here is the output:
There are these 3 red flags in the output:
1. output during merging of 2 sai files into sam files:
[bwa_sai2sam_pe_core] convert to sequence coordinate...
[infer_isize] fail to infer insert size: too few good pairs
[bwa_sai2sam_pe_core] time elapses: 1.00 sec
[bwa_sai2sam_pe_core] changing coordinates of 0 alignments.
[bwa_sai2sam_pe_core] align unmapped mate...
[bwa_sai2sam_pe_core] time elapses: 0.39 sec
[bwa_sai2sam_pe_core] refine gapped alignments... 0.01 sec
[bwa_sai2sam_pe_core] print alignments... 0.18 sec
[bwa_sai2sam_pe_core] 28604456 sequences have been processed.
2. bai files are empty
3. output from flagstat:
57208912 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
0 + 0 mapped (0.00%:nan%)
57208912 + 0 paired in sequencing
28604456 + 0 read1
28604456 + 0 read2
0 + 0 properly paired (0.00%:nan%)
0 + 0 with itself and mate mapped
0 + 0 singletons (0.00%:nan%)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)
Something is going on with the pairing of the reads, based on the output while making the sam file and from the flagstat output.
I have re-downloaded the GATK resource bundle, re-indexed the genome, tried other fastq files, etc, and I'm getting the same result. I can't find the problem. Any ideas?
Thanks!
Allison
Comment