I recently started working bwa-mem on a 2X100 PE RNAseq dataset (ILMN cassava 1.8+) and ran into some problems with the sam/bam headers, as well as the -C option to include the barcode information in the BC:Z:<barcode> field. Spent the better part of a day searching SEQanswers, sourcefourge bwa page, etc. looking for a solution, and finally came up with what I believe is a simple workaround. I'm interested to see if others are having these problems, and if my workaround is sufficient.
I wanted to include the barcode "fastq comment field" in the sam, in case I ever wanted to go back to this data to do something like bam2fastq for a subset of reads in a bed region yada yada yada, so I added the -C option thinking that the entire field (e.g. 1:N:0:GTGGCC) would be included:
and after reading MacManes blog post on samtools flagstats for bwa output (here) I tried to get flagstats:
I should note that this was the sourceforge release (Version: 0.7.5a-r405), not the most recent git commit. I ignored the EOF problem and looked at line 109:
and figured that either 1) ILMN casava 1.8+ barcode comment does not conform to the SAM spec, per bwa.1, or 2) bwa had a bug. Either way, I just want the flagstats to work.
I used sed to fix the sam, and that seems to work:
(This is a subset of reads for testing purposes). My sam barcode field is now BC:Z:GTGGCC, but I decided I want the entire comment field so this would have the strand information as well, for orienting the reads properly in some future bed/bam intersect bam2fastq (instead of retrieving them from the original fastq, which I could also do).
That works for flagstat, which works for me. Does anyone know if having this entire field in the BC:Z field will cause problems downstream? I'm being lazy here with the regex since all of my reads are [1-2]:N:0:<barcode>.
The other problems were the @PG field in the header, which was resolved by getting the most recent version from github, Version: 0.7.5a-r428, and the EOF problem that hasn't been fixed yet, but whatever...I have my flagstats.
I wanted to include the barcode "fastq comment field" in the sam, in case I ever wanted to go back to this data to do something like bam2fastq for a subset of reads in a bed region yada yada yada, so I added the -C option thinking that the entire field (e.g. 1:N:0:GTGGCC) would be included:
Code:
bwa mem -pC -t8 genome.fa interleaved_pe_reads.fq > aln.sam
Code:
samtools view -@2 -Sbhu aln.sam | samtools flagstat - [bam_header_read] EOF marker is absent. The input is probably truncated. [samopen] SAM header is present: 107 sequences. Parse error at line 109: missing colon in auxiliary data 0 + 0 in total (QC-passed reads + QC-failed reads) 0 + 0 duplicates 0 + 0 mapped (-nan%:-nan%) 0 + 0 paired in sequencing 0 + 0 read1 0 + 0 read2 0 + 0 properly paired (-nan%:-nan%) 0 + 0 with itself and mate mapped 0 + 0 singletons (-nan%:-nan%) 0 + 0 with mate mapped to a different chr 0 + 0 with mate mapped to a different chr (mapQ>=5)
Code:
sed -n '109{p;q}' aln.sam HWI-ST651:134:H0B1AADXX:1:1101:1602:1977 65 chrUextra 11015091 0 83M17S = 13575111 2560021 NGTTTGAGTGGGGCGGTACATCTCTCAAATAATAACGGAGGTGTCCCAAGGCCAGCTCAGTGCGGACAGAAACCACACATAGATCAAGTCAGCATTTGCC #0<BFFFFFFFFFFIIIIIFFFFFFFIIIFFFFIFFIFBFF<BBFFBBBFFBFFFFFFBBFFBFF7BFFBFFFFFFFFBBBBBFFFBBBBBBFFFFFFBB NM:i:2 AS:i:77 XS:i:77 1:N:0:CACCTC
I used sed to fix the sam, and that seems to work:
Code:
sed 's/[1-2]\:N\:0\:\([A-Z]*\)/BC\:Z\:\1/' <aln.sam >fixed.sam samtools view -@2 -Sbuh fixed.sam | samtools flagstat - [bam_header_read] EOF marker is absent. The input is probably truncated. [samopen] SAM header is present: 107 sequences. 105795 + 0 in total (QC-passed reads + QC-failed reads) 0 + 0 duplicates 99264 + 0 mapped (93.83%:-nan%) 105795 + 0 paired in sequencing 52991 + 0 read1 52804 + 0 read2 78847 + 0 properly paired (74.53%:-nan%) 98266 + 0 with itself and mate mapped 998 + 0 singletons (0.94%:-nan%) 2154 + 0 with mate mapped to a different chr 420 + 0 with mate mapped to a different chr (mapQ>=5)
Code:
sed 's/\([1-2]\:N\:0\:[A-Z]*\)/BC\:Z\:\1/' <aln.sam >fixed.sam
The other problems were the @PG field in the header, which was resolved by getting the most recent version from github, Version: 0.7.5a-r428, and the EOF problem that hasn't been fixed yet, but whatever...I have my flagstats.
Comment