Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • bwa-mem -C option problem with barcode field in SAM file

    I recently started working bwa-mem on a 2X100 PE RNAseq dataset (ILMN cassava 1.8+) and ran into some problems with the sam/bam headers, as well as the -C option to include the barcode information in the BC:Z:<barcode> field. Spent the better part of a day searching SEQanswers, sourcefourge bwa page, etc. looking for a solution, and finally came up with what I believe is a simple workaround. I'm interested to see if others are having these problems, and if my workaround is sufficient.

    I wanted to include the barcode "fastq comment field" in the sam, in case I ever wanted to go back to this data to do something like bam2fastq for a subset of reads in a bed region yada yada yada, so I added the -C option thinking that the entire field (e.g. 1:N:0:GTGGCC) would be included:

    Code:
    bwa mem -pC -t8 genome.fa interleaved_pe_reads.fq > aln.sam
    and after reading MacManes blog post on samtools flagstats for bwa output (here) I tried to get flagstats:

    Code:
    samtools view -@2 -Sbhu aln.sam | samtools flagstat -
    
    [bam_header_read] EOF marker is absent. The input is probably truncated.
    [samopen] SAM header is present: 107 sequences.
    Parse error at line 109: missing colon in auxiliary data
    0 + 0 in total (QC-passed reads + QC-failed reads)
    0 + 0 duplicates
    0 + 0 mapped (-nan%:-nan%)
    0 + 0 paired in sequencing
    0 + 0 read1
    0 + 0 read2
    0 + 0 properly paired (-nan%:-nan%)
    0 + 0 with itself and mate mapped
    0 + 0 singletons (-nan%:-nan%)
    0 + 0 with mate mapped to a different chr
    0 + 0 with mate mapped to a different chr (mapQ>=5)
    I should note that this was the sourceforge release (Version: 0.7.5a-r405), not the most recent git commit. I ignored the EOF problem and looked at line 109:

    Code:
    sed -n '109{p;q}' aln.sam
    
    HWI-ST651:134:H0B1AADXX:1:1101:1602:1977	65	chrUextra	11015091	0	83M17S	=	13575111	2560021	NGTTTGAGTGGGGCGGTACATCTCTCAAATAATAACGGAGGTGTCCCAAGGCCAGCTCAGTGCGGACAGAAACCACACATAGATCAAGTCAGCATTTGCC	#0<BFFFFFFFFFFIIIIIFFFFFFFIIIFFFFIFFIFBFF<BBFFBBBFFBFFFFFFBBFFBFF7BFFBFFFFFFFFBBBBBFFFBBBBBBFFFFFFBB	NM:i:2	AS:i:77	XS:i:77	1:N:0:CACCTC
    and figured that either 1) ILMN casava 1.8+ barcode comment does not conform to the SAM spec, per bwa.1, or 2) bwa had a bug. Either way, I just want the flagstats to work.

    I used sed to fix the sam, and that seems to work:

    Code:
    sed 's/[1-2]\:N\:0\:\([A-Z]*\)/BC\:Z\:\1/' <aln.sam >fixed.sam
    
    samtools view -@2 -Sbuh fixed.sam | samtools flagstat -
    [bam_header_read] EOF marker is absent. The input is probably truncated.
    [samopen] SAM header is present: 107 sequences.
    105795 + 0 in total (QC-passed reads + QC-failed reads)
    0 + 0 duplicates
    99264 + 0 mapped (93.83%:-nan%)
    105795 + 0 paired in sequencing
    52991 + 0 read1
    52804 + 0 read2
    78847 + 0 properly paired (74.53%:-nan%)
    98266 + 0 with itself and mate mapped
    998 + 0 singletons (0.94%:-nan%)
    2154 + 0 with mate mapped to a different chr
    420 + 0 with mate mapped to a different chr (mapQ>=5)
    (This is a subset of reads for testing purposes). My sam barcode field is now BC:Z:GTGGCC, but I decided I want the entire comment field so this would have the strand information as well, for orienting the reads properly in some future bed/bam intersect bam2fastq (instead of retrieving them from the original fastq, which I could also do).

    Code:
    sed 's/\([1-2]\:N\:0\:[A-Z]*\)/BC\:Z\:\1/' <aln.sam >fixed.sam
    That works for flagstat, which works for me. Does anyone know if having this entire field in the BC:Z field will cause problems downstream? I'm being lazy here with the regex since all of my reads are [1-2]:N:0:<barcode>.

    The other problems were the @PG field in the header, which was resolved by getting the most recent version from github, Version: 0.7.5a-r428, and the EOF problem that hasn't been fixed yet, but whatever...I have my flagstats.

  • #2
    I have tried samtools flagstat on some bam file. I have run BWA without -C option, so I do not have BC:Z fields, and samtools flagstat does not work.
    I do not think BC:Z fields cause any problem.

    Comment

    Latest Articles

    Collapse

    • seqadmin
      Current Approaches to Protein Sequencing
      by seqadmin


      Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
      04-04-2024, 04:25 PM
    • seqadmin
      Strategies for Sequencing Challenging Samples
      by seqadmin


      Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
      03-22-2024, 06:39 AM

    ad_right_rmr

    Collapse

    News

    Collapse

    Topics Statistics Last Post
    Started by seqadmin, 04-11-2024, 12:08 PM
    0 responses
    31 views
    0 likes
    Last Post seqadmin  
    Started by seqadmin, 04-10-2024, 10:19 PM
    0 responses
    32 views
    0 likes
    Last Post seqadmin  
    Started by seqadmin, 04-10-2024, 09:21 AM
    0 responses
    28 views
    0 likes
    Last Post seqadmin  
    Started by seqadmin, 04-04-2024, 09:00 AM
    0 responses
    53 views
    0 likes
    Last Post seqadmin  
    Working...
    X