Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • bwa aligns one file but not another

    I am having a problem in which bwa aligns one .bam, but not another despite the fact that they are simply forward and reverse reads from the same sequencing run. The two .bam files and their reference are attached (files are very small: ~1000 reads per .bam file, reference is ~1600 bases)

    Background:
    PCR amplicon of the V1 hypervariable region of as single purified bacteria.
    Ion Torrent sequencing adaptors ligated, and sequenced on Ion Torrent PGM.
    Because of adaptor ligation, reads are in forward and reverse directions.
    By using forward and reverse primers as the "barcodes", the reads are seperated into 2 unaligned .bam files, one with forward (F.bam), and one with reverse reads (R.bam).
    Reference fasta file for alignment is the 16s sequence of the same bacterial species from NCBI. We will call it 28.fasta.

    Goal:
    Generate a consensus sequence from each of the .bam files.

    Approach:
    Align .bam files with bwa, pass to samtools mpileup to make consensus sequence

    Steps:
    1) index reference:
    bwa index -p 28 -a is 28.fasta

    2) find SA coodinates of input files:
    bwa aln 28 -b R.bam > R.sai
    bwa aln 28 -b F.bam > F.sai

    3) Generate alignments in SAM format:
    bwa samse 28 R.sai R.bam > R.aligned.sam
    bwa samse 28 F.sai F.bam > F.aligned.sam

    4) Convert .sam to .bam for use in samtools:
    samtools view -bS R.aligned.sam > R.aligned.bam
    samtools view -bS F.aligned.sam > F.aligned.bam

    5) Sort .bam files
    samtools sort R.aligned.bam R.aligned.sorted
    samtools sort F.aligned.bam F.aligned.sorted

    6) get the consensus sequence
    samtools mpileup -uf 28.fasta R.aligned.sorted.bam | bcftools view -cg - > R.vcf
    samtools mpileup -uf 28.fasta F.aligned.sorted.bam | bcftools view -cg - > F.vcf

    7) Convert vcf to fasta:
    perl vcfutils.pl vcf2fasta R.vcf > R.fasta
    perl vcfutils.pl vcf2fasta F.vcf > F.fasta

    Viewing the R.fasta consensus file, I see a consensus read at the start of the reference which is where we expect to see a V1 consensus sequence (I'll leave my questions about the gaps indicated by lowercase for later):

    $ more R.fasta
    >gi|7407118|gb|AF227164.1|
    nttgAACGCTAGCGGgatgctttacacatGCAAGTCGAACGGCAGCACAGAGGAACTTGT
    TCCTTGGGTGGCGAGTGGCGCACGGGTnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
    nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
    nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
    nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
    nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
    nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
    nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
    nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
    nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
    nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
    nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
    nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
    nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
    nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
    nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
    nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnaccggtgcnnn
    nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
    nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
    nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
    nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
    nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
    nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
    nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnccacggt
    a

    Viewing the F.fasta file, I see FAILURE!
    >gi|7407118|gb|AF227164.1|
    nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
    nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
    nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
    nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
    nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
    nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
    nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
    nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
    nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
    nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
    nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
    nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
    nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
    nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
    nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
    nnnnnnnnnnnnnntggatgatg


    Working backwards through the steps. If I take the aligned.sorted.bam files and index them and view them with samtools tview, I can see the alignment to the V1 region in R.aligned.sorted.bam, while there is no alignment whatsoever with F.aligned.sorted.bam. This points to the bwa samse step as the problem.

    Can anyone answer why one .bam file is aligning fine in bwa while while the other fails?
    Attached Files

  • #2
    Did you eyeball the second read? Did you eyeball its sam file?

    With a total failure, there is some assumption you are making about your files that is simply wrong, and you have to diagnose that yourself, because you have the files.

    Comment


    • #3
      Thanks for the prompt reply swbarnes2.

      >Did you eyeball the second read...sam file?

      Yes, I have. What I did was actually align the same two .bam files with tmap. I then opened the 4 alignments (R.bam and F.bam each aligned in bwa and tmap) in IGV. What I see is this:

      1) R.bam aligned with bwa: coverage in the reverse direction starting at 46 and dropping to 1 - indicating very few of the reads actually mapped. This is shown in the .sam file in the second column - most reads have a flag of 4.

      2) F.bam aligned with bwa: no coverage except for a very short read to mapped to the wrong place

      3) R.bam aligned with tmap: greater than 700x coverage in the reverse direction across the whole amplicon. This is the expected result.

      4) F.bam aligned with tmap: greater than 1000x coverage in the forward direction across the whole amplicon. This is the expected result.

      Here's a screen cap of the above IGV with teh two tmap alignments on top:



      From these results, it appears that both .bam files are "good" and produce the expected coverage based on the number of reads in the files when used with tmap. So my question remains, what is going so horribly wrong in bwa? Even the .bam that works only maps a tiny fraction of the reads. According to the (rather terse) bwa manual entry for bwa samse:

      "Generate alignments in the SAM format given single-end reads. Repetitive hits will be randomly chosen."

      Could this last line mean that of all the full length reads, one is randomly chosen, and the rest are ignored? The great majority of the reads in my .bam file are high quality full length reads. This would explain why I see only one full length read in the bwa alignment of R.bam with all the other reads not being full length. It still doesn't explain why I get nothing for F.bam
      Last edited by stealthtoad; 03-06-2013, 03:54 PM. Reason: clarigy last sentence.

      Comment


      • #4
        As a follow up, it has occurred to me that perhaps bwa needs some fine tuning using the options available. The only option for bwa samse is [-n MaxOcc] which the manual explains:

        "Maximum number of alignments to output in the XA tag for reads paired properly. If a read has more than INT hits, the XA tag will not be written."

        While this sounds like its only applicable to paired end reads, which mine are not. I tried it anyway with -n set to 500. The results were the same.

        I suppose in the end the answer is "align Ion Torrent output with the Ion Torrent aligner (tmap) or don't be surprised when things don't go your way", nonetheless I would still like to understand bwa's behavior.
        Last edited by stealthtoad; 03-06-2013, 04:38 PM. Reason: I type badly.

        Comment


        • #5
          "Generate alignments in the SAM format given single-end reads. Repetitive hits will be randomly chosen."

          Could this last line mean that of all the full length reads, one is randomly chosen, and the rest are ignored? The great majority of the reads in my .bam file are high quality full length reads. This would explain why I see only one full length read in the bwa alignment of R.bam with all the other reads not being full length. It still doesn't explain why I get nothing for F.bam
          No this means that when a reads can be mapped to many places and the mapping quality are same then one hit will be chosen randomly to output into the sam file.

          If you want more hits you should adjust parameters in BWA ALN step, other than mess with BWA SAMSE, SAMSE merely reads from .sai file to make the .sam, not much going on here.

          Best,

          Dong

          Comment


          • #6
            Dong,

            Thanks for clarifying the samse manual entry.

            Any idea why bwa is showing only a single full length alignment for R.bam and nothing for F.bam?

            As I have demonstrated, it is not something wrong with the .bam files as swbarnes2 suggested: tmap does just fine with them.

            I have also tested whether the proximity of the read alignment to the start of the reference fasta was an issue - no change in the results.

            As you suggested, I have looked at the bwa aln options and cannot see any settings in that would affect the number of reads produced in the alignment.

            I am going to try bwa mem....

            Comment


            • #7
              I tried your data, BWA ALN with -n 10 definitely changed the results. Also your reads seems in different lengths. For ALN there are so many options to play with, e.g. -l for seed length, -k for max diff in seed, etc.

              Comment


              • #8
                xied75,

                Thanks, I will try bwa aln -n 10 and check out the results.

                Update on bwa mem. The version of bwa I had was 0.6.1 which does not have mem. So I downloaded and installed bwa 0.7.0. and tried to align the R.bam and F.bam with bwa mem which didn't work. The manual says bwa mem input is fastq, so I had to download bam2fastq, install that, and then convert the .bam files to fastq. bwa mem then dutifully aligned the fastq files. The results of bwa mem match what I got with tmap - success!

                One issue with this approach (.bam > .fastq > bwa mem > aligned.sam) I am concerned that the quality scores may (are?) coded differently in a PGM .bam file than an Illumina file, so if bam2fastq assumes my .bam is Illumina, the quality scores in the resulting fastq will not be valid....

                Comment


                • #9
                  Ok, I've tried bwa aln with -n set to: 10, 50, and 500 in both bwa 0.6.1 and 0.7.0:

                  R.bam max coverage went from 46 without -n set, to 68 with n = 10, 72 when n = 50, and 73 when n=500, whereas alignment with bwa mem or tmap yielded max coverage of >700. There was no difference in results between bwa 0.6.1 and bwa 0.7.0.

                  F.bam max coverage went from 0 without -n set, to 0 with n = 10, 0 with n=50, and 0 with n=500, whereas alignment with bwa mem or tmap gave max coverage >1000.

                  So, I still don't know why bwa aln won't align the forward reads from F.bam, while both bwa mem and tmap will. In addition while R.bam reads align with bwa aln, I can't seem to get an optimized alignment fiddling with the -n option.

                  Why not just use tmap or bwa mem then? Well bwa mem requires a FASTQ input which leaves me worrying about Ion Torrent quality scores being converted correctly by bam2fastq, and tmap isn't installed on the server I am using, plus I'd like to understand why bwa isn't working on the F.bam reads so that I learn something...

                  I will try -l seed length and -k max dif in seed...
                  Last edited by stealthtoad; 03-07-2013, 12:05 PM.

                  Comment

                  Latest Articles

                  Collapse

                  • seqadmin
                    Essential Discoveries and Tools in Epitranscriptomics
                    by seqadmin




                    The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                    04-22-2024, 07:01 AM
                  • seqadmin
                    Current Approaches to Protein Sequencing
                    by seqadmin


                    Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                    04-04-2024, 04:25 PM

                  ad_right_rmr

                  Collapse

                  News

                  Collapse

                  Topics Statistics Last Post
                  Started by seqadmin, 04-11-2024, 12:08 PM
                  0 responses
                  59 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 04-10-2024, 10:19 PM
                  0 responses
                  57 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 04-10-2024, 09:21 AM
                  0 responses
                  51 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 04-04-2024, 09:00 AM
                  0 responses
                  56 views
                  0 likes
                  Last Post seqadmin  
                  Working...
                  X