Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Need help defining BWA parameters for human screening

    I'm working with metagenomic bacterial samples from various human body sites, and these samples have various levels of human contamination. I have a large number of 100mer Illumina reads and I'm trying to setup a pipeline that will use BWA to identify human contaminant from these metagenomic communities of reads. The reference I am using is basically NCBI build 36 human genome. I have build 37 as well, but its > 4Gb in size so for now I'm sticking with b36.

    I've built a control query file of reads that are of known bacterial origin, and into that I have spiked a set of known human reads. And I'm trying to assess various BWA parameters using that control set.

    BWA is doing surprisingly poorly in what I had thought would be a simple application of the aligner. My control set contains:

    40 million - bacterial reads
    4 million - human reads

    (about 1 lane's worth of data from a 50Gb Illumina run which is our current standard).

    My first thought was to simply try setting a simple mismatch/edit-distance cutoff. I've tried:

    (parameter) - (result)
    -n 15 - found 73.1% of human spike, falsely identified 3593/40mil bacteria
    -n 20 - found 76.3% of human spike, falsely identified 3628/40mil bacteria
    -n 25 - found 79.3% of human spike, falsely identified 3710/40mil bacteria

    This seems quite low. Then I found some cases where a read that clearly had only 3 mismatches (confirmed using BLASTN) was NOT being found by BWA. After some head scratching, I realized that those 3 mismatches were occuring within the first 30 bases of the read, and it looks like the default seeding behavior of BWA for 100mer reads is to allow only 2 mismatches in the first 30 bases. In order to catch these cases where > 2 mismatches occur early in the read, I tried setting the seed to the full length of my read, and then allowing the same number of mismatches in the seed as I do in the full read. I tried:

    -n 25 -k 25 -l 99 (for 100mer reads)
    found 62.1% of human spike, falsely identified 31/40mil bacteria

    but as you can see, those parameters did significantly worse than just using -n 25. This doesn't make sense to me because I had thought those parameters would push the seeding length up to 99bp, and allow 25 mismatches in the seed. And then allow those same 25 mismatches across the full read. So I had expected to find MORE of the human spike than did the -n 25 setting by itself.


    Basically I just need to come up with a set of parameters that can do this screening job. I'd love a deeper explanation of why the -n 25 -k 25 -l 99 parameter set found fewer hits than -n 25 does by itself. But I am really hoping that someone here can suggest parameters I can use to do this human screen. Any suggestions?


    Thanks,
    John Martin

  • #2
    Firstly, please DO NOT use the toplevel file from Ensembl. It has >1Gbp ambiguous bases as space holder. You may use the b37 reference file here: ftp://ftp.ncbi.nih.gov/1000genomes/f...cal/reference/

    Secondly, bwa-short would not work well for large -n. This is by design. You may consider disable seeding with -l 10000, but I guess it will be impractically slow. What is the sequencing error rate you are simulating? It seems very high. Typical Illumina sequencing error rate is only ~1% which means an 100bp read only has a couple of errors.

    Comment


    • #3
      The expected actual variation between individuals is < 1%, but the sequencing error we're seeing from the Illumina 100mer reads varies quite a bit. I used a collection of 100mer reads from a different project being done here at WashU for my human control data, I did not simulate errors for that. They were real 100mer Illumina reads from a different individual (I did not use the Ensembl/NCBI human build for the control).

      I am not sure what sequencing error rate I should expect to see in these Illumina 100mers. I do see some fraction of the reads containing a significant number of ambiguous bases (N). Maybe ~1-2% of the reads will have > 3 ambiguous bases. I was trying such a high -n setting to allow reads containing Ns to align to human.

      What is the largest value of -n that bwa-short can safely use in alignments?

      Comment


      • #4
        I usually recommend the default. For 100bp reads, the default allows 5 mismatches/gaps. You may use -n 6 or 7, but not more. If I were analyzing the data, I would throw those ~1-2% reads in the first place. In addition, are these purity filtered reads? Do the reads have poor-quality tail? You may try -q15 to trim the tail. In general, i guess your data is non-typical. For single-end 100bp reads, bwa should map more than 90% to human; with paired-end, even more.

        Comment


        • #5
          I will give your suggestion a try. I am not sure if the tail is poor quality or not for 100mer illumina reads, but I can use my control set to play around with the -q setting.

          Its very interesting to me to hear that the max suggested -n value is 7. The fact that we are doing alignments with settings as high as -n 25 may explain some of the weird behaviors we are seeing with the aligner.

          Thanks for the help.

          Comment

          Latest Articles

          Collapse

          • seqadmin
            Current Approaches to Protein Sequencing
            by seqadmin


            Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
            04-04-2024, 04:25 PM
          • seqadmin
            Strategies for Sequencing Challenging Samples
            by seqadmin


            Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
            03-22-2024, 06:39 AM

          ad_right_rmr

          Collapse

          News

          Collapse

          Topics Statistics Last Post
          Started by seqadmin, 04-11-2024, 12:08 PM
          0 responses
          27 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-10-2024, 10:19 PM
          0 responses
          30 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-10-2024, 09:21 AM
          0 responses
          26 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-04-2024, 09:00 AM
          0 responses
          52 views
          0 likes
          Last Post seqadmin  
          Working...
          X