Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • dp05yk
    Member
    • Dec 2010
    • 66

    #16
    Originally posted by earonesty View Post
    Poor alignment quality slows down bwa because it has to work harder. One way to speed things up is to clean up your fastq file before feeding it to the aligner. (Removing N's, low quality sequence tails, adapter/primer reads, etc.)
    Correct me if I'm wrong, but don't N's speed up BWA? As soon as the sequence being processed passes a certain mismatch/indel threshold the alignment process for said sequence is aborted, and BWA moves onto the next sequence.

    Comment

    • earonesty
      Member
      • Mar 2011
      • 52

      #17
      Perhaps that's how it's supposed to work... but repeats, artifacts and primers, N's etc seem to slow things down for me. A lot of N's - I would agree. But imagine a few... these get matched as "A or T or G or C" causing lots of potential matches, and requiring each match to be more rigorously evaluated. If you trim them from the ends... it won't hurt your alignment at all (bwa may do this in newer versions? ), but you can't trim them from the middle -- and they necessitate exhaustive searching of the possibilities.
      Last edited by earonesty; 04-16-2011, 11:36 AM.

      Comment

      • dp05yk
        Member
        • Dec 2010
        • 66

        #18
        Originally posted by earonesty View Post
        A lot of N's - I would agree. But imagine a few... these get matched as "A or T or G or C" causing lots of potential matches, and requiring each match to be more rigorously evaluated.
        They do not get matched as "A or T or G or C". An N in an input sequence is automatically treated as a mismatch.

        An N in the reference sequence is randomly assigned A/T/G/C.

        This is from the BWA paper, as well, I verified this in the code.

        Comment

        • lh3
          Senior Member
          • Feb 2008
          • 686

          #19
          Liang Ping and Darren Peters have pointed out that the multithreading in bwa aln is not optimal. It is fine for ~10 threads, but when you push to ~20 threads, a lot of CPU time will be wasted on synchronization. They have provided a patch which has been applied to:

          Burrow-Wheeler Aligner for short-read alignment (see minimap2 for long-read alignment) - lh3/bwa

          Comment

          • earonesty
            Member
            • Mar 2011
            • 52

            #20
            Originally posted by dp05yk View Post
            They do not get matched as "A or T or G or C". An N in an input sequence is automatically treated as a mismatch.

            An N in the reference sequence is randomly assigned A/T/G/C.

            This is from the BWA paper, as well, I verified this in the code.
            Sorry, my explanation was a terse summary of the net effect, not meant as a literal explanation of what happens.

            Unlike a nucleotide mismatch, an "N" is not a "mismatch for some and a non-mismatch for others"... it's "equally a mismatch for all possible reference bases" ... which causes "more possible equal matches".

            Here's a concrete example:

            You can see how "ANGGCTGC" can match many more locations with equal likelihood as opposed to "ATGGCTGC" ... which will match, on average, around 4 times fewer locations... only those with T's in the second position. The first sequence would be a poor, but still passing match for many more locations. As long as there are not enough N's to discredit a whole alignment, the aligner has to consider more possible locations as potential (albeit poor) matches.

            Or maybe you're right... maybe it's not the N's... maybe it's just a high error rate, or poor alignment quality, or something else. But if I every try aligning with a "uncleaned" fastq... (adapters, skewing, or especially poor quality tails) ... it's always a lot slower (and a waste of that time aligning stuff no one wants to see).

            Perhaps (mostly to assure myself I'm not crazy) I'll cobble together an example that's easy to test if I have some time tomorrow.
            Last edited by earonesty; 04-16-2011, 05:30 PM.

            Comment

            • earonesty
              Member
              • Mar 2011
              • 52

              #21
              Using wgsim, it seems I was definitely wrong about N's...

              Originally posted by earonesty View Post
              You can see how "ANGGCTGC" can match many more locations with equal likelihood as opposed to "ATGGCTGC" ... which will match, on average, around 4 times fewer locations... only those with T's in the second position. The first sequence would be a poor, but still passing match for many more locations. As long as there are not enough N's to discredit a whole alignment, the aligner has to consider more possible locations as potential (albeit poor) matches.
              I ran a bunch of simulations with 100K reads and yeast and human genomes, i tried seed lengths of 32 and 17. Tried each command twice and kept the better result (to avoid drive-cache issues).

              When running with a seed length of 32:

              1. Extra N's don't change things much. Too many extra speeds things up as sequences fail to align.

              2. Increasing error rate from .02 to .04 drops performance by 10% and 20% for the 2 seed lengths, but then going much higher causes performance to increase ... as sequences fail to align.

              So the error rate is the issue I probably see when cleaning up data... not N's. N's are just an artifact within the lower-quality tiles that I thought were causing issues... but weren't. I definitely have seen bwa run terribly slow on data that wasn't cleaned up... 10-20% in simulation isn't near the effect I've seen. Can't seem to reproduce it though.
              Last edited by earonesty; 04-18-2011, 07:37 AM.

              Comment

              • robs
                Senior Member
                • May 2010
                • 116

                #22
                Originally posted by dp05yk View Post
                They do not get matched as "A or T or G or C". An N in an input sequence is automatically treated as a mismatch.
                It depends which algorithm in BWA you use. I checked for bwasw and there Ns in the input sequence will be replaced by a random base. Therefore, they might not be treated as a mismatch and you will likely get different results when running BWA multiple times.

                Comment

                Latest Articles

                Collapse

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by SEQadmin2, 06-09-2026, 11:58 AM
                0 responses
                30 views
                0 reactions
                Last Post SEQadmin2  
                Started by SEQadmin2, 06-05-2026, 10:09 AM
                0 responses
                38 views
                0 reactions
                Last Post SEQadmin2  
                Started by SEQadmin2, 06-04-2026, 08:59 AM
                0 responses
                43 views
                0 reactions
                Last Post SEQadmin2  
                Started by SEQadmin2, 06-02-2026, 12:03 PM
                0 responses
                64 views
                0 reactions
                Last Post SEQadmin2  
                Working...