Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • BWA Mem Not Using All Reads

    Hi there,

    I have ~2,450,000 total reads per individual (1,225,000, paired end). When using bwa mem to align paired end reads to a de novo assembly, I get the following output

    [M::main_mem] read 79248 sequences (10000202 bp)...
    [M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (0, 5893, 0, 0)
    [M::mem_pestat] skip orientation FF as there are not enough pairs
    [M::mem_pestat] analyzing insert size distribution for orientation FR...
    [M::mem_pestat] (25, 50, 75) percentile: (75, 114, 156)
    [M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 318)
    [M::mem_pestat] mean and std.dev: (114.68, 50.30)
    [M::mem_pestat] low and high boundaries for proper pairs: (1, 399)
    [M::mem_pestat] skip orientation RF as there are not enough pairs
    [M::mem_pestat] skip orientation RR as there are not enough pairs
    [M::mem_process_seqs] Processed 79248 reads in 8.541 CPU sec, 11.328 real sec.

    I have two questions:

    1. Why is bwa mem only processing 79,000 reads and
    2. Why are the FF, RF and RR orientations being skipped?

  • #2
    1) Make sure you're using the latest version of BWA and that the input files are not corrupt. The files may have been truncated when being transferred, or there may be a strange symbol in the read headers that is interpreted as end of file. Check their size with, for example, wc.
    2) In a normal fragment library there should only be a single orientation (FR). Since there were 0 reads that mapped with other orientations, in this case, they are being ignored.

    Comment


    • #3
      Hi Brian,

      Thanks for your fast reply!
      I'm using bwa 0.7.10, which I downloaded and recompiled today. During compilation, I did not get any "error" messages, but multiple "Wno-unused-function" flags. Do these indicate any problem with my version of bwa, or can they be ignored?

      Regarding input files being corrupt, I have already used them to build a de novo reference assembly in Velvet. If there are characters that are terminating input early because they are being read as end of file, which characters should I be finding and replacing?

      Tony

      Comment


      • #4
        Tony,

        The compilation messages you mention sound like they can be ignored. I have no idea about what specific character might be a problem; I'm just suggesting it as a possibility. If you've already used the reads for assembling, are you sure that Velvet used all of them, or just the first 79248?

        If Velvet didn't have a problem, then the input files should be fine. So I suggest you try a different aligner (such as BBMap ) or different version of BWA, or contact the author directly. I have encountered problems in the past with older versions of BWA stopping after some number of reads, but normally that was accompanied by a core dump. I never figured out what the problem was; it was nondeterministic.

        Comment


        • #5
          @TKTKTK: Have you modified the fastq files in any way (trimming)? What technology are these reads from?

          Comment


          • #6
            Does it just stop after one iteration like that? Normally you'll see a whole series of messages like that. If you use multiple threads, it looks like they sync then with a pthread_join before iterating again. Assuming you're using normal paired-end reads, they should all be FR, which is probably why it skips looking for discordant alignments.

            Comment


            • #7
              @BrianBushnell Thanks again for the input! I will try a couple of other aligners and see if I get different results. I also checked to make sure Velvet was using all of the reads provided for assembly and it is, so that rules out one problem.

              @GenoMax - These are 300bp, paired end reads from an Illumina MiSeq. I've used trimmomatic for quality score trimming and to remove all barcode and adapter readthrough.

              @dpryan After receiving that output I get a list of all of the nodes in my reference assembly, and then my sam alignment output. I don't get multiple iterations of that original output I pasted above, though. I'm working remotely on a cluster, and have specified the number of cpus through the scheduler, but have not specified threading in my bwa commands. I will try again and see if this makes a difference.

              Thanks for everyone's input!

              Tony

              Comment


              • #8
                You'll want to redirect the alignments to a file as well (i.e., "bwa ... > output.sam"). Alternatively, just pipe to samtools and then save the resulting BAM file.

                Comment


                • #9
                  Specifying additional threads has resulted in more reads being processed, so that is one mystery solved!

                  Comment


                  • #10
                    Originally posted by TKTKTK View Post
                    Specifying additional threads has resulted in more reads being processed, so that is one mystery solved!
                    That is not logical. Curious why that is happening.

                    Comment


                    • #11
                      More like "mystery somewhat worked around". You shouldn't have to use multiple threads to get the entire output.

                      Comment


                      • #12
                        I'm working remotely on a cluster, and have to schedule jobs - could there be some conflict with the scheduler and the way bwa mem is written?

                        Comment


                        • #13
                          How much time are you asking for when you schedule a job? My guess is that you aren't allocating enough time and the job is just being killed after hitting the time limit (presuming there's a log file, this is often mentioned).

                          Comment


                          • #14
                            I'm currently asking for 2h per job to avoid that problem - in the log file, the only output I receive is that the job completed successfully, and didn't use all of the time or memory allocated.

                            Comment


                            • #15
                              Perhaps you should do a "titration" to find out how many reads your cluster is processing per hour (there will be variables like limits on queue/your account/how loaded that node is with other jobs etc. that may have to be accounted for).

                              BTW: 2.5M reads per sample should have been processed within 2 hours by a CPU of recent vintage. Are you starting multiple samples in parallel as independent mapping jobs?

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM
                              • seqadmin
                                Techniques and Challenges in Conservation Genomics
                                by seqadmin



                                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                Avian Conservation
                                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                03-08-2024, 10:41 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 03-27-2024, 06:37 PM
                              0 responses
                              15 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-27-2024, 06:07 PM
                              0 responses
                              13 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-22-2024, 10:03 AM
                              0 responses
                              55 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-21-2024, 07:32 AM
                              0 responses
                              70 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X