Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • BWA MEM mate pair incorrect insert sizes?

    I have been trying to run BWA MEM on my recently generated Nextera Mate Pair data in order to directly compare it to Novoalign. However, the output immediately shows some problems with the orientation and insert size calculations - only showing FF & RR reads and with very small insert sizes (20bp). To test a few scenarios, I took only the first 100 read pairs. The library was size selected for 3-5kb.

    Code:
    $ bwa mem -t 1 -M -R "@RG\tID:ID\tSM:Sample\tLB:Nextera_3-5" ../../hg19.fa R1.temp.fastq R2.temp.fastq > ../bwamem/temp.sam
    [M::main_mem] read 200 sequences (20000 bp)...
    [M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (42, 0, 0, 37)
    [M::mem_pestat] analyzing insert size distribution for orientation FF...
    [M::mem_pestat] (25, 50, 75) percentile: (15, 23, 30)
    [M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 60)
    [M::mem_pestat] mean and std.dev: (21.64, 10.61)
    [M::mem_pestat] low and high boundaries for proper pairs: (1, 75)
    [M::mem_pestat] skip orientation FR as there are not enough pairs
    [M::mem_pestat] skip orientation RF as there are not enough pairs
    [M::mem_pestat] analyzing insert size distribution for orientation RR...
    [M::mem_pestat] (25, 50, 75) percentile: (18, 32, 64)
    [M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 156)
    [M::mem_pestat] mean and std.dev: (41.49, 33.13)
    [M::mem_pestat] low and high boundaries for proper pairs: (1, 202)
    [M::worker2@0] performed mate-SW for 376 reads
    [main] Version: 0.7.5a-r405
    [main] CMD: bwa mem -t 1 -M -R @RG\tID:ID\tSM:Sample\tLB:Nextera_3-5 ../../hg19.fa R1.temp.fastq R2.temp.fastq
    I thought I remembered reading that BWA had a problem with the orientation of Illumina mate pair so I did a reverse complement of both reads using seqtk which gave identical results.

    For comparison, using novoalign I got the following:

    Code:
    $ novoalign -d hg19_novoindex -f R1.temp.fastq R2.temp.fastq -i MP 4000,2500 -oSAM "@RG\tID:ID\tSM:Sample\tLB:Nextera_3-5" > ../bwamem/temp.sam
    # novoalign (V2.08.05)
    #       Paired Reads:      100
    #      Pairs Aligned:       44
    #     Read Sequences:      200
    #            Aligned:      135
    #   Unique Alignment:      132
    #   Gapped Alignment:       25
    #     Quality Filter:        2
    # Homopolymer Filter:        0
    # Mean  4061, Std Dev 583.4
    Any ideas? I'm stumped. I've been happy with novoalign but since I found this potential bug, I thought I'd let the community know (or let me know the option I missed in BWA).
    Last edited by hartmaier; 01-06-2014, 08:51 AM.

  • #2
    what no reads?

    Hi

    You seem to have only 200 reads, not enough by a huge margin to make any sensible estimations.

    Comment


    • #3
      Originally posted by pongly View Post
      Hi

      You seem to have only 200 reads, not enough by a huge margin to make any sensible estimations.
      As I said in my post, I took a subset of my data to troubleshoot what was going on. The same input file was used in both programs, yet the calculated insert sizes were completely different. Yes, I agree that 100 paired reads is not a reasonable amount to get an accurate distribution of the true insert size, however, my point is that the two programs are giving extremely different outputs for the insert sizes for those 100 reads and it is obvious from the output that bwa is wrong.

      Comment


      • #4
        Is the mean and standard deviation the estimated insert size from novoalign?

        Comment


        • #5

          Originally posted by Heng Li
          If you have a library with two distinct insert size distributions, the better way is not to perform paired-end mapping at all.
          Maybe this library has a small-insert peak and a large-insert peak, which confuses BWA? Often LMP libraries have a substantial fraction of reads with a short insert size. I have not used Novoalign, but the "MP 4000,2500" could perhaps be forcing it to use the higher of two peaks and ignore short inserts. In other words, it is not at all clear that BWA-mem is wrong and Novoalign is right.

          Perhaps you should use a 3rd aligner as a tiebreaker. BBMap can plot the insert size distribution so you can see what's going on, though you'd need more like 100k reads to produce a nice smooth curve. e.g.

          bbmap.sh -Xmx24g in=reads.fq ref=ref.fa ihist=ihist.txt rcs=f reads=100000

          ...where the 'rcs=f' flag is for long-mate-pair libraries; otherwise it assumes fragment library orientation for pairs.

          Comment


          • #6
            It is normal to find paired end reads in mate pair libraries. This is more pronounced as the size of fragments used for mate pair library prep increases. There are at least two sources of paired end reads in mate pair libraries. One is binding non-biotinylated fragments (fragments lacking junction) to capture beads which make library fragments. The other is result of random shearing true mate pair fragments. The latter is explained here:http://res.illumina.com/documents/pr...processing.pdf

            Comment


            • #7
              First, thank you everyone for reviving this old post and getting some nice discussion here.

              Originally posted by N311V View Post
              Is the mean and standard deviation the estimated insert size from novoalign?
              Yes, size selection was done via gel cutting for 3.5-5 kb so novoalign fits perfectly. I also have some larger insert size libraries and novoalign insert sizes always match with what is expected.

              I re-ran bwa mem with 100k reads...same result...note the complete lack of R-F reads (correct Illumina mate pairs should be in this orientation).
              Code:
              [M::main_mem] read 50000 sequences (4242525 bp)...
              [M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (9024, 0, 0, 9093)
              [M::mem_pestat] analyzing insert size distribution for orientation FF...
              [M::mem_pestat] (25, 50, 75) percentile: (10, 22, 43)
              [M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 109)
              [M::mem_pestat] mean and std.dev: (27.99, 23.96)
              [M::mem_pestat] low and high boundaries for proper pairs: (1, 142)
              [M::mem_pestat] skip orientation FR as there are not enough pairs
              [M::mem_pestat] skip orientation RF as there are not enough pairs
              [M::mem_pestat] analyzing insert size distribution for orientation RR...
              [M::mem_pestat] (25, 50, 75) percentile: (10, 22, 45)
              [M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 115)
              [M::mem_pestat] mean and std.dev: (29.08, 25.38)
              [M::mem_pestat] low and high boundaries for proper pairs: (1, 150)
              [M::mem_process_seqs] Processed 50000 reads in 8.093 CPU sec, 8.111 real sec
              [main] Version: 0.7.10-r789
              [main] CMD: bwa mem -M /Volumes/GenomicsIII/Reference_Genomes/hg19.fa test.R1.fastq test.R2.fastq
              [main] Real time: 188.272 sec; CPU: 14.480 sec
              It is normal to find paired end reads in mate pair libraries.
              Yup, I know and I expect some. After plotting insert sizes from Novoalign, for example, this is a very minor fraction (for the new Nextera based Illumina protocol). I've also aligned the old v2 illumina MP data with novoalign which shows a much larger fraction of contaminating 'inward' facing reads (the new nextera kit is way better with regards to elimination of the contaminating inward reads). So, novoalign can detect both library types if they are there. Together, this tells me, at least in its current form, bwa mem cannot be used with large insert mate pair data (unless someone has any additional suggestions).

              Perhaps you should use a 3rd aligner as a tiebreaker. BBMap can plot the insert size distribution so you can see what's going on, though you'd need more like 100k reads to produce a nice smooth curve.
              I've never run BBmap before but I got 'out of memory error' on my local machine with 16GB. I have access to more on a cluster but based on the above explanation I am convinced that bwa mem cannot work with large insert mate pair data.

              Comment


              • #8
                Originally posted by hartmaier View Post
                I've never run BBmap before but I got 'out of memory error' on my local machine with 16GB. I have access to more on a cluster but based on the above explanation I am convinced that bwa mem cannot work with large insert mate pair data.
                Yes, BBMap requires ~20GB for human reference normally, though you can run it on a 16G node with this command:

                bbmap.sh -Xmx12g in=reads.fq ref=ref.fa ihist=ihist.txt rcs=f reads=100000 usemodulo nodisk

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  Strategies for Sequencing Challenging Samples
                  by seqadmin


                  Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                  03-22-2024, 06:39 AM
                • seqadmin
                  Techniques and Challenges in Conservation Genomics
                  by seqadmin



                  The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                  Avian Conservation
                  Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                  03-08-2024, 10:41 AM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by seqadmin, Yesterday, 06:37 PM
                0 responses
                8 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, Yesterday, 06:07 PM
                0 responses
                8 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 03-22-2024, 10:03 AM
                0 responses
                49 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 03-21-2024, 07:32 AM
                0 responses
                66 views
                0 likes
                Last Post seqadmin  
                Working...
                X