SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
bwa - sampe - large insert sizes and slow Elsie Bioinformatics 16 10-31-2013 02:49 PM
mate pair insert size variation and de novo assembly Mark Introductions 2 10-13-2012 12:48 AM
BWA and mate pair bouhassi Bioinformatics 0 12-07-2011 07:33 AM
Maximum mate pair insert size? anar Sample Prep / Library Generation 2 06-14-2011 10:37 PM
bfast with illumina mate-pair and insert size estimation Protaeus Bioinformatics 3 01-19-2011 01:46 PM

Reply
 
Thread Tools
Old 01-06-2014, 07:34 AM   #1
hartmaier
Member
 
Location: Pittsburgh

Join Date: Dec 2012
Posts: 12
Default BWA MEM mate pair incorrect insert sizes?

I have been trying to run BWA MEM on my recently generated Nextera Mate Pair data in order to directly compare it to Novoalign. However, the output immediately shows some problems with the orientation and insert size calculations - only showing FF & RR reads and with very small insert sizes (20bp). To test a few scenarios, I took only the first 100 read pairs. The library was size selected for 3-5kb.

Code:
$ bwa mem -t 1 -M -R "@RG\tID:ID\tSM:Sample\tLB:Nextera_3-5" ../../hg19.fa R1.temp.fastq R2.temp.fastq > ../bwamem/temp.sam
[M::main_mem] read 200 sequences (20000 bp)...
[M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (42, 0, 0, 37)
[M::mem_pestat] analyzing insert size distribution for orientation FF...
[M::mem_pestat] (25, 50, 75) percentile: (15, 23, 30)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 60)
[M::mem_pestat] mean and std.dev: (21.64, 10.61)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 75)
[M::mem_pestat] skip orientation FR as there are not enough pairs
[M::mem_pestat] skip orientation RF as there are not enough pairs
[M::mem_pestat] analyzing insert size distribution for orientation RR...
[M::mem_pestat] (25, 50, 75) percentile: (18, 32, 64)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 156)
[M::mem_pestat] mean and std.dev: (41.49, 33.13)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 202)
[M::worker2@0] performed mate-SW for 376 reads
[main] Version: 0.7.5a-r405
[main] CMD: bwa mem -t 1 -M -R @RG\tID:ID\tSM:Sample\tLB:Nextera_3-5 ../../hg19.fa R1.temp.fastq R2.temp.fastq
I thought I remembered reading that BWA had a problem with the orientation of Illumina mate pair so I did a reverse complement of both reads using seqtk which gave identical results.

For comparison, using novoalign I got the following:

Code:
$ novoalign -d hg19_novoindex -f R1.temp.fastq R2.temp.fastq -i MP 4000,2500 -oSAM "@RG\tID:ID\tSM:Sample\tLB:Nextera_3-5" > ../bwamem/temp.sam
# novoalign (V2.08.05)
#       Paired Reads:      100
#      Pairs Aligned:       44
#     Read Sequences:      200
#            Aligned:      135
#   Unique Alignment:      132
#   Gapped Alignment:       25
#     Quality Filter:        2
# Homopolymer Filter:        0
# Mean  4061, Std Dev 583.4
Any ideas? I'm stumped. I've been happy with novoalign but since I found this potential bug, I thought I'd let the community know (or let me know the option I missed in BWA).

Last edited by hartmaier; 01-06-2014 at 07:51 AM.
hartmaier is offline   Reply With Quote
Old 06-25-2014, 04:22 AM   #2
pongly
Junior Member
 
Location: London, UK

Join Date: Jun 2014
Posts: 1
Default what no reads?

Hi

You seem to have only 200 reads, not enough by a huge margin to make any sensible estimations.
pongly is offline   Reply With Quote
Old 07-23-2014, 06:01 AM   #3
hartmaier
Member
 
Location: Pittsburgh

Join Date: Dec 2012
Posts: 12
Default

Quote:
Originally Posted by pongly View Post
Hi

You seem to have only 200 reads, not enough by a huge margin to make any sensible estimations.
As I said in my post, I took a subset of my data to troubleshoot what was going on. The same input file was used in both programs, yet the calculated insert sizes were completely different. Yes, I agree that 100 paired reads is not a reasonable amount to get an accurate distribution of the true insert size, however, my point is that the two programs are giving extremely different outputs for the insert sizes for those 100 reads and it is obvious from the output that bwa is wrong.
hartmaier is offline   Reply With Quote
Old 07-23-2014, 05:23 PM   #4
N311V
Member
 
Location: Australia

Join Date: Jul 2013
Posts: 34
Default

Is the mean and standard deviation the estimated insert size from novoalign?
N311V is offline   Reply With Quote
Old 07-23-2014, 06:58 PM   #5
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

http://sourceforge.net/p/bio-bwa/mai...6594D4@hsr.it/
Quote:
Originally Posted by Heng Li
If you have a library with two distinct insert size distributions, the better way is not to perform paired-end mapping at all.
Maybe this library has a small-insert peak and a large-insert peak, which confuses BWA? Often LMP libraries have a substantial fraction of reads with a short insert size. I have not used Novoalign, but the "MP 4000,2500" could perhaps be forcing it to use the higher of two peaks and ignore short inserts. In other words, it is not at all clear that BWA-mem is wrong and Novoalign is right.

Perhaps you should use a 3rd aligner as a tiebreaker. BBMap can plot the insert size distribution so you can see what's going on, though you'd need more like 100k reads to produce a nice smooth curve. e.g.

bbmap.sh -Xmx24g in=reads.fq ref=ref.fa ihist=ihist.txt rcs=f reads=100000

...where the 'rcs=f' flag is for long-mate-pair libraries; otherwise it assumes fragment library orientation for pairs.
Brian Bushnell is offline   Reply With Quote
Old 07-23-2014, 08:08 PM   #6
nucacidhunter
Jafar Jabbari
 
Location: Melbourne

Join Date: Jan 2013
Posts: 1,232
Default

It is normal to find paired end reads in mate pair libraries. This is more pronounced as the size of fragments used for mate pair library prep increases. There are at least two sources of paired end reads in mate pair libraries. One is binding non-biotinylated fragments (fragments lacking junction) to capture beads which make library fragments. The other is result of random shearing true mate pair fragments. The latter is explained here:http://res.illumina.com/documents/pr...processing.pdf
nucacidhunter is offline   Reply With Quote
Old 07-24-2014, 11:04 AM   #7
hartmaier
Member
 
Location: Pittsburgh

Join Date: Dec 2012
Posts: 12
Default

First, thank you everyone for reviving this old post and getting some nice discussion here.

Quote:
Originally Posted by N311V View Post
Is the mean and standard deviation the estimated insert size from novoalign?
Yes, size selection was done via gel cutting for 3.5-5 kb so novoalign fits perfectly. I also have some larger insert size libraries and novoalign insert sizes always match with what is expected.

I re-ran bwa mem with 100k reads...same result...note the complete lack of R-F reads (correct Illumina mate pairs should be in this orientation).
Code:
[M::main_mem] read 50000 sequences (4242525 bp)...
[M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (9024, 0, 0, 9093)
[M::mem_pestat] analyzing insert size distribution for orientation FF...
[M::mem_pestat] (25, 50, 75) percentile: (10, 22, 43)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 109)
[M::mem_pestat] mean and std.dev: (27.99, 23.96)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 142)
[M::mem_pestat] skip orientation FR as there are not enough pairs
[M::mem_pestat] skip orientation RF as there are not enough pairs
[M::mem_pestat] analyzing insert size distribution for orientation RR...
[M::mem_pestat] (25, 50, 75) percentile: (10, 22, 45)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 115)
[M::mem_pestat] mean and std.dev: (29.08, 25.38)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 150)
[M::mem_process_seqs] Processed 50000 reads in 8.093 CPU sec, 8.111 real sec
[main] Version: 0.7.10-r789
[main] CMD: bwa mem -M /Volumes/GenomicsIII/Reference_Genomes/hg19.fa test.R1.fastq test.R2.fastq
[main] Real time: 188.272 sec; CPU: 14.480 sec
Quote:
It is normal to find paired end reads in mate pair libraries.
Yup, I know and I expect some. After plotting insert sizes from Novoalign, for example, this is a very minor fraction (for the new Nextera based Illumina protocol). I've also aligned the old v2 illumina MP data with novoalign which shows a much larger fraction of contaminating 'inward' facing reads (the new nextera kit is way better with regards to elimination of the contaminating inward reads). So, novoalign can detect both library types if they are there. Together, this tells me, at least in its current form, bwa mem cannot be used with large insert mate pair data (unless someone has any additional suggestions).

Quote:
Perhaps you should use a 3rd aligner as a tiebreaker. BBMap can plot the insert size distribution so you can see what's going on, though you'd need more like 100k reads to produce a nice smooth curve.
I've never run BBmap before but I got 'out of memory error' on my local machine with 16GB. I have access to more on a cluster but based on the above explanation I am convinced that bwa mem cannot work with large insert mate pair data.
hartmaier is offline   Reply With Quote
Old 07-24-2014, 11:38 AM   #8
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Quote:
Originally Posted by hartmaier View Post
I've never run BBmap before but I got 'out of memory error' on my local machine with 16GB. I have access to more on a cluster but based on the above explanation I am convinced that bwa mem cannot work with large insert mate pair data.
Yes, BBMap requires ~20GB for human reference normally, though you can run it on a 16G node with this command:

bbmap.sh -Xmx12g in=reads.fq ref=ref.fa ihist=ihist.txt rcs=f reads=100000 usemodulo nodisk
Brian Bushnell is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 03:56 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO