SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
MAQ Scores & Quantify maq alignments? AnamikaDarwin Bioinformatics 5 09-19-2015 09:24 AM
BWA & Reverse Sequences dp05yk Bioinformatics 1 06-14-2011 08:15 PM
Simulated reads mapping to same region - maq simulate & dwgsim gprakhar Bioinformatics 2 02-19-2011 12:12 AM
Parallelizing GEARLD in Illumina CASAVA 1.7 Bustard Bioinformatics 9 10-28-2010 10:16 AM
FAQs & definition of terms on Assembly/MAQ etc jess Bioinformatics 2 06-03-2009 07:01 AM

Reply
 
Thread Tools
Old 09-03-2009, 06:05 PM   #1
mhmckm
Junior Member
 
Location: korea

Join Date: Aug 2009
Posts: 1
Default parallelizing MAQ & BWA

I'm a guy who involved in Bio-informatics.
Theses days, we have a lot of intesests in genome sequencing
and consider some mapping programs, especially MAQ & BWA.

We are trying to run MAQ & BWA in parallel way,
having a some questions.

1. How to divide & distribute the data?
We have two possible options for this.
- divide the reads.
- divide the reference. (1 chromosome per machine)
For example, ABi Corona recommends us to divide the genome.
What is the option MAQ & BWA recommend?

2. Does MAQ processs color-space data efficiently?
We cannnot find any text about how to process CS data with MAQ.
Could you give us some tutorial about processing CS data?

We are looking forward to your valuable advice.
mhmckm is offline   Reply With Quote
Old 09-03-2009, 10:58 PM   #2
nilshomer
Nils Homer
 
nilshomer's Avatar
 
Location: Boston, MA, USA

Join Date: Nov 2008
Posts: 1,285
Default

Quote:
Originally Posted by mhmckm View Post
I'm a guy who involved in Bio-informatics.
Theses days, we have a lot of intesests in genome sequencing
and consider some mapping programs, especially MAQ & BWA.

We are trying to run MAQ & BWA in parallel way,
having a some questions.

1. How to divide & distribute the data?
We have two possible options for this.
- divide the reads.
- divide the reference. (1 chromosome per machine)
For example, ABi Corona recommends us to divide the genome.
What is the option MAQ & BWA recommend?
Divide the reads definitely. With each slide of a SOLiD machine generating 200M-500M of paired end reads this is a must. As for the reference (say human), given enough memory (~4Gb), for BWA and MAQ, you do not need to divide the reference. Other programs you may need to given RAM requirements (please let me know your RAM availability).

Quote:
2. Does MAQ processs color-space data efficiently?
We cannnot find any text about how to process CS data with MAQ.
Could you give us some tutorial about processing CS data?

We are looking forward to your valuable advice.
MAQ and BWA are written by the same author, so they work similarly. BWA can find small indels (1-3bp) whereas MAQ cannot. Both are meant for low-error data (<%2 error rates) and so you may to either tune them for intrinsic higher error-rate SOLiD data (use the -n option), which will dramatically reduce the running times. For longer indels and higher sensitivity, consider BFAST, which is admittedly my own software.

What genomes do you plan to sequence (human)?
nilshomer is offline   Reply With Quote
Old 09-04-2009, 12:50 AM   #3
henry
Member
 
Location: china

Join Date: Sep 2009
Posts: 36
Default

Quote:
Originally Posted by mhmckm View Post
I'm a guy who involved in Bio-informatics.
Theses days, we have a lot of intesests in genome sequencing
and consider some mapping programs, especially MAQ & BWA.

We are trying to run MAQ & BWA in parallel way,
having a some questions.

1. How to divide & distribute the data?
We have two possible options for this.
- divide the reads.
- divide the reference. (1 chromosome per machine)
For example, ABi Corona recommends us to divide the genome.
What is the option MAQ & BWA recommend?

2. Does MAQ processs color-space data efficiently?
We cannnot find any text about how to process CS data with MAQ.
Could you give us some tutorial about processing CS data?

We are looking forward to your valuable advice.
divide reads, after mapping to the reference, MAQ can merge diferent bach of reads together.

As for the second question, I'm also searching for open source tools that can be applied to process ABI Solid data. BFAST may deserve a try, if you concern about higher sensitivity.

Best
henry is offline   Reply With Quote
Old 09-04-2009, 12:54 AM   #4
dawe
Senior Member
 
Location: 4530'25.22"N / 915'53.00"E

Join Date: Apr 2009
Posts: 258
Default

Quote:
Originally Posted by mhmckm View Post
We are trying to run MAQ & BWA in parallel way,
having a some questions.

1. How to divide & distribute the data?
We have two possible options for this.
- divide the reads.
- divide the reference. (1 chromosome per machine)
For example, ABi Corona recommends us to divide the genome.
What is the option MAQ & BWA recommend?
Hi, if you are using BWA and speed is your concern you may try to align with the "-t" option, which enables multithread analysis (i.e. you don't need to split the data).
About splitting the data... One should ask to the algorithm developer to understand how it scales on the size of the genome and on the size of the input (and length of reads...). Suppose it scales O(log(n)) with the genome and O(n) with the input, I think you should split the data, but in the end you will test both, just to see what's better (and results may differ for differently sized genomes...).
On the other side, splitting the genome in a "persistent" way is less flexible, as you will need always a certain machine for a certain chromosome (unless you don't split each time).
HTH
d
dawe is offline   Reply With Quote
Old 09-04-2009, 12:55 AM   #5
henry
Member
 
Location: china

Join Date: Sep 2009
Posts: 36
Default

btw, I also used ABI corona lite. to my experience, it's a disaster. it is not a good tool. it should be worst option, for your consideration,
henry is offline   Reply With Quote
Old 09-04-2009, 03:20 PM   #6
der_eiskern
Member
 
Location: California

Join Date: Jul 2009
Posts: 46
Default

i get the impression that corona lite and maq can map the same amount of sequence for you usually but maq is easier to use. there are differences in the # of reads mapped independently to a given location though. i still haven't figured out how corona lite really works compared to bfast and maq.

changing -n with maq .map function doesn't affect the total % mappable sequence on the data that i've ran when colorspace is used.
der_eiskern is offline   Reply With Quote
Old 09-23-2009, 03:56 PM   #7
jperin
Member
 
Location: Philadelphia

Join Date: Feb 2009
Posts: 10
Default

I'm sure you've found a solution by now, but we split fastq files using either maq's built in tool for doing this, which makes it quite easy to split your reads, or by simply executing the split command in linux and creating batch jobs for bwa. With BWA its also nice to use the -t multi threading option. I tend to stick to about 2 million reads per input. Splitting the reference is not a good idea, and may result in somewhat inaccurate results IMO. Merging the results its also quite easy. In MAQ you can do this automatically with merge, with BWA it's best to use samtools to merge your results once you've sorted your bam file.
jperin is offline   Reply With Quote
Old 01-02-2010, 08:01 PM   #8
KevinLam
Senior Member
 
Location: SEA

Join Date: Nov 2009
Posts: 203
Default

I have a question as well. admittedly its more bwa related.
for bwa
how does the -t (multithread) option differ from splitting the reads fasta itself and running on a single core?
i.e. should i
a) split reads into 20 and run bwa as a single thread in a PBS cluster
or
b) split reads into 5 nodes and run bwa as -t 4 in the same PBS cluster
KevinLam is offline   Reply With Quote
Old 01-03-2010, 02:44 AM   #9
jperin
Member
 
Location: Philadelphia

Join Date: Feb 2009
Posts: 10
Default

I've found that a combination of both works best. Split -t #cores on your node, and split your input reads in about 2million reads per file chunks, more if you have plenty of memory.
jperin is offline   Reply With Quote
Old 01-03-2010, 02:50 AM   #10
KevinLam
Senior Member
 
Location: SEA

Join Date: Nov 2009
Posts: 203
Default

Quote:
Originally Posted by jperin View Post
I've found that a combination of both works best. Split -t #cores on your node, and split your input reads in about 2million reads per file chunks, more if you have plenty of memory.
ic.
if I have 2 GB ram per core is that enough for 2 million reads?
KevinLam is offline   Reply With Quote
Old 01-03-2010, 03:04 AM   #11
jperin
Member
 
Location: Philadelphia

Join Date: Feb 2009
Posts: 10
Default

yes, but i probably wouldn't thread more than 2-4 on a node with 2GB.
jperin is offline   Reply With Quote
Old 01-11-2010, 11:36 PM   #12
geschickten
Member
 
Location: India

Join Date: Jul 2009
Posts: 31
Default

Well all I can say is if you want to use MAQ on many cores or on a cluster ( built using commodity machines) then please try our version at http://www.geschickten.com/PaCGeE.html

Your suggestions and feedback are always appreciated. Thank you.
geschickten is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 12:34 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2021, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO