Heng Li's MAQ programs (maq.sourceforge.net) are a great set of tools for short read reference-guided assembly, SNP calling and indel detection. Improving on alignment may help the downstream process so we've adapted our aligners, novopaired and novoalign (www.novocraft.com) to use the maq binary .map format and hence be compatible with these tools.
The novopaired format -> maq's map format with qualities is ready for use and is released under the same conditions as MAQ (because we've slightly modified the code for this conversion utility).
We've tested it on some data for the following:
I used an E. coli genome to test the conversion utility by simulating 500K pairs of length 36 using a modified version of maq to do simulations
maqindel simulate -N 500000 -R 0.5 -r 0.01 -X 0.1 ecoli.fa raw-36.dat
In english this means use a mutation rate of 1% with 50% of these being indels. The -X is a probability that an indel will get extended
I nicknamed the E coli sequence "22". I also did assembly using "maq assemble..." using a mapping quality cutoff of 10.
__Results__:
Working
maq mapstat
maq mapview
maq assemble
maq indelsoa
maq pileup
maq cns2snp
maq mapcheck
maq cns2win
maq cns2ref
maq cns2fq
maqview graphical viewer
maq simucns
maq simustat
maq indelpe
Some examples of SNP calling:
maq cns2snp P0.01.novo.cns{cookie}
22 175 G R 1 4 1.00 50 34
22 243 C T 48 7 1.00 50 48
22 382 C M 78 9 0.81 50 46
22 383 G S 96 10 0.81 50 46
22 384 T K 105 11 0.88 50 46
22 388 C Y 91 10 0.81 50 58
22 389 T Y 71 9 0.81 50 58
22 390 G K 89 9 0.81 50 60
22 391 C S 84 9 0.81 50 62
22 392 G S 99 9 0.81 50 62
22 393 T K 96 9 0.81 50 54
22 394 G K 84 9 0.81 50 26
22 395 T K 82 9 0.81 50 26
22 397 G K 27 7 1.00 50 22
22 398 C S 27 7 1.00 50 22
Indel Detection:
maq indelpe ecoli.bfa P0.01.novo.map
22 382 . 2 1 0 1
22 383 * 1 1 2 2
22 789 + 0 1 4 0
22 895 * 2 1 3 1
22 1038 * 0 1 4 1
22 1050 + 1 1 2 0
22 1271 * 0 1 1 1
22 1436 - 6 1 1 0
22 1878 * 2 -1 1 4
22 2126 - 5 -1 0 1
maq mapcheck ecoli.bfa P0.01.novo.map
Number of reference sequences: 1
Length of reference sequences exlcuding gaps: 4639675
Length of gaps in the reference sequences: 0
Length of non-gap regions covered by reads: 4573482
Length of 24bp unique regions of the reference: 4358409
Reference nucleotide composition: A: 24.62%, C: 25.42%, G: 25.37%, T: 24.59%
Reads nucleotide composition: A: 24.59%, C: 25.41%, G: 25.34%, T: 24.65%
Average depth across all non-gap regions: 7.614
Average depth across 24bp unique regions: 7.580
... (much more output here)
__Comparison to Maq results for alignment stats:__
Novopaired 1.04:
Q SE_wrong / SE_tot PE_wrong / PE_tot 1 2 4 8 32
0x 18 / 108 18 / 108 0 0 0 0 0
1x 25 / 1420 25 / 1420 0 0 0 0 0
2x 33 / 1974 33 / 1974 0 0 0 0 0
3x 0 / 181 0 / 181 0 0 0 0 0
4x 2 / 246 2 / 246 0 0 0 0 0
5x 6252 / 977422 6252 / 977422 0 0 0 0 0
6x 0 / 0 0 / 0 0 0 0 0 0
7x 0 / 0 0 / 0 0 0 0 0 0
8x 0 / 0 0 / 0 0 0 0 0 0
9x 0 / 0 0 / 0 0 0 0 0 0
MAQ:
Q SE_wrong / SE_tot PE_wrong / PE_tot 1 2 4 8 32
0x 14321 / 67065 13356 / 47158 4 1 5 4 0
1x 23 / 23435 1306 / 47443 0 3 1 2 0
2x 1 / 1498 1056 / 22924 0 0 0 0 0
3x 36 / 39866 76 / 7585 2 2 0 0 0
4x 149 / 38333 20 / 4633 2 4 0 0 0
5x 38 / 103149 9 / 13089 0 0 0 0 0
6x 119 / 116994 22 / 13942 0 8 1 0 0
7x 2 / 17368 8 / 9486 0 2 0 0 0
8x 23 / 240203 15 / 32304 0 4 0 0 0
9x 1465 / 314990 309 / 764337 0 2 2 0 0
__Comparison of SNP calling results at a 1% mutation rate:__
Novopaired
cnsQ #called #wrong err_rate Correct
0x 150991 104401 6.914386e-01 46590
1x 56297 20788 3.692559e-01 35509
2x 72938 36757 5.039486e-01 36181
3x 641904 45256 7.050276e-02 596648
4x 1634151 16559 1.013309e-02 1617592
5x 1371020 19601 1.429665e-02 1351419
6x 472642 23583 4.989612e-02 449059
7x 45020 8748 1.943136e-01 36272
8x 11914 8540 7.168038e-01 3374
9x 18228 17337 9.511192e-01 891
MAQ
cnsQ #called #wrong err_rate Correct
0x 110166 103276 9.374580e-01 6890
1x 12163 1124 9.241141e-02 11039
2x 33644 881 2.618595e-02 32763
3x 947851 2060 2.173337e-03 945791
4x 1749680 605 3.457775e-04 1749075
5x 1223791 418 3.415616e-04 1223373
6x 367460 586 1.594731e-03 366874
7x 26832 259 9.652654e-03 26573
8x 2562 202 7.884465e-02 2360
9x 956 432 4.518828e-01 524
__General stats reporting:__
Novopaired
-- Total number of reads: 981401
-- Sum of read length: 35330436
-- Error rate: 0.022168
-- Average read length: 36.00
-- Mismatch statistics:
-- MM 0 435877
-- MM 1 370873
-- MM 2 124211
-- MM 3 39961
-- MM 4 8628
-- MM 5 1611
-- MM 6 220
-- MM 7 19
-- MM 8 1
-- Mapping quality statistics:
-- MQ 00-09 108 108
-- MQ 10-19 1420 1420
-- MQ 20-29 1974 1974
-- MQ 30-39 181 181
-- MQ 40-49 246 246
-- MQ 50-59 977472 977472
MAQ:
-- Total number of reads: 970668
-- Sum of read length: 34944048
-- Error rate: 0.035581
-- Average read length: 36.00
-- Mismatch statistics:
-- MM 0 317448
-- MM 1 306452
-- MM 2 190590
-- MM 3 95922
-- MM 4 40503
-- MM 5 14300
-- MM 6 4107
-- MM 7 1054
-- MM 8 234
-- MM 9 52
-- MM 10 3
-- MM 11 3
-- Mapping quality statistics:
-- MQ 00-09 68438 48405
-- MQ 10-19 24597 48768
-- MQ 20-29 1498 23034
-- MQ 30-39 40401 7652
-- MQ 40-49 40424 4872
-- MQ 50-59 103606 13169
-- MQ 60-69 117621 14046
-- MQ 70-79 17424 9592
-- MQ 80-89 240912 32522
-- MQ 90-99 315747 768608
The code will be made available to everybody according to the GPL bundled with the MAQ source code. If you're willing to try it out please contact me or sparks. The authors of MAQ will also receive a copy. It will also be put up for free download at www.novocraft.com
The novopaired format -> maq's map format with qualities is ready for use and is released under the same conditions as MAQ (because we've slightly modified the code for this conversion utility).
We've tested it on some data for the following:
I used an E. coli genome to test the conversion utility by simulating 500K pairs of length 36 using a modified version of maq to do simulations
maqindel simulate -N 500000 -R 0.5 -r 0.01 -X 0.1 ecoli.fa raw-36.dat
In english this means use a mutation rate of 1% with 50% of these being indels. The -X is a probability that an indel will get extended
I nicknamed the E coli sequence "22". I also did assembly using "maq assemble..." using a mapping quality cutoff of 10.
__Results__:
Working
maq mapstat
maq mapview
maq assemble
maq indelsoa
maq pileup
maq cns2snp
maq mapcheck
maq cns2win
maq cns2ref
maq cns2fq
maqview graphical viewer
maq simucns
maq simustat
maq indelpe
Some examples of SNP calling:
maq cns2snp P0.01.novo.cns{cookie}
22 175 G R 1 4 1.00 50 34
22 243 C T 48 7 1.00 50 48
22 382 C M 78 9 0.81 50 46
22 383 G S 96 10 0.81 50 46
22 384 T K 105 11 0.88 50 46
22 388 C Y 91 10 0.81 50 58
22 389 T Y 71 9 0.81 50 58
22 390 G K 89 9 0.81 50 60
22 391 C S 84 9 0.81 50 62
22 392 G S 99 9 0.81 50 62
22 393 T K 96 9 0.81 50 54
22 394 G K 84 9 0.81 50 26
22 395 T K 82 9 0.81 50 26
22 397 G K 27 7 1.00 50 22
22 398 C S 27 7 1.00 50 22
Indel Detection:
maq indelpe ecoli.bfa P0.01.novo.map
22 382 . 2 1 0 1
22 383 * 1 1 2 2
22 789 + 0 1 4 0
22 895 * 2 1 3 1
22 1038 * 0 1 4 1
22 1050 + 1 1 2 0
22 1271 * 0 1 1 1
22 1436 - 6 1 1 0
22 1878 * 2 -1 1 4
22 2126 - 5 -1 0 1
maq mapcheck ecoli.bfa P0.01.novo.map
Number of reference sequences: 1
Length of reference sequences exlcuding gaps: 4639675
Length of gaps in the reference sequences: 0
Length of non-gap regions covered by reads: 4573482
Length of 24bp unique regions of the reference: 4358409
Reference nucleotide composition: A: 24.62%, C: 25.42%, G: 25.37%, T: 24.59%
Reads nucleotide composition: A: 24.59%, C: 25.41%, G: 25.34%, T: 24.65%
Average depth across all non-gap regions: 7.614
Average depth across 24bp unique regions: 7.580
... (much more output here)
__Comparison to Maq results for alignment stats:__
Novopaired 1.04:
Q SE_wrong / SE_tot PE_wrong / PE_tot 1 2 4 8 32
0x 18 / 108 18 / 108 0 0 0 0 0
1x 25 / 1420 25 / 1420 0 0 0 0 0
2x 33 / 1974 33 / 1974 0 0 0 0 0
3x 0 / 181 0 / 181 0 0 0 0 0
4x 2 / 246 2 / 246 0 0 0 0 0
5x 6252 / 977422 6252 / 977422 0 0 0 0 0
6x 0 / 0 0 / 0 0 0 0 0 0
7x 0 / 0 0 / 0 0 0 0 0 0
8x 0 / 0 0 / 0 0 0 0 0 0
9x 0 / 0 0 / 0 0 0 0 0 0
MAQ:
Q SE_wrong / SE_tot PE_wrong / PE_tot 1 2 4 8 32
0x 14321 / 67065 13356 / 47158 4 1 5 4 0
1x 23 / 23435 1306 / 47443 0 3 1 2 0
2x 1 / 1498 1056 / 22924 0 0 0 0 0
3x 36 / 39866 76 / 7585 2 2 0 0 0
4x 149 / 38333 20 / 4633 2 4 0 0 0
5x 38 / 103149 9 / 13089 0 0 0 0 0
6x 119 / 116994 22 / 13942 0 8 1 0 0
7x 2 / 17368 8 / 9486 0 2 0 0 0
8x 23 / 240203 15 / 32304 0 4 0 0 0
9x 1465 / 314990 309 / 764337 0 2 2 0 0
__Comparison of SNP calling results at a 1% mutation rate:__
Novopaired
cnsQ #called #wrong err_rate Correct
0x 150991 104401 6.914386e-01 46590
1x 56297 20788 3.692559e-01 35509
2x 72938 36757 5.039486e-01 36181
3x 641904 45256 7.050276e-02 596648
4x 1634151 16559 1.013309e-02 1617592
5x 1371020 19601 1.429665e-02 1351419
6x 472642 23583 4.989612e-02 449059
7x 45020 8748 1.943136e-01 36272
8x 11914 8540 7.168038e-01 3374
9x 18228 17337 9.511192e-01 891
MAQ
cnsQ #called #wrong err_rate Correct
0x 110166 103276 9.374580e-01 6890
1x 12163 1124 9.241141e-02 11039
2x 33644 881 2.618595e-02 32763
3x 947851 2060 2.173337e-03 945791
4x 1749680 605 3.457775e-04 1749075
5x 1223791 418 3.415616e-04 1223373
6x 367460 586 1.594731e-03 366874
7x 26832 259 9.652654e-03 26573
8x 2562 202 7.884465e-02 2360
9x 956 432 4.518828e-01 524
__General stats reporting:__
Novopaired
-- Total number of reads: 981401
-- Sum of read length: 35330436
-- Error rate: 0.022168
-- Average read length: 36.00
-- Mismatch statistics:
-- MM 0 435877
-- MM 1 370873
-- MM 2 124211
-- MM 3 39961
-- MM 4 8628
-- MM 5 1611
-- MM 6 220
-- MM 7 19
-- MM 8 1
-- Mapping quality statistics:
-- MQ 00-09 108 108
-- MQ 10-19 1420 1420
-- MQ 20-29 1974 1974
-- MQ 30-39 181 181
-- MQ 40-49 246 246
-- MQ 50-59 977472 977472
MAQ:
-- Total number of reads: 970668
-- Sum of read length: 34944048
-- Error rate: 0.035581
-- Average read length: 36.00
-- Mismatch statistics:
-- MM 0 317448
-- MM 1 306452
-- MM 2 190590
-- MM 3 95922
-- MM 4 40503
-- MM 5 14300
-- MM 6 4107
-- MM 7 1054
-- MM 8 234
-- MM 9 52
-- MM 10 3
-- MM 11 3
-- Mapping quality statistics:
-- MQ 00-09 68438 48405
-- MQ 10-19 24597 48768
-- MQ 20-29 1498 23034
-- MQ 30-39 40401 7652
-- MQ 40-49 40424 4872
-- MQ 50-59 103606 13169
-- MQ 60-69 117621 14046
-- MQ 70-79 17424 9592
-- MQ 80-89 240912 32522
-- MQ 90-99 315747 768608
The code will be made available to everybody according to the GPL bundled with the MAQ source code. If you're willing to try it out please contact me or sparks. The authors of MAQ will also receive a copy. It will also be put up for free download at www.novocraft.com
Comment