Seqanswers Leaderboard Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • zee
    NGS specialist
    • Apr 2008
    • 249

    Running MAQ SNP/Indel detection/Assembly Tools on short aligners

    Heng Li's MAQ programs (maq.sourceforge.net) are a great set of tools for short read reference-guided assembly, SNP calling and indel detection. Improving on alignment may help the downstream process so we've adapted our aligners, novopaired and novoalign (www.novocraft.com) to use the maq binary .map format and hence be compatible with these tools.
    The novopaired format -> maq's map format with qualities is ready for use and is released under the same conditions as MAQ (because we've slightly modified the code for this conversion utility).
    We've tested it on some data for the following:

    I used an E. coli genome to test the conversion utility by simulating 500K pairs of length 36 using a modified version of maq to do simulations

    maqindel simulate -N 500000 -R 0.5 -r 0.01 -X 0.1 ecoli.fa raw-36.dat
    In english this means use a mutation rate of 1% with 50% of these being indels. The -X is a probability that an indel will get extended

    I nicknamed the E coli sequence "22". I also did assembly using "maq assemble..." using a mapping quality cutoff of 10.


    __Results__:
    Working
    maq mapstat
    maq mapview
    maq assemble
    maq indelsoa
    maq pileup
    maq cns2snp
    maq mapcheck
    maq cns2win
    maq cns2ref
    maq cns2fq
    maqview graphical viewer
    maq simucns
    maq simustat
    maq indelpe

    Some examples of SNP calling:

    maq cns2snp P0.01.novo.cns{cookie}
    22 175 G R 1 4 1.00 50 34
    22 243 C T 48 7 1.00 50 48
    22 382 C M 78 9 0.81 50 46
    22 383 G S 96 10 0.81 50 46
    22 384 T K 105 11 0.88 50 46
    22 388 C Y 91 10 0.81 50 58
    22 389 T Y 71 9 0.81 50 58
    22 390 G K 89 9 0.81 50 60
    22 391 C S 84 9 0.81 50 62
    22 392 G S 99 9 0.81 50 62
    22 393 T K 96 9 0.81 50 54
    22 394 G K 84 9 0.81 50 26
    22 395 T K 82 9 0.81 50 26
    22 397 G K 27 7 1.00 50 22
    22 398 C S 27 7 1.00 50 22

    Indel Detection:
    maq indelpe ecoli.bfa P0.01.novo.map
    22 382 . 2 1 0 1
    22 383 * 1 1 2 2
    22 789 + 0 1 4 0
    22 895 * 2 1 3 1
    22 1038 * 0 1 4 1
    22 1050 + 1 1 2 0
    22 1271 * 0 1 1 1
    22 1436 - 6 1 1 0
    22 1878 * 2 -1 1 4
    22 2126 - 5 -1 0 1



    maq mapcheck ecoli.bfa P0.01.novo.map
    Number of reference sequences: 1
    Length of reference sequences exlcuding gaps: 4639675
    Length of gaps in the reference sequences: 0
    Length of non-gap regions covered by reads: 4573482
    Length of 24bp unique regions of the reference: 4358409
    Reference nucleotide composition: A: 24.62%, C: 25.42%, G: 25.37%, T: 24.59%
    Reads nucleotide composition: A: 24.59%, C: 25.41%, G: 25.34%, T: 24.65%
    Average depth across all non-gap regions: 7.614
    Average depth across 24bp unique regions: 7.580
    ... (much more output here)


    __Comparison to Maq results for alignment stats:__

    Novopaired 1.04:

    Q SE_wrong / SE_tot PE_wrong / PE_tot 1 2 4 8 32
    0x 18 / 108 18 / 108 0 0 0 0 0
    1x 25 / 1420 25 / 1420 0 0 0 0 0
    2x 33 / 1974 33 / 1974 0 0 0 0 0
    3x 0 / 181 0 / 181 0 0 0 0 0
    4x 2 / 246 2 / 246 0 0 0 0 0
    5x 6252 / 977422 6252 / 977422 0 0 0 0 0
    6x 0 / 0 0 / 0 0 0 0 0 0
    7x 0 / 0 0 / 0 0 0 0 0 0
    8x 0 / 0 0 / 0 0 0 0 0 0
    9x 0 / 0 0 / 0 0 0 0 0 0

    MAQ:
    Q SE_wrong / SE_tot PE_wrong / PE_tot 1 2 4 8 32
    0x 14321 / 67065 13356 / 47158 4 1 5 4 0
    1x 23 / 23435 1306 / 47443 0 3 1 2 0
    2x 1 / 1498 1056 / 22924 0 0 0 0 0
    3x 36 / 39866 76 / 7585 2 2 0 0 0
    4x 149 / 38333 20 / 4633 2 4 0 0 0
    5x 38 / 103149 9 / 13089 0 0 0 0 0
    6x 119 / 116994 22 / 13942 0 8 1 0 0
    7x 2 / 17368 8 / 9486 0 2 0 0 0
    8x 23 / 240203 15 / 32304 0 4 0 0 0
    9x 1465 / 314990 309 / 764337 0 2 2 0 0


    __Comparison of SNP calling results at a 1% mutation rate:__

    Novopaired
    cnsQ #called #wrong err_rate Correct
    0x 150991 104401 6.914386e-01 46590
    1x 56297 20788 3.692559e-01 35509
    2x 72938 36757 5.039486e-01 36181
    3x 641904 45256 7.050276e-02 596648
    4x 1634151 16559 1.013309e-02 1617592
    5x 1371020 19601 1.429665e-02 1351419
    6x 472642 23583 4.989612e-02 449059
    7x 45020 8748 1.943136e-01 36272
    8x 11914 8540 7.168038e-01 3374
    9x 18228 17337 9.511192e-01 891

    MAQ
    cnsQ #called #wrong err_rate Correct
    0x 110166 103276 9.374580e-01 6890
    1x 12163 1124 9.241141e-02 11039
    2x 33644 881 2.618595e-02 32763
    3x 947851 2060 2.173337e-03 945791
    4x 1749680 605 3.457775e-04 1749075
    5x 1223791 418 3.415616e-04 1223373
    6x 367460 586 1.594731e-03 366874
    7x 26832 259 9.652654e-03 26573
    8x 2562 202 7.884465e-02 2360
    9x 956 432 4.518828e-01 524


    __General stats reporting:__

    Novopaired

    -- Total number of reads: 981401
    -- Sum of read length: 35330436
    -- Error rate: 0.022168
    -- Average read length: 36.00

    -- Mismatch statistics:

    -- MM 0 435877
    -- MM 1 370873
    -- MM 2 124211
    -- MM 3 39961
    -- MM 4 8628
    -- MM 5 1611
    -- MM 6 220
    -- MM 7 19
    -- MM 8 1

    -- Mapping quality statistics:

    -- MQ 00-09 108 108
    -- MQ 10-19 1420 1420
    -- MQ 20-29 1974 1974
    -- MQ 30-39 181 181
    -- MQ 40-49 246 246
    -- MQ 50-59 977472 977472

    MAQ:

    -- Total number of reads: 970668
    -- Sum of read length: 34944048
    -- Error rate: 0.035581
    -- Average read length: 36.00

    -- Mismatch statistics:

    -- MM 0 317448
    -- MM 1 306452
    -- MM 2 190590
    -- MM 3 95922
    -- MM 4 40503
    -- MM 5 14300
    -- MM 6 4107
    -- MM 7 1054
    -- MM 8 234
    -- MM 9 52
    -- MM 10 3
    -- MM 11 3

    -- Mapping quality statistics:

    -- MQ 00-09 68438 48405
    -- MQ 10-19 24597 48768
    -- MQ 20-29 1498 23034
    -- MQ 30-39 40401 7652
    -- MQ 40-49 40424 4872
    -- MQ 50-59 103606 13169
    -- MQ 60-69 117621 14046
    -- MQ 70-79 17424 9592
    -- MQ 80-89 240912 32522
    -- MQ 90-99 315747 768608


    The code will be made available to everybody according to the GPL bundled with the MAQ source code. If you're willing to try it out please contact me or sparks. The authors of MAQ will also receive a copy. It will also be put up for free download at www.novocraft.com
  • zee
    NGS specialist
    • Apr 2008
    • 249

    #2
    If anybody is interested in how this performs with real data on human alignments I will post some numbers shortly.

    Comment

    • lh3
      Senior Member
      • Feb 2008
      • 686

      #3
      A development version of maq, which supports novo2maq, is now available at maq website. The URL is:



      This version is almost identical to maq-0.6.8 and will finally evolve into maq-0.6.9 in the future.
      Last edited by lh3; 08-19-2008, 01:00 AM.

      Comment

      • zee
        NGS specialist
        • Apr 2008
        • 249

        #4
        I've put some results for simulated data and the novo2maq conversion here for anybody who is interested.
        However, some figures for real data should be more conclusive so these are just a preliminary.

        Comment

        • der_eiskern
          Member
          • Jul 2009
          • 46

          #5
          SE novoaligned data to MAQ pileup?

          Originally posted by zee View Post
          I've put some results for simulated data and the novo2maq conversion here for anybody who is interested.
          However, some figures for real data should be more conclusive so these are just a preliminary.
          So I aligned my Single-End illumina data with MAQ then took all the unmappable reads and aligned those with NovoAlign. Now I'm wondering if I should use the novo2maq function and combine the novo.map file with my maq.map file and then generate a consensus pile up from that

          OR

          if I should write my own python script to take the Novo Native output and add that to my pileup file from MAQ. (should be rather straightforward)

          My concern is how novo2maq transfers the information about indels within individual reads to MAQ and how MAQ outputs this indel data after aligning to the reference.

          I have not seen this concern raised elsewhere and am wondering what you all think about my situation. What are the benefits to using novo2maq and maq to create the composite pileup versus generating it myself?

          Thanks,
          der.

          Update: I ran "novo2maq" then "maq pileup" and you lose the indel data in the novoalign Native format output. it seem to just output everything just to relative to offset start position of the mapping. I think i might make my own conversion tool...
          Last edited by der_eiskern; 12-11-2009, 04:10 PM.

          Comment

          Latest Articles

          Collapse

          • seqadmin
            Pathogen Surveillance with Advanced Genomic Tools
            by seqadmin




            The COVID-19 pandemic highlighted the need for proactive pathogen surveillance systems. As ongoing threats like avian influenza and newly emerging infections continue to pose risks, researchers are working to improve how quickly and accurately pathogens can be identified and tracked. In a recent SEQanswers webinar, two experts discussed how next-generation sequencing (NGS) and machine learning are shaping efforts to monitor viral variation and trace the origins of infectious...
            03-24-2025, 11:48 AM
          • seqadmin
            New Genomics Tools and Methods Shared at AGBT 2025
            by seqadmin


            This year’s Advances in Genome Biology and Technology (AGBT) General Meeting commemorated the 25th anniversary of the event at its original venue on Marco Island, Florida. While this year’s event didn’t include high-profile musical performances, the industry announcements and cutting-edge research still drew the attention of leading scientists.

            The Headliner
            The biggest announcement was Roche stepping back into the sequencing platform market. In the years since...
            03-03-2025, 01:39 PM

          ad_right_rmr

          Collapse

          News

          Collapse

          Topics Statistics Last Post
          Started by seqadmin, 03-20-2025, 05:03 AM
          0 responses
          49 views
          0 reactions
          Last Post seqadmin  
          Started by seqadmin, 03-19-2025, 07:27 AM
          0 responses
          57 views
          0 reactions
          Last Post seqadmin  
          Started by seqadmin, 03-18-2025, 12:50 PM
          0 responses
          50 views
          0 reactions
          Last Post seqadmin  
          Started by seqadmin, 03-03-2025, 01:15 PM
          0 responses
          201 views
          0 reactions
          Last Post seqadmin  
          Working...