Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to tell what is the 'best' assembly

    I'm assembling some bacterial genomes for which there are no close reference sequences available. Reads are 70-90 bp after cleaning and trimming with fastq-mcf, coverage about 100x

    I'm using abyss 1.3.7 and Ray 2.3.1 (others are in the planning) and use abyss-fac to get some metrics

    How do I tell from these metrics what the 'best' assembly is., for example using abyss both k=31 and k=38 or k=39 seem to be OK. (the number before the fasta file is the kmer value i used)
    Ray stats suggests that the assemblies at k=31 ,k=37, but also k=85 could be fine assemblies. ray seems to scaffold better

    I'm a bit lost here how to assess these metrics. I would appreciate any suggestions.

    Fenny

    Abyss run:
    n n:500 n:N50 min N80 N50 N20 E-size max sum name
    13926 1609 228 500 3470 9163 22002 15036 72141 8202130 k20/unknown-contigs.fa
    7381 899 120 505 7393 18774 43716 26836 107339 8500853 k21/unknown-contigs.fa
    4213 505 66 504 14988 39642 71511 49270 215678 8644984 k22/unknown-contigs.fa
    3341 429 52 505 18561 49387 92879 64247 292483 8710849 k23/unknown-contigs.fa
    2999 382 47 513 20123 54351 103578 68436 241664 8720215 k24/unknown-contigs.fa
    2373 298 37 578 29373 75091 133436 90121 309081 8765701 k25/unknown-contigs.fa
    2110 275 31 534 33500 82613 175193 104953 361349 8782006 k26/unknown-contigs.fa
    2043 275 35 517 38713 76396 143749 97326 309040 8774293 k27/unknown-contigs.fa
    1670 250 25 517 39917 93324 199540 130157 406193 8814319 k28/unknown-contigs.fa
    1514 248 27 519 42798 95525 190595 120848 362303 8824549 k29/unknown-contigs.fa
    1411 242 26 521 43444 99602 199159 126189 363974 8848575 k30/unknown-contigs.fa
    1283 231 21 523 43214 112193 302571 155113 414744 8869109 k31/unknown-contigs.fa
    1209 242 25 525 43617 95477 237887 137683 414746 8871337 k32/unknown-contigs.fa
    1183 240 24 527 42803 99289 237887 139646 414385 8866006 k33/unknown-contigs.fa
    1036 221 23 610 47707 112215 237887 144304 414279 8882417 k34/unknown-contigs.fa
    978 220 24 634 43232 99289 237887 141346 414279 8916246 k35/unknown-contigs.fa
    937 217 23 636 51036 112215 220762 155820 570462 8900832 k36/unknown-contigs.fa
    808 218 23 535 52818 112412 220814 149460 495193 8913342 k37/unknown-contigs.fa
    765 210 20 537 53472 118915 245992 173014 535873 8915544 k38/unknown-contigs.fa
    714 209 21 581 54738 118916 237887 160046 495233 8909314 k39/unknown-contigs.fa
    660 213 22 583 53763 117753 237887 154655 495223 8936400 k40/unknown-contigs.fa

    Ray run
    n n:500 n:N50 min N80 N50 N20 E-size max sum name
    178 121 15 770 62111 164112 363566 228332 760374 8782282 21/Scaffolds.fasta
    175 112 14 770 67656 176067 362164 233702 757332 8739663 23/Scaffolds.fasta
    177 113 15 683 72292 164206 389778 234641 757371 8696466 25/Scaffolds.fasta
    135 98 14 2178 91732 188176 389831 246831 757371 8782374 27/Scaffolds.fasta
    151 114 17 501 77399 153247 351688 214424 757371 8708387 29/Scaffolds.fasta
    145 110 13 509 91732 168523 833846 344303 1121085 8807879 31/Scaffolds.fasta
    149 114 13 509 92390 164110 716013 323769 1121085 8824925 33/Scaffolds.fasta
    143 111 13 509 91732 164135 833855 343321 1121085 8813200 35/Scaffolds.fasta
    144 109 13 509 91732 168523 833855 344524 1121085 8808399 37/Scaffolds.fasta
    147 112 13 509 91732 168523 833846 342379 1121085 8814542 39/Scaffolds.fasta
    151 115 14 509 91732 154479 716030 322057 1121085 8821523 41/Scaffolds.fasta
    146 111 13 509 92390 168523 716030 325379 1121085 8808165 43/Scaffolds.fasta
    148 112 13 509 91732 168523 833855 342560 1121085 8805456 45/Scaffolds.fasta
    148 112 13 509 91732 168523 833855 342554 1121085 8805822 47/Scaffolds.fasta
    149 113 13 509 92390 164110 716030 323939 1121085 8815467 49/Scaffolds.fasta
    149 113 13 509 91732 164110 833855 342692 1121085 8815976 51/Scaffolds.fasta
    142 109 12 509 92390 172838 833846 345337 1121085 8815343 53/Scaffolds.fasta
    144 109 13 509 91732 164110 833838 343961 1121085 8808473 55/Scaffolds.fasta
    146 111 13 509 91732 168523 833846 342407 1121085 8813776 57/Scaffolds.fasta
    145 110 13 509 92390 164110 833846 343310 1121085 8815638 59/Scaffolds.fasta
    145 110 13 509 91732 168523 833855 342783 1121085 8806084 61/Scaffolds.fasta
    143 108 13 509 91732 168523 833838 344559 1121085 8808222 63/Scaffolds.fasta
    148 112 13 509 92390 168523 716030 325211 1121085 8809623 65/Scaffolds.fasta
    148 113 13 509 91732 168523 833855 342435 1121085 8813376 67/Scaffolds.fasta
    140 107 12 509 95774 172838 833855 345732 1121085 8807698 69/Scaffolds.fasta
    148 113 13 509 92390 164110 716030 323923 1121085 8816039 71/Scaffolds.fasta
    142 109 12 509 95774 172838 833855 345409 1121085 8807219 73/Scaffolds.fasta
    144 109 13 509 91732 168523 833855 344530 1121085 8808208 75/Scaffolds.fasta
    144 111 12 509 91732 172838 833855 344861 1121085 8809073 77/Scaffolds.fasta
    146 111 13 509 92390 168523 833846 344770 1121085 8808362 79/Scaffolds.fasta
    145 110 13 509 92390 164110 833855 343305 1121085 8816026 81/Scaffolds.fasta
    147 112 13 509 91732 164110 833838 342556 1121085 8814178 83/Scaffolds.fasta
    143 108 13 509 91732 168523 833855 344556 1121085 8808270 85/Scaffolds.fasta

  • #2
    "I'm assembling some bacterial genomes for which there are no close reference sequences available"

    In the absence of reference data to align your contigs against (plasmids, EST tags, mRNA), you could try ORF finding and blasting those ORFs to see what you have. Generating a dotplot of your assembly against itself will also tell you if you have a highly repetitive assembly. "Best assembly" is generally understood to be the correct one, but that is difficult to tell up front. The largest contig or largest n50 is also a poor indicator of assembly quality.

    Ideally, an assembly should have a minimum number of alternative paths: a dip in coverage and a large number of reads that map to a different contig may suggest a repetitive region was improperly collapsed, causing two paths.
    Last edited by ctseto; 04-07-2014, 01:41 PM.

    Comment


    • #3
      Consult the lab guys. Maybe they have some additional information about the bacteria. (Known gene, transposon elements, restriction map, PCR fragment.) There is no golden rule to validate a de-novo assembly with computer-only methods.

      Comment

      Latest Articles

      Collapse

      • seqadmin
        Strategies for Sequencing Challenging Samples
        by seqadmin


        Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
        03-22-2024, 06:39 AM
      • seqadmin
        Techniques and Challenges in Conservation Genomics
        by seqadmin



        The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

        Avian Conservation
        Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
        03-08-2024, 10:41 AM

      ad_right_rmr

      Collapse

      News

      Collapse

      Topics Statistics Last Post
      Started by seqadmin, 03-27-2024, 06:37 PM
      0 responses
      13 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 03-27-2024, 06:07 PM
      0 responses
      12 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 03-22-2024, 10:03 AM
      0 responses
      53 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 03-21-2024, 07:32 AM
      0 responses
      69 views
      0 likes
      Last Post seqadmin  
      Working...
      X