Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Strategy for genome assembly

    I am looking for some guidance on assembling a small bacterial genome. This is not working out to be as straightforward as I thought it would be

    The expected genome size is approximately 1MB and has a low (~30%) GC content (could this cause emulsion PCR bias, creating gaps?). This is based on 3 different published reference genomes. I have completed two 454 sequencing runs for one isolate; Run 1 28M bases 400bp average read length and Run 2 42M bases 445bp average read length. I thought this would be more than enough to get a complete assembly, not so!

    So far I have only done analysis with the 454 software. I am getting really different results depending on how I go about it.

    Using GSReferenceMapper with a single reference genome the best I can get is 42 large contigs, 56 total contigs, ave contig size 21352, longest contig 104874, total bases 899615. A few of these contigs contain very few reads.

    By comparison, using GSReferenceMapper with all 3 reference genomes I get 184 large contigs, 252 total contigs, ave contig size 2153, longest contig 31324, total bases 416772. I would have thought a better mapping would have been achieved with more reference information? There is, however, a large inversion (almost 1/2 the genome) in one of the reference isolates.

    The de novo assembler gives me 91 large contigs, 199 total contigs, ave contig size 10432, longest contig 78810, total bases 971829. For some reason I am willing to trust these contigs because they are not based on variable reference genomes.

    How do I go about deciding the best way to assemble this data? Software recommendations? I am limited to Windows for now so I understand my choices are limited. Any help is appreciated.

  • #2
    With the 454 coverage you have (around 90x), newbler is your best bet. You may need to downsample, even, as usually 30-50x is enough. You could use Mauve contig mover to order the contigs as per a (single) reference genome, provided there are no large rearrangements (so don't take that one reference genome).

    Mapping using gsMapper to multiple references is of no use, you only confuse the mapper, as many reads can be placed multiple times.

    An alternative would be celera, which seems to be working well with 454 data, but is somewhat more difficult to use.

    Comment


    • #3
      In addition to Newbler, where I would probably go the de novo route, I can recommend MIRA. You might need to down-sample a bit, but MIRA should do a really good job. The only reason that I am not using to all the time is that it has problems with larger genomes, but a bacterial genome is perfect for it. The support in the mailing-lists is excellent, so there is always help to be had. It does not do scaffolding though, so that you would need to do using a scaffolding program.

      In my fungal project, MIRA produced the longest contains that I felt I also could trust. Celera also did good, but I spotted a few mis-assemblies so I dropped that one.

      Comment


      • #4
        Thank you for the suggestions. Where is downsampling done within Newbler. Do you adjust this with the expected coverage? I have been adjusting this depending on the input files I have been using.

        After playing around with Mauve I found that my de novo contigs from Newbler fit two of my reference genomes rather well (not with the other one with the large rearrangement). Both gave me 11 Local Colinear Blocks. Using Mauve to align the reference mapper output (42 large, 56 total contigs) gives me 1 LCB. However, there are some gaps in the reference genome at the junction of most of the contigs in my assembled sequence.

        What are my options for confirming contig order and closing these gaps? This would be a lot of work (and $) to do it by PCR. But I also don't think more coverage would be a cost-effective way to do it either.

        Comment


        • #5
          Originally posted by mbseq View Post
          Where is downsampling done within Newbler.
          You can use the sfffile command, with the '-pick' or option, giving the amount of bases you want to try, and it will randomly select reads to that amount of bases.

          Originally posted by mbseq View Post
          What are my options for confirming contig order and closing these gaps? This would be a lot of work (and $) to do it by PCR. But I also don't think more coverage would be a cost-effective way to do it either.
          PCR is your choice, unless you want to spend money on, say, PacBio sequencing...

          Comment


          • #6
            I played around with downsampling from 60x coverage down to 10x coverage and running the denovo assembler. There does not seem to be much difference when I compare Newbler metrics until it gets down below 20x coverage. After that, average, N50 and largest contig size fall while the number of large contigs goes up (as expected). Is there something I should be looking for specifically to validate the downsampling or to settle on a level of coverage?

            I compared one of the known isolates with the contigs from the de novo assembler (both with 70 Mb and 40 Mb) using Mauve. There were small differences in the results; 70Mb resulted in 16 LCBs with min wt 202, 40Mb resulted in 12 LCBs with min wt 658. So it looks like using 40Mb is somewhat better. Is trial and error the only way to determine the optimal coverage?

            I will be using PCR to close gaps. How much confidence can I put into the LCBs provided by Mauve? Closing 12 LCBs seems much more manageable than confirming the order of 85 large contigs by PCR

            Comment

            Latest Articles

            Collapse

            • seqadmin
              Current Approaches to Protein Sequencing
              by seqadmin


              Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
              04-04-2024, 04:25 PM
            • seqadmin
              Strategies for Sequencing Challenging Samples
              by seqadmin


              Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
              03-22-2024, 06:39 AM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, 04-11-2024, 12:08 PM
            0 responses
            31 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-10-2024, 10:19 PM
            0 responses
            32 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-10-2024, 09:21 AM
            0 responses
            28 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-04-2024, 09:00 AM
            0 responses
            53 views
            0 likes
            Last Post seqadmin  
            Working...
            X