Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Strategy for genome assembly

    I am looking for some guidance on assembling a small bacterial genome. This is not working out to be as straightforward as I thought it would be

    The expected genome size is approximately 1MB and has a low (~30%) GC content (could this cause emulsion PCR bias, creating gaps?). This is based on 3 different published reference genomes. I have completed two 454 sequencing runs for one isolate; Run 1 28M bases 400bp average read length and Run 2 42M bases 445bp average read length. I thought this would be more than enough to get a complete assembly, not so!

    So far I have only done analysis with the 454 software. I am getting really different results depending on how I go about it.

    Using GSReferenceMapper with a single reference genome the best I can get is 42 large contigs, 56 total contigs, ave contig size 21352, longest contig 104874, total bases 899615. A few of these contigs contain very few reads.

    By comparison, using GSReferenceMapper with all 3 reference genomes I get 184 large contigs, 252 total contigs, ave contig size 2153, longest contig 31324, total bases 416772. I would have thought a better mapping would have been achieved with more reference information? There is, however, a large inversion (almost 1/2 the genome) in one of the reference isolates.

    The de novo assembler gives me 91 large contigs, 199 total contigs, ave contig size 10432, longest contig 78810, total bases 971829. For some reason I am willing to trust these contigs because they are not based on variable reference genomes.

    How do I go about deciding the best way to assemble this data? Software recommendations? I am limited to Windows for now so I understand my choices are limited. Any help is appreciated.

  • #2
    With the 454 coverage you have (around 90x), newbler is your best bet. You may need to downsample, even, as usually 30-50x is enough. You could use Mauve contig mover to order the contigs as per a (single) reference genome, provided there are no large rearrangements (so don't take that one reference genome).

    Mapping using gsMapper to multiple references is of no use, you only confuse the mapper, as many reads can be placed multiple times.

    An alternative would be celera, which seems to be working well with 454 data, but is somewhat more difficult to use.

    Comment


    • #3
      In addition to Newbler, where I would probably go the de novo route, I can recommend MIRA. You might need to down-sample a bit, but MIRA should do a really good job. The only reason that I am not using to all the time is that it has problems with larger genomes, but a bacterial genome is perfect for it. The support in the mailing-lists is excellent, so there is always help to be had. It does not do scaffolding though, so that you would need to do using a scaffolding program.

      In my fungal project, MIRA produced the longest contains that I felt I also could trust. Celera also did good, but I spotted a few mis-assemblies so I dropped that one.

      Comment


      • #4
        Thank you for the suggestions. Where is downsampling done within Newbler. Do you adjust this with the expected coverage? I have been adjusting this depending on the input files I have been using.

        After playing around with Mauve I found that my de novo contigs from Newbler fit two of my reference genomes rather well (not with the other one with the large rearrangement). Both gave me 11 Local Colinear Blocks. Using Mauve to align the reference mapper output (42 large, 56 total contigs) gives me 1 LCB. However, there are some gaps in the reference genome at the junction of most of the contigs in my assembled sequence.

        What are my options for confirming contig order and closing these gaps? This would be a lot of work (and $) to do it by PCR. But I also don't think more coverage would be a cost-effective way to do it either.

        Comment


        • #5
          Originally posted by mbseq View Post
          Where is downsampling done within Newbler.
          You can use the sfffile command, with the '-pick' or option, giving the amount of bases you want to try, and it will randomly select reads to that amount of bases.

          Originally posted by mbseq View Post
          What are my options for confirming contig order and closing these gaps? This would be a lot of work (and $) to do it by PCR. But I also don't think more coverage would be a cost-effective way to do it either.
          PCR is your choice, unless you want to spend money on, say, PacBio sequencing...

          Comment


          • #6
            I played around with downsampling from 60x coverage down to 10x coverage and running the denovo assembler. There does not seem to be much difference when I compare Newbler metrics until it gets down below 20x coverage. After that, average, N50 and largest contig size fall while the number of large contigs goes up (as expected). Is there something I should be looking for specifically to validate the downsampling or to settle on a level of coverage?

            I compared one of the known isolates with the contigs from the de novo assembler (both with 70 Mb and 40 Mb) using Mauve. There were small differences in the results; 70Mb resulted in 16 LCBs with min wt 202, 40Mb resulted in 12 LCBs with min wt 658. So it looks like using 40Mb is somewhat better. Is trial and error the only way to determine the optimal coverage?

            I will be using PCR to close gaps. How much confidence can I put into the LCBs provided by Mauve? Closing 12 LCBs seems much more manageable than confirming the order of 85 large contigs by PCR

            Comment

            Latest Articles

            Collapse

            • seqadmin
              Essential Discoveries and Tools in Epitranscriptomics
              by seqadmin




              The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
              04-22-2024, 07:01 AM
            • seqadmin
              Current Approaches to Protein Sequencing
              by seqadmin


              Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
              04-04-2024, 04:25 PM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, Yesterday, 11:49 AM
            0 responses
            15 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-24-2024, 08:47 AM
            0 responses
            16 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-11-2024, 12:08 PM
            0 responses
            61 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-10-2024, 10:19 PM
            0 responses
            60 views
            0 likes
            Last Post seqadmin  
            Working...
            X