Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • 454 paired end bacterial whole genome assembly

    I have a paired end bacterial genome data set. With help from some of you I was able to liberate the sequence data from the 454 .sff.

    What assembly engine would work best? I want a "second opinion" from gsAssembler.

    This is a small bacterial genome: less than 1.2 million bases. The sequence coverage is a little difficult to measure. But I would estimate, but it is around 50X. It does also contain some contaminating eukaryotic DNA. Very low pass on that however (maybe 0.01X, at most).

    Directly using phrap on the .fasta and .qual files produced a very large number of contigs and scaffolds. I reduced all the quality values to 1/2 their original values and a single large scaffold emerged. But still 2x more contigs than gsAssembler gives.

    Any advice?

    --
    Phillip

  • #2
    My experience with assembling bacterial genomes from 454 data is that Newbler with default parameters will do better than any other assembly engine, when dealing purely with 454 reads.

    I don't know why this is, but I presume there are some built-in heuristics in Newbler that suit the error model for this technology (which is very different to Solexa or Sanger-ABI).

    However, if you really want to try a different assembler then I would suggest MIRA and CLC Genomics Workbench are possible options.

    How many contigs are you getting right now and is this consistent with the repeat profile (i.e. number of repeat regions >= read length) predicted from the genome in question?

    Comment


    • #3
      I'd second the suggestion to try MIRA 3 on the assembly. The manual takes you through preparing the SFF file using sff_extract etc. If you start using the tool seriously, do sign up to their mailing list.

      Comment


      • #4
        Originally posted by nickloman View Post
        How many contigs are you getting right now and is this consistent with the repeat profile (i.e. number of repeat regions >= read length) predicted from the genome in question?
        gsAssembler (aka "Newbler") gives me 52 contigs in a single scaffold, with lots of extraneous contigs--most or all likely deriving from contaminating eukaryotic DNA.

        I looks to me like there are very few repeats in this genome. The read lengths are standard Titanium read lengths (~400 mean) but in cases where the read contains the F/R mate split, each right/left reads will be shorter, of course. The pair-ends look to be 2-4 kb apart--consistent with 3kb paired-ends.

        --
        Phillip

        Comment


        • #5
          Originally posted by pmiguel View Post
          gsAssembler (aka "Newbler") gives me 52 contigs in a single scaffold, with lots of extraneous contigs--most or all likely deriving from contaminating eukaryotic DNA.

          I looks to me like there are very few repeats in this genome. The read lengths are standard Titanium read lengths (~400 mean) but in cases where the read contains the F/R mate split, each right/left reads will be shorter, of course. The pair-ends look to be 2-4 kb apart--consistent with 3kb paired-ends.

          --
          Phillip
          Oh right! Sounds like you have a pretty good result then if ended up with a single scaffold. Be cautious, doing another assembly with a different program may be more confounding than confirmatory! Perhaps it would be better, if you want to verify the genome order to make some confirmatory PCRs, perhaps primers to amplify the entire genome in 10kb sections.

          Comment


          • #6
            Originally posted by nickloman View Post
            Oh right! Sounds like you have a pretty good result then if ended up with a single scaffold. Be cautious, doing another assembly with a different program may be more confounding than confirmatory! Perhaps it would be better, if you want to verify the genome order to make some confirmatory PCRs, perhaps primers to amplify the entire genome in 10kb sections.
            Well, I neglected to add that a cursory examination of the optical map bore no similarity that I could perceive to the scaffold restriction map. But that might be for a trivial reason--like the optical map was done with some other restriction enzyme than the one we were told.

            It is the right bacterium--everything fits previous sequencing results.

            In part I'm just trying to transition or include the tools I've long used for eukaryotic BAC assembly (phred/phrap/consed) into this brave new next-generation world.

            Thanks for your advice,
            Phillip

            Comment


            • #7
              for bacteria genome, newbler is a good choice;
              for " 52 contigs in a single scaffold", you can try to close gaps with the pair-end information.

              Comment


              • #8
                Originally posted by bio-x View Post
                for bacteria genome, newbler is a good choice;
                for " 52 contigs in a single scaffold", you can try to close gaps with the pair-end information.
                Yes, that is the plan. But I am a little suspicious of the gaps.

                I have now done assemblies with both gsAssembler and phrap and examined the assemblies in consed.

                BTW, I had to do a perl -i -pe 's/_left/.f/ if /^>/;s/_right/.r/ if /^>/;' on the fasta and qual files before doing phrap to get consed to show the F/R paired ends in assembly view.

                As I mentioned earlier, the phrap assembly produced a larger number of contigs by far than gsAssembler. This was somewhat mitigated when I cut all the quality values for the reads in half. Nevertheless, the assemblies appear qualitatively different.

                Phrap contig consensus sequence extends fairly far into questionable areas towards the ends of contigs. This is normal for phrap. The bases at the ends of contigs tend to have very low quality values. Newbler, however, is far more parsimonious with how far it allows the consensus sequence to extend at the ends of contigs.

                While I can understand the purpose behind this more stringent consensus sequence generating behavior, I want to turn it off! As it is, none of the contigs show overlap at their ends--so I have no basis to join them. Anyone know a way to get gsAssembler to become less censorious?

                --
                Phillip

                Comment


                • #9
                  Phillip,

                  Here is something I did once when working on a herpes virus genome. I first performed the assembly with gsAssembler. I then took the ACE output and loaded that into consed. Once in consed you can use the use the reassembly and contig joining tools to try to bring them together. You will need to (re)run gsAssembler with the ACE output mode set to generate a full consed folder and you will need the latest version of consed to properly recognize a 454 project.

                  I also noted that consed ignores any scaffolding information from the gsAssembler; it calculates its own scaffolding based on the read pairing information in the contigs. I ran a number of independent assemblies of the viral genome in gsAssembler (I had vastly more raw data than needed for a single assembly). While the size and sequence of the contigs was consistent from one assembly to the next, the scaffolding produced by gsAssembler varied. When the assemblies were loaded into consed it recalculated the scaffolding and produced consistent arrangements for all the assemblies.

                  Comment


                  • #10
                    Originally posted by kmcarr View Post
                    Phillip,

                    Here is something I did once when working on a herpes virus genome. I first performed the assembly with gsAssembler. I then took the ACE output and loaded that into consed. Once in consed you can use the use the reassembly and contig joining tools to try to bring them together.
                    Yes, that is what I am doing as well. The caveat being that I normally use the sequences matches produced in consed assembly view to drive joins. But there are virtually no sequence matches among the contigs produced by gsAssembler.

                    But it looks to me as if gsAssembler has deliberately excluded sequence at the ends of contigs that might match the adjacent contigs. Lacking these regions I don't know how to drive joins.

                    --
                    Phillip

                    Comment


                    • #11
                      What I find very helpful is the 454ContigGraph.txt file which should give you the information you are looking for. However, unhelpfully this is not produced by default by Newbler. Run Newbler with the -g option set on to get this file. This will give you a contig graph which will allow you to make joins between contigs.

                      However, you must be aware that contigs that are not included in scaffolds are likely to be "repeat consensus" contigs, so they may not reflect a true biological sequence but a consensus of two or more repetitive regions.

                      You'd want to do confirmatory Sanger sequencing on these contigs to be sure you get the right sequence.

                      Hope that helps

                      Comment


                      • #12
                        Going along with Phillip's questions (since we are working on the same project), I am having problems running the Newbler assembly through amosvalidate. I get to the point (step 600+) where amosvalidate is looking at the singletons. nucmer/mummer then crashes because the singletons file is empty.

                        Has anyone run a newbler ace file through amosvalidate?

                        Thanks,

                        Comment


                        • #13
                          Originally posted by nickloman View Post
                          What I find very helpful is the 454ContigGraph.txt file which should give you the information you are looking for. However, unhelpfully this is not produced by default by Newbler. Run Newbler with the -g option set on to get this file. This will give you a contig graph which will allow you to make joins between contigs.

                          However, you must be aware that contigs that are not included in scaffolds are likely to be "repeat consensus" contigs, so they may not reflect a true biological sequence but a consensus of two or more repetitive regions.

                          You'd want to do confirmatory Sanger sequencing on these contigs to be sure you get the right sequence.

                          Hope that helps
                          With v. 2.3 "-g" is deprecated because the ContigGraph.txt file is created by default.

                          The issue I was describing--where Newbler appeared to be censoring the ends of contigs was more bizarre than I initially thought. Near as I can tell this censorship hides reads spanning contig junctions. That is, the contigs should be joined on the basis of reads spanning the pseudo-gap, but for a mysterious reason described below, a 0 base gap is created in the assembly.

                          The reason? These breaks apparently denote multiple branch points. Got that? Contiguous reads join a stretch of sequence, but newbler segments the stretch into multiple contigs because, if the reads were not contiguous, there would be multiple possible branch points there? Using a method I describe below I am able to visualize the "suppressed" overlapping sequence in Assembly View of Consed, using "What to Show" "Run cross_match". Very few of them appear to branch at all--there is only one place they match to--the contig adjacent to them!

                          How to see the suppressed sequence that shows that Newber mistakenly failed to join adjacent contigs?

                          Check the "reads limited to one Contig" box of the Parameters:Output pane of the GS de novo assembler GUI.

                          --
                          Phillip

                          Comment


                          • #14
                            Just to give some other recommendation .. as maubp has already mentioned, I'd give MIRA3 a try. It is (usually) doing a good job on this size of genome. It also handles repetitive sequence quite effective.

                            In the 1-100Mb class we successfully use Celera Assembler (currently v6beta), results can be converted to ACE/CAF ([1], still "beta").

                            Just my 2p,
                            Sven

                            [1]=https://sourceforge.net/projects/asm2ace/

                            Comment


                            • #15
                              As a postscript. The bacterial genome I was assembling, after examination and merging inside consed dropped from 52 contigs in a scaffold down to a single contig.

                              Still not sure what Newbler was doing breaking the contigs down like that.

                              --
                              Phillip

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM
                              • seqadmin
                                Techniques and Challenges in Conservation Genomics
                                by seqadmin



                                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                Avian Conservation
                                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                03-08-2024, 10:41 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, Yesterday, 06:37 PM
                              0 responses
                              10 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, Yesterday, 06:07 PM
                              0 responses
                              10 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-22-2024, 10:03 AM
                              0 responses
                              51 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-21-2024, 07:32 AM
                              0 responses
                              67 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X