SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Genome Res De novo bacterial genome sequencing: millions of very short reads assembly b_seite Literature Watch 1 10-05-2017 12:26 AM
Using Mosaik to assemble bacterial genome 454 sequencing jpearl01 Bioinformatics 2 03-27-2013 07:29 AM
454 paired-end or single-end? ps376 454 Pyrosequencing 2 11-02-2011 08:33 AM
paired-end reads mapped to genome.. gene with only one direction of paired-end reads? danwiththeplan Bioinformatics 2 09-22-2011 03:06 AM
Whole genome assembly of cDNA with Illumina paired-end sequencing mjouret Bioinformatics 4 04-15-2010 06:35 AM

Reply
 
Thread Tools
Old 02-23-2010, 05:26 AM   #1
pmiguel
Senior Member
 
Location: Purdue University, West Lafayette, Indiana

Join Date: Aug 2008
Posts: 2,318
Default 454 paired end bacterial whole genome assembly

I have a paired end bacterial genome data set. With help from some of you I was able to liberate the sequence data from the 454 .sff.

What assembly engine would work best? I want a "second opinion" from gsAssembler.

This is a small bacterial genome: less than 1.2 million bases. The sequence coverage is a little difficult to measure. But I would estimate, but it is around 50X. It does also contain some contaminating eukaryotic DNA. Very low pass on that however (maybe 0.01X, at most).

Directly using phrap on the .fasta and .qual files produced a very large number of contigs and scaffolds. I reduced all the quality values to 1/2 their original values and a single large scaffold emerged. But still 2x more contigs than gsAssembler gives.

Any advice?

--
Phillip
pmiguel is offline   Reply With Quote
Old 02-23-2010, 06:11 AM   #2
nickloman
Senior Member
 
Location: Birmingham, UK

Join Date: Jul 2009
Posts: 356
Default

My experience with assembling bacterial genomes from 454 data is that Newbler with default parameters will do better than any other assembly engine, when dealing purely with 454 reads.

I don't know why this is, but I presume there are some built-in heuristics in Newbler that suit the error model for this technology (which is very different to Solexa or Sanger-ABI).

However, if you really want to try a different assembler then I would suggest MIRA and CLC Genomics Workbench are possible options.

How many contigs are you getting right now and is this consistent with the repeat profile (i.e. number of repeat regions >= read length) predicted from the genome in question?
nickloman is offline   Reply With Quote
Old 02-23-2010, 06:18 AM   #3
maubp
Peter (Biopython etc)
 
Location: Dundee, Scotland, UK

Join Date: Jul 2009
Posts: 1,543
Default

I'd second the suggestion to try MIRA 3 on the assembly. The manual takes you through preparing the SFF file using sff_extract etc. If you start using the tool seriously, do sign up to their mailing list.
maubp is offline   Reply With Quote
Old 02-23-2010, 08:59 AM   #4
pmiguel
Senior Member
 
Location: Purdue University, West Lafayette, Indiana

Join Date: Aug 2008
Posts: 2,318
Default

Quote:
Originally Posted by nickloman View Post
How many contigs are you getting right now and is this consistent with the repeat profile (i.e. number of repeat regions >= read length) predicted from the genome in question?
gsAssembler (aka "Newbler") gives me 52 contigs in a single scaffold, with lots of extraneous contigs--most or all likely deriving from contaminating eukaryotic DNA.

I looks to me like there are very few repeats in this genome. The read lengths are standard Titanium read lengths (~400 mean) but in cases where the read contains the F/R mate split, each right/left reads will be shorter, of course. The pair-ends look to be 2-4 kb apart--consistent with 3kb paired-ends.

--
Phillip
pmiguel is offline   Reply With Quote
Old 02-23-2010, 09:16 AM   #5
nickloman
Senior Member
 
Location: Birmingham, UK

Join Date: Jul 2009
Posts: 356
Default

Quote:
Originally Posted by pmiguel View Post
gsAssembler (aka "Newbler") gives me 52 contigs in a single scaffold, with lots of extraneous contigs--most or all likely deriving from contaminating eukaryotic DNA.

I looks to me like there are very few repeats in this genome. The read lengths are standard Titanium read lengths (~400 mean) but in cases where the read contains the F/R mate split, each right/left reads will be shorter, of course. The pair-ends look to be 2-4 kb apart--consistent with 3kb paired-ends.

--
Phillip
Oh right! Sounds like you have a pretty good result then if ended up with a single scaffold. Be cautious, doing another assembly with a different program may be more confounding than confirmatory! Perhaps it would be better, if you want to verify the genome order to make some confirmatory PCRs, perhaps primers to amplify the entire genome in 10kb sections.
nickloman is offline   Reply With Quote
Old 02-23-2010, 09:21 AM   #6
pmiguel
Senior Member
 
Location: Purdue University, West Lafayette, Indiana

Join Date: Aug 2008
Posts: 2,318
Default

Quote:
Originally Posted by nickloman View Post
Oh right! Sounds like you have a pretty good result then if ended up with a single scaffold. Be cautious, doing another assembly with a different program may be more confounding than confirmatory! Perhaps it would be better, if you want to verify the genome order to make some confirmatory PCRs, perhaps primers to amplify the entire genome in 10kb sections.
Well, I neglected to add that a cursory examination of the optical map bore no similarity that I could perceive to the scaffold restriction map. But that might be for a trivial reason--like the optical map was done with some other restriction enzyme than the one we were told.

It is the right bacterium--everything fits previous sequencing results.

In part I'm just trying to transition or include the tools I've long used for eukaryotic BAC assembly (phred/phrap/consed) into this brave new next-generation world.

Thanks for your advice,
Phillip
pmiguel is offline   Reply With Quote
Old 02-23-2010, 06:09 PM   #7
bio-x
Member
 
Location: China

Join Date: Nov 2008
Posts: 18
Default

for bacteria genome, newbler is a good choice;
for " 52 contigs in a single scaffold", you can try to close gaps with the pair-end information.
bio-x is offline   Reply With Quote
Old 02-24-2010, 04:37 AM   #8
pmiguel
Senior Member
 
Location: Purdue University, West Lafayette, Indiana

Join Date: Aug 2008
Posts: 2,318
Default

Quote:
Originally Posted by bio-x View Post
for bacteria genome, newbler is a good choice;
for " 52 contigs in a single scaffold", you can try to close gaps with the pair-end information.
Yes, that is the plan. But I am a little suspicious of the gaps.

I have now done assemblies with both gsAssembler and phrap and examined the assemblies in consed.

BTW, I had to do a perl -i -pe 's/_left/.f/ if /^>/;s/_right/.r/ if /^>/;' on the fasta and qual files before doing phrap to get consed to show the F/R paired ends in assembly view.

As I mentioned earlier, the phrap assembly produced a larger number of contigs by far than gsAssembler. This was somewhat mitigated when I cut all the quality values for the reads in half. Nevertheless, the assemblies appear qualitatively different.

Phrap contig consensus sequence extends fairly far into questionable areas towards the ends of contigs. This is normal for phrap. The bases at the ends of contigs tend to have very low quality values. Newbler, however, is far more parsimonious with how far it allows the consensus sequence to extend at the ends of contigs.

While I can understand the purpose behind this more stringent consensus sequence generating behavior, I want to turn it off! As it is, none of the contigs show overlap at their ends--so I have no basis to join them. Anyone know a way to get gsAssembler to become less censorious?

--
Phillip
pmiguel is offline   Reply With Quote
Old 02-24-2010, 06:08 AM   #9
kmcarr
Senior Member
 
Location: USA, Midwest

Join Date: May 2008
Posts: 1,178
Default

Phillip,

Here is something I did once when working on a herpes virus genome. I first performed the assembly with gsAssembler. I then took the ACE output and loaded that into consed. Once in consed you can use the use the reassembly and contig joining tools to try to bring them together. You will need to (re)run gsAssembler with the ACE output mode set to generate a full consed folder and you will need the latest version of consed to properly recognize a 454 project.

I also noted that consed ignores any scaffolding information from the gsAssembler; it calculates its own scaffolding based on the read pairing information in the contigs. I ran a number of independent assemblies of the viral genome in gsAssembler (I had vastly more raw data than needed for a single assembly). While the size and sequence of the contigs was consistent from one assembly to the next, the scaffolding produced by gsAssembler varied. When the assemblies were loaded into consed it recalculated the scaffolding and produced consistent arrangements for all the assemblies.
kmcarr is offline   Reply With Quote
Old 02-24-2010, 07:13 AM   #10
pmiguel
Senior Member
 
Location: Purdue University, West Lafayette, Indiana

Join Date: Aug 2008
Posts: 2,318
Default

Quote:
Originally Posted by kmcarr View Post
Phillip,

Here is something I did once when working on a herpes virus genome. I first performed the assembly with gsAssembler. I then took the ACE output and loaded that into consed. Once in consed you can use the use the reassembly and contig joining tools to try to bring them together.
Yes, that is what I am doing as well. The caveat being that I normally use the sequences matches produced in consed assembly view to drive joins. But there are virtually no sequence matches among the contigs produced by gsAssembler.

But it looks to me as if gsAssembler has deliberately excluded sequence at the ends of contigs that might match the adjacent contigs. Lacking these regions I don't know how to drive joins.

--
Phillip
pmiguel is offline   Reply With Quote
Old 02-24-2010, 08:20 AM   #11
nickloman
Senior Member
 
Location: Birmingham, UK

Join Date: Jul 2009
Posts: 356
Default

What I find very helpful is the 454ContigGraph.txt file which should give you the information you are looking for. However, unhelpfully this is not produced by default by Newbler. Run Newbler with the -g option set on to get this file. This will give you a contig graph which will allow you to make joins between contigs.

However, you must be aware that contigs that are not included in scaffolds are likely to be "repeat consensus" contigs, so they may not reflect a true biological sequence but a consensus of two or more repetitive regions.

You'd want to do confirmatory Sanger sequencing on these contigs to be sure you get the right sequence.

Hope that helps
nickloman is offline   Reply With Quote
Old 02-24-2010, 09:36 AM   #12
westerman
Rick Westerman
 
Location: Purdue University, Indiana, USA

Join Date: Jun 2008
Posts: 1,104
Default

Going along with Phillip's questions (since we are working on the same project), I am having problems running the Newbler assembly through amosvalidate. I get to the point (step 600+) where amosvalidate is looking at the singletons. nucmer/mummer then crashes because the singletons file is empty.

Has anyone run a newbler ace file through amosvalidate?

Thanks,
westerman is offline   Reply With Quote
Old 02-25-2010, 08:17 AM   #13
pmiguel
Senior Member
 
Location: Purdue University, West Lafayette, Indiana

Join Date: Aug 2008
Posts: 2,318
Lightbulb

Quote:
Originally Posted by nickloman View Post
What I find very helpful is the 454ContigGraph.txt file which should give you the information you are looking for. However, unhelpfully this is not produced by default by Newbler. Run Newbler with the -g option set on to get this file. This will give you a contig graph which will allow you to make joins between contigs.

However, you must be aware that contigs that are not included in scaffolds are likely to be "repeat consensus" contigs, so they may not reflect a true biological sequence but a consensus of two or more repetitive regions.

You'd want to do confirmatory Sanger sequencing on these contigs to be sure you get the right sequence.

Hope that helps
With v. 2.3 "-g" is deprecated because the ContigGraph.txt file is created by default.

The issue I was describing--where Newbler appeared to be censoring the ends of contigs was more bizarre than I initially thought. Near as I can tell this censorship hides reads spanning contig junctions. That is, the contigs should be joined on the basis of reads spanning the pseudo-gap, but for a mysterious reason described below, a 0 base gap is created in the assembly.

The reason? These breaks apparently denote multiple branch points. Got that? Contiguous reads join a stretch of sequence, but newbler segments the stretch into multiple contigs because, if the reads were not contiguous, there would be multiple possible branch points there? Using a method I describe below I am able to visualize the "suppressed" overlapping sequence in Assembly View of Consed, using "What to Show" "Run cross_match". Very few of them appear to branch at all--there is only one place they match to--the contig adjacent to them!

How to see the suppressed sequence that shows that Newber mistakenly failed to join adjacent contigs?

Check the "reads limited to one Contig" box of the Parameters:Output pane of the GS de novo assembler GUI.

--
Phillip
pmiguel is offline   Reply With Quote
Old 02-25-2010, 12:46 PM   #14
sklages
Senior Member
 
Location: Berlin, DE

Join Date: May 2008
Posts: 628
Default

Just to give some other recommendation .. as maubp has already mentioned, I'd give MIRA3 a try. It is (usually) doing a good job on this size of genome. It also handles repetitive sequence quite effective.

In the 1-100Mb class we successfully use Celera Assembler (currently v6beta), results can be converted to ACE/CAF ([1], still "beta").

Just my 2p,
Sven

[1]=https://sourceforge.net/projects/asm2ace/
sklages is offline   Reply With Quote
Old 02-26-2010, 10:55 AM   #15
pmiguel
Senior Member
 
Location: Purdue University, West Lafayette, Indiana

Join Date: Aug 2008
Posts: 2,318
Default

As a postscript. The bacterial genome I was assembling, after examination and merging inside consed dropped from 52 contigs in a scaffold down to a single contig.

Still not sure what Newbler was doing breaking the contigs down like that.

--
Phillip
pmiguel is offline   Reply With Quote
Old 03-11-2010, 05:50 AM   #16
pmiguel
Senior Member
 
Location: Purdue University, West Lafayette, Indiana

Join Date: Aug 2008
Posts: 2,318
Default

Quote:
Originally Posted by sklages View Post
Just to give some other recommendation .. as maubp has already mentioned, I'd give MIRA3 a try. It is (usually) doing a good job on this size of genome. It also handles repetitive sequence quite effective.

In the 1-100Mb class we successfully use Celera Assembler (currently v6beta), results can be converted to ACE/CAF ([1], still "beta").

Just my 2p,
Sven

[1]=https://sourceforge.net/projects/asm2ace/
I finally did try MIRA3. Yes, I like the results. 37 contigs in one main scaffold, 10 of the contigs larger than 1 kb. When I look at the .ace file with consed, the contig breaks seem reasonable. That is, they are in regions with a fairly complex repeat structure.

--
Phillip
pmiguel is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 06:42 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2022, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO