SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Genome Res De novo bacterial genome sequencing: millions of very short reads assembly b_seite Literature Watch 1 10-05-2017 12:26 AM
Cleanup and de novo assembly of a 2.9 Gb genome stvos Bioinformatics 0 08-01-2011 02:11 AM
Create the best prokaryotic genome assembly: Combining different methods JurgenP Bioinformatics 4 02-27-2010 11:46 AM
Velvet de novo assembly of Solid reads HOWTO KevinLam De novo discovery 1 01-10-2010 01:11 AM
advice for de novo assembly of plant genome using 454 bio-x 454 Pyrosequencing 4 07-24-2009 11:05 AM

Reply
 
Thread Tools
Old 01-04-2010, 02:47 AM   #1
Bukowski
Senior Member
 
Location: Aberdeen, Scotland

Join Date: Jan 2010
Posts: 355
Default Combining 454FLX and SOLiD runs for de novo genome assembly

I have a project that has done 1.5 plates worth of 454FLX (mixed paired end/single read) and subsequently done a SOLiD run.

The genome in question is ~11Mbp and has no reference to assemble to as its a non-model organism.

The 454 runs have been assembled with Newbler, but I'm interested in strategies and packages for combining the 454 and SOLiD data together.

Any pitfalls, protocols or papers I should be aware of?

Bukwoski
Bukowski is offline   Reply With Quote
Old 01-05-2010, 03:17 AM   #2
Rao
Member
 
Location: India

Join Date: Oct 2008
Posts: 36
Default

You can try velvet assembler... it accepts both long and short reads
Rao is offline   Reply With Quote
Old 01-06-2010, 12:23 AM   #3
KevinLam
Senior Member
 
Location: SEA

Join Date: Nov 2009
Posts: 197
Default

Related qn here.
solid uses colorspace and velvet is colorspace aware..
so should we assemble in color space?
i.e. convert 454 (or maybe even BAC clone reads from sanger seq?) to color space and assemble?


if my ram per core is only 2 GB can I assemble a subset in velvet (splitting the reads into 20 million sets) and then reassemble again using velvet for all of the reads?
KevinLam is offline   Reply With Quote
Old 01-06-2010, 04:03 AM   #4
bio-x
Member
 
Location: China

Join Date: Nov 2008
Posts: 18
Default

my suggestion is that assemble 454 and solid (velvet)separately, then combine the two assembly. i have successfully assemble one genome using the method.
bio-x is offline   Reply With Quote
Old 09-14-2010, 04:57 AM   #5
Temima
Member
 
Location: Israel

Join Date: Sep 2010
Posts: 12
Default Has anyone had a chance to try velvet on 454 and SOLiD?

I too have Roche reads aligned by newbler and would like to combine them with SOLiD reads. I'm working with the transcriptome of an organism with no reference.
If I do the assemblies separately - how can I combine them?
Does anyone have experience with translating 454 reads to colrspace and then using velvet on them and the SOLiD reads together?

I'm new here and would love to learn from other people's mistakes
Temima is offline   Reply With Quote
Old 09-14-2010, 09:08 AM   #6
KevinLam
Senior Member
 
Location: SEA

Join Date: Nov 2009
Posts: 197
Default

I think that Matz Lab did an excellent job with that combi with coral transcriptome
check out
http://www.bio.utexas.edu/research/m...b/Methods.html

Coral Transcriptomics-a budget NGS approach?
http://kevin-gattaca.blogspot.com/20...udget-ngs.html
has a summary of the tools required for the pipeline.

basically the 454 created a patchy transcriptome that could be annotated and by adding the SOLiD reads, reasonable amount of data can be extracted.

the scripts are posted on the web site as well.

do share your findings. This is an area I am keen to explore once i get my hands on the data as well.
KevinLam is offline   Reply With Quote
Old 09-14-2010, 04:08 PM   #7
sbberes
Member
 
Location: Houston TX

Join Date: Jan 2009
Posts: 22
Default

I have used MIRA to do denovo bacterial genome assemblies using 454 and illumina sequencing data. I used MIRA because it is a true hybrid denovo assembler. That is many of the assemblers that are capable of using reads from different technologies perform an iterative assembly, that is they first do an assembly using the 454 reads and then layer on top of that the data from the illumina reads. MIRA uses all of the data irrespective of the technology equally in the assembly. In my experience rather large contigs were generated (up to a couple hundred kb) with 100 fold depth of coverage ~350 nt long 454 reads and 100 fold coverage of 36 nt Illumina reads. After the initial MIRA assembly I further assembled the contigs into larger supercontigs as it was clear that some many of the contigs were overlapping but due to junk reads assembled on to the ends of the contigs, overlapping contigs were unable to coalesce.
sbberes is offline   Reply With Quote
Old 09-14-2010, 06:38 PM   #8
KevinLam
Senior Member
 
Location: SEA

Join Date: Nov 2009
Posts: 197
Default

Quote:
Originally Posted by sbberes View Post
I have used MIRA to do denovo bacterial genome assemblies using 454 and illumina sequencing data. I used MIRA because it is a true hybrid denovo assembler. That is many of the assemblers that are capable of using reads from different technologies perform an iterative assembly, that is they first do an assembly using the 454 reads and then layer on top of that the data from the illumina reads. MIRA uses all of the data irrespective of the technology equally in the assembly. In my experience rather large contigs were generated (up to a couple hundred kb) with 100 fold depth of coverage ~350 nt long 454 reads and 100 fold coverage of 36 nt Illumina reads. After the initial MIRA assembly I further assembled the contigs into larger supercontigs as it was clear that some many of the contigs were overlapping but due to junk reads assembled on to the ends of the contigs, overlapping contigs were unable to coalesce.
I do not have much experience with assembly but I had the impression that 100x coverage is sufficient for de novo assemblies.
and for bacterial genomes, I had assumed that it should be a clinch.
was it really neccessary for the 200x coverage from 454 and solexa?
(this is worrying as it might mean I have to sequence 200x on SOLiD or possibly get seq from a 454 somehow)

ps. http://www.chevreux.org/projects_mira.html MIRA link
KevinLam is offline   Reply With Quote
Old 09-15-2010, 07:29 AM   #9
sbberes
Member
 
Location: Houston TX

Join Date: Jan 2009
Posts: 22
Default

Kevin,
No 100x coverage from both technologies was I am sure overkill. Both of these sequencing instruments using the stock protocols are really to large in capacity for bacterial sized genomes (squirrel hunting with a bazooka), but at the time I did not yet have barcoding up and running so that I could multiplex my runs. That said I have not gone back in and run multiple assemblies using lesser portion of the data in order to determine what the minimal requirements are. I suspect that about 15-to-20x coverage with both technologies would suffice. Given pyrosequencing’s difficulties with homopolymeric tracts you really are much better doing hybrid assemblies.
SBB
sbberes is offline   Reply With Quote
Old 09-15-2010, 07:50 AM   #10
Zigster
(Jeremy Leipzig)
 
Location: Philadelphia, PA

Join Date: May 2009
Posts: 116
Default

Quote:
Originally Posted by sbberes View Post
. After the initial MIRA assembly I further assembled the contigs into larger supercontigs .
how exactly did you do this?
__________________
--
Jeremy Leipzig
Bioinformatics Programmer
--
My blog
Twitter
Zigster is offline   Reply With Quote
Old 09-15-2010, 11:01 AM   #11
sbberes
Member
 
Location: Houston TX

Join Date: Jan 2009
Posts: 22
Default

Jeremy,
Our laboratory does bacterial genome denovo sequencing and lots of pathogenomic resequencing for comparative population genomic investigations (Staph, Strep, and TB). The most recent genomes being sequenced are ~2Mbp in size. The most recent denovo assemblies were accomplished by combining data obtained from pyrosequencing using a 454 titanium instrument (~0.5 million reads with an average read length of 350 nt) and from an Illumina GAII instrument (~5 million reads of 36 nt). Reads from these instruments were first preprocessed using the FASTX toolkit to filter out low quality and redundant artifactual reads (primer derived sequences). The filtered data was then feed into MIRA using the recommended protocol and parameters. MIRA was run on a desktop machine with with 8 cores and 12gb ram running Ubuntu. I think it took a couple of days to process, that is ran over a weekend. This process was run for two strains. The resultant fasta file of contigs was then filtered to remove contigs of less than 0.5 kb (~40 contigs, the largest of which were in the 150 to 200 kbp range). The filtered contigs were aligned to the genome of a related strain to order the contigs. The contigs were then feed into Sequencher where they were trimmed if needed on the ends and then overlaping contigs were assembled into supercontigs (~10 per genome). Virtually all of the breackpoints remaing in the assembly were large repeated elements, such as rRNA operons, 1.5kb transposons, and some phage lytic cassettes. The gaps were PCR amplified and walked using Sanger sequencing. Regions of overlap in the contigs where there were discrepant base calls were resolved with Sanger sequencing. After final assembly the ~5 million Illumina reads were compared to the genome using VAAL. ~20 polymorphisms were identified, virtually all of these polymorphisms were in homoploymeric nt tracts. Most of the polymorphisms lay in coding sequences and shifted the reading frames disrupting the gene. This indicated that despite using two different sequencing technologies and the hybrid assembly a couple of handfulls of errors still likely occurred. This was again resolved by Sanger sequencing. 20 errors at the end of the process for a 2 Mbp genome is not to shabby. The smaller contigs ie those less than 500nt in size were also compared to the assembly and virtually all of them did assemble/overlap with the genome so there was no indication that these smaller contigs represented sequence not present in the final assembly.
SBB
sbberes is offline   Reply With Quote
Old 09-15-2010, 11:41 AM   #12
Zigster
(Jeremy Leipzig)
 
Location: Philadelphia, PA

Join Date: May 2009
Posts: 116
Default

Steve, thanks for detailing your approach. I find this thread pretty interesting.

We do a lot of big plant transcriptomes. I am reluctant to feed 100M+ reads to MIRA, so I have tried feeding it a Newbler assembly of 454 + Velvet assembly of Solexa, both masquerading as Sanger reads. The results are certainly better than either alone, but nothing spectacular.

It would be nice if MIRA would have a setting that identifies certain long read input as "homopolymer-prone", but maybe that is too controversial.
__________________
--
Jeremy Leipzig
Bioinformatics Programmer
--
My blog
Twitter
Zigster is offline   Reply With Quote
Old 09-16-2010, 02:09 AM   #13
Temima
Member
 
Location: Israel

Join Date: Sep 2010
Posts: 12
Default Thank you all for the input!

MIRA sounds interesting.

So far, I have tried double encoding my 454 reads and feeding them to velvet as colorspace reads. Unfortunately, this gives me a segmentation fault when I try running velvetg...
Temima is offline   Reply With Quote
Old 10-07-2010, 04:33 AM   #14
Temima
Member
 
Location: Israel

Join Date: Sep 2010
Posts: 12
Default Update

Ok, so the good news is that a denovo run of 454 and solid reads really can be done. Of course, there's also bad news:

So I took my 454 reads and converted them to color space using a script I wrote (verified all was well). Then I fed them into the solid preprocesser for velvet. I preprocessed my solid reads too and them fed them all into velvet_de. At this point I defined both groups of reads as 'short'. I got a bunch of contigs with a maximal length of 820 bp. Nice but not amazing. Ran these contigs through the post processor and denovoadp and all went smoothly.

Then I tried running velvet_de again but this time I entered the 454 reads twice - once as long and once as short. I got amazingly long contigs (this is for transcriptome) with a maximal length of 3.5 kb. Wonderful!
I ran the solid post processor on them and all went well. Then I tried running denovoadp and was told:
'contig exceed maximum length or reads match to negative position'

So I have these gorgeous long contigs, but they're stuck in cs...
Trying to figure out if asid light can help me. Wish the documentation was better...
Any input would be welcome.
Temima is offline   Reply With Quote
Old 10-10-2010, 10:05 PM   #15
ganga.jeena
Member
 
Location: INDIA

Join Date: Jun 2010
Posts: 15
Default

hello Temima
glad to know of the gud news
cd u forward me the script for 454 to color spce conversion,

i hv been also trying to combine 454 and SOLiD data for assembly hwver my experimental approach hvnt yet helped me out
Thanks n Regards
ganga.jeena is offline   Reply With Quote
Old 10-10-2010, 11:56 PM   #16
Temima
Member
 
Location: Israel

Join Date: Sep 2010
Posts: 12
Default

Here's the script, Hope it works.
Just to clarify - I was given the 454 data in fasta format, not SFF.

I'd be interested in hearing how things go for you.
Attached Files
File Type: pl fasta2csfasta.pl (2.2 KB, 35 views)
Temima is offline   Reply With Quote
Old 10-11-2010, 02:55 AM   #17
ganga.jeena
Member
 
Location: INDIA

Join Date: Jun 2010
Posts: 15
Smile thanks

thanks 4 the script
ganga.jeena is offline   Reply With Quote
Old 10-11-2010, 03:15 AM   #18
ganga.jeena
Member
 
Location: INDIA

Join Date: Jun 2010
Posts: 15
Default

>contig00001 length=1782 numreads=47
GAAcAAagaG

>contig00001 length=1782 numreads=47
200200202

the output doesnt seem to be correct , can u check it up with the min bases above?
ganga.jeena is offline   Reply With Quote
Old 10-11-2010, 04:11 AM   #19
Temima
Member
 
Location: Israel

Join Date: Sep 2010
Posts: 12
Default

the problem is with the lower case bases. change all bases to capital letters and then try
Temima is offline   Reply With Quote
Old 10-12-2010, 03:56 AM   #20
ganga.jeena
Member
 
Location: INDIA

Join Date: Jun 2010
Posts: 15
Default Regaarding velvetg_de

Error:

velvetg: Can't calloc 18446744072095878219 Annotations totalling 18446744009162615736 bytes: Cannot allocate memory
Reading roadmap file /home/data/7_oct/777//Roadmaps

exit


Wht could this error mean?
I have 500 GB RAM working system...
the READMAP file is 23 GB and Sequence file 9.6 GB

dont get it
ganga.jeena is offline   Reply With Quote
Reply

Tags
454, assembly, combining, genome, solid

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 05:37 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2018, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO