Seqanswers Leaderboard Ad

**Rao** · 01-05-2010, 03:17 AM

You can try velvet assembler... it accepts both long and short reads

**KevinLam** · 01-06-2010, 12:23 AM

Related qn here.
solid uses colorspace and velvet is colorspace aware..
so should we assemble in color space?
i.e. convert 454 (or maybe even BAC clone reads from sanger seq?) to color space and assemble?

if my ram per core is only 2 GB can I assemble a subset in velvet (splitting the reads into 20 million sets) and then reassemble again using velvet for all of the reads?

**bio-x** · 01-06-2010, 04:03 AM

my suggestion is that assemble 454 and solid (velvet)separately, then combine the two assembly. i have successfully assemble one genome using the method.

**Temima** · 09-14-2010, 03:57 AM

Has anyone had a chance to try velvet on 454 and SOLiD?

I too have Roche reads aligned by newbler and would like to combine them with SOLiD reads. I'm working with the transcriptome of an organism with no reference.
If I do the assemblies separately - how can I combine them?
Does anyone have experience with translating 454 reads to colrspace and then using velvet on them and the SOLiD reads together?

I'm new here and would love to learn from other people's mistakes

**KevinLam** · 09-14-2010, 08:08 AM

I think that Matz Lab did an excellent job with that combi with coral transcriptome
check out

404 - Page Not Found

http://www.bio.utexas.edu/research/matz_lab/matzlab/Methods.html

Coral Transcriptomics-a budget NGS approach?

http://kevin-gattaca.blogspot.com/2010/05/coral-transcriptomics-budget-ngs.html

Was surprised I didn't blog about this earlier. Dr Mikhail Matz is a researcher in the field of coral genomics. His approach to doing de n...

has a summary of the tools required for the pipeline.

basically the 454 created a patchy transcriptome that could be annotated and by adding the SOLiD reads, reasonable amount of data can be extracted.

the scripts are posted on the web site as well.

do share your findings. This is an area I am keen to explore once i get my hands on the data as well.

**sbberes** · 09-14-2010, 03:08 PM

I have used MIRA to do denovo bacterial genome assemblies using 454 and illumina sequencing data. I used MIRA because it is a true hybrid denovo assembler. That is many of the assemblers that are capable of using reads from different technologies perform an iterative assembly, that is they first do an assembly using the 454 reads and then layer on top of that the data from the illumina reads. MIRA uses all of the data irrespective of the technology equally in the assembly. In my experience rather large contigs were generated (up to a couple hundred kb) with 100 fold depth of coverage ~350 nt long 454 reads and 100 fold coverage of 36 nt Illumina reads. After the initial MIRA assembly I further assembled the contigs into larger supercontigs as it was clear that some many of the contigs were overlapping but due to junk reads assembled on to the ends of the contigs, overlapping contigs were unable to coalesce.

**KevinLam** · 09-14-2010, 05:38 PM

Originally posted by sbberes View Post

I have used MIRA to do denovo bacterial genome assemblies using 454 and illumina sequencing data. I used MIRA because it is a true hybrid denovo assembler. That is many of the assemblers that are capable of using reads from different technologies perform an iterative assembly, that is they first do an assembly using the 454 reads and then layer on top of that the data from the illumina reads. MIRA uses all of the data irrespective of the technology equally in the assembly. In my experience rather large contigs were generated (up to a couple hundred kb) with 100 fold depth of coverage ~350 nt long 454 reads and 100 fold coverage of 36 nt Illumina reads. After the initial MIRA assembly I further assembled the contigs into larger supercontigs as it was clear that some many of the contigs were overlapping but due to junk reads assembled on to the ends of the contigs, overlapping contigs were unable to coalesce.

I do not have much experience with assembly but I had the impression that 100x coverage is sufficient for de novo assemblies.
and for bacterial genomes, I had assumed that it should be a clinch.
was it really neccessary for the 200x coverage from 454 and solexa?
(this is worrying as it might mean I have to sequence 200x on SOLiD or possibly get seq from a 454 somehow)

ps. http://www.chevreux.org/projects_mira.html MIRA link

**sbberes** · 09-15-2010, 06:29 AM

Kevin,
No 100x coverage from both technologies was I am sure overkill. Both of these sequencing instruments using the stock protocols are really to large in capacity for bacterial sized genomes (squirrel hunting with a bazooka), but at the time I did not yet have barcoding up and running so that I could multiplex my runs. That said I have not gone back in and run multiple assemblies using lesser portion of the data in order to determine what the minimal requirements are. I suspect that about 15-to-20x coverage with both technologies would suffice. Given pyrosequencing’s difficulties with homopolymeric tracts you really are much better doing hybrid assemblies.
SBB

**Zigster** · 09-15-2010, 06:50 AM

Originally posted by sbberes View Post

. After the initial MIRA assembly I further assembled the contigs into larger supercontigs .

how exactly did you do this?

**sbberes** · 09-15-2010, 10:01 AM

Jeremy,
Our laboratory does bacterial genome denovo sequencing and lots of pathogenomic resequencing for comparative population genomic investigations (Staph, Strep, and TB). The most recent genomes being sequenced are ~2Mbp in size. The most recent denovo assemblies were accomplished by combining data obtained from pyrosequencing using a 454 titanium instrument (~0.5 million reads with an average read length of 350 nt) and from an Illumina GAII instrument (~5 million reads of 36 nt). Reads from these instruments were first preprocessed using the FASTX toolkit to filter out low quality and redundant artifactual reads (primer derived sequences). The filtered data was then feed into MIRA using the recommended protocol and parameters. MIRA was run on a desktop machine with with 8 cores and 12gb ram running Ubuntu. I think it took a couple of days to process, that is ran over a weekend. This process was run for two strains. The resultant fasta file of contigs was then filtered to remove contigs of less than 0.5 kb (~40 contigs, the largest of which were in the 150 to 200 kbp range). The filtered contigs were aligned to the genome of a related strain to order the contigs. The contigs were then feed into Sequencher where they were trimmed if needed on the ends and then overlaping contigs were assembled into supercontigs (~10 per genome). Virtually all of the breackpoints remaing in the assembly were large repeated elements, such as rRNA operons, 1.5kb transposons, and some phage lytic cassettes. The gaps were PCR amplified and walked using Sanger sequencing. Regions of overlap in the contigs where there were discrepant base calls were resolved with Sanger sequencing. After final assembly the ~5 million Illumina reads were compared to the genome using VAAL. ~20 polymorphisms were identified, virtually all of these polymorphisms were in homoploymeric nt tracts. Most of the polymorphisms lay in coding sequences and shifted the reading frames disrupting the gene. This indicated that despite using two different sequencing technologies and the hybrid assembly a couple of handfulls of errors still likely occurred. This was again resolved by Sanger sequencing. 20 errors at the end of the process for a 2 Mbp genome is not to shabby. The smaller contigs ie those less than 500nt in size were also compared to the assembly and virtually all of them did assemble/overlap with the genome so there was no indication that these smaller contigs represented sequence not present in the final assembly.
SBB

**Zigster** · 09-15-2010, 10:41 AM

Steve, thanks for detailing your approach. I find this thread pretty interesting.

We do a lot of big plant transcriptomes. I am reluctant to feed 100M+ reads to MIRA, so I have tried feeding it a Newbler assembly of 454 + Velvet assembly of Solexa, both masquerading as Sanger reads. The results are certainly better than either alone, but nothing spectacular.

It would be nice if MIRA would have a setting that identifies certain long read input as "homopolymer-prone", but maybe that is too controversial.

**Temima** · 09-16-2010, 01:09 AM

Thank you all for the input!

MIRA sounds interesting.

So far, I have tried double encoding my 454 reads and feeding them to velvet as colorspace reads. Unfortunately, this gives me a segmentation fault when I try running velvetg...

**Temima** · 10-07-2010, 03:33 AM

Update

Ok, so the good news is that a denovo run of 454 and solid reads really can be done. Of course, there's also bad news:

So I took my 454 reads and converted them to color space using a script I wrote (verified all was well). Then I fed them into the solid preprocesser for velvet. I preprocessed my solid reads too and them fed them all into velvet_de. At this point I defined both groups of reads as 'short'. I got a bunch of contigs with a maximal length of 820 bp. Nice but not amazing. Ran these contigs through the post processor and denovoadp and all went smoothly.

Then I tried running velvet_de again but this time I entered the 454 reads twice - once as long and once as short. I got amazingly long contigs (this is for transcriptome) with a maximal length of 3.5 kb. Wonderful!
I ran the solid post processor on them and all went well. Then I tried running denovoadp and was told:

'contig exceed maximum length or reads match to negative position'

So I have these gorgeous long contigs, but they're stuck in cs...
Trying to figure out if asid light can help me. Wish the documentation was better...
Any input would be welcome.

**ganga.jeena** · 10-10-2010, 09:05 PM

hello Temima
glad to know of the gud news
cd u forward me the script for 454 to color spce conversion,

i hv been also trying to combine 454 and SOLiD data for assembly hwver my experimental approach hvnt yet helped me out
Thanks n Regards

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 25 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 28 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 24 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 52 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Combining 454FLX and SOLiD runs for de novo genome assembly

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News