pmart1 06-30-2014 11:34 AM

illumina 454 de novo hybrid
Hi all,

I am currently trying to assemble a 5mb bacterial genome. I have 43bp single end reads from an Illumina Genome Analyzer IIx and 500bp double end reads from a Roche 454 GS FLX and was wondering if anyone had any luck with hybrid de novo assembly for these library types. I have read that putting the illumina data into velvet, using EMBOSS to cut the resulting contigs down a bit, and then combining it with the 454 data in Newbler gets decent results but that was two years ago so I didn't know if there was a higher quality pipeline to go with now.


martin2 06-30-2014 11:50 AM

Use mira assembler and make sure you feed it with untrimmed Illumina reads (do not do quality trimming on your own). It will remove Illumina adapters on its own. Regarding 454 data ... I cannot recommend any good adapter removal tool for it, except the one I wrote. ;)

GenoMax 06-30-2014 12:00 PM


Originally Posted by martin2 (Post 143902)
I cannot recommend any good adapter removal tool for it, except the one I wrote. ;)

Share the code then ;)

pmart1 06-30-2014 12:14 PM

Thank you for the suggestion, Martin! I will have to try that. I've never used MIRA before. Also, would you be able to share your code for the 454 trimming?

martin2 06-30-2014 01:12 PM

Hi, it happened I did all the development on my own so currently I only offer a data cleanup as a service (or even assembly). It is not only the code (28k lines of python code) but also a collection of artefacts which I found more 'manually' than by any 'computer-based' approach. They are not so abundant in one dataset while maybe you hit them in some other later on ...

I am a molecular biologist and with some datasets (transcriptomes) I had a lot of fun while looking for the restrictions sites, ligation results, and namely tried to come up with an answer how they emerged and how to generalize queries for them. To date I developed/tested it on 2227 datasets, better not counting how many times I re-calculated all of them from scratch once I realized something has been escaping me to date. :(( You wouldn't believe that I am still finding datasets produced by yet another lab protocol with yet another batch of primers/adapters and associated issues.

It even works on at least some WGS IonTorrent datasets as the lab protocols are just same. If I am not mistaken it was started by people who left 454 so some ideas and issues are common to both.

Unfortunately, I cannot share the code or even the queries. You can find URL in my Profile.

For your particular case, I think it is better to get more sequencing data, the 43bp are too short these days and I doubt it is worth the efforts.

martin2 06-30-2014 01:14 PM

> ... and 250bp double end reads from a Roche 454 GS FLX ...

What do you have? Isn't this Illumina instead?

pmart1 06-30-2014 01:35 PM

Pardon me, the 454 is 500bp double ended. I'm actually an intern in a lab and all of this (including linux) is extremely new to me.

martin2 06-30-2014 01:38 PM

That sounds like Illumina mate-pair protocol. What is the name of the file and what is the first entry or two in it?

pmart1 06-30-2014 01:51 PM

The libraries are from an old strain that perished in a power outage so we cannot run any further sequencing.

pmart1 06-30-2014 01:55 PM

(Strain name).sff. I'm not sure how to open it.

martin2 06-30-2014 01:57 PM

sffinfo (Strain name).sff | head -n 100

pmart1 06-30-2014 02:09 PM

Magic Number: 0x2E736666
Version: 0001
Index Offset: 551942992
Index Length: 4131866
# of Reads: 206557
Header Length: 840
Key Length: 4
# of Flows: 800
Flowgram Code: 1
Flow Chars: (sequence data)
Key Sequence: TCAG

>(Strain name)
Run Prefix:
Region #:
XY Location:

Run Name:
Analysis Name:

Thank you so much for the patience and help.

martin2 06-30-2014 02:30 PM

OK, so this is likely Titanium sequencing, General Library Preparation protocol or Amplicon/paired-end ..., so read length up to 500nt. The best would be to feed it into newbler:

runAssembly -o (Strain name) -mi 90 -ml 80 -consed -scaffold -cpu 2 (Strain name).sff GAIIxdata.fastq

For non-Roche assemblers you have to go with:

sffinfo -s (Strain name).sff > (Strain name).fasta
sffinfo -q (Strain name).sff > (Strain name).fasta.qual

pmart1 06-30-2014 02:35 PM

I do have Newbler on my other computer. When I get home, I will run it and post the results. Thank you very much for all of your help! It is very appreciated.

pmart1 06-30-2014 06:31 PM

I was able to run it and these were the resulting metrics.

numberOfScaffolds = 51;
numberOfBases = 145548;
avgScaffoldSize = 2853;
N50ScaffoldSize = 2594, 19;
largestScaffoldSize = 12532;
numberOfScaffoldContigs = 51;
numberOfScaffoldContigBases = 145548;
avgScaffoldContigSize = 2853;
N50ScaffoldContigSize = 2594, 19;
largestScaffoldContigSize = 12532;
NoEdges = 97, 95.1%;
OneEdge = 1, 1.0%;
TwoEdges = 4, 3.9%;
ManyEdges = 0, 0.0%;
BothNoEdges = 0, 0.0%;
OneNoEdges = 0, 0.0%;
BothOneEdge = 0, 0.0%;
MultiEdges = 0, 0.0%;
numberOfContigs = 2050;
numberOfBases = 1802379;
avgContigSize = 879;
N50ContigSize = 888;
largestContigSize = 12532;
Q40PlusBases = 1705911, 94.65%;
Q39MinusBases = 96468, 5.35%;
NoEdges = 4052, 98.8%;
OneEdge = 29, 0.7%;
TwoEdges = 16, 0.4%;
ManyEdges = 3, 0.1%;
numberOfContigs = 4866;
numberOfBases = 2738478;

martin2 07-23-2014 03:45 PM

Hi pmart1,
sorry I somehow did not receive an email update.

The assembly you show is bad, 95% of reads having no edge means it just did not assemble, almost at all. Was this only the 454 data or 454 with Illumina together? Try to assemble just the 454 data alone.

Are you sure you removed adapters? You should really go to the lab protocols to check what you have in the files. Or ask somebody knowing how to check the raw data. It is hard to tell remotely. Provided it is your special strain and top-secret ... I could help you debug the data via private email, depends how much info you can disclose.


