SEQanswers

Go Back   SEQanswers > Applications Forums > De novo discovery



Similar Threads
Thread Thread Starter Forum Replies Last Post
The best genome de novo assembly software using hybrid data (Illumina, 454 & Sanger)? Godevil De novo discovery 36 08-01-2012 02:25 AM
illumina/454 de novo hybrid cDNA assembly with newbler2.6 Seqasaurus Bioinformatics 2 01-23-2012 08:20 AM
hybrid assembly Illumina/454 Robby Bioinformatics 1 09-01-2011 12:54 AM
De novo hybrid assembly of 454/illumina : CLC workbench Bardj Bioinformatics 1 11-21-2010 04:14 PM
Denovo Hybrid Assembly using 454/illumina intikhab Bioinformatics 5 09-16-2010 03:54 AM

Reply
 
Thread Tools
Old 06-30-2014, 11:34 AM   #1
pmart1
Junior Member
 
Location: Nashville

Join Date: Jun 2014
Posts: 8
Smile illumina 454 de novo hybrid

Hi all,

I am currently trying to assemble a 5mb bacterial genome. I have 43bp single end reads from an Illumina Genome Analyzer IIx and 500bp double end reads from a Roche 454 GS FLX and was wondering if anyone had any luck with hybrid de novo assembly for these library types. I have read that putting the illumina data into velvet, using EMBOSS to cut the resulting contigs down a bit, and then combining it with the 454 data in Newbler gets decent results but that was two years ago so I didn't know if there was a higher quality pipeline to go with now.

Thanks

Last edited by pmart1; 06-30-2014 at 01:35 PM.
pmart1 is offline   Reply With Quote
Old 06-30-2014, 11:50 AM   #2
martin2
Member
 
Location: Prague, Czech Republic

Join Date: Nov 2010
Posts: 40
Default

Use mira assembler and make sure you feed it with untrimmed Illumina reads (do not do quality trimming on your own). It will remove Illumina adapters on its own. Regarding 454 data ... I cannot recommend any good adapter removal tool for it, except the one I wrote.
martin2 is offline   Reply With Quote
Old 06-30-2014, 12:00 PM   #3
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,062
Default

Quote:
Originally Posted by martin2 View Post
I cannot recommend any good adapter removal tool for it, except the one I wrote.
Share the code then
GenoMax is offline   Reply With Quote
Old 06-30-2014, 12:14 PM   #4
pmart1
Junior Member
 
Location: Nashville

Join Date: Jun 2014
Posts: 8
Default

Thank you for the suggestion, Martin! I will have to try that. I've never used MIRA before. Also, would you be able to share your code for the 454 trimming?
pmart1 is offline   Reply With Quote
Old 06-30-2014, 01:12 PM   #5
martin2
Member
 
Location: Prague, Czech Republic

Join Date: Nov 2010
Posts: 40
Default

Hi, it happened I did all the development on my own so currently I only offer a data cleanup as a service (or even assembly). It is not only the code (28k lines of python code) but also a collection of artefacts which I found more 'manually' than by any 'computer-based' approach. They are not so abundant in one dataset while maybe you hit them in some other later on ...

I am a molecular biologist and with some datasets (transcriptomes) I had a lot of fun while looking for the restrictions sites, ligation results, and namely tried to come up with an answer how they emerged and how to generalize queries for them. To date I developed/tested it on 2227 datasets, better not counting how many times I re-calculated all of them from scratch once I realized something has been escaping me to date. ( You wouldn't believe that I am still finding datasets produced by yet another lab protocol with yet another batch of primers/adapters and associated issues.

It even works on at least some WGS IonTorrent datasets as the lab protocols are just same. If I am not mistaken it was started by people who left 454 so some ideas and issues are common to both.

Unfortunately, I cannot share the code or even the queries. You can find URL in my Profile.

--
For your particular case, I think it is better to get more sequencing data, the 43bp are too short these days and I doubt it is worth the efforts.
martin2 is offline   Reply With Quote
Old 06-30-2014, 01:14 PM   #6
martin2
Member
 
Location: Prague, Czech Republic

Join Date: Nov 2010
Posts: 40
Default

> ... and 250bp double end reads from a Roche 454 GS FLX ...

What do you have? Isn't this Illumina instead?
martin2 is offline   Reply With Quote
Old 06-30-2014, 01:35 PM   #7
pmart1
Junior Member
 
Location: Nashville

Join Date: Jun 2014
Posts: 8
Default

Pardon me, the 454 is 500bp double ended. I'm actually an intern in a lab and all of this (including linux) is extremely new to me.
pmart1 is offline   Reply With Quote
Old 06-30-2014, 01:38 PM   #8
martin2
Member
 
Location: Prague, Czech Republic

Join Date: Nov 2010
Posts: 40
Default

That sounds like Illumina mate-pair protocol. What is the name of the file and what is the first entry or two in it?
martin2 is offline   Reply With Quote
Old 06-30-2014, 01:51 PM   #9
pmart1
Junior Member
 
Location: Nashville

Join Date: Jun 2014
Posts: 8
Default

The libraries are from an old strain that perished in a power outage so we cannot run any further sequencing.
pmart1 is offline   Reply With Quote
Old 06-30-2014, 01:55 PM   #10
pmart1
Junior Member
 
Location: Nashville

Join Date: Jun 2014
Posts: 8
Default

(Strain name).sff. I'm not sure how to open it.
pmart1 is offline   Reply With Quote
Old 06-30-2014, 01:57 PM   #11
martin2
Member
 
Location: Prague, Czech Republic

Join Date: Nov 2010
Posts: 40
Default

sffinfo (Strain name).sff | head -n 100
martin2 is offline   Reply With Quote
Old 06-30-2014, 02:09 PM   #12
pmart1
Junior Member
 
Location: Nashville

Join Date: Jun 2014
Posts: 8
Default

Magic Number: 0x2E736666
Version: 0001
Index Offset: 551942992
Index Length: 4131866
# of Reads: 206557
Header Length: 840
Key Length: 4
# of Flows: 800
Flowgram Code: 1
Flow Chars: (sequence data)
Key Sequence: TCAG

>(Strain name)
Run Prefix:
Region #:
XY Location:

Run Name:
Analysis Name:


Thank you so much for the patience and help.
pmart1 is offline   Reply With Quote
Old 06-30-2014, 02:30 PM   #13
martin2
Member
 
Location: Prague, Czech Republic

Join Date: Nov 2010
Posts: 40
Default

OK, so this is likely Titanium sequencing, General Library Preparation protocol or Amplicon/paired-end ..., so read length up to 500nt. The best would be to feed it into newbler:

runAssembly -o (Strain name) -mi 90 -ml 80 -consed -scaffold -cpu 2 (Strain name).sff GAIIxdata.fastq



For non-Roche assemblers you have to go with:

sffinfo -s (Strain name).sff > (Strain name).fasta
sffinfo -q (Strain name).sff > (Strain name).fasta.qual

Last edited by martin2; 06-30-2014 at 02:31 PM. Reason: Typo
martin2 is offline   Reply With Quote
Old 06-30-2014, 02:35 PM   #14
pmart1
Junior Member
 
Location: Nashville

Join Date: Jun 2014
Posts: 8
Default

I do have Newbler on my other computer. When I get home, I will run it and post the results. Thank you very much for all of your help! It is very appreciated.
pmart1 is offline   Reply With Quote
Old 06-30-2014, 06:31 PM   #15
pmart1
Junior Member
 
Location: Nashville

Join Date: Jun 2014
Posts: 8
Default

I was able to run it and these were the resulting metrics.

scaffoldMetrics
numberOfScaffolds = 51;
numberOfBases = 145548;
avgScaffoldSize = 2853;
N50ScaffoldSize = 2594, 19;
largestScaffoldSize = 12532;
numberOfScaffoldContigs = 51;
numberOfScaffoldContigBases = 145548;
avgScaffoldContigSize = 2853;
N50ScaffoldContigSize = 2594, 19;
largestScaffoldContigSize = 12532;
scaffoldEndMetrics
NoEdges = 97, 95.1%;
OneEdge = 1, 1.0%;
TwoEdges = 4, 3.9%;
ManyEdges = 0, 0.0%;
scaffoldGapMetrics
BothNoEdges = 0, 0.0%;
OneNoEdges = 0, 0.0%;
BothOneEdge = 0, 0.0%;
MultiEdges = 0, 0.0%;
largeContigMetrics
numberOfContigs = 2050;
numberOfBases = 1802379;
avgContigSize = 879;
N50ContigSize = 888;
largestContigSize = 12532;
Q40PlusBases = 1705911, 94.65%;
Q39MinusBases = 96468, 5.35%;
largeContigEndMetrics
NoEdges = 4052, 98.8%;
OneEdge = 29, 0.7%;
TwoEdges = 16, 0.4%;
ManyEdges = 3, 0.1%;
allContigMetrics
numberOfContigs = 4866;
numberOfBases = 2738478;
pmart1 is offline   Reply With Quote
Old 07-23-2014, 03:45 PM   #16
martin2
Member
 
Location: Prague, Czech Republic

Join Date: Nov 2010
Posts: 40
Default

Hi pmart1,
sorry I somehow did not receive an email update.

The assembly you show is bad, 95% of reads having no edge means it just did not assemble, almost at all. Was this only the 454 data or 454 with Illumina together? Try to assemble just the 454 data alone.

Are you sure you removed adapters? You should really go to the lab protocols to check what you have in the files. Or ask somebody knowing how to check the raw data. It is hard to tell remotely. Provided it is your special strain and top-secret ... I could help you debug the data via private email, depends how much info you can disclose.

Martin
martin2 is offline   Reply With Quote
Reply

Tags
de novo, hybrid assembly, illumina & 454

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 09:57 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO