SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
sff files, fasta and fastq Feenix 454 Pyrosequencing 4 06-26-2014 05:43 AM
Generating SFF files Xterra 454 Pyrosequencing 8 10-31-2011 01:07 PM
Renaming reads within SFF files jvhaarst 454 Pyrosequencing 9 11-17-2010 05:26 AM
sff 454 files into fasta Peruano 454 Pyrosequencing 4 03-08-2010 01:21 PM
sff file size limit in Newbler? lmilne Bioinformatics 7 10-21-2009 03:03 AM

Reply
 
Thread Tools
Old 10-13-2010, 02:23 PM   #1
Autotroph
Member
 
Location: Europe

Join Date: Oct 2010
Posts: 22
Default running Newbler with a lot off .sff files

I have around 1,00,000 .sff files that i want to use for doing an assembly with Newbler.

What would be the best strategy to use?

1)Combine the .sff files? Please let me know how i would do the combining.
2)Add each sff file individually by a batch script which uses addRun to add each file and then run the assembler?

Note:I cant get these sff files as a single pre-combined file and need to either combine them by converting them to FASTA and quality files and back again into sff files.

Could anybody please let me know if there is a easier way?

Thanks
Autotroph is offline   Reply With Quote
Old 10-13-2010, 03:07 PM   #2
maubp
Peter (Biopython etc)
 
Location: Dundee, Scotland, UK

Join Date: Jul 2009
Posts: 1,540
Default

Newbler will accept multiple SFF files but there are limits on things like command line length. You can also use the Roche tools to merge SFF files - either sfffile or possibly sffinfo.

Biopython can also be used to merge SSF files but doesn't really handle the undocumented Roche XML emmedded manifest well. If you don't care about the manifest (or know how to merge these) it might be a useful alternative.

Note only SFF files from the same generation of Roche 454 can be merged - all the reads in a file must have the same number and pattern of flow data.
maubp is offline   Reply With Quote
Old 10-13-2010, 03:34 PM   #3
Autotroph
Member
 
Location: Europe

Join Date: Oct 2010
Posts: 22
Default sfffile vs addRun

Thanks maubp

sfffile merges the sff files in a given directory. The performance of sfffile and addRun seem to be same. However, addRun is better as it also adds it to the project simultaneously. May be performance of newbler is better with a single sff file than adding multiple sff files.

Fortunately all data is from the same generation.
Autotroph is offline   Reply With Quote
Old 10-13-2010, 10:58 PM   #4
flxlex
Moderator
 
Location: Oslo, Norway

Join Date: Nov 2008
Posts: 415
Default

You can just gove newbler all sff files in one go, I don't think it will protest:

runAssembly -o yourproject /folder/*.sff

This would also fix any problems with files from muttiple generations (not applicable to you).

Quote:
Originally Posted by Autotroph View Post
May be performance of newbler is better with a single sff file than adding multiple sff files.
No, that should make no difference at all. The only thing that comes to mind here is incremental assembly with shotgun reads first, followed by paired end reads. I have not tested this, but I can vaguely remember somebody mentioning a difference in favor of incremental over all-in-one-go.

Just curious, 100 000 sff files? How did you manage that?
flxlex is offline   Reply With Quote
Old 10-13-2010, 11:50 PM   #5
Autotroph
Member
 
Location: Europe

Join Date: Oct 2010
Posts: 22
Default

Thanks flxlex

I tried this for about 500 files and it worked. Hope it works as the incremental procedure continues.

Well, the explanation is 100 000 sff files are not real data. It was generated using Flowsim(which simulates 454 reads.But it just gives single end reads.)

These sff files have incremental coverage.
ex:
First file - 100 reads
second file -200 reads
Third file - 300 reads
etc..

Then do incremental assemblies with these files.Idea is to find incorrect assemblies at different coverage. Although a coverage of 10X should be good enough, how would it be affected by sequencing bias, uneven coverage of different regions etc. Please do let me know your thoughts about this approach.
Autotroph is offline   Reply With Quote
Old 10-16-2010, 06:21 AM   #6
flxlex
Moderator
 
Location: Oslo, Norway

Join Date: Nov 2008
Posts: 415
Default

Quote:
Originally Posted by Autotroph View Post
Idea is to find incorrect assemblies at different coverage. Although a coverage of 10X should be good enough, how would it be affected by sequencing bias, uneven coverage of different regions etc. Please do let me know your thoughts about this approach.
Interesting project. How are going to detect sequencing bias by using simulated reads? Uneven coverage you will find, that is just plain stochastics (poisson distributions and all that). Just not sure what you are looking for...
flxlex is offline   Reply With Quote
Old 10-16-2010, 08:10 AM   #7
Autotroph
Member
 
Location: Europe

Join Date: Oct 2010
Posts: 22
Default

Unfortunately runAssembly failed when i tried with 1111 sff files,The assembly was going fine for smaller datasets.I dint change the command, just added more files into the data directory.

Have i run out of memory or is it the limit of newbler? I am trying to combine the files using sfffiles.However, i keep getting segmentation faults.

Given below is the error message:
Indexing lot of files....

Indexing 1111.sff...
-> 9 reads, 4286 bases.
Setting up long overlap detection...
-> 878 of 878, 867 reads to align
Building a tree for 4126 seeds...
Computing long overlap alignments...
-> 867 of 867
Setting up overlap detection...
-> 878 of 878, 867 reads to align
Building a tree for 32932 seeds...
Computing alignments...
-> 867 of 867
Checkpointing...
terminate called after throwing an instance of 'std:ut_of_range'
what(): vector::_M_range_check

Error: An internal error (assertion failure) has occurred in the computation.
Please report this error to your customer support representative.


To generate sequencing bias, i have created these huge "genome" which has different possible sequences. Although it would not be possible to generate sequencing bias, the effect the different sequences have on the assembly process may become clear. Flowsim is able to simulate homopolymer errors.

Finally the idea is to take up assembled genomes and check if such errors have occurred while assembling them.

Last edited by Autotroph; 10-16-2010 at 10:28 AM.
Autotroph is offline   Reply With Quote
Old 10-21-2010, 02:07 AM   #8
Autotroph
Member
 
Location: Europe

Join Date: Oct 2010
Posts: 22
Default

The problem seemed to have many components to it.
1)Flowsim produced duplicate accession numbers when cutting at the same base twice.
2)Newbler does not accept duplicate accession numbers
3)The files can be combined 'easily' 2 at a time using sfffile.

For a more detailed solution and code i used take a look at below link:

http://nagarjunv.blogspot.com/2010/1...n-numbers.html
Autotroph is offline   Reply With Quote
Old 10-21-2010, 02:22 AM   #9
maubp
Peter (Biopython etc)
 
Location: Dundee, Scotland, UK

Join Date: Jul 2009
Posts: 1,540
Default

Have you mentioned this to Ketil Malde? Maybe he can fix Flowsim to avoid duplicate accessions (this seems like a useful bug fix).
http://blog.malde.org/index.php/flowsim/
maubp is offline   Reply With Quote
Old 10-21-2010, 07:08 AM   #10
Autotroph
Member
 
Location: Europe

Join Date: Oct 2010
Posts: 22
Default

Yes, i mailed him with a suggestion to include the read number at the end of the accession number. This should give unique accession numbers as long as the input accession numbers are unique
Autotroph is offline   Reply With Quote
Old 02-12-2013, 12:20 AM   #11
Kaurh5
Junior Member
 
Location: India

Join Date: Feb 2013
Posts: 3
Default

Hi all!

I am trying to assemble a low coverage 454 data of a plant using Newbler/gsassembler. I have two raw sff files from two different genotypes of my experimental plant. newbler completes the assembly step without a considerable error for the individual sffs. But when I try to assemble the sff files of both genotypes together(using incremental denovo assembly) it just adds up the total contigs and the singletons for that matter neglecting the possible common contigs between the two genotypes. To my understanding newbler is treating every read in both the sff files as unique which is very unlikely to happen. My basic aim is to find the SNPs and repeats in the genome and if newbler is assembling every read into a unique contig then this could be a matter of concern to me. Please provide the necessary explaination for this behaviour.
Kaurh5 is offline   Reply With Quote
Old 10-31-2013, 05:04 PM   #12
martin2
Member
 
Location: Prague, Czech Republic

Join Date: Nov 2010
Posts: 40
Default

Quote:
Originally Posted by flxlex View Post
You can just gove newbler all sff files in one go, I don't think it will protest:

runAssembly -o yourproject /folder/*.sff
+1 vote from me

Quote:
Originally Posted by flxlex View Post
This would also fix any problems with files from muttiple generations (not applicable to you).
I would be strongly against merging SFF files together. We can be only guessing what newbler or other tools are doing while inspecting SFF data. I have a lot of experience with SFF files unpacked from SRA files and in brief, I always split the merged SFF files back into separate files. The reason in my case is that reads from physically separated regions should be processed individually. Moreover, it saves you CPU and other resources in some cases you you do not mix different fruits together.


Quote:
Originally Posted by flxlex View Post
No, that should make no difference at all. The only thing that comes to mind here is incremental assembly with shotgun reads first, followed by paired end reads. I have not tested this, but I can vaguely remember somebody mentioning a difference in favor of incremental over all-in-one-go.
That is recommended in Roche docs for newbler. I never remember one should start with shortest (shotgun) or longest reads (20kb paired-ends, 8kb, 3kb) but it is easy to lookup the docs on the web.
martin2 is offline   Reply With Quote
Reply

Tags
assembly, newbler, sff

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 02:48 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2018, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO