![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
sff files, fasta and fastq | Feenix | 454 Pyrosequencing | 4 | 06-26-2014 06:43 AM |
Generating SFF files | Xterra | 454 Pyrosequencing | 8 | 10-31-2011 02:07 PM |
Renaming reads within SFF files | jvhaarst | 454 Pyrosequencing | 9 | 11-17-2010 06:26 AM |
sff 454 files into fasta | Peruano | 454 Pyrosequencing | 4 | 03-08-2010 02:21 PM |
sff file size limit in Newbler? | lmilne | Bioinformatics | 7 | 10-21-2009 04:03 AM |
![]() |
|
Thread Tools |
![]() |
#1 |
Member
Location: Europe Join Date: Oct 2010
Posts: 22
|
![]()
I have around 1,00,000 .sff files that i want to use for doing an assembly with Newbler.
What would be the best strategy to use? 1)Combine the .sff files? Please let me know how i would do the combining. 2)Add each sff file individually by a batch script which uses addRun to add each file and then run the assembler? Note:I cant get these sff files as a single pre-combined file and need to either combine them by converting them to FASTA and quality files and back again into sff files. Could anybody please let me know if there is a easier way? Thanks |
![]() |
![]() |
![]() |
#2 |
Peter (Biopython etc)
Location: Dundee, Scotland, UK Join Date: Jul 2009
Posts: 1,543
|
![]()
Newbler will accept multiple SFF files but there are limits on things like command line length. You can also use the Roche tools to merge SFF files - either sfffile or possibly sffinfo.
Biopython can also be used to merge SSF files but doesn't really handle the undocumented Roche XML emmedded manifest well. If you don't care about the manifest (or know how to merge these) it might be a useful alternative. Note only SFF files from the same generation of Roche 454 can be merged - all the reads in a file must have the same number and pattern of flow data. |
![]() |
![]() |
![]() |
#3 |
Member
Location: Europe Join Date: Oct 2010
Posts: 22
|
![]()
Thanks maubp
sfffile merges the sff files in a given directory. The performance of sfffile and addRun seem to be same. However, addRun is better as it also adds it to the project simultaneously. May be performance of newbler is better with a single sff file than adding multiple sff files. Fortunately all data is from the same generation. |
![]() |
![]() |
![]() |
#4 | |
Moderator
Location: Oslo, Norway Join Date: Nov 2008
Posts: 415
|
![]()
You can just gove newbler all sff files in one go, I don't think it will protest:
runAssembly -o yourproject /folder/*.sff This would also fix any problems with files from muttiple generations (not applicable to you). Quote:
Just curious, 100 000 sff files? How did you manage that? |
|
![]() |
![]() |
![]() |
#5 |
Member
Location: Europe Join Date: Oct 2010
Posts: 22
|
![]()
Thanks flxlex
I tried this for about 500 files and it worked. Hope it works as the incremental procedure continues. Well, the explanation is 100 000 sff files are not real data. It was generated using Flowsim(which simulates 454 reads.But it just gives single end reads.) These sff files have incremental coverage. ex: First file - 100 reads second file -200 reads Third file - 300 reads etc.. Then do incremental assemblies with these files.Idea is to find incorrect assemblies at different coverage. Although a coverage of 10X should be good enough, how would it be affected by sequencing bias, uneven coverage of different regions etc. Please do let me know your thoughts about this approach. |
![]() |
![]() |
![]() |
#6 |
Moderator
Location: Oslo, Norway Join Date: Nov 2008
Posts: 415
|
![]()
Interesting project. How are going to detect sequencing bias by using simulated reads? Uneven coverage you will find, that is just plain stochastics (poisson distributions and all that). Just not sure what you are looking for...
|
![]() |
![]() |
![]() |
#7 |
Member
Location: Europe Join Date: Oct 2010
Posts: 22
|
![]()
Unfortunately runAssembly failed when i tried with 1111 sff files,The assembly was going fine for smaller datasets.I dint change the command, just added more files into the data directory.
Have i run out of memory or is it the limit of newbler? I am trying to combine the files using sfffiles.However, i keep getting segmentation faults. Given below is the error message: Indexing lot of files.... Indexing 1111.sff... -> 9 reads, 4286 bases. Setting up long overlap detection... -> 878 of 878, 867 reads to align Building a tree for 4126 seeds... Computing long overlap alignments... -> 867 of 867 Setting up overlap detection... -> 878 of 878, 867 reads to align Building a tree for 32932 seeds... Computing alignments... -> 867 of 867 Checkpointing... terminate called after throwing an instance of 'std: ![]() what(): vector::_M_range_check Error: An internal error (assertion failure) has occurred in the computation. Please report this error to your customer support representative. To generate sequencing bias, i have created these huge "genome" which has different possible sequences. Although it would not be possible to generate sequencing bias, the effect the different sequences have on the assembly process may become clear. Flowsim is able to simulate homopolymer errors. Finally the idea is to take up assembled genomes and check if such errors have occurred while assembling them. Last edited by Autotroph; 10-16-2010 at 11:28 AM. |
![]() |
![]() |
![]() |
#8 |
Member
Location: Europe Join Date: Oct 2010
Posts: 22
|
![]()
The problem seemed to have many components to it.
1)Flowsim produced duplicate accession numbers when cutting at the same base twice. 2)Newbler does not accept duplicate accession numbers 3)The files can be combined 'easily' 2 at a time using sfffile. For a more detailed solution and code i used take a look at below link: http://nagarjunv.blogspot.com/2010/1...n-numbers.html |
![]() |
![]() |
![]() |
#9 |
Peter (Biopython etc)
Location: Dundee, Scotland, UK Join Date: Jul 2009
Posts: 1,543
|
![]()
Have you mentioned this to Ketil Malde? Maybe he can fix Flowsim to avoid duplicate accessions (this seems like a useful bug fix).
http://blog.malde.org/index.php/flowsim/ |
![]() |
![]() |
![]() |
#10 |
Member
Location: Europe Join Date: Oct 2010
Posts: 22
|
![]()
Yes, i mailed him with a suggestion to include the read number at the end of the accession number. This should give unique accession numbers as long as the input accession numbers are unique
![]() |
![]() |
![]() |
![]() |
#11 |
Junior Member
Location: India Join Date: Feb 2013
Posts: 3
|
![]()
Hi all!
I am trying to assemble a low coverage 454 data of a plant using Newbler/gsassembler. I have two raw sff files from two different genotypes of my experimental plant. newbler completes the assembly step without a considerable error for the individual sffs. But when I try to assemble the sff files of both genotypes together(using incremental denovo assembly) it just adds up the total contigs and the singletons for that matter neglecting the possible common contigs between the two genotypes. To my understanding newbler is treating every read in both the sff files as unique which is very unlikely to happen. My basic aim is to find the SNPs and repeats in the genome and if newbler is assembling every read into a unique contig then this could be a matter of concern to me. Please provide the necessary explaination for this behaviour. |
![]() |
![]() |
![]() |
#12 | |||
Member
Location: Prague, Czech Republic Join Date: Nov 2010
Posts: 40
|
![]() Quote:
Quote:
Quote:
|
|||
![]() |
![]() |
![]() |
Tags |
assembly, newbler, sff |
Thread Tools | |
|
|