Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • running Newbler with a lot off .sff files

    I have around 1,00,000 .sff files that i want to use for doing an assembly with Newbler.

    What would be the best strategy to use?

    1)Combine the .sff files? Please let me know how i would do the combining.
    2)Add each sff file individually by a batch script which uses addRun to add each file and then run the assembler?

    Note:I cant get these sff files as a single pre-combined file and need to either combine them by converting them to FASTA and quality files and back again into sff files.

    Could anybody please let me know if there is a easier way?

    Thanks

  • #2
    Newbler will accept multiple SFF files but there are limits on things like command line length. You can also use the Roche tools to merge SFF files - either sfffile or possibly sffinfo.

    Biopython can also be used to merge SSF files but doesn't really handle the undocumented Roche XML emmedded manifest well. If you don't care about the manifest (or know how to merge these) it might be a useful alternative.

    Note only SFF files from the same generation of Roche 454 can be merged - all the reads in a file must have the same number and pattern of flow data.

    Comment


    • #3
      sfffile vs addRun

      Thanks maubp

      sfffile merges the sff files in a given directory. The performance of sfffile and addRun seem to be same. However, addRun is better as it also adds it to the project simultaneously. May be performance of newbler is better with a single sff file than adding multiple sff files.

      Fortunately all data is from the same generation.

      Comment


      • #4
        You can just gove newbler all sff files in one go, I don't think it will protest:

        runAssembly -o yourproject /folder/*.sff

        This would also fix any problems with files from muttiple generations (not applicable to you).

        Originally posted by Autotroph View Post
        May be performance of newbler is better with a single sff file than adding multiple sff files.
        No, that should make no difference at all. The only thing that comes to mind here is incremental assembly with shotgun reads first, followed by paired end reads. I have not tested this, but I can vaguely remember somebody mentioning a difference in favor of incremental over all-in-one-go.

        Just curious, 100 000 sff files? How did you manage that?

        Comment


        • #5
          Thanks flxlex

          I tried this for about 500 files and it worked. Hope it works as the incremental procedure continues.

          Well, the explanation is 100 000 sff files are not real data. It was generated using Flowsim(which simulates 454 reads.But it just gives single end reads.)

          These sff files have incremental coverage.
          ex:
          First file - 100 reads
          second file -200 reads
          Third file - 300 reads
          etc..

          Then do incremental assemblies with these files.Idea is to find incorrect assemblies at different coverage. Although a coverage of 10X should be good enough, how would it be affected by sequencing bias, uneven coverage of different regions etc. Please do let me know your thoughts about this approach.

          Comment


          • #6
            Originally posted by Autotroph View Post
            Idea is to find incorrect assemblies at different coverage. Although a coverage of 10X should be good enough, how would it be affected by sequencing bias, uneven coverage of different regions etc. Please do let me know your thoughts about this approach.
            Interesting project. How are going to detect sequencing bias by using simulated reads? Uneven coverage you will find, that is just plain stochastics (poisson distributions and all that). Just not sure what you are looking for...

            Comment


            • #7
              Unfortunately runAssembly failed when i tried with 1111 sff files,The assembly was going fine for smaller datasets.I dint change the command, just added more files into the data directory.

              Have i run out of memory or is it the limit of newbler? I am trying to combine the files using sfffiles.However, i keep getting segmentation faults.

              Given below is the error message:
              Indexing lot of files....

              Indexing 1111.sff...
              -> 9 reads, 4286 bases.
              Setting up long overlap detection...
              -> 878 of 878, 867 reads to align
              Building a tree for 4126 seeds...
              Computing long overlap alignments...
              -> 867 of 867
              Setting up overlap detection...
              -> 878 of 878, 867 reads to align
              Building a tree for 32932 seeds...
              Computing alignments...
              -> 867 of 867
              Checkpointing...
              terminate called after throwing an instance of 'std:ut_of_range'
              what(): vector::_M_range_check

              Error: An internal error (assertion failure) has occurred in the computation.
              Please report this error to your customer support representative.


              To generate sequencing bias, i have created these huge "genome" which has different possible sequences. Although it would not be possible to generate sequencing bias, the effect the different sequences have on the assembly process may become clear. Flowsim is able to simulate homopolymer errors.

              Finally the idea is to take up assembled genomes and check if such errors have occurred while assembling them.
              Last edited by Autotroph; 10-16-2010, 10:28 AM.

              Comment


              • #8
                The problem seemed to have many components to it.
                1)Flowsim produced duplicate accession numbers when cutting at the same base twice.
                2)Newbler does not accept duplicate accession numbers
                3)The files can be combined 'easily' 2 at a time using sfffile.

                For a more detailed solution and code i used take a look at below link:

                Having simulated loads of 454 data with Flowsim , i wanted to assemble everything. Unfortunately Flowsim gives one output .sff file for one...

                Comment


                • #9
                  Have you mentioned this to Ketil Malde? Maybe he can fix Flowsim to avoid duplicate accessions (this seems like a useful bug fix).

                  Comment


                  • #10
                    Yes, i mailed him with a suggestion to include the read number at the end of the accession number. This should give unique accession numbers as long as the input accession numbers are unique

                    Comment


                    • #11
                      Hi all!

                      I am trying to assemble a low coverage 454 data of a plant using Newbler/gsassembler. I have two raw sff files from two different genotypes of my experimental plant. newbler completes the assembly step without a considerable error for the individual sffs. But when I try to assemble the sff files of both genotypes together(using incremental denovo assembly) it just adds up the total contigs and the singletons for that matter neglecting the possible common contigs between the two genotypes. To my understanding newbler is treating every read in both the sff files as unique which is very unlikely to happen. My basic aim is to find the SNPs and repeats in the genome and if newbler is assembling every read into a unique contig then this could be a matter of concern to me. Please provide the necessary explaination for this behaviour.

                      Comment


                      • #12
                        Originally posted by flxlex View Post
                        You can just gove newbler all sff files in one go, I don't think it will protest:

                        runAssembly -o yourproject /folder/*.sff
                        +1 vote from me

                        Originally posted by flxlex View Post
                        This would also fix any problems with files from muttiple generations (not applicable to you).
                        I would be strongly against merging SFF files together. We can be only guessing what newbler or other tools are doing while inspecting SFF data. I have a lot of experience with SFF files unpacked from SRA files and in brief, I always split the merged SFF files back into separate files. The reason in my case is that reads from physically separated regions should be processed individually. Moreover, it saves you CPU and other resources in some cases you you do not mix different fruits together.


                        Originally posted by flxlex View Post
                        No, that should make no difference at all. The only thing that comes to mind here is incremental assembly with shotgun reads first, followed by paired end reads. I have not tested this, but I can vaguely remember somebody mentioning a difference in favor of incremental over all-in-one-go.
                        That is recommended in Roche docs for newbler. I never remember one should start with shortest (shotgun) or longest reads (20kb paired-ends, 8kb, 3kb) but it is easy to lookup the docs on the web.

                        Comment

                        Latest Articles

                        Collapse

                        • seqadmin
                          Strategies for Sequencing Challenging Samples
                          by seqadmin


                          Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                          03-22-2024, 06:39 AM
                        • seqadmin
                          Techniques and Challenges in Conservation Genomics
                          by seqadmin



                          The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                          Avian Conservation
                          Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                          03-08-2024, 10:41 AM

                        ad_right_rmr

                        Collapse

                        News

                        Collapse

                        Topics Statistics Last Post
                        Started by seqadmin, Yesterday, 06:37 PM
                        0 responses
                        10 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, Yesterday, 06:07 PM
                        0 responses
                        10 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 03-22-2024, 10:03 AM
                        0 responses
                        51 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 03-21-2024, 07:32 AM
                        0 responses
                        67 views
                        0 likes
                        Last Post seqadmin  
                        Working...
                        X