Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Swift: Open source primary data analysis for Next-gen sequencers

    Right now that primary data is processed with closed source proprietary tools provided by the manufacturer. That's really unfortunate because the data is being used to draw scientific conclusions. It's difficult to trust your data and understand the artifacts in it if the data analysis algorithms are not open to peer review. Not only that but it means you can't easily change things and try out new methods.

    Until recently I was working at the Sanger Institute and in order to address this we have been developing a primary data analysis package for next-gen sequence data. At the moment our tools are aimed at Illumina data, but it should be possible to adapt them for processing SOLiD images as well.

    I've recently left Sanger, to pursue a career in next-next-gen sequencing at Oxford Nanopore Technologies. I'm going to continue developing Swift, as will my colleagues (particularly Tom Skelly who's put a lot of work in to Swift) at Sanger.

    While Swift is fully functional, it could do with more validation and testing. However, we've decided that we'd like to make it available to the wider community in the hope of gaining support and ideally attracting more developers.

    Right now, the post image analysis corrections (basecalling) in Swift work well, generally it produces error rates lower than the Illumina pipeline. It's probably ready for production usage, so feel free to try it out and let us know what you find.

    The native image analysis works but is more of a work in progress, we'd like people to try it out too and tell us what happens.

    Swift is available under LGPL3 at: http://swiftng.sourceforge.net

    You'll need to check it out of the subversion repository to run it, but it should be reasonably straight forward. Please email me if you have any trouble.

    I'm very interested in getting any feedback, positive or negative. You can either post here or contact me direct: new at sgenomics dot org.

  • #2
    cool

    i wonder if if can be put onto a boot DVD and run on the iPar computers - data mirrored in real time using the sanger mirroring scripts ?

    Comment


    • #3
      Originally posted by cgb View Post
      i wonder if if can be put onto a boot DVD and run on the iPar computers - data mirrored in real time using the sanger mirroring scripts ?
      Yes, this absolutely should be possible and is something we'd like to look in to. Users interested in doing this are encouraged to make contract.

      Comment


      • #4
        Could you maybe share some stats as to how Swift performs vs the current version of Bustard?
        E.g. amount of data/reads mapped, error rate for the same lane analysed both ways.
        thanks
        david

        Comment


        • #5
          Originally posted by dvh View Post
          Could you maybe share some stats as to how Swift performs vs the current version of Bustard?
          E.g. amount of data/reads mapped, error rate for the same lane analysed both ways.
          thanks
          david
          I'm still in the process of validating it on non-phiX data. For the phiX data I've looked at, against the 1.0 pipeline I've seen 20% more PF reads at a similar error rate.

          In terms of runtime, a GA1 single end takes around 10mins end to end. GA2 37 cycles paired end takes around an hour end to end.

          Comment


          • #6
            In terms of memory usage we're trying to stay within a 2Gb limit. A 37Gb paired end peaks at around 1Gb.

            Comment


            • #7
              BTW - the link: http://swiftng.sourceforge.net appears to be broken.

              The connection seems to be a problem only from my desktop at work (which is behind a US government firewall). From other locations i can get through OK.
              Last edited by timread; 11-18-2008, 12:44 PM. Reason: clarification of connection problem

              Comment


              • #8
                works for me

                Comment


                • #9
                  Is it normal to see different output when running the same binary version of swift on the same computer for multiple times and running it on different computers? I observed both. It looks like most of the differences in the fastq output is the quality scores.

                  Comment


                  • #10
                    Originally posted by iris42 View Post
                    Is it normal to see different output when running the same binary version of swift on the same computer for multiple times and running it on different computers? I observed both. It looks like most of the differences in the fastq output is the quality scores.
                    Running on different computers it's quite likely that the output will vary slightly as they are likely to have different floating point implementations.

                    On the same computer is a little odd, how different are the results? If it's a small difference then this could be down to the FFTW implementation we are using which sometimes employs a non-deterministic algorithm.

                    Comment


                    • #11
                      Originally posted by timread View Post
                      BTW - the link: http://swiftng.sourceforge.net appears to be broken.

                      The connection seems to be a problem only from my desktop at work (which is behind a US government firewall). From other locations i can get through OK.
                      Odd, you can try: http://sgenomics.org/swift/ which should also work.

                      Comment


                      • #12
                        I'm quite interested in using open-source software for scientific work. We have recently acquired an Illumina GAII machine, and are trying to come up with data management solutions. Right now we are planning to throw away the images after the primary analysis (base-calling) is completed. We are saving the intensity and noise files, but not the images, which seems to be fairly common. However, it seems that this software requires the original images, which makes sense, but would limit our ability to use it on past experiments.

                        Would it be feasible to use swift on the Firecrest output (intensity and noise)?

                        Do many labs actually save the image files?

                        It seems like an ideal initial setup would be to process the images with both the Illumina pipeline and Swift. Has anyone yet set this up?

                        Comment


                        • #13
                          sanger have it set up - talk to Tom Skelley.

                          Images are still very diagnostic of any issue with your sample or sequencer (or run). Looking at images allowed sanger to optimise their pipeline. For example, when your flowcell quality goes down, or an operator gets oil on the flowcell etc., or your focusing is off and you suddenly get lots of strange new 'contaminants' in your output file as a result, or your base qualities all drop halfway through your project, youe data goes bad and you look and your clusters look wierd coz of an issue with your cluster station, or theres stuff growing in your reagents appearing as blobs on the images (but not visible to the naked eye), or your flowcell surface isnt there etc etc. You should keep them for QC - then throw them. Generally (but not in all cases) higher throughput labs with big projects indulge in some image retention for some period.

                          Comment


                          • #14
                            Originally posted by lparsons View Post
                            I'm quite interested in using open-source software for scientific work. We have recently acquired an Illumina GAII machine, and are trying to come up with data management solutions. Right now we are planning to throw away the images after the primary analysis (base-calling) is completed. We are saving the intensity and noise files, but not the images, which seems to be fairly common. However, it seems that this software requires the original images, which makes sense, but would limit our ability to use it on past experiments.
                            Are you using iPar to process the images and then mirroring off the intensity files? Swift will process from intensity files (as produced by the UNIX pipeline). I've heard the iPar intensity format is different from that used by the UNIX pipeline if someone wants to send me a sample file I'll write a parser for it.

                            Originally posted by lparsons View Post
                            Would it be feasible to use swift on the Firecrest output (intensity and noise)?
                            Yes it's feasible, I would hope the results would be comparable with the Illumina pipeline.

                            Originally posted by lparsons View Post

                            Do many labs actually save the image files?

                            It seems like an ideal initial setup would be to process the images with both the Illumina pipeline and Swift. Has anyone yet set this up?
                            As mentioned Sanger save the images while they do QC, the images are mirrored off as the run progresses and processed using the UNIX pipeline on a separate cluster.

                            If you're interested in trying out Swift drop me an email at new at sgenomics dot org. It's in ``active development'' at the moment and I'm happy to work with people on any issues that come up.

                            Comment


                            • #15
                              Are there any updates on SWIFT? data sizes, number of files generated, comparison with Illumina pipeline results..
                              --
                              bioinfosm

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Current Approaches to Protein Sequencing
                                by seqadmin


                                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                04-04-2024, 04:25 PM
                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 04-11-2024, 12:08 PM
                              0 responses
                              29 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 10:19 PM
                              0 responses
                              31 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 09:21 AM
                              0 responses
                              28 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-04-2024, 09:00 AM
                              0 responses
                              52 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X