Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Combining 454FLX and SOLiD runs for de novo genome assembly

    I have a project that has done 1.5 plates worth of 454FLX (mixed paired end/single read) and subsequently done a SOLiD run.

    The genome in question is ~11Mbp and has no reference to assemble to as its a non-model organism.

    The 454 runs have been assembled with Newbler, but I'm interested in strategies and packages for combining the 454 and SOLiD data together.

    Any pitfalls, protocols or papers I should be aware of?

    Bukwoski

  • #2
    You can try velvet assembler... it accepts both long and short reads

    Comment


    • #3
      Related qn here.
      solid uses colorspace and velvet is colorspace aware..
      so should we assemble in color space?
      i.e. convert 454 (or maybe even BAC clone reads from sanger seq?) to color space and assemble?


      if my ram per core is only 2 GB can I assemble a subset in velvet (splitting the reads into 20 million sets) and then reassemble again using velvet for all of the reads?
      http://kevin-gattaca.blogspot.com/

      Comment


      • #4
        my suggestion is that assemble 454 and solid (velvet)separately, then combine the two assembly. i have successfully assemble one genome using the method.

        Comment


        • #5
          Has anyone had a chance to try velvet on 454 and SOLiD?

          I too have Roche reads aligned by newbler and would like to combine them with SOLiD reads. I'm working with the transcriptome of an organism with no reference.
          If I do the assemblies separately - how can I combine them?
          Does anyone have experience with translating 454 reads to colrspace and then using velvet on them and the SOLiD reads together?

          I'm new here and would love to learn from other people's mistakes

          Comment


          • #6
            I think that Matz Lab did an excellent job with that combi with coral transcriptome
            check out


            Coral Transcriptomics-a budget NGS approach?
            Was surprised I didn't blog about this earlier. Dr Mikhail Matz is a researcher in the field of coral genomics. His approach to doing de n...

            has a summary of the tools required for the pipeline.

            basically the 454 created a patchy transcriptome that could be annotated and by adding the SOLiD reads, reasonable amount of data can be extracted.

            the scripts are posted on the web site as well.

            do share your findings. This is an area I am keen to explore once i get my hands on the data as well.
            http://kevin-gattaca.blogspot.com/

            Comment


            • #7
              I have used MIRA to do denovo bacterial genome assemblies using 454 and illumina sequencing data. I used MIRA because it is a true hybrid denovo assembler. That is many of the assemblers that are capable of using reads from different technologies perform an iterative assembly, that is they first do an assembly using the 454 reads and then layer on top of that the data from the illumina reads. MIRA uses all of the data irrespective of the technology equally in the assembly. In my experience rather large contigs were generated (up to a couple hundred kb) with 100 fold depth of coverage ~350 nt long 454 reads and 100 fold coverage of 36 nt Illumina reads. After the initial MIRA assembly I further assembled the contigs into larger supercontigs as it was clear that some many of the contigs were overlapping but due to junk reads assembled on to the ends of the contigs, overlapping contigs were unable to coalesce.

              Comment


              • #8
                Originally posted by sbberes View Post
                I have used MIRA to do denovo bacterial genome assemblies using 454 and illumina sequencing data. I used MIRA because it is a true hybrid denovo assembler. That is many of the assemblers that are capable of using reads from different technologies perform an iterative assembly, that is they first do an assembly using the 454 reads and then layer on top of that the data from the illumina reads. MIRA uses all of the data irrespective of the technology equally in the assembly. In my experience rather large contigs were generated (up to a couple hundred kb) with 100 fold depth of coverage ~350 nt long 454 reads and 100 fold coverage of 36 nt Illumina reads. After the initial MIRA assembly I further assembled the contigs into larger supercontigs as it was clear that some many of the contigs were overlapping but due to junk reads assembled on to the ends of the contigs, overlapping contigs were unable to coalesce.
                I do not have much experience with assembly but I had the impression that 100x coverage is sufficient for de novo assemblies.
                and for bacterial genomes, I had assumed that it should be a clinch.
                was it really neccessary for the 200x coverage from 454 and solexa?
                (this is worrying as it might mean I have to sequence 200x on SOLiD or possibly get seq from a 454 somehow)

                ps. http://www.chevreux.org/projects_mira.html MIRA link
                http://kevin-gattaca.blogspot.com/

                Comment


                • #9
                  Kevin,
                  No 100x coverage from both technologies was I am sure overkill. Both of these sequencing instruments using the stock protocols are really to large in capacity for bacterial sized genomes (squirrel hunting with a bazooka), but at the time I did not yet have barcoding up and running so that I could multiplex my runs. That said I have not gone back in and run multiple assemblies using lesser portion of the data in order to determine what the minimal requirements are. I suspect that about 15-to-20x coverage with both technologies would suffice. Given pyrosequencing’s difficulties with homopolymeric tracts you really are much better doing hybrid assemblies.
                  SBB

                  Comment


                  • #10
                    Originally posted by sbberes View Post
                    . After the initial MIRA assembly I further assembled the contigs into larger supercontigs .
                    how exactly did you do this?
                    --
                    Jeremy Leipzig
                    Bioinformatics Programmer
                    --
                    My blog
                    Twitter

                    Comment


                    • #11
                      Jeremy,
                      Our laboratory does bacterial genome denovo sequencing and lots of pathogenomic resequencing for comparative population genomic investigations (Staph, Strep, and TB). The most recent genomes being sequenced are ~2Mbp in size. The most recent denovo assemblies were accomplished by combining data obtained from pyrosequencing using a 454 titanium instrument (~0.5 million reads with an average read length of 350 nt) and from an Illumina GAII instrument (~5 million reads of 36 nt). Reads from these instruments were first preprocessed using the FASTX toolkit to filter out low quality and redundant artifactual reads (primer derived sequences). The filtered data was then feed into MIRA using the recommended protocol and parameters. MIRA was run on a desktop machine with with 8 cores and 12gb ram running Ubuntu. I think it took a couple of days to process, that is ran over a weekend. This process was run for two strains. The resultant fasta file of contigs was then filtered to remove contigs of less than 0.5 kb (~40 contigs, the largest of which were in the 150 to 200 kbp range). The filtered contigs were aligned to the genome of a related strain to order the contigs. The contigs were then feed into Sequencher where they were trimmed if needed on the ends and then overlaping contigs were assembled into supercontigs (~10 per genome). Virtually all of the breackpoints remaing in the assembly were large repeated elements, such as rRNA operons, 1.5kb transposons, and some phage lytic cassettes. The gaps were PCR amplified and walked using Sanger sequencing. Regions of overlap in the contigs where there were discrepant base calls were resolved with Sanger sequencing. After final assembly the ~5 million Illumina reads were compared to the genome using VAAL. ~20 polymorphisms were identified, virtually all of these polymorphisms were in homoploymeric nt tracts. Most of the polymorphisms lay in coding sequences and shifted the reading frames disrupting the gene. This indicated that despite using two different sequencing technologies and the hybrid assembly a couple of handfulls of errors still likely occurred. This was again resolved by Sanger sequencing. 20 errors at the end of the process for a 2 Mbp genome is not to shabby. The smaller contigs ie those less than 500nt in size were also compared to the assembly and virtually all of them did assemble/overlap with the genome so there was no indication that these smaller contigs represented sequence not present in the final assembly.
                      SBB

                      Comment


                      • #12
                        Steve, thanks for detailing your approach. I find this thread pretty interesting.

                        We do a lot of big plant transcriptomes. I am reluctant to feed 100M+ reads to MIRA, so I have tried feeding it a Newbler assembly of 454 + Velvet assembly of Solexa, both masquerading as Sanger reads. The results are certainly better than either alone, but nothing spectacular.

                        It would be nice if MIRA would have a setting that identifies certain long read input as "homopolymer-prone", but maybe that is too controversial.
                        --
                        Jeremy Leipzig
                        Bioinformatics Programmer
                        --
                        My blog
                        Twitter

                        Comment


                        • #13
                          Thank you all for the input!

                          MIRA sounds interesting.

                          So far, I have tried double encoding my 454 reads and feeding them to velvet as colorspace reads. Unfortunately, this gives me a segmentation fault when I try running velvetg...

                          Comment


                          • #14
                            Update

                            Ok, so the good news is that a denovo run of 454 and solid reads really can be done. Of course, there's also bad news:

                            So I took my 454 reads and converted them to color space using a script I wrote (verified all was well). Then I fed them into the solid preprocesser for velvet. I preprocessed my solid reads too and them fed them all into velvet_de. At this point I defined both groups of reads as 'short'. I got a bunch of contigs with a maximal length of 820 bp. Nice but not amazing. Ran these contigs through the post processor and denovoadp and all went smoothly.

                            Then I tried running velvet_de again but this time I entered the 454 reads twice - once as long and once as short. I got amazingly long contigs (this is for transcriptome) with a maximal length of 3.5 kb. Wonderful!
                            I ran the solid post processor on them and all went well. Then I tried running denovoadp and was told:
                            'contig exceed maximum length or reads match to negative position'

                            So I have these gorgeous long contigs, but they're stuck in cs...
                            Trying to figure out if asid light can help me. Wish the documentation was better...
                            Any input would be welcome.

                            Comment


                            • #15
                              hello Temima
                              glad to know of the gud news
                              cd u forward me the script for 454 to color spce conversion,

                              i hv been also trying to combine 454 and SOLiD data for assembly hwver my experimental approach hvnt yet helped me out
                              Thanks n Regards

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Current Approaches to Protein Sequencing
                                by seqadmin


                                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                04-04-2024, 04:25 PM
                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 04-11-2024, 12:08 PM
                              0 responses
                              25 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 10:19 PM
                              0 responses
                              28 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 09:21 AM
                              0 responses
                              24 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-04-2024, 09:00 AM
                              0 responses
                              52 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X