Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • De Novo plant Bac sequencing

    We have sequenced a stretch of overlapping plant Bac's with both Solexa and 454.

    Can anyone give me advice which assembler to use to get nice big contigs?

    So far we get only small contigs.

    Regerds,
    JDL

  • #2
    What was your current assembly strategy?

    Comment


    • #3
      velvet is good for short reads and it also works on long sequences as well...

      Comment


      • #4
        Agreed.
        There are two reasons you may have many short contigs in a de novo assembly:

        1. Gaps in coverage.
        2. High repeat content.

        In most assembly projects I've seen, (1) is more likely to be the reason that you have many short contigs. This is counterintuitive since often the coverage is so high you would expect reads to start at every position in the genome many times. Unfortunately, for the GAI and to a lesser extent FLX sequencers, there are amplification biases that create "sampling deserts" that have considerably less coverage than expected, often causing gaps in the assembly.

        If you are running paired-end assemblies, the latest releases of assemblers do some scaffolding where the short contigs are joined by mate pairs, which improves the N50 statistic greatly.

        Also, I've found that hybrid 454/Illumina assemblies (with old 35 bp illumina reads) complement coverage nicely.

        Regarding (2), if the repeat length is longer than the clone size (or the read length with an unpaired assembly), every instance of the repeat in the genome will split a contig.

        There are utilities in euler-sr that tell you if your coverage is fragmented, and I'll try and add a utility to the forthcoming release that gives statistics on the number of repeats in the genome that are unresolved with mate-pairs.

        cheers,
        -mark

        Comment


        • #5
          used assemblers

          Thank you for your commands

          We don't have the sequencers ourself so the first assemblies are made by the sequencing companies.

          The Solexa reads (5 pooled BAC's (contig)) are assembled with Velvet 0.5.0 and SSAKE 3.2

          The 454 reads (5 tagged Bac's so individual sequences, pool of 3 Bac's, all 8 together are one contig)are assembled with the semiautomatic GS FLX Assembly.

          I think we have a nice amount of sequences which should deliver a better assembly than we had up till now.

          Mark, is it possible to put both the 454 and Solaxa reads together into the euler-sr?
          In which format should the sequences be then?

          JDL

          Comment


          • #6
            Hi all,
            I'm performing similar experiments simulating all the data. I have noticed that the results are really influenced by the BAC overlaps and by the coverage of short reads.
            Usually instruments like Velvet, Edena and Euler have good performances with a short read coverage grater then 30X.

            In order to understand what is not working in your experiment I will suggest to simulate the same experiment that you are performing with real data, and if the results of the simulated experiment are similar to the ones that you obtain with real data start to play with the coverage of short reads in order to check if a deeper coverage improves your assembly.

            Francesco

            Comment


            • #7
              Originally posted by JDL View Post
              Thank you for your commands

              We don't have the sequencers ourself so the first assemblies are made by the sequencing companies.

              The Solexa reads (5 pooled BAC's (contig)) are assembled with Velvet 0.5.0 and SSAKE 3.2

              The 454 reads (5 tagged Bac's so individual sequences, pool of 3 Bac's, all 8 together are one contig)are assembled with the semiautomatic GS FLX Assembly.

              I think we have a nice amount of sequences which should deliver a better assembly than we had up till now.

              Mark, is it possible to put both the 454 and Solaxa reads together into the euler-sr?
              In which format should the sequences be then?

              JDL
              euler-sr will work with all types of reads as long as they are all placed in a single FASTA file.
              If you have a paired-end library, you will have to define some regular expressions that are used to define the mate type based on fasta titles.

              If you want to use euler-sr, you may want to wait a couple of days for the next release to be posted, I'm finishing up some modifications that speed it up greatly (especially on large genomes), and improve some results.

              -mark

              Comment


              • #8
                mark,

                There are no quality files or so needed to go with the sequences?
                Each sequence is considered to be of the same quality?

                JDL

                Comment


                • #9
                  Yes, for now all sequences are considered the same quality. There are some modules I'm working on that may consider quality, but I doubt I will ever include that. More likely is the possibility that I'll write AMOS compatible output that will redo basecalling. If that uses quality values then they will be more useful.

                  For the most part, contigs are formed by reads with massive coverage, and so in non-repetitive errors there are very few base miscalls. Most of the miscalls that euler-sr produces are in repeats since for now it simply outputs the repeat consensus as the sequence of a repeat. This *should* be fixed by the amos basecaller, and it will take a fair amount of coding to get it changed in euler if not.

                  -mark

                  Comment


                  • #10
                    We're working on making the pipeline published in the following paper:

                    An international, peer-reviewed genome sciences journal featuring outstanding original research that offers novel insights into the biology of all organisms


                    more user friendly. It's currently based on the VCAKE algorithm, which is sort of a relic but which we developed in house for exactly this 454/solexa co-assembly problem. We'll be trying to get a streamlined package out on our sourceforge site (http://sourceforge.net/projects/vcake) within the next two weeks.

                    I've been out of the loop on these things for a while, so it's quite possible that current versions of euler-sr and other assemblers have exceeded the abilities of our pipeline.

                    Comment


                    • #11
                      In order to have a comparison of your pipeline with the other assemblers (edena, velvet abyss, euler-sr) i suggest to read the following article "de novo assembly of the Pseudomonas syringae pv. syringae B728a genome using Illumina/Solexa short sequence reads" by Farrer.

                      Francesco

                      Comment


                      • #12
                        Very useful paper, thanks for the pointer

                        Comment

                        Latest Articles

                        Collapse

                        • seqadmin
                          Strategies for Sequencing Challenging Samples
                          by seqadmin


                          Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                          03-22-2024, 06:39 AM
                        • seqadmin
                          Techniques and Challenges in Conservation Genomics
                          by seqadmin



                          The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                          Avian Conservation
                          Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                          03-08-2024, 10:41 AM

                        ad_right_rmr

                        Collapse

                        News

                        Collapse

                        Topics Statistics Last Post
                        Started by seqadmin, 03-27-2024, 06:37 PM
                        0 responses
                        13 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 03-27-2024, 06:07 PM
                        0 responses
                        11 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 03-22-2024, 10:03 AM
                        0 responses
                        53 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 03-21-2024, 07:32 AM
                        0 responses
                        69 views
                        0 likes
                        Last Post seqadmin  
                        Working...
                        X