Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • how to assess a de novo assembly result?

    Hi, All,

    We recently sequenced a genome around 800M using ILMN GAII and the coverage reached almost 100x. We used SOAP for de novo assembly and the statistics are as blow:

    N50 contig:150 bp;
    Max contig: ~3K
    median contig: 130 bp

    Is this assembly a good one with nearly 100x? Could it be further improved?

    Thanks a million!

    CC

  • #2
    This is probably a good recent paper to read:

    Comment


    • #3
      Thanks a lot!

      I know it's too much to ask but could you email me a copy of the paper? I don't have access to it.



      Again, thanks!

      CC

      Originally posted by ECO View Post
      This is probably a good recent paper to read:

      http://www.nature.com/nmeth/journal/...meth.1527.html
      Last edited by CC_seqanswers; 01-22-2011, 10:38 AM.

      Comment


      • #4
        No worries. I got it from a friend.

        Have a nice day!

        CC

        Originally posted by CC_seqanswers View Post
        Thanks a lot!

        I know it's too much to ask but could you email me a copy of the paper? I don't have access to it.



        Again, thanks!

        CC

        Comment


        • #5
          Originally posted by CC_seqanswers View Post
          N50 contig:150 bp;
          Max contig: ~3K
          median contig: 130 bp
          In my humble opinion, these are not very encouraging numbers. If you look at recent genomes (panda!, turkey, apple, cacao, strawberry, you name it), these have much better metrics with lower coverage. Also, don't you have scaffolds?

          Comment


          • #6
            I agree that its not an encouraging number, though the problem most likely lies in soap denovo configuration rather than the data itself. I used soap denovo several times and found it a bit hard to tweak, most assembly from a simulation dataset produce somewhat very short N50 contig. Afterwards, I redo it using CLC assembler, and it turns out fine.

            My point is, you should try another assembler such as velvet,mira,clc. or contact the author asking for recommended config.

            Comment


            • #7
              Can you give us some basic background about what was sequenced - for example is it inbred or likely to be highly heterozygous? This will also affect your assembly results and different assemblers behave differently or require tweaking to best handle such cases.

              The recent strawberry and cocoa genomes were both based on at least partially inbred lines. They both had decent amounts of 454 data and the cocoa project had some Sanger BAC sequences too.

              We've found that the SOAP developers simply don't reply to emails so I would suggest trying Velvet if you have enough RAM or ABySS.

              Are your reads all shotgun?

              Comment


              • #8
                We also tried AbySS, which only ended up slightly longer contigs.

                Does CLS requires even more memory ?

                Originally posted by rwenang View Post
                I agree that its not an encouraging number, though the problem most likely lies in soap denovo configuration rather than the data itself. I used soap denovo several times and found it a bit hard to tweak, most assembly from a simulation dataset produce somewhat very short N50 contig. Afterwards, I redo it using CLC assembler, and it turns out fine.

                My point is, you should try another assembler such as velvet,mira,clc. or contact the author asking for recommended config.

                Comment


                • #9
                  1. I believe it's heterzygous and it's a plant which is supposed to have substantial repetitive sequences.

                  2. All the data are pure short reads containing 200bp short insert and 2k/5k mate pair reads.

                  3. We tried ABySS and it did not seem help a lot.

                  4. Can any one tell me that, with current avaialbe assembler, is it possible/feasible to do de novo assembly at all from a dataset containing pure short reads, such as Illumina data? If not, what else can help? Will 454 read, which are bit logner, help at all?

                  Originally posted by natstreet View Post
                  Can you give us some basic background about what was sequenced - for example is it inbred or likely to be highly heterozygous? This will also affect your assembly results and different assemblers behave differently or require tweaking to best handle such cases.

                  The recent strawberry and cocoa genomes were both based on at least partially inbred lines. They both had decent amounts of 454 data and the cocoa project had some Sanger BAC sequences too.

                  We've found that the SOAP developers simply don't reply to emails so I would suggest trying Velvet if you have enough RAM or ABySS.

                  Are your reads all shotgun?

                  Comment


                  • #10
                    Sorry, another question. Is Velvet good only for small genome? The one we are working on is estimated to be around 800M.

                    Thanks so much!

                    Originally posted by CC_seqanswers View Post
                    1. I believe it's heterzygous and it's a plant which is supposed to have substantial repetitive sequences.

                    2. All the data are pure short reads containing 200bp short insert and 2k/5k mate pair reads.

                    3. We tried ABySS and it did not seem help a lot.

                    4. Can any one tell me that, with current avaialbe assembler, is it possible/feasible to do de novo assembly at all from a dataset containing pure short reads, such as Illumina data? If not, what else can help? Will 454 read, which are bit logner, help at all?

                    Comment


                    • #11
                      I used it with a 36GB machine, but I never tried it with a 100x data before. There is a 30day trial license if you want to try at http://www.clcbio.com.
                      and for the record i dont get any incentive from recommending clc

                      Anyway, your case is quite interesting, there are several steps that I might do if I were in your shoes:

                      1. try reducing the reads up to 60x or less, either by removing duplicates (i suspect you have done this) or simple quality-based filtering. Some studies have shown that more coverage does not necessarily means better assembly. because abundance of reads might mess up the algorithm.

                      2. try allpaths-lg from broad. never used it but its the latest new assembler out there (i think).

                      3. try another assembler which are based on overlapping consensus, ie celera, phrap, etc. use a strict overlapping criteria. If the distribution of 100x data is good, then the assembly should be good. Though it might fail to detect repeats.




                      Originally posted by CC_seqanswers View Post
                      We also tried AbySS, which only ended up slightly longer contigs.

                      Does CLS requires even more memory ?

                      Comment


                      • #12
                        Originally posted by CC_seqanswers View Post
                        1. I believe it's heterzygous and it's a plant which is supposed to have substantial repetitive sequences.
                        2. All the data are pure short reads containing 200bp short insert and 2k/5k mate pair reads.
                        3. We tried ABySS and it did not seem help a lot.
                        4. Can any one tell me that, with current avaialbe assembler, is it possible/feasible to do de novo assembly at all from a dataset containing pure short reads, such as Illumina data? If not, what else can help? Will 454 read, which are bit logner, help at all?
                        1. That will confuse most assemblers.

                        2. So you have 3 Illumina libraries: 200 PE, 2000 MP, and 5000 MP. I assume the 100x is the combined depth. What were the readlengths (100bp?) and yields per library?

                        3. Did you just use default parameters? Are you sure you set up SOAPdenovo properly?

                        4. Yes, you should be able to do this on Velvet, but you will need a machine with about 900GB RAM eg. a 1TB Dell R910.

                        Comment


                        • #13
                          Hi,

                          Can you email me a copy of the paper that you mention regarding "how to assess a de novo assembly result?"?
                          Many thanks.

                          I can't access it too

                          Comment

                          Latest Articles

                          Collapse

                          • seqadmin
                            Essential Discoveries and Tools in Epitranscriptomics
                            by seqadmin




                            The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                            Yesterday, 07:01 AM
                          • seqadmin
                            Current Approaches to Protein Sequencing
                            by seqadmin


                            Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                            04-04-2024, 04:25 PM

                          ad_right_rmr

                          Collapse

                          News

                          Collapse

                          Topics Statistics Last Post
                          Started by seqadmin, 04-11-2024, 12:08 PM
                          0 responses
                          57 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 04-10-2024, 10:19 PM
                          0 responses
                          53 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 04-10-2024, 09:21 AM
                          0 responses
                          45 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 04-04-2024, 09:00 AM
                          0 responses
                          55 views
                          0 likes
                          Last Post seqadmin  
                          Working...
                          X