Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Improving genome assembly-Suggestion

    I have de novo assembled pre-processed illumina paired-end reads (interleaved as single read) of size 108Gbp in file size of plant genome. The estimated genome size is around 2Gbp. I have used Minia to assemble the genome with kmer 47 and minimum abdudance is 3 (estimated through Kmer genie). The Minia outputs as contigs. Before, I do scaffolding, I need suggestion from you. I have done genome assembly evaluation using quast tool. Also, I have mapped back paried-end reads to assembled genome and got back results from qualimap. I understand coverage is low, is it possible make a publication with this genome assembly?. Any recommendation to improve genome assembly with available data?

    Quast results:
    All statistics are based on contigs of size >= 500 bp, unless otherwise noted (e.g., "# contigs (>= 0 bp)" and "Total length (>= 0 bp)" include all contigs).

    Assembly output_prefix.contigs
    # contigs (>= 0 bp) 9285711
    # contigs (>= 1000 bp) 88260
    Total length (>= 0 bp) 1590477304
    Total length (>= 1000 bp) 146312325
    # contigs 316519
    Largest contig 12582
    Total length 300873518
    GC (%) 34.45
    N50 977
    N75 677
    L50 92434
    L75 186047
    # N's per 100 kbp 0.00


    Qualimap- Mapping results:
    >>>>>>> Reference

    number of bases = 1,590,477,304 bp
    number of contigs = 9285711



    >>>>>>> Globals

    number of windows = 9286108

    number of reads = 102,927,571
    number of mapped reads = 102,654,076 (99.73%)

    >>>>>>> Mapping quality

    mean mapping quality = 37.18


    >>>>>>> ACTG content

    number of A's = 2,874,099,790 bp (31.76%)
    number of C's = 1,783,726,295 bp (19.71%)
    number of T's = 2,718,363,373 bp (30.04%)
    number of G's = 1,672,166,807 bp (18.48%)
    number of N's = 0 bp (0%)

    >>>>>>> Coverage

    mean coverageData = 5.69X
    std coverageData = 57.95X

  • #2
    Any suggestion is appreciated.

    Comment


    • #3
      I understand coverage is low, is it possible make a publication with this genome assembly?
      Well, its not only that the coverage is low, but you just don't have most of your genome sequenced... So far I'm not aware of a published genome were they used less than 1kb long contigs to state genome size. In this case you would just have 160Mb, so less than 10% of the estimated size... Even if you go down to 500bp (I assume these are the values for that: # contigs 316519; Largest contig 12582; Total length 300873518), it doesn't work. The mean coverage from qualimap is also not accurate as you are simply mapping back to reads and not "real" contigs.

      Any recommendation to improve genome assembly with available data?
      Even if you would know the order of reads and all of them would be adjacent to each other, you're still missing ~20% of the estimated genome size. There is simply not enough sequence information available to do any improvements, sry.

      You could, however, already start with this data. Assembly of plastid genomes might already be possible, gene prediction also (of course only on contigs >1kb). Maybe there are some fancy new things which justify a publication

      Comment


      • #4
        Thanks for your comments. But I assume quast result of No. of contigs is above 500 bp (# contigs 316519). Did you think N50 977 bp (~1 Kb) is not enough for publication?.
        Last edited by bioman1; 08-12-2014, 07:48 AM.

        Comment


        • #5
          Originally posted by bioman1 View Post
          Thanks for your comments. But I assume quast result of No. of contigs is above 500 bp (# contigs 316519). Did you think N50 977 bp (~1 Kb) is not enough for publication?.
          No, the numbers do not look good. Is the genome highly heterozygous? - this might explain some of the difficulties?

          I would suggest to run CEGMA to get aan dditional quality measure:

          Comment


          • #6
            Originally posted by bioman1 View Post
            Thanks for your comments. But I assume quast result of No. of contigs is above 500 bp (# contigs 316519). Did you think N50 977 bp (~1 Kb) is not enough for publication?.
            1) The N50 is a quality measurement for the assembly. However, it is calculated according to your sequence information input. As you are below your estimated genome size, I would not consider it a good quality measure at all.

            2) Even if reviewers would still accept it, 1kb is indeed to low. Remember that this means that 50% of your sequence information is contained in "contigs" smaller than 977bp! So, that's a big read pool that just wasn't assembled. Have a look at this paper: Assembly of large genomes using second-generation sequencing (especially figure 3)

            3) @luc made I good suggestion here. If you are really lucky and the genome size was just overestimated (a lot) you could try to find core eukaryotic genes via CEGMA to estimate the genome completeness (This is anyhow expected to be done for a publication)

            4) I just read that you did a "de novo" assembly. Is there a (more or less) close sequenced relative available? You could try a reference-guided assembly then.

            Comment


            • #7
              Thanks @whatsoever and @luc for your comments. I would do CEGMA estimation after getting satisfactory result of metrics (N50, largest contig etc). I need suggestion in improving with these dataset I have. I am working on non-model plant and it is polyploidy.

              1. Do I need to do any error correction before assembly? If so, can you recommend any good error correction tool that too memory efficient. I came across soap de novo and all paths error corrector.
              2. I tried Minia because it is memory efficient, do any good memory efficient assembler for large plant genome. Can soap de novo assembler work on Dell Precision workstation T7600 (24 CPu's and 124 Gb memory).?
              3.Is it useful for combing different kmer assembly to improve using assembley merger?

              Comment


              • #8
                I did CEGMA completness test. I have very low coverage genome around 5x. The CEGMA completness test is around 20%, is this metric OK for publication?

                # Statistics of the completeness of the genome based on 248 CEGs #

                #Prots %Completeness - #Total Average %Ortho

                Complete 51 20.56 - 63 1.24 23.53

                Group 1 12 18.18 - 15 1.25 25.00
                Group 2 9 16.07 - 9 1.00 0.00
                Group 3 12 19.67 - 15 1.25 25.00
                Group 4 18 27.69 - 24 1.33 33.33

                Partial 150 60.48 - 215 1.43 29.33

                Group 1 38 57.58 - 44 1.16 13.16
                Group 2 30 53.57 - 40 1.33 20.00
                Group 3 34 55.74 - 52 1.53 41.18
                Group 4 48 73.85 - 79 1.65 39.58

                # These results are based on the set of genes selected by Genis Parra #

                # Key: #
                # Prots = number of 248 ultra-conserved CEGs present in genome #
                # %Completeness = percentage of 248 ultra-conserved CEGs present #
                # Total = total number of CEGs present including putative orthologs #
                # Average = average number of orthologs per CEG #
                # %Ortho = percentage of detected CEGS that have more than 1 ortholog #

                Comment


                • #9
                  Hi bioman,

                  Sorry; if you cannot provide any reasoning why this genome is extraordinarily difficult, I do no think the CEGMA and assembly metrics give hope for a publication. They indicate that the gene-space (the more easily "assemblable" part of the genome is only partly represented; although 60% seem to be partially represented).

                  Error correction usually helps but also requires enough coverage and will not perform miracles; the new BLUE read error correction software promises to be usable without huge memory requirements (MUSKET is another good one). More sequencing data (also mate pairs will be required.
                  I would suggest assembling your sequences with the trial version of CLCbio (after error correction). The CLC assembler is remarkably tolerant to weird genomes and sequence data.

                  BTW, did you perform adapter trimming on your reads? What insert sizes did they have?
                  Last edited by luc; 09-02-2014, 12:22 AM.

                  Comment


                  • #10
                    Hi bioman1,
                    The metrics are quite clear, aren't they? You just don't have a "genome" to publish. Sorry, but it's not about choosing a different assembler or different settings or doing error correction, you simply don't have not enough sequence information.

                    There are still some possibilities to work with your data:
                    - I would still go for a reference guided assembly using a close relative (check the scaffold_builder for example: http://www.scfbm.org/content/8/1/23)
                    - If this doesn't work, I would focus on complete protein sequences only and try to find something interesting there.
                    - If you can provide reasoning for the extraordinarily difficulty of sequencing your genome (as luc stated), I could imagine you also have a good point for getting additional funding for another sequencing run

                    Comment


                    • #11
                      Thanks luc & WhatsOEver for your comments. The difficulty in my genome assembly, my plant is a tetraploid crop and highly heterozygous. Also non-model crop and my coverage of my sequence around 5x.

                      Comment


                      • #12
                        OK, this makes things messy or more or less impossible to resolve with Illumina sequencing.
                        Very likely your assembly has quite a lot of chimeric contigs. As mentioned in another thread, the Platanus assembler could potentially be helpful in these cases, but I have few hopes.

                        Is the coverage around 50X? This is certainly not a lot.

                        In your first post you wrote:
                        ".. of size 108Gbp in file size of plant genome. The estimated genome size is around 2Gbp "

                        Comment


                        • #13
                          According to the qualimap stats, there is only ~9Gb sequence information available in the reads

                          Originally posted by bioman1 View Post
                          Qualimap- Mapping results:
                          >>>>>>> ACTG content

                          number of A's = 2,874,099,790 bp (31.76%)
                          number of C's = 1,783,726,295 bp (19.71%)
                          number of T's = 2,718,363,373 bp (30.04%)
                          number of G's = 1,672,166,807 bp (18.48%)
                          number of N's = 0 bp (0%)
                          Thinking of it, this is indeed odd as you have a file size of 108Gb I would expect sequence count to be at least ~20-30Gb (depending on the length of headers and if definition lines are empty). ~10Gb sequences would fit for the 100M sequences that were found by qualimap (assuming you have 100bp reads).

                          Have actually checked whether all your reads went into the assembly? If quite a lot were discarded initially, there may be a chance to improve your assembly using a different software/parameters. Make a line count of your fastq file (wc -l ./yourFile.fastq) for a fast check on the read numbers.

                          Comment


                          • #14
                            My plant is allotetraploid crop. Sequenced through Hiseq, illumina paired-end reads of 101 bp length.

                            I tried with different Kmer 41 (which was estimated by SGA preqc) and assembled with Minia genome assembler.Then I did scaffolding with default parameters with SSPace scaffolder, and I get the N50 improved from 1kbp to around 2kbp. Then I used post processing script N50 boster script available from http://www.acgt.me/blog/2014/3/31/a-...enome-assembly, it amazing improved my N50 from 2Kbp to 400Mbp. Does it logically good to boost using this script?. Even though it improved assembly N50, I get lot of N's per 100 kbp and L50 comes to 1. If this not good method to boost, can I try with PAGIT (https://www.sanger.ac.uk/resources/s...agit/#Download) ?

                            Scaffoling metrics tested with quast
                            All statistics are based on contigs of size >= 500 bp, unless otherwise noted (e.g., "# contigs (>= 0 bp)" and "Total length (>= 0 bp)" include all contigs).

                            Assembly standard_output.final.scaffolds
                            # contigs (>= 0 bp) 7009924
                            # contigs (>= 1000 bp) 84213
                            Total length (>= 0 bp) 1491240084
                            Total length (>= 1000 bp) 359510617
                            # contigs 284770
                            Largest contig 216550792
                            Total length 495697477
                            GC (%) 34.36
                            N50 2699
                            N75 929
                            L50 8319
                            L75 96950
                            # N's per 100 kbp 43684.48


                            Used N50 booster script and tested metrics with quast
                            All statistics are based on contigs of size >= 500 bp, unless otherwise noted (e.g., "# contigs (>= 0 bp)" and "Total length (>= 0 bp)" include all contigs).

                            Assembly standard_output.final.scaffolds.fasta.n50
                            # contigs (>= 0 bp) 5151562
                            # contigs (>= 1000 bp) 84213
                            Total length (>= 0 bp) 1491240084
                            Total length (>= 1000 bp) 544051890
                            # contigs 284770
                            Largest contig 401092065
                            Total length 680238750
                            GC (%) 34.36
                            N50 401092065
                            N75 1251
                            L50 1
                            L75 53780
                            # N's per 100 kbp 58962.26



                            I will try with different assembler and parameter as you suggested.

                            Comment


                            • #15
                              Being posted on "April 01, 2014", the script is surely just a joke!

                              It just glues your sequences together, there is no real improvement on the assembly.

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM
                              • seqadmin
                                Techniques and Challenges in Conservation Genomics
                                by seqadmin



                                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                Avian Conservation
                                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                03-08-2024, 10:41 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 03-27-2024, 06:37 PM
                              0 responses
                              12 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-27-2024, 06:07 PM
                              0 responses
                              11 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-22-2024, 10:03 AM
                              0 responses
                              53 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-21-2024, 07:32 AM
                              0 responses
                              68 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X