Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    De novo genome assembly: beginner

    Hi folks!!

    I too am completely new to the NGS. and I am struggling with a question:

    What decides the 'amount of data' needed for a de novo genome assembly for a particular organism??????? What decides the insert size during the library construction???

    How to decide important parameters such as coverage, size, accuracy, and sensitivity; library type (fragment or mate paired?); and read length.


    Any one can please help me!!

    many thanks in advance!!
    shal

    Comment


    • #17
      Hi shal,

      It is very hard to say in advance how much data you will need for a making a denovo assembly of a particular organism.. It not only depends on the genomic content (especially repeats, polymorphism, GC content) and the assembly software you're using, but also on the properties of the reads you get (in terms of for example quality and distribution).

      I have assembled a 1 Gb genome and got the best result when I used just a subset of my data (less than 30X coverage), for others it works better using 50X or 80X. You can always start with a smaller amount (but probably never below 20X) and then sequence more if you are unsatisfied with the results.

      Also, I would say that after you have decided on to what coverage you would like to have, sequence at least 1.5 times more (or even 2 times more), since you will loose some in the filtering steps (some reads will be duplicated, some will have too poor quality etc).

      For the insert size, you should preferrably have a mix of short and long libraries. The shorter paired end (ins <1000bp) are used for building contigs, and the longer mate pairs are used for joining the contigs into scaffolds. For mate-pairs I would say "the longer the better" - longer insert size in mate-pair libraries will certainly give you larger N50 of the assembly. But it's usually the costs that sets the limit... Note that some assemblers (like Allpaths-LG) have certain recomendations for setting up the libraries.

      Paired/Mate-pair reads are better than single end reads for denovo assembly. With Illumina (I suppose you intend to use this since you chose this forum) the read length isn't very variable, they go up to ~150bp. Most of our libraries were 100bp (which worked fine), when we tried longer reads it seemed that the read quality was much poorer the last 50 bp, so we ended up trimming them anyway.

      I'm not sure if I understood your questions regarding sensitivity and accuracy (in reads or assembly?), but hope this helps a bit!

      Good luck!

      Comment


      • #18
        Originally posted by Linnea View Post
        Hi shal,

        It is very hard to say in advance how much data you will need for a making a denovo assembly of a particular organism.. It not only depends on the genomic content (especially repeats, polymorphism, GC content) and the assembly software you're using, but also on the properties of the reads you get (in terms of for example quality and distribution).

        I have assembled a 1 Gb genome and got the best result when I used just a subset of my data (less than 30X coverage), for others it works better using 50X or 80X. You can always start with a smaller amount (but probably never below 20X) and then sequence more if you are unsatisfied with the results.

        Also, I would say that after you have decided on to what coverage you would like to have, sequence at least 1.5 times more (or even 2 times more), since you will loose some in the filtering steps (some reads will be duplicated, some will have too poor quality etc).

        For the insert size, you should preferrably have a mix of short and long libraries. The shorter paired end (ins <1000bp) are used for building contigs, and the longer mate pairs are used for joining the contigs into scaffolds. For mate-pairs I would say "the longer the better" - longer insert size in mate-pair libraries will certainly give you larger N50 of the assembly. But it's usually the costs that sets the limit... Note that some assemblers (like Allpaths-LG) have certain recomendations for setting up the libraries.

        Paired/Mate-pair reads are better than single end reads for denovo assembly. With Illumina (I suppose you intend to use this since you chose this forum) the read length isn't very variable, they go up to ~150bp. Most of our libraries were 100bp (which worked fine), when we tried longer reads it seemed that the read quality was much poorer the last 50 bp, so we ended up trimming them anyway.

        I'm not sure if I understood your questions regarding sensitivity and accuracy (in reads or assembly?), but hope this helps a bit!

        Good luck!
        Dear Linnea,

        Thank you so much for sharing your knowledge and experience. Your reply was helpful for me and it answered my queries.

        Thanks again
        Shal

        Comment


        • #19
          Hi ! Me again

          I was just wondering all possible clues for scaffolding that are used, I mean I know we can map contigs agains reference genomes, use long paired-end reads , but is/are there other way(s) to find such things like orientation, order, distances ??

          Comment

          Latest Articles

          Collapse

          • seqadmin
            Essential Discoveries and Tools in Epitranscriptomics
            by seqadmin




            The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
            04-22-2024, 07:01 AM
          • seqadmin
            Current Approaches to Protein Sequencing
            by seqadmin


            Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
            04-04-2024, 04:25 PM

          ad_right_rmr

          Collapse

          News

          Collapse

          Topics Statistics Last Post
          Started by seqadmin, Yesterday, 11:49 AM
          0 responses
          13 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-24-2024, 08:47 AM
          0 responses
          16 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-11-2024, 12:08 PM
          0 responses
          61 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-10-2024, 10:19 PM
          0 responses
          60 views
          0 likes
          Last Post seqadmin  
          Working...
          X