Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    De novo genome assembly: beginner

    Hi folks!!

    I too am completely new to the NGS. and I am struggling with a question:

    What decides the 'amount of data' needed for a de novo genome assembly for a particular organism??????? What decides the insert size during the library construction???

    How to decide important parameters such as coverage, size, accuracy, and sensitivity; library type (fragment or mate paired?); and read length.


    Any one can please help me!!

    many thanks in advance!!
    shal

    Comment


    • #17
      Hi shal,

      It is very hard to say in advance how much data you will need for a making a denovo assembly of a particular organism.. It not only depends on the genomic content (especially repeats, polymorphism, GC content) and the assembly software you're using, but also on the properties of the reads you get (in terms of for example quality and distribution).

      I have assembled a 1 Gb genome and got the best result when I used just a subset of my data (less than 30X coverage), for others it works better using 50X or 80X. You can always start with a smaller amount (but probably never below 20X) and then sequence more if you are unsatisfied with the results.

      Also, I would say that after you have decided on to what coverage you would like to have, sequence at least 1.5 times more (or even 2 times more), since you will loose some in the filtering steps (some reads will be duplicated, some will have too poor quality etc).

      For the insert size, you should preferrably have a mix of short and long libraries. The shorter paired end (ins <1000bp) are used for building contigs, and the longer mate pairs are used for joining the contigs into scaffolds. For mate-pairs I would say "the longer the better" - longer insert size in mate-pair libraries will certainly give you larger N50 of the assembly. But it's usually the costs that sets the limit... Note that some assemblers (like Allpaths-LG) have certain recomendations for setting up the libraries.

      Paired/Mate-pair reads are better than single end reads for denovo assembly. With Illumina (I suppose you intend to use this since you chose this forum) the read length isn't very variable, they go up to ~150bp. Most of our libraries were 100bp (which worked fine), when we tried longer reads it seemed that the read quality was much poorer the last 50 bp, so we ended up trimming them anyway.

      I'm not sure if I understood your questions regarding sensitivity and accuracy (in reads or assembly?), but hope this helps a bit!

      Good luck!

      Comment


      • #18
        Originally posted by Linnea View Post
        Hi shal,

        It is very hard to say in advance how much data you will need for a making a denovo assembly of a particular organism.. It not only depends on the genomic content (especially repeats, polymorphism, GC content) and the assembly software you're using, but also on the properties of the reads you get (in terms of for example quality and distribution).

        I have assembled a 1 Gb genome and got the best result when I used just a subset of my data (less than 30X coverage), for others it works better using 50X or 80X. You can always start with a smaller amount (but probably never below 20X) and then sequence more if you are unsatisfied with the results.

        Also, I would say that after you have decided on to what coverage you would like to have, sequence at least 1.5 times more (or even 2 times more), since you will loose some in the filtering steps (some reads will be duplicated, some will have too poor quality etc).

        For the insert size, you should preferrably have a mix of short and long libraries. The shorter paired end (ins <1000bp) are used for building contigs, and the longer mate pairs are used for joining the contigs into scaffolds. For mate-pairs I would say "the longer the better" - longer insert size in mate-pair libraries will certainly give you larger N50 of the assembly. But it's usually the costs that sets the limit... Note that some assemblers (like Allpaths-LG) have certain recomendations for setting up the libraries.

        Paired/Mate-pair reads are better than single end reads for denovo assembly. With Illumina (I suppose you intend to use this since you chose this forum) the read length isn't very variable, they go up to ~150bp. Most of our libraries were 100bp (which worked fine), when we tried longer reads it seemed that the read quality was much poorer the last 50 bp, so we ended up trimming them anyway.

        I'm not sure if I understood your questions regarding sensitivity and accuracy (in reads or assembly?), but hope this helps a bit!

        Good luck!
        Dear Linnea,

        Thank you so much for sharing your knowledge and experience. Your reply was helpful for me and it answered my queries.

        Thanks again
        Shal

        Comment


        • #19
          Hi ! Me again

          I was just wondering all possible clues for scaffolding that are used, I mean I know we can map contigs agains reference genomes, use long paired-end reads , but is/are there other way(s) to find such things like orientation, order, distances ??

          Comment

          Latest Articles

          Collapse

          • seqadmin
            Current Approaches to Protein Sequencing
            by seqadmin


            Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
            04-04-2024, 04:25 PM
          • seqadmin
            Strategies for Sequencing Challenging Samples
            by seqadmin


            Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
            03-22-2024, 06:39 AM

          ad_right_rmr

          Collapse

          News

          Collapse

          Topics Statistics Last Post
          Started by seqadmin, 04-11-2024, 12:08 PM
          0 responses
          30 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-10-2024, 10:19 PM
          0 responses
          32 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-10-2024, 09:21 AM
          0 responses
          28 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-04-2024, 09:00 AM
          0 responses
          52 views
          0 likes
          Last Post seqadmin  
          Working...
          X