Seqanswers Leaderboard Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • lamz138138
    Member
    • Mar 2015
    • 10

    library design in de novo assembly, thanks!

    Hi, everyone!

    I have a question of designing library size when performing de novo assembly. Usually, there are short fragment librarys (250bp-800bp), and long fragment librarys (2K-40K), and it is aboult 3:1 for coverage of 250bp to 500bp, and 2:1 for coverage of 2K to 5K. Why there are such ratio, can I just sequence the same coverage for each library or any other suggestion?

    Thanks in advance!

    Best wishes!

    XM Zhong
  • westerman
    Rick Westerman
    • Jun 2008
    • 1104

    #2
    This is for Illumina, of course. I am not sure what you mean by the 3:1 and 2:1 ratios.

    The short fragment libraries are most commonly used to create contigs. However short fragments have problems dealing with repeats and duplicated regions of the genome thus even if you have high coverage the contigs will be truncated. The long fragment libraries are then used to deal with these problem areas via stringing the contigs together into scaffolds.

    You can sequence both libraries at the same coverage however since less information needed for the creation of scaffolds as opposed to the contigs then it is usually better to put more effort into the short fragment libraries. We usually recommend about a 2:1 ratio. In other words if it takes 1 HiSeq lane to come up with the number of bases needed for a given species' short fragments (approximately a 1 GB organism) then we would do 1/2 of a HiSeq lane for the long fragments. Multiple long fragment libraries can be useful.

    Comment

    • lamz138138
      Member
      • Mar 2015
      • 10

      #3
      Hi, westerman, thank you for your reply very much!

      Take paper titled “Whole-genome sequencing of the snub-nosed monkey provides insights into folivory and evolutionary history" as example, the sequencing coverage of 180bp, 500bp, 2K and 5K were 57.3, 22.9, 19.5 and 10.7 respectively. So I got the ratio of 3:1 and 2:1, which could be 6:2:2:1 too.

      I felt the other papers also have these ratio too, so I wander to know whether this is optimal ratio to perform de novo assembly or because they can only got that coverage at the moment? In other word, if I want to perform de novo assembly with 200bp, 500bp, 2K and 5K library, considering one paper had suggested 45X is suitable, I should get all these library with 45X data or with the 45X, 16X, 16X, 8X respectively? With your suggestion above, I think it would be 45X for 200bp and 500bp library, and 22X for 2K and 5K library, am I right? Or could you give me other suggestions?

      Thanks for any suggestion!

      Best wishes!

      XM Zhong

      Comment

      • westerman
        Rick Westerman
        • Jun 2008
        • 1104

        #4
        The answer is both "because of coverage at the moment" and "because someone came up with the ratios they used". I do not think that there is a universally accepted ratio between short libraries and long libraries. We do wish to have more short libraries than long ones but after that statement it becomes more of a "I like this ratio" statement than anything else.

        For technical reasons it is harder to construct libraries the longer they become plus the size error becomes worse therefore saying "we need more 2K long library sequences than 20K sequences" is reasonable. Those 20K sequences may be mostly worthless anyway. Also be aware that upon filtering and QC long (mate-pair) libraries will lose many more sequences/bases than short (paired-end) libraries thus you need to order correspondingly more lanes than expected. Thus my general 2:1 ratio.

        Looking at the paper they did 41 lanes of sequencing (HiSeq2000). Not surprisingly they do not mention how many lanes they allocated to each library so we can not tell if the drop off in the number of reads from the 2K, 5K, 10K, and 20K libraries is due to sequencing loss or due to some pre-set ratio of lanes ordered. Probably both. May be affected by budget constraints and/or budget re-allocations mid-stream within the project.

        My suggestion is to order enough lanes to do 50x coverage for the short libraries (the contigs) and enough lanes to (at least in theory) to 25x coverage for the long libraries and then just be satisfied with what you get from the long libraries (which will be less than 25x). The mate-pair libraries will tend to lose about 25% of their reads so the final ratio will be closer to 50:18 or 2.5:1

        But it does depend on your budget. The 41 lanes that they ordered would be somewhere on the order of USD $100,000. Most of the plant and animal projects I work with have a much smaller budget and thus skimp on the number of lanes ordered. Generally I am lucky to have 2 lanes of paired-end plus 1 lane of a single mate-pair library. It is possible to get by with fewer mate-pair library reads. Having multiple mate-pair libraries is wonderful but not at the cost of going less than 10x coverage per library. When in doubt order more of the short library.

        Bottom line. The number of lanes to order is not an exact science and, to large extent, depends on your budget. If I was able to order up 41 lanes for a project ... gee ... I'd go hog wild in my ratios.

        Comment

        • lamz138138
          Member
          • Mar 2015
          • 10

          #5
          Hi, westerman, thank you for your detailed reply very much, especially the experience of sequence coverage in de novo assembly, which had give me a important guide for my project!

          Thanks again!

          Best wishes!

          XM Zhong

          Comment

          Latest Articles

          Collapse

          • seqadmin
            Pathogen Surveillance with Advanced Genomic Tools
            by seqadmin




            The COVID-19 pandemic highlighted the need for proactive pathogen surveillance systems. As ongoing threats like avian influenza and newly emerging infections continue to pose risks, researchers are working to improve how quickly and accurately pathogens can be identified and tracked. In a recent SEQanswers webinar, two experts discussed how next-generation sequencing (NGS) and machine learning are shaping efforts to monitor viral variation and trace the origins of infectious...
            03-24-2025, 11:48 AM
          • seqadmin
            New Genomics Tools and Methods Shared at AGBT 2025
            by seqadmin


            This year’s Advances in Genome Biology and Technology (AGBT) General Meeting commemorated the 25th anniversary of the event at its original venue on Marco Island, Florida. While this year’s event didn’t include high-profile musical performances, the industry announcements and cutting-edge research still drew the attention of leading scientists.

            The Headliner
            The biggest announcement was Roche stepping back into the sequencing platform market. In the years since...
            03-03-2025, 01:39 PM

          ad_right_rmr

          Collapse

          News

          Collapse

          Topics Statistics Last Post
          Started by seqadmin, 03-20-2025, 05:03 AM
          0 responses
          49 views
          0 reactions
          Last Post seqadmin  
          Started by seqadmin, 03-19-2025, 07:27 AM
          0 responses
          57 views
          0 reactions
          Last Post seqadmin  
          Started by seqadmin, 03-18-2025, 12:50 PM
          0 responses
          50 views
          0 reactions
          Last Post seqadmin  
          Started by seqadmin, 03-03-2025, 01:15 PM
          0 responses
          201 views
          0 reactions
          Last Post seqadmin  
          Working...