Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • library design in de novo assembly, thanks!

    Hi, everyone!

    I have a question of designing library size when performing de novo assembly. Usually, there are short fragment librarys (250bp-800bp), and long fragment librarys (2K-40K), and it is aboult 3:1 for coverage of 250bp to 500bp, and 2:1 for coverage of 2K to 5K. Why there are such ratio, can I just sequence the same coverage for each library or any other suggestion?

    Thanks in advance!

    Best wishes!

    XM Zhong

  • #2
    This is for Illumina, of course. I am not sure what you mean by the 3:1 and 2:1 ratios.

    The short fragment libraries are most commonly used to create contigs. However short fragments have problems dealing with repeats and duplicated regions of the genome thus even if you have high coverage the contigs will be truncated. The long fragment libraries are then used to deal with these problem areas via stringing the contigs together into scaffolds.

    You can sequence both libraries at the same coverage however since less information needed for the creation of scaffolds as opposed to the contigs then it is usually better to put more effort into the short fragment libraries. We usually recommend about a 2:1 ratio. In other words if it takes 1 HiSeq lane to come up with the number of bases needed for a given species' short fragments (approximately a 1 GB organism) then we would do 1/2 of a HiSeq lane for the long fragments. Multiple long fragment libraries can be useful.

    Comment


    • #3
      Hi, westerman, thank you for your reply very much!

      Take paper titled “Whole-genome sequencing of the snub-nosed monkey provides insights into folivory and evolutionary history" as example, the sequencing coverage of 180bp, 500bp, 2K and 5K were 57.3, 22.9, 19.5 and 10.7 respectively. So I got the ratio of 3:1 and 2:1, which could be 6:2:2:1 too.

      I felt the other papers also have these ratio too, so I wander to know whether this is optimal ratio to perform de novo assembly or because they can only got that coverage at the moment? In other word, if I want to perform de novo assembly with 200bp, 500bp, 2K and 5K library, considering one paper had suggested 45X is suitable, I should get all these library with 45X data or with the 45X, 16X, 16X, 8X respectively? With your suggestion above, I think it would be 45X for 200bp and 500bp library, and 22X for 2K and 5K library, am I right? Or could you give me other suggestions?

      Thanks for any suggestion!

      Best wishes!

      XM Zhong

      Comment


      • #4
        The answer is both "because of coverage at the moment" and "because someone came up with the ratios they used". I do not think that there is a universally accepted ratio between short libraries and long libraries. We do wish to have more short libraries than long ones but after that statement it becomes more of a "I like this ratio" statement than anything else.

        For technical reasons it is harder to construct libraries the longer they become plus the size error becomes worse therefore saying "we need more 2K long library sequences than 20K sequences" is reasonable. Those 20K sequences may be mostly worthless anyway. Also be aware that upon filtering and QC long (mate-pair) libraries will lose many more sequences/bases than short (paired-end) libraries thus you need to order correspondingly more lanes than expected. Thus my general 2:1 ratio.

        Looking at the paper they did 41 lanes of sequencing (HiSeq2000). Not surprisingly they do not mention how many lanes they allocated to each library so we can not tell if the drop off in the number of reads from the 2K, 5K, 10K, and 20K libraries is due to sequencing loss or due to some pre-set ratio of lanes ordered. Probably both. May be affected by budget constraints and/or budget re-allocations mid-stream within the project.

        My suggestion is to order enough lanes to do 50x coverage for the short libraries (the contigs) and enough lanes to (at least in theory) to 25x coverage for the long libraries and then just be satisfied with what you get from the long libraries (which will be less than 25x). The mate-pair libraries will tend to lose about 25% of their reads so the final ratio will be closer to 50:18 or 2.5:1

        But it does depend on your budget. The 41 lanes that they ordered would be somewhere on the order of USD $100,000. Most of the plant and animal projects I work with have a much smaller budget and thus skimp on the number of lanes ordered. Generally I am lucky to have 2 lanes of paired-end plus 1 lane of a single mate-pair library. It is possible to get by with fewer mate-pair library reads. Having multiple mate-pair libraries is wonderful but not at the cost of going less than 10x coverage per library. When in doubt order more of the short library.

        Bottom line. The number of lanes to order is not an exact science and, to large extent, depends on your budget. If I was able to order up 41 lanes for a project ... gee ... I'd go hog wild in my ratios.

        Comment


        • #5
          Hi, westerman, thank you for your detailed reply very much, especially the experience of sequence coverage in de novo assembly, which had give me a important guide for my project!

          Thanks again!

          Best wishes!

          XM Zhong

          Comment

          Latest Articles

          Collapse

          • seqadmin
            Advancing Precision Medicine for Rare Diseases in Children
            by seqadmin




            Many organizations study rare diseases, but few have a mission as impactful as Rady Children’s Institute for Genomic Medicine (RCIGM). “We are all about changing outcomes for children,” explained Dr. Stephen Kingsmore, President and CEO of the group. The institute’s initial goal was to provide rapid diagnoses for critically ill children and shorten their diagnostic odyssey, a term used to describe the long and arduous process it takes patients to obtain an accurate...
            12-16-2024, 07:57 AM
          • seqadmin
            Recent Advances in Sequencing Technologies
            by seqadmin



            Innovations in next-generation sequencing technologies and techniques are driving more precise and comprehensive exploration of complex biological systems. Current advancements include improved accessibility for long-read sequencing and significant progress in single-cell and 3D genomics. This article explores some of the most impactful developments in the field over the past year.

            Long-Read Sequencing
            Long-read sequencing has seen remarkable advancements,...
            12-02-2024, 01:49 PM

          ad_right_rmr

          Collapse

          News

          Collapse

          Topics Statistics Last Post
          Started by seqadmin, 12-17-2024, 10:28 AM
          0 responses
          33 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 12-13-2024, 08:24 AM
          0 responses
          49 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 12-12-2024, 07:41 AM
          0 responses
          34 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 12-11-2024, 07:45 AM
          0 responses
          46 views
          0 likes
          Last Post seqadmin  
          Working...
          X