Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • PacBio assembly using sperm DNA

    Hi all,

    we're currently working on assembling a ~600Mb genome using Pacbio sequences from sperm DNA. We're using 3 different libraries with insert sizes ranging from 10-13kb and have ~100x total but using a read length cut-off of 15kb still gives us ~50x coverage. Our current assembly, after including read lengths down to 10kb, doubled the contig N50 from ~200kb to about ~500kb after scaffolding but also increased the total genome size considerably (500mb to 690mb).

    We have just started evaluating the assemblies but I was expecting larger N50s given the sequencing depth. One thing I was pondering since the beginning is if recombination events present in the sperm DNA are frequent enough to mess with the assembly and if so, if Falcon is able to resolve these conflicts based on coverage information. I'd assume that the overlap filtering settings should have problems removing these regions unless Falcon calculates coverages on a haplotype basis (i.e. coverages in haplotype context).

    Unfortunately I couldn't find any information on this. Has anyone used sperm DNA for assembly before or has any information how Falcon would deal with such "pseudo chimeric" reads from recombined loci?

    Cheers,
    Zapp

  • #2
    The pacbio library & sequencing artefacts are the main cause of trouble.

    The frequency of the pacbio Pacbio library & sequencing artefacts: chimeras (2%-10%) and siameras (1%-3%) would be 3-5 orders of magnitude higher than genuine meiosis recombination events (every 10Mbp - 100Mbp of raw sequence).

    The high level of heterogenicity/polyploidy may also contibute to the problems.

    The pacbio library & sequencing artifacts are the cause of trouble.

    In order to reliably filter those artefacts from the large eukaryotic genome you REQUIRE error-correction of the pacbio datasets (see prooveread), even if you would use only pacbio data for your de novo assembly later on (after splitting chimeras/siameras).

    For error correction you need either good quality illumina 2x250 or 2x300 bps dataset - 50X - 100X coverage by PCR-free 350bp library on MiSeq or Hiseq2500 or/and Pabio CCS dataset at 30-40X coverage. The illumina dataset can be pre assembled using FLASH/PANDA and overlapping reads used for error correction. The longer the HQ reads, the better the error correction results, esp in repetitive regions, so the 2x100 or 2x150 datasets are of limited utility.

    Also the error-correction/kmer counting is very sensitive to the raw reads errors, so try to get as High Quality reads, as possible (slight underclustering of the MiSeq/Hiseq 2500 platforms is recommended).

    PS: Also give CANU assembler a try on the uncorrected pacbio data.

    Comment


    • #3
      Hi Zapp,

      I'm currently involved in a project where we are doing just that, using sperm DNA for denovo assembly in FALCON. It's not my project, so I can't go into the details, i'm simply helping on the assembly side.

      We are still in the preliminary stages with a highly heterozygous organism with an approximately 800Mb haploid genome - with one of goals being to identify possible recombinant reads. We went with sperm sample in this particular case as tissue is difficult to work with due to a plethora of secondary metabolites in this particular organism.

      The high heterozygosity is limiting our contig N50 in this particular case, giving us an N50 of ~600kb, but with a maximum contig size up to 4Mb.

      Recombinant reads should occur at low frequency, and assuming there are no recombination hotspots (*this is a major assumption!!!) then your falcon_sense_option and overlap_filtering_setting options should hopefully help weed out recombinant reads that do not have enough support. That being said, recombinant hotspots certainly have potential to throw off the algorithm and limit overall assembly contiguity.

      We would have preferred starting from somatic tissue for this project, but for reasons I mentioned earlier, we went with a sperm sample. Can I ask why you decided to go with a sperm sample in your case? Is your organism highly heterozygous?

      Also, if you have enough Pacbio data for assembly, then you also have enough for error correction. No need for extra short read data. If you have a polyploid organism, you may benefit from FALCON_unzip and 1 or more subsequent rounds of polishing with PacBio raw data.
      Last edited by gconcepcion; 10-31-2017, 03:05 PM.

      Comment


      • #4
        10 to13 kb libraries sounds a bit short? Which length were the samples sheared for and which cut did you use for the pippin sise-selection?

        Comment


        • #5
          Hi all,

          thanks for the replies. I'll try to address them 1by1.

          @Markiyan, yes, heterogenecity is likely a problem with our organisms, we have encountered this before in our short read assemblies. I was hoping that PacBio has less problems with it. At least it seems that recombination events should be a minor problem so thanks for the info. As for the error correction, I was expecting that 100x coverage is enough for efficient error correction. We'll try a CANU assembly and see if it improves the assembly.

          @gconcepcion I also think that our coverage should be sufficient for error correction but I might be overly optimistic. Unfortunately our organisms also show high levels of heterozyogisty and the final assembly stats of 500kb were achieved after additional scaffolding and 1 round of polishing. We're trying to see if we can further improve this using Falcon_unzip while testing alternative assemblers.

          As for the sample, we are dealing with a symbiotic organism and symbiont contamination is an issue, hence the decision to use sperm DNA. Unfortunately we cannot generate inbred lines so there's no alternative but to find ways to deal with heterozygosity on a bioinformatic level.

          @luc, unfortunately our facility doesn't offer 20kb libraries. They tried several times but failed and therefore do not offer sizes above 15kb. However, the 15kb libraries we ordered ended up ranging between 10-13kb. Like I mentioned in my first post we get ~50x coverage from reads >15kb which is not optimal but the best we can expect from our inhouse facility at the moment. Do you think this is the main problem? I was pondering throwing in some nanopore reads but I am not impressed by the throughput and read length distribution.

          Cheers,
          Zapp

          Comment

          Latest Articles

          Collapse

          • seqadmin
            Strategies for Sequencing Challenging Samples
            by seqadmin


            Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
            03-22-2024, 06:39 AM
          • seqadmin
            Techniques and Challenges in Conservation Genomics
            by seqadmin



            The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

            Avian Conservation
            Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
            03-08-2024, 10:41 AM

          ad_right_rmr

          Collapse

          News

          Collapse

          Topics Statistics Last Post
          Started by seqadmin, Today, 06:37 PM
          0 responses
          7 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, Today, 06:07 PM
          0 responses
          7 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 03-22-2024, 10:03 AM
          0 responses
          49 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 03-21-2024, 07:32 AM
          0 responses
          66 views
          0 likes
          Last Post seqadmin  
          Working...
          X