SEQanswers

Go Back   SEQanswers > Sequencing Technologies/Companies > Pacific Biosciences



Reply
 
Thread Tools
Old 10-29-2017, 11:55 PM   #1
Zapp
Junior Member
 
Location: Saudi Arabia

Join Date: Mar 2011
Posts: 6
Default PacBio assembly using sperm DNA

Hi all,

we're currently working on assembling a ~600Mb genome using Pacbio sequences from sperm DNA. We're using 3 different libraries with insert sizes ranging from 10-13kb and have ~100x total but using a read length cut-off of 15kb still gives us ~50x coverage. Our current assembly, after including read lengths down to 10kb, doubled the contig N50 from ~200kb to about ~500kb after scaffolding but also increased the total genome size considerably (500mb to 690mb).

We have just started evaluating the assemblies but I was expecting larger N50s given the sequencing depth. One thing I was pondering since the beginning is if recombination events present in the sperm DNA are frequent enough to mess with the assembly and if so, if Falcon is able to resolve these conflicts based on coverage information. I'd assume that the overlap filtering settings should have problems removing these regions unless Falcon calculates coverages on a haplotype basis (i.e. coverages in haplotype context).

Unfortunately I couldn't find any information on this. Has anyone used sperm DNA for assembly before or has any information how Falcon would deal with such "pseudo chimeric" reads from recombined loci?

Cheers,
Zapp
Zapp is offline   Reply With Quote
Old 10-30-2017, 06:59 AM   #2
Markiyan
Senior Member
 
Location: Cambridge

Join Date: Sep 2010
Posts: 109
Lightbulb The pacbio library & sequencing artefacts are the main cause of trouble.

The frequency of the pacbio Pacbio library & sequencing artefacts: chimeras (2%-10%) and siameras (1%-3%) would be 3-5 orders of magnitude higher than genuine meiosis recombination events (every 10Mbp - 100Mbp of raw sequence).

The high level of heterogenicity/polyploidy may also contibute to the problems.

The pacbio library & sequencing artifacts are the cause of trouble.

In order to reliably filter those artefacts from the large eukaryotic genome you REQUIRE error-correction of the pacbio datasets (see prooveread), even if you would use only pacbio data for your de novo assembly later on (after splitting chimeras/siameras).

For error correction you need either good quality illumina 2x250 or 2x300 bps dataset - 50X - 100X coverage by PCR-free 350bp library on MiSeq or Hiseq2500 or/and Pabio CCS dataset at 30-40X coverage. The illumina dataset can be pre assembled using FLASH/PANDA and overlapping reads used for error correction. The longer the HQ reads, the better the error correction results, esp in repetitive regions, so the 2x100 or 2x150 datasets are of limited utility.

Also the error-correction/kmer counting is very sensitive to the raw reads errors, so try to get as High Quality reads, as possible (slight underclustering of the MiSeq/Hiseq 2500 platforms is recommended).

PS: Also give CANU assembler a try on the uncorrected pacbio data.
Markiyan is offline   Reply With Quote
Old 10-31-2017, 01:41 PM   #3
gconcepcion
Member
 
Location: Menlo Park

Join Date: Dec 2010
Posts: 67
Default

Hi Zapp,

I'm currently involved in a project where we are doing just that, using sperm DNA for denovo assembly in FALCON. It's not my project, so I can't go into the details, i'm simply helping on the assembly side.

We are still in the preliminary stages with a highly heterozygous organism with an approximately 800Mb haploid genome - with one of goals being to identify possible recombinant reads. We went with sperm sample in this particular case as tissue is difficult to work with due to a plethora of secondary metabolites in this particular organism.

The high heterozygosity is limiting our contig N50 in this particular case, giving us an N50 of ~600kb, but with a maximum contig size up to 4Mb.

Recombinant reads should occur at low frequency, and assuming there are no recombination hotspots (*this is a major assumption!!!) then your falcon_sense_option and overlap_filtering_setting options should hopefully help weed out recombinant reads that do not have enough support. That being said, recombinant hotspots certainly have potential to throw off the algorithm and limit overall assembly contiguity.

We would have preferred starting from somatic tissue for this project, but for reasons I mentioned earlier, we went with a sperm sample. Can I ask why you decided to go with a sperm sample in your case? Is your organism highly heterozygous?

Also, if you have enough Pacbio data for assembly, then you also have enough for error correction. No need for extra short read data. If you have a polyploid organism, you may benefit from FALCON_unzip and 1 or more subsequent rounds of polishing with PacBio raw data.

Last edited by gconcepcion; 10-31-2017 at 03:05 PM.
gconcepcion is offline   Reply With Quote
Old 10-31-2017, 11:50 PM   #4
luc
Senior Member
 
Location: US

Join Date: Dec 2010
Posts: 324
Default

10 to13 kb libraries sounds a bit short? Which length were the samples sheared for and which cut did you use for the pippin sise-selection?
luc is offline   Reply With Quote
Old 11-01-2017, 12:48 AM   #5
Zapp
Junior Member
 
Location: Saudi Arabia

Join Date: Mar 2011
Posts: 6
Default

Hi all,

thanks for the replies. I'll try to address them 1by1.

@Markiyan, yes, heterogenecity is likely a problem with our organisms, we have encountered this before in our short read assemblies. I was hoping that PacBio has less problems with it. At least it seems that recombination events should be a minor problem so thanks for the info. As for the error correction, I was expecting that 100x coverage is enough for efficient error correction. We'll try a CANU assembly and see if it improves the assembly.

@gconcepcion I also think that our coverage should be sufficient for error correction but I might be overly optimistic. Unfortunately our organisms also show high levels of heterozyogisty and the final assembly stats of 500kb were achieved after additional scaffolding and 1 round of polishing. We're trying to see if we can further improve this using Falcon_unzip while testing alternative assemblers.

As for the sample, we are dealing with a symbiotic organism and symbiont contamination is an issue, hence the decision to use sperm DNA. Unfortunately we cannot generate inbred lines so there's no alternative but to find ways to deal with heterozygosity on a bioinformatic level.

@luc, unfortunately our facility doesn't offer 20kb libraries. They tried several times but failed and therefore do not offer sizes above 15kb. However, the 15kb libraries we ordered ended up ranging between 10-13kb. Like I mentioned in my first post we get ~50x coverage from reads >15kb which is not optimal but the best we can expect from our inhouse facility at the moment. Do you think this is the main problem? I was pondering throwing in some nanopore reads but I am not impressed by the throughput and read length distribution.

Cheers,
Zapp
Zapp is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 10:57 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2018, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO