SEQanswers

Go Back   SEQanswers > Sequencing Technologies/Companies > Pacific Biosciences



Similar Threads
Thread Thread Starter Forum Replies Last Post
PBJelly sagarutturkar Pacific Biosciences 17 03-05-2018 11:48 PM
PBJelly errors in setup, extraction, support stages marrakesh De novo discovery 3 04-21-2015 12:49 PM
Help/advise to NGS virgin.... Coltom Metagenomics 1 07-22-2013 11:04 AM
Another Newbie.. Anyone to advise.? teutara Bioinformatics 7 03-16-2011 12:14 PM
Need Advise on the mRNA sequencing foamy Sample Prep / Library Generation 1 10-13-2009 10:20 AM

Reply
 
Thread Tools
Old 10-17-2018, 03:57 PM   #1
jmartin
Member
 
Location: St. Louis

Join Date: Dec 2009
Posts: 70
Default PBJelly novice needs advise

I have a supernova assembly of 10x genomics data for which I also have 4 smrt cells of PacBio Sequel data. The general workflow of my efforts so far have been:

supernova (using 10x genomics data)
SSPACE-LongRead (using pacbio sequel data)
GapFiller (using 10x genomics data)
PBJelly (using pacbio sequel data)

I saw steady improvement of the assembly up through GapFiller, but when I ran PBJelly at default settings the output seem to be in worse shape than the input. Our guiding metrics were total assembly length (which we expect to be 400Mb) and BUSCO completeness. The GapFiller results looked good at 414Mb total length & 88.8% core genes being found by BUSCO. But the output of default PBJelly grew in size to 550Mb and the BUSCO completeness dropped to 82.8%.

I then tried running PBJelly set to only do internal gap filling to address the issue with the overall length. It performed better with this argument set but still too long at 500Mb, and the BUSCO results were still a bit worse than the input at 88.4% (which is 1 core gene less than what was found for the GapFiller results that were the input).

So I could use some advise on how to tune PBJelly for my project. Are there certain input assembly metrics I can look at to drive my choice of parameters to set? Any advice would be greatly appreciated.

Thanks,
John
jmartin is offline   Reply With Quote
Old 10-20-2018, 06:53 AM   #2
Gopo
Member
 
Location: Louisiana

Join Date: Nov 2013
Posts: 30
Default

For the PacBio Sequel data, were you using the raw subreads? That's what I would recommend. By default the PacBio subread BAM files give a quality score of Q0 to all bases, and PBJelly needs quality scores for the bases. When I ran PBJelly, I gave all PacBio Sequel subread sequences a quality score of Q30.

Last edited by Gopo; 10-20-2018 at 06:54 AM. Reason: clarity
Gopo is offline   Reply With Quote
Old 10-22-2018, 02:51 PM   #3
jmartin
Member
 
Location: St. Louis

Join Date: Dec 2009
Posts: 70
Default

I am using subreads that were generated from the smrt link using the 'bam to fastx' method and they do appear to all be quality 0. Is there some way to force PBJelly to assume all quality values is 30? Or do I need to swap out the quality values myself in the input fastq?
jmartin is offline   Reply With Quote
Old 10-22-2018, 11:13 PM   #4
Gopo
Member
 
Location: Louisiana

Join Date: Nov 2013
Posts: 30
Default

I have a GIST that should help you with instructions that I used for running PBJelly and then correct indels (not with Pilon but with a variant caller)- see https://gist.github.com/jelber2/730f...3d5da2c97bedea
Gopo is offline   Reply With Quote
Old 10-26-2018, 03:38 PM   #5
jmartin
Member
 
Location: St. Louis

Join Date: Dec 2009
Posts: 70
Default

I've got PBJelly running now using subreads for which I've swapped in q30 values. I'm running one instance w/ all defaults and another using support -x "--capturedOnly" in the hopes to minimize the expansion of my input genome. I hadn't set a mapping quality filter which you use in your workflow. I may launch another instance to try that out.

Your workflow is very interesting, I will look into variant calling (via BBMap) for error correction and compare it to PILON (which had been my original plan). Thanks for all the suggestions & info!


John
jmartin is offline   Reply With Quote
Old 10-27-2018, 04:56 AM   #6
Gopo
Member
 
Location: Louisiana

Join Date: Nov 2013
Posts: 30
Default

Here is a script for running Pilon twice quickly by splitting the genome to be corrected into parts, generating intervals for Pilon to hasten its processing, and combining them again

It assumes you have a file called genome.pilon-0.fasta in whatever you call work-dir
https://gist.github.com/jelber2/0c7f...cef40b5946a393

It really is only useful if you have many cores on your Server (i.e., >4) and probably 100GB RAM.

Obviously also depends on the amount of Illumina reads and the size of the genome being corrected (also number of scaffolds in the file genome.pilon-0.fasta)
Gopo is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 01:51 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2018, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO