SEQanswers

Go Back   SEQanswers > Sequencing Technologies/Companies > Pacific Biosciences



Similar Threads
Thread Thread Starter Forum Replies Last Post
How to calculate Z-score for aneuploidy? sachin Bioinformatics 4 05-21-2015 05:34 PM
How does ENCODE calculate ChIP score? rudyzhou2 Bioinformatics 1 12-03-2014 02:53 PM
how to calculate RPKM from mean per-base read coverage linp5 Bioinformatics 3 11-04-2014 07:49 AM
How does bowtie2 calculate read mapping qualities? mmanhart Bioinformatics 4 07-23-2013 11:27 PM
How do you calculate the article score? dan Wiki Discussion 1 09-18-2011 06:03 AM

Reply
 
Thread Tools
Old 09-27-2015, 08:34 PM   #1
ymc
Senior Member
 
Location: Hong Kong

Join Date: Mar 2010
Posts: 498
Default How do I calculate read score?

The SMRT pipeline can use "read score" to filter out reads. "Read Score" seems to be defined as the percentage of bases that is expected to be correct in a read. It is a number between 0 to 1.

But how do I calculate this metric? Can I do it from the filtered fastq file? Or do I need to read the bax.h5 files?
ymc is offline   Reply With Quote
Old 09-28-2015, 09:58 AM   #2
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

I don't really use the SMRT pipeline, but you can use reformat to filter the fastq files based on the expected error rate (phred-scaled). For example:

reformat.sh in=subreads.fq out=filtered.fq maq=20


...will give you only the reads with an expected error rate of at most 1%.
Brian Bushnell is offline   Reply With Quote
Old 09-29-2015, 08:54 AM   #3
rhall
Senior Member
 
Location: San Francisco

Join Date: Aug 2012
Posts: 322
Default

You can calculate a read score from the base QVs in the fastq, but it will not be exactly the same as the "Read Score" defined in the bax.h5.
I would be careful using either the Read Score, or the QVs in the fastq for PacBio data, they are not well calibrated. Using either a 0.75, or 0.8 Read Score filters out a lot of junk reads, but I would not have a lot of confidence in it for distinguishing a 0.8 from a 0.82 read.
rhall is offline   Reply With Quote
Old 09-29-2015, 09:49 AM   #4
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

I mostly agree, in terms of raw data, that the QVs are not particularly useful. But they are quite useful in consensus (CCS, reads of insert, or error-corrected) data.
Brian Bushnell is offline   Reply With Quote
Old 10-03-2015, 05:44 AM   #5
ymc
Senior Member
 
Location: Hong Kong

Join Date: Mar 2010
Posts: 498
Default

Quote:
Originally Posted by rhall View Post
You can calculate a read score from the base QVs in the fastq, but it will not be exactly the same as the "Read Score" defined in the bax.h5.
I would be careful using either the Read Score, or the QVs in the fastq for PacBio data, they are not well calibrated. Using either a 0.75, or 0.8 Read Score filters out a lot of junk reads, but I would not have a lot of confidence in it for distinguishing a 0.8 from a 0.82 read.
Thanks for your reply. I once tried to increase the filter from 0.8 to 0.83 for a 80x E coli dataset but it made no differences in the assembly.

I wanted to learn more about read score because I want to tweak parameters in my HGAP pipeline to improve the assembly but so far I tried changing minReadScore, xCoverage and minLongReadLength but nothing changed. The default filter filters out reads shorter than 500bp. Do you think increasing the cutoff can improve assembly?
ymc is offline   Reply With Quote
Old 10-05-2015, 08:47 AM   #6
rhall
Senior Member
 
Location: San Francisco

Join Date: Aug 2012
Posts: 322
Default

From a lot of experience with HGAP and bacterial assembly, the one thing I can say is that it is extremely robust. Changing parameter will not have much of an affect on the quality of an assembly. Time is much better spent diagnosing why the assembly did not give the expected results with the default parameters.
The first things to check are the PreAssembled read N50, Preassembled coverage (# Preassembled bases / genome size) and the preassembled yield. An N50 < 5kb will result in problems for many bacterial assemblies as you will not be able to span common repeats. A yield of <0.70 indicates that something is going wrong with the PreAssembly. The PreAssembled coverage should be >15x.
rhall is offline   Reply With Quote
Old 10-07-2015, 11:16 PM   #7
ymc
Senior Member
 
Location: Hong Kong

Join Date: Mar 2010
Posts: 498
Default

Quote:
Originally Posted by rhall View Post
From a lot of experience with HGAP and bacterial assembly, the one thing I can say is that it is extremely robust. Changing parameter will not have much of an affect on the quality of an assembly. Time is much better spent diagnosing why the assembly did not give the expected results with the default parameters.
The first things to check are the PreAssembled read N50, Preassembled coverage (# Preassembled bases / genome size) and the preassembled yield. An N50 < 5kb will result in problems for many bacterial assemblies as you will not be able to span common repeats. A yield of <0.70 indicates that something is going wrong with the PreAssembly. The PreAssembled coverage should be >15x.
Thanks a lot for your reply. My Pre-assembler report for P6C4 is:

Polymerase Read Bases 779850648 Length Cutoff 21586
Seed Bases 138056215 Pre-Assembled bases 44327787
Pre-Assembled Yield 0.32108505220138045 Pre-Assembled Reads 4630
Pre-Assembled Reads Length 9574 Pre-Assembled N50 16443

Apparently, yield is 0.32<0.7 and Preassembled coverage is 9.63x<15x. How do I fix this? I basically use the example HGAP3 params.xml included in the package but I changed the genomeSize to 4600000.
ymc is offline   Reply With Quote
Old 10-08-2015, 12:12 AM   #8
ymc
Senior Member
 
Location: Hong Kong

Join Date: Mar 2010
Posts: 498
Default

The P5C3 stat looks much better and satisfy all the criteria you mentioned:

Polymerase Read Bases 373874428 Length Cutoff 13715
Seed Bases 139256931 Pre-Assembled bases 113989293
Pre-Assembled Yield 0.8185538212098039 Pre-Assembled Reads 8887
Pre-Assembled Reads Length 12826 Pre-Assembled N50 15326

I think improving Pre-Assembled bases can help my P6C4 but how can I do that?
ymc is offline   Reply With Quote
Old 10-08-2015, 10:05 AM   #9
rhall
Senior Member
 
Location: San Francisco

Join Date: Aug 2012
Posts: 322
Default

Something looks amiss with the P6C4 assembly / data, how is the loading for the cell, % P0, P1 and P2? Can you post the subread histogram. Can you provide any more information on how the libraries were prepped?
rhall is offline   Reply With Quote
Old 10-08-2015, 05:18 PM   #10
ymc
Senior Member
 
Location: Hong Kong

Join Date: Mar 2010
Posts: 498
Default

https://github.com/PacificBioscience...ary-with-P6-C4

I was using the data I downloaded from pacbio. It is a 20kb size selected library. Does that mean I should filter out reads significantly longer than 20kb, e.g. filter out >30kb reads?
ymc is offline   Reply With Quote
Old 10-08-2015, 05:21 PM   #11
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Quote:
Originally Posted by ymc View Post
https://github.com/PacificBioscience...ary-with-P6-C4

I was using the data I downloaded from pacbio. It is a 20kb size selected library. Does that mean I should filter out reads significantly longer than 20kb, e.g. filter out >30kb reads?
No. Those are the most valuable reads. Size-selection is not perfect, and the goal in this case is mainly to aim for 20kbp+, anyway - not to prevent longer reads from being sequenced.
Brian Bushnell is offline   Reply With Quote
Old 10-09-2015, 02:15 PM   #12
rhall
Senior Member
 
Location: San Francisco

Join Date: Aug 2012
Posts: 322
Default

I just looked at the data with the latest version of SMRT Analysis 2.3.0 - patch 4 and it looks like the subreadlength distribution is such that the automatic calculation of the seed read length is being too aggressive. The resulting yield of preassembled reads is a little low to get a good assembly. If you un-check "Compute Minimum Seed Read Length" and set the seed length to 15000 the result is a single circular contig with overlapping ends that can be circularized.
Note, although the ecoli is K12 it is not base for base the same as the published reference, some of the variations have been validated. You may want to check any differences between the assembly and the reference, in a lot of cases the assembly is correct it is just that the strain being sequences has diverged a little from the reference.
rhall is offline   Reply With Quote
Old 10-09-2015, 06:56 PM   #13
ymc
Senior Member
 
Location: Hong Kong

Join Date: Mar 2010
Posts: 498
Default

Quote:
Originally Posted by rhall View Post
I just looked at the data with the latest version of SMRT Analysis 2.3.0 - patch 4 and it looks like the subreadlength distribution is such that the automatic calculation of the seed read length is being too aggressive. The resulting yield of preassembled reads is a little low to get a good assembly. If you un-check "Compute Minimum Seed Read Length" and set the seed length to 15000 the result is a single circular contig with overlapping ends that can be circularized.
Note, although the ecoli is K12 it is not base for base the same as the published reference, some of the variations have been validated. You may want to check any differences between the assembly and the reference, in a lot of cases the assembly is correct it is just that the strain being sequences has diverged a little from the reference.
Thank you for your reply. I will give that a try.

I suppose "Pre-assembled Yield" is not that important but rather "Preassembled Bases" should be >25x of genomeSize, correct?

Do you know for the quiver step in the HGAP pipeline, would the reads I filtered out at the P_Filter step be used for the quiver step? When I run quiver after PBcR, I always use all the reads to polish even if the input fastq to PBcR is what's left after filtering. It would be nice to know HGAP uses all the reads to correct but not just filtered reads.
ymc is offline   Reply With Quote
Old 10-10-2015, 04:20 AM   #14
ymc
Senior Member
 
Location: Hong Kong

Join Date: Mar 2010
Posts: 498
Default

I found that I can improve preassembled yield of P6C4 just by changing targetChunks to 1 and splitBestn to the same value as totalBestn which is 24.

Polymerase Read Bases 779850648 Length Cutoff 21553
Seed Bases 139264087 Pre-Assembled bases 50571157
Pre-Assembled Yield 0.36313135776346994 Pre-Assembled Reads 6001
Pre-Assembled Reads Length 8427 Pre-Assembled N50 15684

But it ended up with broken contigs
unitig_0|quiver 3838171
unitig_1|quiver 828945
unitig_2|quiver 33788

Why was that? What does it do?
ymc is offline   Reply With Quote
Old 10-12-2015, 08:47 AM   #15
rhall
Senior Member
 
Location: San Francisco

Join Date: Aug 2012
Posts: 322
Default

The alignment and quiver correction will use the post filter reads. P_Filter only happens once, and all downstream analysis use these results.
The targetChunks will have a small effect on the results, but not in any significant way. It relates to the splitting up of the first alignment of all reads to the seed reads, in order to optimize computation. The defaults should be reasonable for bacterial assemblies, if you increase the size of genome you are assembling the number of targetChunks will have to be increased. The splitBestn is the number of alignments allowed for each read within the alignment chunk, the totalBestn is the total across all chunks.
rhall is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 12:59 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO