Seqanswers Leaderboard Ad

**metheuse** · 08-29-2013, 05:37 PM

hmm, I just saw this too, quiver only calls haploid variants. My data is human. The reason to use pacbio ccs instead of illumina is that we want to find some really low freq small indels. We think CCS should be much more accurate.

So can I use Plurality? Or do I have to go back to GATK? It seems that the gatk module in smrtanalysis pipeline cannot find indels.

So you suggest using compareSequences.py instead of blasr?

Once I get alignment from blasr as sam files, could I just leave smrtanalysis modules and use GATK or samtools to call variants? But still I don't know how different it will be compared to illumina reads.

**rhall** · 08-30-2013, 07:56 AM

How low frequency do you expect the variant? lower than 1/100 reads will be very difficult.
The last time I checked (~1 year ago) GATK will not find indels with PacBio data, the local realignment fails. Also I'm not sure GATK will find rare variants, I know it calls diploid, but I have never seen it used for rare variants, with PacBio data.

How long is your amplicon, and how many indels do you expect?
One option if the indel is more than a few bases, is to take high pass CCS data then look at the read length distribution.
Or:
If you know the expected variant, but want to calculate frequency, take high pass CCS data and map to multiple reference that account for the indel and measure coverage depth.
Or:
If the indel is novel and only a few bases, map high pass CCS reads to a reference, using pbalign (the replacement for compareSequences.py) mapping the CCS, not using the CCS to place the subread. Then simply count bases at each position in the alignment, applying a statistical test to the significance of variation. This last part may require writing some code, I'm not sure of any off the shelf solution, but there is possibly something that works with SAM files, maybe from the viral space.

**metheuse** · 08-30-2013, 07:59 AM

This is strange.

I used bash5tools.py on bas.h5 file and extracted the CCS reads that have at least 3 passes. I compared the output with the p0.ccs.fasta file. The numbers are different. There are fewer reads in bash5tools.py output.

And if I set --readType to "Raw" for bash5tools.py, the output has more reads than p0.fasta file. The p0.fasta file should contain all the raw reads, right?

**metheuse** · 08-30-2013, 08:09 AM

Originally posted by rhall View Post

How low frequency do you expect the variant? lower than 1/100 reads will be very difficult.
The last time I checked (~1 year ago) GATK will not find indels with PacBio data, the local realignment fails. Also I'm not sure GATK will find rare variants, I know it calls diploid, but I have never seen it used for rare variants, with PacBio data.

How long is your amplicon, and how many indels do you expect?
One option if the indel is more than a few bases, is to take high pass CCS data then look at the read length distribution.
Or:
If you know the expected variant, but want to calculate frequency, take high pass CCS data and map to multiple reference that account for the indel and measure coverage depth.
Or:
If the indel is novel and only a few bases, map high pass CCS reads to a reference, using pbalign (the replacement for compareSequences.py) mapping the CCS, not using the CCS to place the subread. Then simply count bases at each position in the alignment, applying a statistical test to the significance of variation. This last part may require writing some code, I'm not sure of any off the shelf solution, but there is possibly something that works with SAM files, maybe from the viral space.

Thanks! I think I'll simply try GATK first. If it doesn't work, I'll try the solution you suggested.

One question, for variant detection purpose, when I do alignment, should I align only ccs, or subreads? It seems that you suggested aligning only CCS.

**rhall** · 08-30-2013, 09:35 AM

For rare variants it is easy to use CCS reads, the statistics are simpler, as you are essentially reducing the amount of noise in the alignment.

For haploid variants I would use quiver and raw reads, the rich quality information in the raw reads together with the quiver algorithm produces great results, with the advantage that you can use all the data.

The subreads.fasta file and the ccs.fasta files output by primary will have different numbers of reads than bash5tools output as they are subject to a read quality filter, which is also a parameter in bash5tools. I'm not sure off the top of my head what the filter is for the primary produced files.

**metheuse** · 08-30-2013, 09:57 AM

Originally posted by rhall View Post

For rare variants it is easy to use CCS reads, the statistics are simpler, as you are essentially reducing the amount of noise in the alignment.

For haploid variants I would use quiver and raw reads, the rich quality information in the raw reads together with the quiver algorithm produces great results, with the advantage that you can use all the data.

The subreads.fasta file and the ccs.fasta files output by primary will have different numbers of reads than bash5tools output as they are subject to a read quality filter, which is also a parameter in bash5tools. I'm not sure off the top of my head what the filter is for the primary produced files.

Thanks!

Since my sample is human, I'll use ccs alignment only.

If there is another filter for ccs.fasta, then the number of reads in it should be smaller than the output of bash5tools.py, right? But the fact is the opposite.

Here are the parameters of bash5tools.py:

Code:

optional arguments:
  -h, --help            show this help message and exit
  -i, --info            turn on progress monitoring to stdout [False]
  -d, --debug           turn on progress monitoring to stdout and keep temp files [False]
  -v, --version         show program's version number and exit
  --outFilePref OUTFILEPREF
                        output filename prefix []
  --readType READTYPE   read type (CCS or Raw) [Raw]
  --outType OUTTYPE     output file type (fasta, fastq, csv or detail) [fasta]
  --subReads            for fasta/fastq files, generate the subreads and not the raw reads [False]

read filtering arguments:
  --minLength MINLEN    min read length [0]
  --minReadscore MINRS  min read score, valid only when used with --readType=Raw [0]
  --minMeanQV MINMQV    min per read mean Quality Value [0]
  --minNumberOfPasses MINNP
                        min number of CCS passes, valid only when used with --readType=CCS [0]

I want to filter based on number of passes, so I should use "CCS" as the readType, right? I tried "raw", but it outputs even more reads than the long read fasta file.

My command is:

Code:

python bash5tools.py mydata.bas.h5 --readType CCS --outType fasta --minNumberOfPasses 6 --outFilePref filter6passesCCS_s1

Is there any recommended setting for the other thresholds?

Another question: how can I combine two bas.h5 files (s1 and s2 for each well)? I found how to merge cmp.h5, not bas.h5.
Or could I filter them separately by bash5tools.py, and then put their names in an fofn file, and use this fofn file as input for alignment, and similarly, use an fofn rgn.h5 file for --regionTable?

I hope I'm not bugging you too much, Richard.

**rhall** · 08-30-2013, 10:23 AM

No problem,
The subtleties of the filtering are complicated, it's something I'd have to look into, but the ccs.fasta file from primary is actually being removed in the next software release, I wouldn't use it.

I would also set --minMeanQV 25 (but make sure you don't loose too many reads), as a high number of passes could in theory produce a read with low QVs in unusual situations.

For multiple bas.h5 files, it should always be possible to give the program an input.fofn file for multiple bas.h5 files. It is not possible to merge bas.h5 files.

**metheuse** · 08-30-2013, 11:34 AM

Originally posted by rhall View Post

No problem,
The subtleties of the filtering are complicated, it's something I'd have to look into, but the ccs.fasta file from primary is actually being removed in the next software release, I wouldn't use it.

I would also set --minMeanQV 25 (but make sure you don't loose too many reads), as a high number of passes could in theory produce a read with low QVs in unusual situations.

For multiple bas.h5 files, it should always be possible to give the program an input.fofn file for multiple bas.h5 files. It is not possible to merge bas.h5 files.

Thanks!

By the way, the small indels we are aiming to find can be at a frequency as low as 0.0001.

**rhall** · 08-30-2013, 12:37 PM

At a frequency of 0.0001, that would be ~4-5 single reads showing the mutation in a single RS II run, assuming all reads produce CCS data.
If you need any more assistance let me know, we can also take this conversation over to email if you like.
Good Luck.

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 23 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 24 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 21 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 52 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News