Unconfigured Ad

**krobison** · 10-16-2013, 09:10 AM

I'm not sure how to get this within the SMRT Portal environment, but there is a utility called length_Info.py which can be applied to your reads to get an idea of a good length cutoff. This is computed using Lander-Waterman statistics (which favor high coverage) and the ratio of data below the cutoff to above (which favors longer reads; 3X is the minimum recommended) -- if you have sufficient data there should be a range of values where both are in a desirable range.

I have a somewhat expanded version of the code, but ported to Perl, which I could post if there is interest; a few bells-and-whistles such as estimating how much more data your would need to meet the goals at different cutoffs.

**GenoMax** · 10-16-2013, 09:53 AM

PacBio has a video tutorial available for HGAP use via SMRTportal: http://www.pacificbiosciences.com/Tu...GAP/story.html

Make sure you check the "allow partial alignments" box in the HGAP settings when you set up your run that improves assemblies. There are not a lot of settings you can change for HGAP in SMRTportal.

**rhall** · 10-17-2013, 01:35 PM

The main parameter effecting HGAP results, assuming you have 'calculate seed automatically' switched on, is the estimate of genome size. The most common issue is that you don't have enough coverage for the auto seed calculation (30x), so it falls back to 6000, this can easily be checked by looking at the PreAssembly report for the seed length used.

**zhoufan** · 10-21-2013, 12:39 AM

Hi krobison,
I'm trying smrtpipe 2.0.1 right now and don't find the " length_Info.py ". Is there any other utility to calculate seed?

Thanks in advance!

**krobison** · 10-22-2013, 07:54 AM

The program reads the all_norm.fa file which contains all the reads without length filtering (at least if you are using HBAR_WF.py) -- it defaults to that name but otherwise takes an argument.

Default genome size is 10Mb, but you can override this with the -g=size argument, where size is in bp or can be in kbp, Mbp or Gbp if you use k, m or g as a suffix on the size (e.g. -g=8m would give 8 megabases as the genome size)

You want rows in which the first column reads 'rns' -- this means all three criteria are met (r=ratio; n & s have to do with the Lander-Waterman statistics it is calculating). For rows satisfying r but not n and/or s, the last column is an estimate of how much more data would be needed -- e.g. a 1.4 there would mean 1.4X more data is expected to satisfy all the criteria.

Enjoy!

Code:

#!/usr/bin/perl
use strict;
use Bio::SeqIO;
use Getopt::Long;

### generate estimates for HGAP cutoff

my $firstCut=1000;
my $genomeSize=10*1000*1000;

&GetOptions('g=s'=>\$genomeSize,'f=i'=>\$firstCut);
if ($genomeSize=~/[a-z]\.[a-z]/)
{
    my $sum=0;
    my $rdr=new Bio::SeqIO(-file=>$genomeSize);
    while (my $rec=$rdr->next_seq)
    {
        $sum+=$rec->length;
    }
    $genomeSize=$sum;
}
if ($genomeSize=~(/([0-9\.]+)([kmg])/i))
{
    my ($num,$scale)=($1,$2);
    if ($scale=~/k/i)
    {
        $genomeSize=$num*1000;
        print STDERR "GenomeSize given as $num kilobases\n";
    }
    elsif ($scale=~/m/i)
    {
        print STDERR "GenomeSize given as $num megabases\n";
        $genomeSize=$num*1000*1000;
    }
    elsif ($scale=~/g/i)
    {
        print STDERR "GenomeSize given as $num gigabases\n";
        $genomeSize=$num*1000*1000*1000;
    }
}
else
{
    print STDERR "GenomeSize given as $genomeSize bases\n";
}
if ($genomeSize<50000)
{
    die "\tYou can't be serious about a genome size of $genomeSize!\n";
}
my @files=@ARGV;
@files=("all_norm.fa") if (scalar(@files)==0);

my @lengths=();
my $total=0;
foreach my $file(@files)
{
    my $rdr=new Bio::SeqIO(-file=>$file);
    while (my $rec=$rdr->next_seq)
    {
        $total+=$rec->length;
        push(@lengths,$rec->length);
    }
}
@lengths=sort {$a<=>$b} @lengths;
print join("\t","flags","cut","TotalBases","ratio","cover","nContig","conLen","gsize"),"\n";
my @cuts=();
for (my $cut=$firstCut; $cut<20000; $cut+=100)
{
    push(@cuts,$cut);
}

while (scalar(@cuts)>0)
{
    my $cut=shift(@cuts);
    my $psum=0;
    foreach my $length(grep($_>$cut,@lengths))
    {
        $psum+=$length;
    }
    if ($psum==0)
    {
        last;
    }
    my ($flagSet,$coverage,$contigCount,$contigLength,$ratio)=&computeStats($total,$psum,$genomeSize,$cut);
    my $boost=undef;
    if ($flagSet=~/r/)
    {
        unless ($flagSet=~/rns/)
        {
            for (my $mult=1.1; $mult<10; $mult+=0.1)
            {
                my ($flagSetB)=&computeStats(int($total*$mult),int($psum*$mult),$genomeSize,$cut);
                if ($flagSetB eq 'rns')
                {
                    $boost=$mult;
                    last;
                }
            }
        }
    }

    print join("\t",$flagSet,$cut,$psum,sprintf('%.2f',1.0*$ratio),
               $coverage,
               sprintf('%.2f',$contigCount),
               sprintf('%.3f',$contigLength),
               sprintf('%.2f',$contigLength/$genomeSize),
               $genomeSize,$boost),"\n";
    unless ($flagSet=~/[a-z]/ || $ratio<=4)
    {
        last;
    }
}

sub computeStats
{
    my ($total,$psum,$genomeSize,$cut)=@_;
    my $coverage =0.5 * $psum / $genomeSize;
    my $contigCount=$coverage * $genomeSize / $cut * exp(-1*$coverage);
    my $contigLength=(exp($coverage)-1)*$cut/$coverage;
    my @flags=(' ',' ',' ');
    my $ratio=$total/$psum;
    $flags[0]='r' if ($ratio>=3);
    $flags[1]='n' if ($contigCount<=0.25);
    $flags[2]='s' if ($contigLength/$genomeSize>=0.25);
    my $flagSet=join('',@flags);
    return ($flagSet,$coverage,$contigCount,$contigLength,$ratio);
}

**bonifera** · 10-24-2013, 01:09 PM

I'm still getting severely reduced outputs by HGAP. I'm getting 9 polished contigs with 42000 bases for a 4.9 Mb genomes. Not really sure why this is happening.

**rhall** · 10-24-2013, 01:10 PM

Can you post the results of PreAssembly? and your Post filter base count?

Topics	Statistics	Last Post
Sequencing the Two-Toed Sloth Genome Reveals Jumping Genes Tied to Its Extreme Metabolism by SEQadmin2 Started by SEQadmin2, 06-09-2026, 11:58 AM	0 responses 17 views 0 reactions	Last Post by SEQadmin2 06-09-2026, 11:58 AM
A New Method Makes Hantavirus Genome Analysis Faster and More Accessible by SEQadmin2 Started by SEQadmin2, 06-05-2026, 10:09 AM	0 responses 27 views 0 reactions	Last Post by SEQadmin2 06-05-2026, 10:09 AM
A New Single-Cell Method Maps DNA-Protein Interactions by SEQadmin2 Started by SEQadmin2, 06-04-2026, 08:59 AM	0 responses 38 views 0 reactions	Last Post by SEQadmin2 06-04-2026, 08:59 AM
Long-Read RNA Sequencing Uncovers a Hidden Layer of Immune Cell Regulation by SEQadmin2 Started by SEQadmin2, 06-02-2026, 12:03 PM	0 responses 61 views 0 reactions	Last Post by SEQadmin2 06-02-2026, 12:03 PM

Unconfigured Ad

HGAP Parameters in SMRT Analysis

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News