Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • bonifera
    Member
    • May 2013
    • 10

    HGAP Parameters in SMRT Analysis

    Hello,

    We just got SMRT Analysis installed in our lab and are trying to use HGAP for the first time. We have uploaded a SMRT cell into the portal and the HGAP protocol successfully runs, but the polished assembly contains only one single contig of 1200 bp of a 4.6 Mb genome.

    I decided to tweak the parameters a bit and the next time I did the HGAP protocol, I returned a few more contigs. My question is this:

    How do you determine the parameters of HGAP? I've sifted through the XML file provided by our sequencing facility, but it doesn't seem to have much information.

    Any help you can give would be greatly appreciated!

    -Alec
  • krobison
    Senior Member
    • Nov 2007
    • 734

    #2
    I'm not sure how to get this within the SMRT Portal environment, but there is a utility called length_Info.py which can be applied to your reads to get an idea of a good length cutoff. This is computed using Lander-Waterman statistics (which favor high coverage) and the ratio of data below the cutoff to above (which favors longer reads; 3X is the minimum recommended) -- if you have sufficient data there should be a range of values where both are in a desirable range.

    I have a somewhat expanded version of the code, but ported to Perl, which I could post if there is interest; a few bells-and-whistles such as estimating how much more data your would need to meet the goals at different cutoffs.

    Comment

    • GenoMax
      Senior Member
      • Feb 2008
      • 7142

      #3
      PacBio has a video tutorial available for HGAP use via SMRTportal: http://www.pacificbiosciences.com/Tu...GAP/story.html

      Make sure you check the "allow partial alignments" box in the HGAP settings when you set up your run that improves assemblies. There are not a lot of settings you can change for HGAP in SMRTportal.

      Comment

      • rhall
        Senior Member
        • Aug 2012
        • 324

        #4
        The main parameter effecting HGAP results, assuming you have 'calculate seed automatically' switched on, is the estimate of genome size. The most common issue is that you don't have enough coverage for the auto seed calculation (30x), so it falls back to 6000, this can easily be checked by looking at the PreAssembly report for the seed length used.

        Comment

        • zhoufan
          Member
          • Feb 2009
          • 14

          #5
          Hi krobison,
          I'm trying smrtpipe 2.0.1 right now and don't find the " length_Info.py ". Is there any other utility to calculate seed?

          Thanks in advance!

          Comment

          • krobison
            Senior Member
            • Nov 2007
            • 734

            #6
            The program reads the all_norm.fa file which contains all the reads without length filtering (at least if you are using HBAR_WF.py) -- it defaults to that name but otherwise takes an argument.

            Default genome size is 10Mb, but you can override this with the -g=size argument, where size is in bp or can be in kbp, Mbp or Gbp if you use k, m or g as a suffix on the size (e.g. -g=8m would give 8 megabases as the genome size)

            You want rows in which the first column reads 'rns' -- this means all three criteria are met (r=ratio; n & s have to do with the Lander-Waterman statistics it is calculating). For rows satisfying r but not n and/or s, the last column is an estimate of how much more data would be needed -- e.g. a 1.4 there would mean 1.4X more data is expected to satisfy all the criteria.

            Enjoy!

            Code:
            #!/usr/bin/perl
            use strict;
            use Bio::SeqIO;
            use Getopt::Long;
            
            ### generate estimates for HGAP cutoff
            
            my $firstCut=1000;
            my $genomeSize=10*1000*1000;
            
            &GetOptions('g=s'=>\$genomeSize,'f=i'=>\$firstCut);
            if ($genomeSize=~/[a-z]\.[a-z]/)
            {
                my $sum=0;
                my $rdr=new Bio::SeqIO(-file=>$genomeSize);
                while (my $rec=$rdr->next_seq)
                {
                    $sum+=$rec->length;
                }
                $genomeSize=$sum;
            }
            if ($genomeSize=~(/([0-9\.]+)([kmg])/i))
            {
                my ($num,$scale)=($1,$2);
                if ($scale=~/k/i)
                {
                    $genomeSize=$num*1000;
                    print STDERR "GenomeSize given as $num kilobases\n";
                }
                elsif ($scale=~/m/i)
                {
                    print STDERR "GenomeSize given as $num megabases\n";
                    $genomeSize=$num*1000*1000;
                }
                elsif ($scale=~/g/i)
                {
                    print STDERR "GenomeSize given as $num gigabases\n";
                    $genomeSize=$num*1000*1000*1000;
                }
            }
            else
            {
                print STDERR "GenomeSize given as $genomeSize bases\n";
            }
            if ($genomeSize<50000)
            {
                die "\tYou can't be serious about a genome size of $genomeSize!\n";
            }
            my @files=@ARGV;
            @files=("all_norm.fa") if (scalar(@files)==0);
            
            my @lengths=();
            my $total=0;
            foreach my $file(@files)
            {
                my $rdr=new Bio::SeqIO(-file=>$file);
                while (my $rec=$rdr->next_seq)
                {
                    $total+=$rec->length;
                    push(@lengths,$rec->length);
                }
            }
            @lengths=sort {$a<=>$b} @lengths;
            print join("\t","flags","cut","TotalBases","ratio","cover","nContig","conLen","gsize"),"\n";
            my @cuts=();
            for (my $cut=$firstCut; $cut<20000; $cut+=100)
            {
                push(@cuts,$cut);
            }
            
            while (scalar(@cuts)>0)
            {
                my $cut=shift(@cuts);
                my $psum=0;
                foreach my $length(grep($_>$cut,@lengths))
                {
                    $psum+=$length;
                }
                if ($psum==0)
                {
                    last;
                }
                my ($flagSet,$coverage,$contigCount,$contigLength,$ratio)=&computeStats($total,$psum,$genomeSize,$cut);
                my $boost=undef;
                if ($flagSet=~/r/)
                {
                    unless ($flagSet=~/rns/)
                    {
                        for (my $mult=1.1; $mult<10; $mult+=0.1)
                        {
                            my ($flagSetB)=&computeStats(int($total*$mult),int($psum*$mult),$genomeSize,$cut);
                            if ($flagSetB eq 'rns')
                            {
                                $boost=$mult;
                                last;
                            }
                        }
                    }
                }
            
                print join("\t",$flagSet,$cut,$psum,sprintf('%.2f',1.0*$ratio),
                           $coverage,
                           sprintf('%.2f',$contigCount),
                           sprintf('%.3f',$contigLength),
                           sprintf('%.2f',$contigLength/$genomeSize),
                           $genomeSize,$boost),"\n";
                unless ($flagSet=~/[a-z]/ || $ratio<=4)
                {
                    last;
                }
            }
            
            sub computeStats
            {
                my ($total,$psum,$genomeSize,$cut)=@_;
                my $coverage =0.5 * $psum / $genomeSize;
                my $contigCount=$coverage * $genomeSize / $cut * exp(-1*$coverage);
                my $contigLength=(exp($coverage)-1)*$cut/$coverage;
                my @flags=(' ',' ',' ');
                my $ratio=$total/$psum;
                $flags[0]='r' if ($ratio>=3);
                $flags[1]='n' if ($contigCount<=0.25);
                $flags[2]='s' if ($contigLength/$genomeSize>=0.25);
                my $flagSet=join('',@flags);
                return ($flagSet,$coverage,$contigCount,$contigLength,$ratio);
            }

            Comment

            • bonifera
              Member
              • May 2013
              • 10

              #7
              I'm still getting severely reduced outputs by HGAP. I'm getting 9 polished contigs with 42000 bases for a 4.9 Mb genomes. Not really sure why this is happening.

              Comment

              • rhall
                Senior Member
                • Aug 2012
                • 324

                #8
                Can you post the results of PreAssembly? and your Post filter base count?

                Comment

                Latest Articles

                Collapse

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by SEQadmin2, 06-09-2026, 11:58 AM
                0 responses
                17 views
                0 reactions
                Last Post SEQadmin2  
                Started by SEQadmin2, 06-05-2026, 10:09 AM
                0 responses
                27 views
                0 reactions
                Last Post SEQadmin2  
                Started by SEQadmin2, 06-04-2026, 08:59 AM
                0 responses
                38 views
                0 reactions
                Last Post SEQadmin2  
                Started by SEQadmin2, 06-02-2026, 12:03 PM
                0 responses
                61 views
                0 reactions
                Last Post SEQadmin2  
                Working...