Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • HGAP Parameters in SMRT Analysis

    Hello,

    We just got SMRT Analysis installed in our lab and are trying to use HGAP for the first time. We have uploaded a SMRT cell into the portal and the HGAP protocol successfully runs, but the polished assembly contains only one single contig of 1200 bp of a 4.6 Mb genome.

    I decided to tweak the parameters a bit and the next time I did the HGAP protocol, I returned a few more contigs. My question is this:

    How do you determine the parameters of HGAP? I've sifted through the XML file provided by our sequencing facility, but it doesn't seem to have much information.

    Any help you can give would be greatly appreciated!

    -Alec

  • #2
    I'm not sure how to get this within the SMRT Portal environment, but there is a utility called length_Info.py which can be applied to your reads to get an idea of a good length cutoff. This is computed using Lander-Waterman statistics (which favor high coverage) and the ratio of data below the cutoff to above (which favors longer reads; 3X is the minimum recommended) -- if you have sufficient data there should be a range of values where both are in a desirable range.

    I have a somewhat expanded version of the code, but ported to Perl, which I could post if there is interest; a few bells-and-whistles such as estimating how much more data your would need to meet the goals at different cutoffs.

    Comment


    • #3
      PacBio has a video tutorial available for HGAP use via SMRTportal: http://www.pacificbiosciences.com/Tu...GAP/story.html

      Make sure you check the "allow partial alignments" box in the HGAP settings when you set up your run that improves assemblies. There are not a lot of settings you can change for HGAP in SMRTportal.

      Comment


      • #4
        The main parameter effecting HGAP results, assuming you have 'calculate seed automatically' switched on, is the estimate of genome size. The most common issue is that you don't have enough coverage for the auto seed calculation (30x), so it falls back to 6000, this can easily be checked by looking at the PreAssembly report for the seed length used.

        Comment


        • #5
          Hi krobison,
          I'm trying smrtpipe 2.0.1 right now and don't find the " length_Info.py ". Is there any other utility to calculate seed?

          Thanks in advance!

          Comment


          • #6
            The program reads the all_norm.fa file which contains all the reads without length filtering (at least if you are using HBAR_WF.py) -- it defaults to that name but otherwise takes an argument.

            Default genome size is 10Mb, but you can override this with the -g=size argument, where size is in bp or can be in kbp, Mbp or Gbp if you use k, m or g as a suffix on the size (e.g. -g=8m would give 8 megabases as the genome size)

            You want rows in which the first column reads 'rns' -- this means all three criteria are met (r=ratio; n & s have to do with the Lander-Waterman statistics it is calculating). For rows satisfying r but not n and/or s, the last column is an estimate of how much more data would be needed -- e.g. a 1.4 there would mean 1.4X more data is expected to satisfy all the criteria.

            Enjoy!

            Code:
            #!/usr/bin/perl
            use strict;
            use Bio::SeqIO;
            use Getopt::Long;
            
            ### generate estimates for HGAP cutoff
            
            my $firstCut=1000;
            my $genomeSize=10*1000*1000;
            
            &GetOptions('g=s'=>\$genomeSize,'f=i'=>\$firstCut);
            if ($genomeSize=~/[a-z]\.[a-z]/)
            {
                my $sum=0;
                my $rdr=new Bio::SeqIO(-file=>$genomeSize);
                while (my $rec=$rdr->next_seq)
                {
                    $sum+=$rec->length;
                }
                $genomeSize=$sum;
            }
            if ($genomeSize=~(/([0-9\.]+)([kmg])/i))
            {
                my ($num,$scale)=($1,$2);
                if ($scale=~/k/i)
                {
                    $genomeSize=$num*1000;
                    print STDERR "GenomeSize given as $num kilobases\n";
                }
                elsif ($scale=~/m/i)
                {
                    print STDERR "GenomeSize given as $num megabases\n";
                    $genomeSize=$num*1000*1000;
                }
                elsif ($scale=~/g/i)
                {
                    print STDERR "GenomeSize given as $num gigabases\n";
                    $genomeSize=$num*1000*1000*1000;
                }
            }
            else
            {
                print STDERR "GenomeSize given as $genomeSize bases\n";
            }
            if ($genomeSize<50000)
            {
                die "\tYou can't be serious about a genome size of $genomeSize!\n";
            }
            my @files=@ARGV;
            @files=("all_norm.fa") if (scalar(@files)==0);
            
            my @lengths=();
            my $total=0;
            foreach my $file(@files)
            {
                my $rdr=new Bio::SeqIO(-file=>$file);
                while (my $rec=$rdr->next_seq)
                {
                    $total+=$rec->length;
                    push(@lengths,$rec->length);
                }
            }
            @lengths=sort {$a<=>$b} @lengths;
            print join("\t","flags","cut","TotalBases","ratio","cover","nContig","conLen","gsize"),"\n";
            my @cuts=();
            for (my $cut=$firstCut; $cut<20000; $cut+=100)
            {
                push(@cuts,$cut);
            }
            
            while (scalar(@cuts)>0)
            {
                my $cut=shift(@cuts);
                my $psum=0;
                foreach my $length(grep($_>$cut,@lengths))
                {
                    $psum+=$length;
                }
                if ($psum==0)
                {
                    last;
                }
                my ($flagSet,$coverage,$contigCount,$contigLength,$ratio)=&computeStats($total,$psum,$genomeSize,$cut);
                my $boost=undef;
                if ($flagSet=~/r/)
                {
                    unless ($flagSet=~/rns/)
                    {
                        for (my $mult=1.1; $mult<10; $mult+=0.1)
                        {
                            my ($flagSetB)=&computeStats(int($total*$mult),int($psum*$mult),$genomeSize,$cut);
                            if ($flagSetB eq 'rns')
                            {
                                $boost=$mult;
                                last;
                            }
                        }
                    }
                }
            
                print join("\t",$flagSet,$cut,$psum,sprintf('%.2f',1.0*$ratio),
                           $coverage,
                           sprintf('%.2f',$contigCount),
                           sprintf('%.3f',$contigLength),
                           sprintf('%.2f',$contigLength/$genomeSize),
                           $genomeSize,$boost),"\n";
                unless ($flagSet=~/[a-z]/ || $ratio<=4)
                {
                    last;
                }
            }
            
            sub computeStats
            {
                my ($total,$psum,$genomeSize,$cut)=@_;
                my $coverage =0.5 * $psum / $genomeSize;
                my $contigCount=$coverage * $genomeSize / $cut * exp(-1*$coverage);
                my $contigLength=(exp($coverage)-1)*$cut/$coverage;
                my @flags=(' ',' ',' ');
                my $ratio=$total/$psum;
                $flags[0]='r' if ($ratio>=3);
                $flags[1]='n' if ($contigCount<=0.25);
                $flags[2]='s' if ($contigLength/$genomeSize>=0.25);
                my $flagSet=join('',@flags);
                return ($flagSet,$coverage,$contigCount,$contigLength,$ratio);
            }

            Comment


            • #7
              I'm still getting severely reduced outputs by HGAP. I'm getting 9 polished contigs with 42000 bases for a 4.9 Mb genomes. Not really sure why this is happening.

              Comment


              • #8
                Can you post the results of PreAssembly? and your Post filter base count?

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  Strategies for Sequencing Challenging Samples
                  by seqadmin


                  Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                  03-22-2024, 06:39 AM
                • seqadmin
                  Techniques and Challenges in Conservation Genomics
                  by seqadmin



                  The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                  Avian Conservation
                  Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                  03-08-2024, 10:41 AM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by seqadmin, Yesterday, 06:37 PM
                0 responses
                8 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, Yesterday, 06:07 PM
                0 responses
                8 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 03-22-2024, 10:03 AM
                0 responses
                49 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 03-21-2024, 07:32 AM
                0 responses
                67 views
                0 likes
                Last Post seqadmin  
                Working...
                X