Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • HGAP error caused by blasr memory limit

    Hi,
    I am using HGAP for a large genome pacbio data.
    I do not have cluster and our server has 64 cores & 500G mem.
    In the smrt pipeline when running blasr I receive :
    ERROR! Reading fasta files greater than 4Gbytes is not supported.
    How can I change the parameters to skip this error?

    Cheers,

  • #2
    What version of SMRT Analysis are you using? Early versions of HGAP were limited in genome size (~100Mb) because of this limitation with blasr. SMRT Analysis 2.2 and HGAP.3 will not have this problem as the data can be split into multiple blasr alignments (parameter - number of seed read chunks)

    Comment


    • #3
      I am using SMRT Analysis 2.2 and HGAP.3 and I set "targetChunks" parameter to 6. But still I receive the same ERROR.
      here are the prarameters which I set for "P_PreAssemblerDagcon" module:

      <module id="P_PreAssemblerDagcon">
      <param name="computeLengthCutoff"><value>true</value></param>
      <param name="minLongReadLength"><value>9000</value></param>
      <param name="targetChunks"><value>6</value></param>
      <param name="splitBestn"><value>11</value></param>
      <param name="totalBestn"><value>24</value></param>
      <param name="blasrOpts"><value> -noSplitSubreads -minReadLength 200 -maxScore -1000 -maxLCPLength 16 -minMatch 14</value></param>
      </module>


      and this is the error which I see in smrtpipe.log:

      [ERROR] 2014-04-24 10:25:19,992 [smrtpipe.status refreshTargets 413] *** Failed task task://Anonymous/P_PreAssemblerDagcon/hgapAlignForCorrection

      I checked the "P_PreAssemblerDagcon/hgapAlignForCorrection.log" file and it seems that the input file is not splitted for blasr:

      Successfully found /PacBio/4th_run/data/filtered_subreads.fasta
      Successfully found /PacBio/4th_run/filtered_longreads.fasta
      Successfully validated input files
      [INFO] 2014-04-24T10:25:15 [blasr] started.
      ERROR! Reading fasta files greater than 4Gbytes is not supported.


      FYI, "filtered_subreads.fasta" file is 14G and "filtered_longreads.fasta" is 11G.

      Any idea?

      Comment


      • #4
        I haven't used blasr, but this error seems to be about reading the reference(instead of reading your reads).

        I did some googling and saw this :


        Something that you may want to know.
        First, the maximum reference genome size that blasr supports is 4G.
        Second, blasr is designed to align Pacbio reads to genome, not genome to genome, and there is a limit on read length (e.g., <100K). If a read is too long, blasr may consume all the memory and cause problems.
        Hope this helps.

        Comment


        • #5
          The splitting is not being carried out as I guess you are running in none distributed mode? HGAP for this size genome will have to be run using the '--distribute' parameter to smrtpipe.py. This is not really intuitive when executing on a single server with high number of cores, but it is they only way to get the workflow engine to split the assembly into manageable chunks. SMRT pipe command:
          Code:
          smrtpipe.py --distribute -D CLUSTER_MANAGER=BASH -D MAX_THREADS=6 -D NPROC=10 ...
          Last edited by rhall; 04-24-2014, 07:23 AM. Reason: Correction

          Comment


          • #6
            tabotaab,

            So, did you figure this out?
            I have exactly same problem.

            Comment

            Latest Articles

            Collapse

            • seqadmin
              Essential Discoveries and Tools in Epitranscriptomics
              by seqadmin




              The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
              04-22-2024, 07:01 AM
            • seqadmin
              Current Approaches to Protein Sequencing
              by seqadmin


              Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
              04-04-2024, 04:25 PM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, Today, 08:47 AM
            0 responses
            10 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-11-2024, 12:08 PM
            0 responses
            60 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-10-2024, 10:19 PM
            0 responses
            59 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-10-2024, 09:21 AM
            0 responses
            53 views
            0 likes
            Last Post seqadmin  
            Working...
            X