Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Trouble indexing reference db for BFAST

    I'm trying out the BFAST aligner for the first time today, and I'm having trouble getting the 'bfast index' step to run. I was able to successfully run 'bfast fasta2nrg' on my fasta reference db, but then each time I've tried running 'bfast index' on that file, I end up with 0-size .bif files.

    Here are some details about what I am doing:

    BFAST version: 0.6.3c
    Linux OS, 64bit cpu, 8+ Gb memory

    Reference db: Ensembl build 37 human genome (+ some novel regions from 2 of the BGI human genomes + some other stuff). Size ~4.5Gb. This fasta file is a little 'lopsided' in that 24 of the sequences (chr 1-22, X, Y) are VERY large, and then there are thousands of other, much smaller fasta sequences for the other components. Those 24 chromosomes make up ~4.4Gb of the total sequence in the file.


    What I'm trying to run is:

    bfast index -f <my fasta> -m 1111111111111111111111 -w 14 -i 1 -d 1

    *note: The 'fasta.brg' file is sitting right next to the fasta I enter in the command line


    When I run that the program starts with no problems, but it doesn't seem to finish. Here is a snippit of output I see on my screen:

    ************************************************************
    Checking input parameters supplied by the user ...
    Validating fastaFileName Human_build37_expanded_screening_db.100323.fna.
    Validating tmpDir path ./.
    Input arguments look good!
    ************************************************************
    ************************************************************
    Printing Program Parameters:
    programMode: [ExecuteProgram]
    fastaFileName: Human_build37_expanded_screening_db.100323.fna
    space: [NT Space]
    mask: 1111111111111111111111
    depth: 1
    hashWidth: 14
    indexNumber: 1
    repeatMasker: [Not Using]
    startContig: 0
    startPos: 0
    endContig: 2147483647
    endPos: 2147483647
    exonsFileName: [Not Using]
    numThreads: 1
    tmpDir: ./
    timing: [Not Using]
    ************************************************************
    ************************************************************
    Reading in reference genome from Human_build37_expanded_screening_db.100323.fna.nt.brg.
    In total read 21291 contigs for a total of 4691431462 bases
    ************************************************************
    Creating the index...
    ************************************************************
    Warning: startContig was less than zero.
    Defaulting to contig=1 and position=1.
    ************************************************************
    ************************************************************
    Warning: endContig was greater than the number of contigs in the reference genome.
    Defaulting to reference genome's end contig=21291 and position=1112.
    ************************************************************
    [---21291,-------1112]
    ************************************************************
    Creating index (bin 1/4)
    Sorting...
    0.263 percent complete



    At that point the program seems to halt. I notice a few unsettling warning messages in there, but not being familiar with BFAST, I didn't know if they were anything I should be worried about.

    Any advice would be appreciated.

  • #2
    Originally posted by jmartin View Post
    I'm trying out the BFAST aligner for the first time today, and I'm having trouble getting the 'bfast index' step to run. I was able to successfully run 'bfast fasta2nrg' on my fasta reference db, but then each time I've tried running 'bfast index' on that file, I end up with 0-size .bif files.

    Here are some details about what I am doing:

    BFAST version: 0.6.3c
    Linux OS, 64bit cpu, 8+ Gb memory

    Reference db: Ensembl build 37 human genome (+ some novel regions from 2 of the BGI human genomes + some other stuff). Size ~4.5Gb. This fasta file is a little 'lopsided' in that 24 of the sequences (chr 1-22, X, Y) are VERY large, and then there are thousands of other, much smaller fasta sequences for the other components. Those 24 chromosomes make up ~4.4Gb of the total sequence in the file.


    What I'm trying to run is:

    bfast index -f <my fasta> -m 1111111111111111111111 -w 14 -i 1 -d 1

    *note: The 'fasta.brg' file is sitting right next to the fasta I enter in the command line


    When I run that the program starts with no problems, but it doesn't seem to finish. Here is a snippit of output I see on my screen:

    ************************************************************
    Checking input parameters supplied by the user ...
    Validating fastaFileName Human_build37_expanded_screening_db.100323.fna.
    Validating tmpDir path ./.
    Input arguments look good!
    ************************************************************
    ************************************************************
    Printing Program Parameters:
    programMode: [ExecuteProgram]
    fastaFileName: Human_build37_expanded_screening_db.100323.fna
    space: [NT Space]
    mask: 1111111111111111111111
    depth: 1
    hashWidth: 14
    indexNumber: 1
    repeatMasker: [Not Using]
    startContig: 0
    startPos: 0
    endContig: 2147483647
    endPos: 2147483647
    exonsFileName: [Not Using]
    numThreads: 1
    tmpDir: ./
    timing: [Not Using]
    ************************************************************
    ************************************************************
    Reading in reference genome from Human_build37_expanded_screening_db.100323.fna.nt.brg.
    In total read 21291 contigs for a total of 4691431462 bases
    ************************************************************
    Creating the index...
    ************************************************************
    Warning: startContig was less than zero.
    Defaulting to contig=1 and position=1.
    ************************************************************
    ************************************************************
    Warning: endContig was greater than the number of contigs in the reference genome.
    Defaulting to reference genome's end contig=21291 and position=1112.
    ************************************************************
    [---21291,-------1112]
    ************************************************************
    Creating index (bin 1/4)
    Sorting...
    0.263 percent complete



    At that point the program seems to halt. I notice a few unsettling warning messages in there, but not being familiar with BFAST, I didn't know if they were anything I should be worried about.

    Any advice would be appreciated.
    You wont be able to make an index from a reference genome larger than "2^32" or 4294967296 bases in length. This is true of all aligners I can think of, or at least the most popular ones. hg18 does not have this problem, though if you search around this site, Heng Li (lh3) gave a link to the 1000 Genoems website that has an hg19 reference that is < 2^32 bases in length.

    Also, for the human genome, you need 24GB of memory to create the indexes. If you have only 8GB, then use "-d 1". How much ram is "8+"?

    Comment


    • #3
      I was trying -d 1, I do have access to machines with 32Gb memory but they are usually busy and difficult to reserve.

      But my problem must be my db size, thanks for the quick reply.

      Comment


      • #4
        Originally posted by jmartin View Post
        I was trying -d 1, I do have access to machines with 32Gb memory but they are usually busy and difficult to reserve.

        But my problem must be my db size, thanks for the quick reply.
        Let me know how it goes, I would be happy to help further.

        Comment


        • #5
          Dear Nils,

          I have a similar problem except that the size of hg19 I'm using is below 2^32 bases so it shouldn't be a problem. But the bif file stay empty after a few hours running. we are working on a cluster with nodes having 8 cores and 24gb of ram. here is the script:

          #! /bin/bash
          #parameter for PBS
          #PBS -q smp
          #PBS -l walltime=10:00:00
          #PBS -l mem=24gb
          #PBS -M [email]
          #PBS -m abe
          #PBS -N indexhg19mask1

          #start of BFAST
          module load bfast-gcc
          cd $PBS_O_WORKDIR

          #create the index from the ref genome
          bfast index -f hg19.fa -n 8 -m 1111111111111111111111 -w 14 -i 1 -A 1


          and here is the execution file:

          ************************************************************
          Checking input parameters supplied by the user ...
          Validating fastaFileName hg19.fa.
          Validating tmpDir path ./.
          Input arguments look good!
          ************************************************************
          ************************************************************
          Printing Program Parameters:
          programMode: [ExecuteProgram]
          fastaFileName: hg19.fa
          space: [Color Space]
          mask: 1111111111111111111111
          depth: 0
          hashWidth: 14
          indexNumber: 1
          repeatMasker: [Not Using]
          startContig: 0
          startPos: 0
          endContig: 2147483647
          endPos: 2147483647
          exonsFileName: [Not Using]
          numThreads: 8
          tmpDir: ./
          timing: [Not Using]
          ************************************************************
          ************************************************************
          Reading in reference genome from hg19.fa.cs.brg.
          In total read 25 contigs for a total of 3095693983 bases
          ************************************************************
          Creating the index...
          ************************************************************
          Warning: startContig was less than zero.
          Defaulting to contig=1 and position=1.
          ************************************************************
          ************************************************************
          Warning: endContig was greater than the number of contigs in the reference genome.
          Defaulting to reference genome's end contig=25 and position=16571.
          ************************************************************
          Currently on [contig,pos]:
          ^M[-------0,----------0]^M[-------1,----1000000]^M[-------1,----2000000]^M[-------1,----3000000]
          ^M[-------1,----4000000]^M[-------1,----5000000]^M[-------1,----6000000]^M[-------1,----7000000]
          ^M[-------1,----8000000]^M[-------1,----9000000]^M[-------1,---10000000]^M[-------1,---11000000]
          ^M[-------1,---12000000]^M[-------1,---13000000]^M[-------1,---14000000]^M[-------1,---15000000]
          ^M[-------1,---16000000]^M[-------1,---17000000]^M[-------1,---18000000]^M[-------1,---19000000]
          ^M[-------1,---20000000]^M[-------1,---21000000]^M[-------1,---22000000]^M[-------1,---23000000]
          ^M[-------1,---24000000]^M[-------1,---25000000]^M[-------1,---26000000]^M[-------1,---27000000]
          ^M[-------1,---28000000]^M[-------1,---29000000]^M[-------1,---30000000]^M[-------1,---31000000]
          ^M[-------1,---32000000]^M[-------1,---33000000]^M[-------1,---34000000]^M[-------1,---35000000]
          ^M[-------1,---36000000]^M[-------1,---37000000]^M[-------1,---38000000]^M[-------1,---39000000]
          ^M[-------1,---40000000]^M[-------1,---41000000]^M[-------1,---42000000]^M[-------1,---43000000]
          ^M[-------1,---44000000]^M[-------1,---45000000]^M[-------1,---46000000]^M[-------1,---47000000]
          ^M[-------1,---48000000]^M[-------1,---49000000]^M[-------1,---50000000]^M[-------1,---51000000]
          ^M[-------1,---52000000]^M[-------1,---53000000]^M[-------1,---54000000]^M[-------1,---55000000]
          ^M[-------1,---56000000]^M[-------1,---57000000]^M[-------1,---58000000]^M[-------1,---59000000]
          ^M[-------1,---60000000]^M[-------1,---61000000]^M[-------1,---62000000]^M[-------1,---63000000]
          ^M[-------1,---64000000]^M[-------1,---65000000]^M[-------1,---66000000]^M[-------1,---67000000]
          ^M[-------1,---68000000]^M[-------1,---69000000]^M[-------1,---70000000]^M[-------1,---71000000]
          ^M[-------1,---72000000]^M[-------1,---73000000]^M[-------1,---74000000]^M[-------1,---75000000]
          ^M[-------1,---76000000]^M[-------1,---77000000]^M[-------1,---78000000]^M[-------1,---79000000]
          ^M[-------1,---80000000]^M[-------1,---81000000]^M[-------1,---82000000]^M[-------1,---83000000]
          ^M[-------1,---84000000]^M[-------1,---85000000]^M[-------1,---86000000]^M[-------1,---87000000]
          ^M[-------1,---88000000]^M[-------1,---89000000]^M[-------1,---90000000]^M[-------1,---91000000]
          ^M[-------1,---92000000]^M[-------1,---93000000]^M[-------1,---94000000]^M[-------1,---95000000]
          ^M[-------1,---96000000]^M[-------1,---97000000]^M[-------1,---98000000]^M[-------1,---99000000]
          ^M[-------1,--100000000]^M[-------1,--101000000]^M[-------1,--102000000]^M[-------1,--103000000]
          ^M[-------1,--104000000]^M[-------1,--105000000]^M[-------1,--106000000]^M[-------1,--107000000]
          ^M[-------1,--108000000]^M[-------1,--109000000]^M[-------1,--110000000]^M[-------1,--111000000]
          ^M[-------1,--112000000]^M[-------1,--113000000]^M[-------1,--114000000]^M[-------1,--115000000]
          ^M[-------1,--116000000]^M[-------1,--117000000]^M[-------1,--118000000]^M[-------1,--119000000]
          ^M[-------1,--120000000]^M[-------1,--121000000]^M[-------1,--122000000]^M[-------1,--123000000]
          ^M[-------1,--124000000]^M[-------1,--125000000]^M[-------1,--126000000]^M[-------1,--127000000]
          ^M[-------1,--128000000]^M[-------1,--129000000]^M[-------1,--130000000]^M[-------1,--131000000]
          ^M[-------1,--132000000]^M[-------1,--133000000]^M[-------1,--134000000]^M[-------1,--135000000]
          ^M[-------1,--136000000]^M[-------1,--137000000]^M[-------1,--138000000]^M[-------1,--139000000]
          ^M[-------1,--140000000]^M[-------1,--141000000]^M[-------1,--142000000]^M[-------1,--143000000]
          ^M[-------1,--144000000]^M[-------1,--145000000]^M[-------1,--146000000]^M[-------1,--147000000]
          ^M[-------1,--148000000]^M[-------1,--149000000]^M[-------1,--150000000]^M[-------1,--151000000]
          ^M[-------1,--152000000]^M[-------1,--153000000]^M[-------1,--154000000]^M[-------1,--155000000]
          ^M[-------1,--156000000]^M[-------1,--157000000]^M[-------1,--158000000]^M[-------1,--159000000]
          ^M[-------1,--160000000]^M[-------1,--161000000]^M[-------1,--162000000]^M[-------1,--163000000]
          ^M[-------1,--164000000]^M[-------1,--165000000]^M[-------1,--166000000]^M[-------1,--167000000]
          ^M[-------1,--168000000]^M[-------1,--169000000]^M[-------1,--170000000]^M[-------1,--171000000]
          ^M[-------1,

          thanks in advance for your help!

          Comment


          • #6
            problem solved!

            Comment


            • #7
              Originally posted by Fabrice ODEFREY View Post
              problem solved!
              Just to help others, what was the solution?

              Comment


              • #8
                yes of course. the solution was patience :-).
                let me elaborate a bit more. the 1st time I tested bfast I ran some demo on a small genome (E.coli) and I could see my bif file being created in real time ( size of the file was increasing). But when running on the human genome the size was staying at 0, hence I assumed that nothing was happening. after 10 hours I stopped the process. Also I had assumed that since I was specifing in PBS the use of 1 node 8 cores (smp) it should transfert this info to bfast...but no. So by specifing the n option to 8 and being patient it did work even if I couldn't see my file being created. I assum that because of the size everything happen in a temp dir on the nodes...

                Comment


                • #9
                  Originally posted by Fabrice ODEFREY View Post
                  yes of course. the solution was patience :-).
                  let me elaborate a bit more. the 1st time I tested bfast I ran some demo on a small genome (E.coli) and I could see my bif file being created in real time ( size of the file was increasing). But when running on the human genome the size was staying at 0, hence I assumed that nothing was happening. after 10 hours I stopped the process. Also I had assumed that since I was specifing in PBS the use of 1 node 8 cores (smp) it should transfert this info to bfast...but no. So by specifing the n option to 8 and being patient it did work even if I couldn't see my file being created. I assum that because of the size everything happen in a temp dir on the nodes...
                  Thanks! I was having the same problem, and this helped a ton.

                  Comment

                  Latest Articles

                  Collapse

                  • seqadmin
                    Essential Discoveries and Tools in Epitranscriptomics
                    by seqadmin




                    The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                    04-22-2024, 07:01 AM
                  • seqadmin
                    Current Approaches to Protein Sequencing
                    by seqadmin


                    Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                    04-04-2024, 04:25 PM

                  ad_right_rmr

                  Collapse

                  News

                  Collapse

                  Topics Statistics Last Post
                  Started by seqadmin, Yesterday, 11:49 AM
                  0 responses
                  15 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 04-24-2024, 08:47 AM
                  0 responses
                  16 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 04-11-2024, 12:08 PM
                  0 responses
                  61 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 04-10-2024, 10:19 PM
                  0 responses
                  60 views
                  0 likes
                  Last Post seqadmin  
                  Working...
                  X