Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • A question about BFAST indexing

    Hi guys,

    This is Carlos from Valencia, I am a new member (congratulations for the site is fantastic).

    Although I have some humble experience on bioinformatics, in next gene sequencing I am just starting and there are some things I still am unfamiliar. My first question is begginer´s question and it is about the BFAST indexing process if using the human genome hg19 as a case study.

    I downloaded this genome and joined all the chromosomes in a single file installed BFAST and run fasta2brg to create the gen reference. I did that without troubles.

    Then, i tried the indexing. First i read something here about how to do that and here is my question (in fact are two).

    A priori nothing was wrong. I tried two things regarding masking and I guess I was able to create the bif file in both cases. However I do no find apparent differences in the command outputs of both processes. In short, the indexing was succesful but i am not sure if it was right.

    First i tried with one mask

    ..assembling/tools/bfast/bfast-0.7.0a/bfast> ./bfast index -f hg19.fa -m 1111111111111111 -w 14 -i 1

    and then tried 10 masks recommended in another post.

    ..assembling/tools/bfast/bfast-0.7.0a/bfast> ./bfast index -f hg19.fa -m 1111111111111111 11111110011110011111 111111111001111111 1111000111111111111 111101000110010011111111 1111111111110001111 1111100100111110011111 1111111110011001100111 1100111001110011111111 11110011011110010011111 -w 14 -i 10

    Below is the output of the second case (10). The point is that with the exception of being much more larger in time (i only used 1 thread) than the process with one mask. The output is more or less similar. The tool gives you warning messages that apparently do not affect the run.

    Question 1) Here I wonder about if this mesages are a normal part of the output or if I must check/amend something of my genome file in order to avoid these messages??

    Question 2) the printing program parameters list (below) only list the first mask although this is the output of the 10 mask examples. Does this means that the tool only recognizes the first mask string or is just that it only prints (for simplicity sake) this first string???

    Thank you in advande

    Carlos

    ................................................


    Checking input parameters supplied by the user ...
    Validating fastaFileName hg19.fa.
    Validating tmpDir path ./.
    Input arguments look good!
    ************************************************************
    ************************************************************
    Printing Program Parameters:
    programMode: [ExecuteProgram]
    fastaFileName: hg19.fa
    space: [NT Space]
    mask: 1111111111111111
    depth: 0
    hashWidth: 14
    indexNumber: 10
    repeatMasker: [Not Using]
    startContig: 0
    startPos: 0
    endContig: 2147483647
    endPos: 2147483647
    exonsFileName: [Not Using]
    numThreads: 1
    tmpDir: ./
    timing: [Not Using]
    ************************************************************
    ************************************************************
    Reading in reference genome from hg19.fa.nt.brg.
    In total read 93 contigs for a total of 3137161264 bases
    ************************************************************
    Creating the index...
    ************************************************************
    Warning: startContig was less than zero.
    Defaulting to contig=1 and position=1.
    ************************************************************
    ************************************************************
    Warning: endContig was greater than the number of contigs in the reference genome.
    Defaulting to reference genome's end contig=93 and position=59373566.
    ************************************************************
    Currently on [contig,pos]:
    [------93,---59373566]
    Sorting... 100.00 percent complete 100.000 percent complete
    Sorted.
    Creating a hash.
    Pass 1 out of 2. Out of 2897303003, currently on:
    2897303003
    Pass 2 of 2. Out of 268435456, currently on:
    268435456
    Hash created.
    Index created.
    Index size is 14.492GB.
    Terminating successfully!
    ************************************************************

  • #2
    You need to run bfast index independently for each mask. Each index file is the same size, ~14 Gb. In your example only the first mask was used to create index 10.

    Comment


    • #3
      Ok chipper, got it, thank you for the clarification. I will try to see if the tool allows me to create a little command file to automate this step for each mask and save them in distinct files. If not i´ll do it independently for each mask, you say.

      Comment


      • #4
        Here's a shell script I hope works
        Code:
        #!/bin/sh
        bfast fasta2brg -f hg19.fasta
        I=1;
        for MASK in 1111111111111111 11111110011110011111 111111111001111111 1111000111111111111 111101000110010011111111 1111111111110001111 1111100100111110011111 1111111110011001100111 1100111001110011111111 11110011011110010011111
        do
            bfast index -f hg19.fasta -i ${I} -w 14 -m ${MASK} -n <num threads>;
            I=`echo ${I} + 1`;
        done

        Comment


        • #5
          Nils
          It is still running but it seems it works.
          Thank you

          Comment


          • #6
            Hi Nils
            Just a correction for users having the same starting need. As written, the script above failed just because the bif index file created using the first mask had the same name to that expected to create based on the second mask. Already existing (in regards of name) when trying to create the second index the scripts is aborted with the message.
            file "blablabla" already exists.

            To solve this, I paste here an easy modification of the nilshomer script to let it finish the loop etc. At least in my computers it works. Hope it to be useful to others.

            -----------------------

            #!/bin/bash
            bfast fasta2brg -f hg19.fasta;
            I=1;
            for MASK in 1111111111111111 11111110011110011111 111111111001111111 1111000111111111111 111101000110010011111111 1111111111110001111 1111100100111110011111 1111111110011001100111 1100111001110011111111 11110011011110010011111
            do
            bfast index -f hg19.fasta -i $I -w 14 -m $MASK -n 4;
            mv hg19.fasta.nt.1.1.bif hg19.fasta.nt.$I.bif
            let I=I+1;
            done

            -----------------

            Comment

            Latest Articles

            Collapse

            • seqadmin
              Current Approaches to Protein Sequencing
              by seqadmin


              Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
              04-04-2024, 04:25 PM
            • seqadmin
              Strategies for Sequencing Challenging Samples
              by seqadmin


              Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
              03-22-2024, 06:39 AM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, 04-11-2024, 12:08 PM
            0 responses
            25 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-10-2024, 10:19 PM
            0 responses
            29 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-10-2024, 09:21 AM
            0 responses
            25 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-04-2024, 09:00 AM
            0 responses
            52 views
            0 likes
            Last Post seqadmin  
            Working...
            X