Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • flacchy
    Member
    • Apr 2013
    • 33

    Tracking blastall

    Hi,
    I just started my phd and I am working with a huge dataset (~7mil reads).
    I set blastall for nt into my biolinux shell and since it's going to take forever I wanted to ask for some help on how keep traks of the analysis.
    Using the less comand I can see what's on the output file but is there a way to get some numbers out of it? such as how many reads have been submitted already, and stuff like that.
    could someone help?

    ps.: this is the command I used:
    blastall -d 'nt' -p 'blastn' -i contigs.fa -o contigs.fa.blastn -e 1e-06 -b 10 -v 10 -a 4

    Thanks
  • rhinoceros
    Senior Member
    • Apr 2013
    • 372

    #2
    How many months/years you expect this query will take? You think you have enough hdd space for the output file? If it's impossible for you to run your query on some more powerful platform, at least split the input into smaller files..
    Last edited by rhinoceros; 05-08-2013, 01:26 AM.
    savetherhino.org

    Comment

    • flacchy
      Member
      • Apr 2013
      • 33

      #3
      We do have enough space for the output file, I know somebody tried this before and took 6 months, that's why I was wondering for a way to keep track...
      Do you know if there is a different way then clustering the data? or a free platform I could use?

      Comment

      • rhinoceros
        Senior Member
        • Apr 2013
        • 372

        #4
        If I were you, I'd run my blasts on Amazon EC2 or something similar. It's not that expensive..
        savetherhino.org

        Comment

        • maubp
          Peter (Biopython etc)
          • Jul 2009
          • 1544

          #5
          How may sequences in your contig FASTA file?

          Are your contigs from a transcriptome assembly, meaning each is not that long (typical genes)? Or genomic meaning some could be very large? Either way, try smaller batches of 100 or 1000 sequences at a time - that should let you estimate how long the whole assembly will take.

          Does your computer have enough RAM for the NT database?

          Does your computer have multiple CPU cores? Have you tried running BLAST with multiple threads and/or multiple copies of BLAST on separate query files?

          Are you using the plain text output? If so what will you do with it - parse it? Perhaps a more compact and computer friendly output might be wiser, like the tabular output?

          Comment

          • flacchy
            Member
            • Apr 2013
            • 33

            #6
            Thanks maubp... so

            The metagenome is been sequenced with Illumina and we know that the read length is in a range between 15 and 99 bp.

            Do you suggest using softwares such as CD-Hit to spilt the file into smaller batches?

            We installed the nt db into the NX machine and we have 8cores CPU, could you help me a little more on how run BLAST with multiple threads and/or multiple copies of BLAST on separate query files? (is there some link I can look at?)

            as a output I set a fasta file (I sow that on some workshops) so I told the program to give as output a file named contigs.fa.blastn

            Comment

            • rhinoceros
              Senior Member
              • Apr 2013
              • 372

              #7
              Originally posted by flacchy View Post
              The metagenome is been sequenced with Illumina and we know that the read length is in a range between 15 and 99 bp.

              Do you suggest using softwares such as CD-Hit to spilt the file into smaller batches?
              Why not assemble before doing anything else, or alternatively send the reads for blast to mg-rast or img/m or some other online pipeline? But really, you should assemble first. What do you hope to gain from blasting reads that are just 15 nt long?
              We installed the nt db into the NX machine and we have 8cores CPU, could you help me a little more on how run BLAST with multiple threads and/or multiple copies of BLAST on separate query files? (is there some link I can look at?)
              http://www.ncbi.nlm.nih.gov/books/NBK1762/ ..you had already set up 4 threads with the -a flag. In newer versions of blast -num_threads replaces this flag, and really, for speed gains you should be using the latest version..
              Last edited by rhinoceros; 05-08-2013, 02:52 AM.
              savetherhino.org

              Comment

              • maubp
                Peter (Biopython etc)
                • Jul 2009
                • 1544

                #8
                Originally posted by rhinoceros View Post
                Why not assemble before doing anything else, or alternatively send the reads for blast to mg-rast or img/m or some other online pipeline? But really, you should assemble first. What do you hope to gain from blasting reads that are just 15 nt long?

                http://www.ncbi.nlm.nih.gov/books/NBK1762/ ..you had already set up 4 threads with the -a flag. In newer versions of blasts -num_threads replaces this flag..
                I assumed from your question from the filename contigs.fa that you had already assembled the data. If not, you should do that first.

                Comment

                • flacchy
                  Member
                  • Apr 2013
                  • 33

                  #9
                  I assemble these reads with velvet, now I am trying to set metavelvet to get better contigs, since the contigs I obtained are still short (some of them 41nt)

                  at the same time we are running a search on the reads to look at what kind of 'organisms' expect from the data. Does it make sense?
                  Last edited by flacchy; 05-08-2013, 05:17 AM.

                  Comment

                  • GenoMax
                    Senior Member
                    • Feb 2008
                    • 7142

                    #10
                    Wouldn't it be preferable to use a resource like MG-RAST (http://metagenomics.anl.gov/) for this type of analysis? Assuming that the sample here is metagenomic, of course.

                    Comment

                    • flacchy
                      Member
                      • Apr 2013
                      • 33

                      #11
                      yes it is metagenome (specifically marine viromes), I'll have a look.. Thank you so much this was of great help!

                      Comment

                      • flacchy
                        Member
                        • Apr 2013
                        • 33

                        #12
                        If anyone is curious there is a script to keep track on blast (if you are dealing with huge data)

                        Comment

                        • kmcarr
                          Senior Member
                          • May 2008
                          • 1181

                          #13
                          Originally posted by flacchy View Post
                          yes it is metagenome (specifically marine viromes), I'll have a look.. Thank you so much this was of great help!
                          DO NOT use nt!! If your query sequences are from marine viruses don't search against the entire universe of DNA sequences.

                          One of the very first things you should do when setting up a BLAST experiment (yes, think of running BLAST as an in silico experiment) is choosing a database appropriate to your experimental system and objective. The nt database has DNA from every branch of the taxonomic tree and every species from aardvark to zyzzyva. I am hard pressed to think of a time when nt is the correct database to use. Construct a target database focused to the experiment and it will greatly speed up your BLAST.

                          Comment

                          Latest Articles

                          Collapse

                          • SEQadmin2
                            From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                            by SEQadmin2


                            Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                            The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                            ...
                            06-02-2026, 10:05 AM
                          • SEQadmin2
                            Single-Cell Sequencing at an Inflection Point: Early Impacts of New Platforms and Emerging Trends
                            by SEQadmin2


                            With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.


                            Introduction

                            Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...
                            05-22-2026, 06:42 AM
                          • SEQadmin2
                            Environmental Genomics in the Age of NGS: From Microbes to Conservation Strategies
                            by SEQadmin2

                            Studying ecosystems means dealing with complex, multi-species communities that are hard to observe at scale. This complexity, however, hides many important questions to be answered, from how biogeochemical cycles work and how climate change can affect species distribution to how conservation strategies can work best.


                            Genomics, particularly since the expansion of NGS, has transformed ecosystem ecology. By sequencing environmental DNA, we can now assess biodiversity without direct...
                            05-06-2026, 09:04 AM

                          ad_right_rmr

                          Collapse

                          News

                          Collapse

                          Topics Statistics Last Post
                          Started by SEQadmin2, 06-02-2026, 12:03 PM
                          0 responses
                          21 views
                          0 reactions
                          Last Post SEQadmin2  
                          Started by SEQadmin2, 06-02-2026, 11:40 AM
                          0 responses
                          14 views
                          0 reactions
                          Last Post SEQadmin2  
                          Started by SEQadmin2, 05-28-2026, 11:40 AM
                          0 responses
                          29 views
                          0 reactions
                          Last Post SEQadmin2  
                          Started by SEQadmin2, 05-26-2026, 10:12 AM
                          0 responses
                          31 views
                          0 reactions
                          Last Post SEQadmin2  
                          Working...