Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to run blastn against my data (and compare to ncbi nt library)?

    I am new to blast and using bioinformatics tools on the command line. I recently have cleaned the data I was given (groomed, filtered, trimmed, and converted into fasta). I have read the documentation of ncbi and the commandline quick guide, but still have questions.

    I have the following two requirements given me by my boss for blastn:
    • E-value has to be smaller than 10E-100 for a hit to be reporte
    • Use the whole NCBI nt as my database that I blast against (it would be good if I could filter out tick sequences, if possible)


    I have the following questions:
    • What parameter allows me to choose 10E-100 or smaller to be significant?
    • How can I reference the whole NCBI nt db?
    • How can I filter out certain tick sequences?
    • I have about 63 fasta files I need to blast, I think. Do I run this in parallel or is there a way I can define a directory as the input with blastn?


    Here is my current skeleton I have written for the blastn command, but I am stuck:

    Code:
    /homes/hlyates/ncbi-blast-2.2.28+/bin ./blastn -db ? -query ~/scripts/Sample_Index2/Sample1.fasta
    I deeply appreciate any comments or assistance that could be rendered.

  • #2
    Command line option tables: http://www.ncbi.nlm.nih.gov/books/NBK279675/

    BTW: If this was NGS data why are you using BLAST when you should be using a real NGS aligner?
    Last edited by GenoMax; 04-04-2015, 08:33 AM.

    Comment


    • #3
      Get all "nt*" files from: ftp://ftp.ncbi.nih.gov/blast/db/. Un-compress them and leave in the same directory.

      -db /path_to_directory/nt
      does a search against the entire "nt" db.
      -evalue put_your_limit_value_here
      Will give you results with e-values better than your limit.

      It may be easier to remove "tick" hits from your results afterwards. You could use the "filtering_db" option with the tick sequences (or use the "-negative_gilist" if you have a few sequences to exclude).

      What kind of hardware do you have access to? On a cluster, parallel jobs with your input files is the best option. Standalone computer you may need to go through the files sequentially.

      Comment


      • #4
        Geno,

        Thanks for your informative reply. I was given a very short window to accomplish this step. I plan on continuing to read the documentation to learn more on my own even after I accomplish this task.

        That being said, could you elaborate more on what you meant about using an NGS aligner? This was never discussed and I feel like I should bring this to my supervisors attention. I found this link: http://en.wikibooks.org/wiki/Next_Ge...NGS)/Alignment

        Last but not least, I do have access to a cluster and will run this job in parallel. This is something I already know how to do.

        Thanks,
        Heath

        Comment


        • #5
          Most sequencing libraries contain DNA from a single species, and so the fastest (by far) approach is to index the reference genome of interest and use short-read aligners to map each read to the reference genome. People use BBMap, bwa-mem, GSNAP, and many others to do this.

          If you have a mixed population of DNA, or DNA samples in which you suspect there are contaminating species present, it is a reasonable approach to take a sampling of your reads and use blastn against the non-redundant database just to see what is in your sample in order to choose some reference genomes for short-read read alignment. We check our samples this way and sometimes find useful information by doing so.

          blastn -db /path_to/ncbi_nt/nt -query sample.fasta -out sample_blast.txt
          Providing nextRAD genotyping and PacBio sequencing services. http://snpsaurus.com

          Comment


          • #6
            @hlyates: NGS aligners are designed to map short reads efficiently against a reference. BBMap/Bowtie2/bwa can all align tens of millions of reads against a human genome sized reference per hour with an efficient memory footprint (on other hand regular blast would take significantly longer to do same task). So if you do have NGS data that you are analyzing then consider using a proper NGS aligner (with original fastq format data) to do the alignments.

            If you have metagenomics type dataset then blast may still be appropriate (tblastx or tblastn) for functional assignments.

            As @SNPsaurus indicated above blast is also the go to tool to diagnose problem runs with a sampling of reads.

            Blast will require ~20G of memory per job against nt DB so keep that in mind as you plan your search strategy. If you have a high memory node available keeping the db in memory (via a RAMdisk) can speed up the searches.

            Comment


            • #7
              Originally posted by GenoMax View Post
              Get all "nt*" files from: ftp://ftp.ncbi.nih.gov/blast/db/. Un-compress them and leave in the same directory.



              does a search against the entire "nt" db.


              Will give you results with e-values better than your limit.

              It may be easier to remove "tick" hits from your results afterwards. You could use the "filtering_db" option with the tick sequences (or use the "-negative_gilist" if you have a few sequences to exclude).

              What kind of hardware do you have access to? On a cluster, parallel jobs with your input files is the best option. Standalone computer you may need to go through the files sequentially.

              I totally understand the point of what I am doing now. Basically, the idea is to run blast on the converted fasta files first to see if I have contamination. If I do, I want to filter those out and then do an assembly.

              I ran this command

              Code:
              /homes/hlyates/ncbi-blast-2.2.29+/bin/blastn -db /homes/hlyates/blastdb/nt -query ~/tickdata/Sample_Index2/Index2_CGATGT_L008_R1_001_converted.fasta -out blast.txt -evalue 10E-100
              I have two questions:
              • How can I exclude certain sequences from this blast? I only want to have false positives. So is there a way to exclude tick sequences I expect to belong?
              • I am worried. I did 'grep 'No hits found' blast.txt | wc -l and 125796? What would it look like for hits? 'Hits found', right? I get 0. That has to be wrong? Any clues a more experienced analyst might offer?

              Comment


              • #8
                There is a better way to do this.

                I suggest that you use BBSplit (part of BBMap) to pull out the reads that you know are "contamination" (your wording appears to suggest that these are not of interest). Link for BBSplit: http://seqanswers.com/forums/showthread.php?t=41288

                This should leave the reads you are interested in a separate file.

                You could employ this tool in reverse as well.

                What length are you reads? Blast may not be the best tool to catch all "contaminants".

                Comment


                • #9
                  I do:
                  grep -A2 significant blast_results.txt

                  One good idea is to "less" the blast file yourself and just look at the formatting to see how to structure a search for hits and non-hits.

                  If you have a list of tick sequences that you want to remove from the fasta file you could grep the file with the tick sequence file (-f is the option to use a file as a search) and -v to keep the non-matchers. Something like
                  grep -v -f tick_list.fasta full_fasta.fasta
                  Providing nextRAD genotyping and PacBio sequencing services. http://snpsaurus.com

                  Comment


                  • #10
                    Originally posted by SNPsaurus View Post
                    I do:
                    grep -A2 significant blast_results.txt

                    One good idea is to "less" the blast file yourself and just look at the formatting to see how to structure a search for hits and non-hits.

                    If you have a list of tick sequences that you want to remove from the fasta file you could grep the file with the tick sequence file (-f is the option to use a file as a search) and -v to keep the non-matchers. Something like
                    grep -v -f tick_list.fasta full_fasta.fasta
                    Thanks for your reply. Your solution surprised me. That's weird you cannot "exclude" using the blastn command. I know the GUI tool allows for this. I also think something went wrong as I am getting a 0 line count for significant results.

                    Comment


                    • #11
                      You can "exclude" things. There were two options in post #3. I have not specifically used these myself on the command line but you can tell us if they work

                      You could use the "filtering_db" option with the tick sequences (or use the "-negative_gilist" if you have a few sequences to exclude).

                      Comment


                      • #12
                        Originally posted by GenoMax View Post
                        There is a better way to do this.

                        I suggest that you use BBSplit (part of BBMap) to pull out the reads that you know are "contamination" (your wording appears to suggest that these are not of interest). Link for BBSplit: http://seqanswers.com/forums/showthread.php?t=41288

                        This should leave the reads you are interested in a separate file.

                        You could employ this tool in reverse as well.

                        What length are you reads? Blast may not be the best tool to catch all "contaminants".
                        I don't know what reads are contaminated yet. What I am trying to do is filter out the reads I know that are good first (haven't figured out how to exclude with blastn yet, but geno just indicated it's done with the filtering_db or -negative_gilist) then run blast on the excluded file and see what is significant at 10E-100.


                        Hmmm, 10E-100 seems really stringent. Honestly, that seems like a really small E value and I am not getting anything with my preliminary blast on ANYTHING. Granted, it's just the first fasta file, but does that seem too stringent? Maybe I did something wrong in my cleaning steps? I only blasted my very first converted fasta file, but I would expect there to be something to come up on it?

                        Comment


                        • #13
                          It is unlikely that individual reads are contaminated (unless some are chimeras).

                          You haven't told us the details of what exactly you did to "clean" the reads (what trimming program, settings etc).

                          BTW: The options for exclusion I included are for the blastn command itself. You will include that db when you run the blast search.

                          Remove the e-value restriction to see what is the lowest e-value you get for hits before you go about adding that in. Sample your fasta file (5-10K sequences) so the blast can finish quickly in this investigative phase.

                          Comment


                          • #14
                            hlyates, e-100 might not be possible with short reads. I just glanced at a recent run and reads trimmed to 93 got e-40 with a perfect match. Genomax is right: start with a test file, examine the results manually and narrow in on appropriate parameters knowing more about what sort of results to expect. That's a more important skill in bioinformatics than knowing how to run a particular script!
                            Providing nextRAD genotyping and PacBio sequencing services. http://snpsaurus.com

                            Comment


                            • #15
                              Originally posted by GenoMax View Post
                              It is unlikely that individual reads are contaminated (unless some are chimeras).

                              You haven't told us the details of what exactly you did to "clean" the reads (what trimming program, settings etc).

                              BTW: The options for exclusion I included are for the blastn command itself. You will include that db when you run the blast search.

                              Remove the e-value restriction to see what is the lowest e-value you get for hits before you go about adding that in. Sample your fasta file (5-10K sequences) so the blast can finish quickly in this investigative phase.
                              Originally posted by GenoMax View Post
                              It is unlikely that individual reads are contaminated (unless some are chimeras).

                              You haven't told us the details of what exactly you did to "clean" the reads (what trimming program, settings etc).

                              BTW: The options for exclusion I included are for the blastn command itself. You will include that db when you run the blast search.

                              Remove the e-value restriction to see what is the lowest e-value you get for hits before you go about adding that in. Sample your fasta file (5-10K sequences) so the blast can finish quickly in this investigative phase.
                              I was asked to use galaxy tools, but for the commandline. Thus, I did the following:

                              Specifically, I was instructed (told) to do the following:
                              Groom: fasta to fasta
                              Trimming: trim ends 5' to 3', window size 3, step size 1, max number of bases to exclude 0, aggregate action for window min score, trim until aggregate score is >= 30
                              Filtering: min size 40, max size 0, min qual 30, max qual 0, max number of bases allowed outside of quality range 5

                              This is what I am going to do in the meantime:
                              • Verify my cleaning met the parameters I gave above
                              • Take 5 to 10K first reads of a fasta file (any will do). Run blast on it and see what lowest E value produced is?

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Current Approaches to Protein Sequencing
                                by seqadmin


                                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                04-04-2024, 04:25 PM
                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 04-11-2024, 12:08 PM
                              0 responses
                              25 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 10:19 PM
                              0 responses
                              28 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 09:21 AM
                              0 responses
                              24 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-04-2024, 09:00 AM
                              0 responses
                              52 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X