Seqanswers Leaderboard Ad

**GenoMax** · 04-04-2015, 08:05 AM

Command line option tables: http://www.ncbi.nlm.nih.gov/books/NBK279675/

BTW: If this was NGS data why are you using BLAST when you should be using a real NGS aligner?

**GenoMax** · 04-04-2015, 08:31 AM

Get all "nt*" files from: ftp://ftp.ncbi.nih.gov/blast/db/. Un-compress them and leave in the same directory.

-db /path_to_directory/nt

does a search against the entire "nt" db.

-evalue put_your_limit_value_here

Will give you results with e-values better than your limit.

It may be easier to remove "tick" hits from your results afterwards. You could use the "filtering_db" option with the tick sequences (or use the "-negative_gilist" if you have a few sequences to exclude).

What kind of hardware do you have access to? On a cluster, parallel jobs with your input files is the best option. Standalone computer you may need to go through the files sequentially.

**hlyates** · 04-04-2015, 09:16 AM

Geno,

Thanks for your informative reply. I was given a very short window to accomplish this step. I plan on continuing to read the documentation to learn more on my own even after I accomplish this task.

That being said, could you elaborate more on what you meant about using an NGS aligner? This was never discussed and I feel like I should bring this to my supervisors attention. I found this link: http://en.wikibooks.org/wiki/Next_Ge...NGS)/Alignment

Last but not least, I do have access to a cluster and will run this job in parallel. This is something I already know how to do.

Thanks,
Heath

**SNPsaurus** · 04-04-2015, 09:39 AM

Most sequencing libraries contain DNA from a single species, and so the fastest (by far) approach is to index the reference genome of interest and use short-read aligners to map each read to the reference genome. People use BBMap, bwa-mem, GSNAP, and many others to do this.

If you have a mixed population of DNA, or DNA samples in which you suspect there are contaminating species present, it is a reasonable approach to take a sampling of your reads and use blastn against the non-redundant database just to see what is in your sample in order to choose some reference genomes for short-read read alignment. We check our samples this way and sometimes find useful information by doing so.

blastn -db /path_to/ncbi_nt/nt -query sample.fasta -out sample_blast.txt

**GenoMax** · 04-04-2015, 12:05 PM

@hlyates: NGS aligners are designed to map short reads efficiently against a reference. BBMap/Bowtie2/bwa can all align tens of millions of reads against a human genome sized reference per hour with an efficient memory footprint (on other hand regular blast would take significantly longer to do same task). So if you do have NGS data that you are analyzing then consider using a proper NGS aligner (with original fastq format data) to do the alignments.

If you have metagenomics type dataset then blast may still be appropriate (tblastx or tblastn) for functional assignments.

As @SNPsaurus indicated above blast is also the go to tool to diagnose problem runs with a sampling of reads.

Blast will require ~20G of memory per job against nt DB so keep that in mind as you plan your search strategy. If you have a high memory node available keeping the db in memory (via a RAMdisk) can speed up the searches.

**hlyates** · 04-06-2015, 01:22 PM

Originally posted by GenoMax View Post

Get all "nt*" files from: ftp://ftp.ncbi.nih.gov/blast/db/. Un-compress them and leave in the same directory.

does a search against the entire "nt" db.

Will give you results with e-values better than your limit.

It may be easier to remove "tick" hits from your results afterwards. You could use the "filtering_db" option with the tick sequences (or use the "-negative_gilist" if you have a few sequences to exclude).

What kind of hardware do you have access to? On a cluster, parallel jobs with your input files is the best option. Standalone computer you may need to go through the files sequentially.

I totally understand the point of what I am doing now. Basically, the idea is to run blast on the converted fasta files first to see if I have contamination. If I do, I want to filter those out and then do an assembly.

I ran this command

Code:

/homes/hlyates/ncbi-blast-2.2.29+/bin/blastn -db /homes/hlyates/blastdb/nt -query ~/tickdata/Sample_Index2/Index2_CGATGT_L008_R1_001_converted.fasta -out blast.txt -evalue 10E-100

I have two questions:

How can I exclude certain sequences from this blast? I only want to have false positives. So is there a way to exclude tick sequences I expect to belong?
I am worried. I did 'grep 'No hits found' blast.txt | wc -l and 125796? What would it look like for hits? 'Hits found', right? I get 0. That has to be wrong? Any clues a more experienced analyst might offer?

**GenoMax** · 04-06-2015, 01:34 PM

There is a better way to do this.

I suggest that you use BBSplit (part of BBMap) to pull out the reads that you know are "contamination" (your wording appears to suggest that these are not of interest). Link for BBSplit: http://seqanswers.com/forums/showthread.php?t=41288

This should leave the reads you are interested in a separate file.

You could employ this tool in reverse as well.

What length are you reads? Blast may not be the best tool to catch all "contaminants".

**SNPsaurus** · 04-06-2015, 01:38 PM

I do:
grep -A2 significant blast_results.txt

One good idea is to "less" the blast file yourself and just look at the formatting to see how to structure a search for hits and non-hits.

If you have a list of tick sequences that you want to remove from the fasta file you could grep the file with the tick sequence file (-f is the option to use a file as a search) and -v to keep the non-matchers. Something like
grep -v -f tick_list.fasta full_fasta.fasta

**hlyates** · 04-06-2015, 01:43 PM

Originally posted by SNPsaurus View Post

I do:
grep -A2 significant blast_results.txt

One good idea is to "less" the blast file yourself and just look at the formatting to see how to structure a search for hits and non-hits.

If you have a list of tick sequences that you want to remove from the fasta file you could grep the file with the tick sequence file (-f is the option to use a file as a search) and -v to keep the non-matchers. Something like
grep -v -f tick_list.fasta full_fasta.fasta

Thanks for your reply. Your solution surprised me. That's weird you cannot "exclude" using the blastn command. I know the GUI tool allows for this. I also think something went wrong as I am getting a 0 line count for significant results.

**GenoMax** · 04-06-2015, 01:47 PM

You can "exclude" things. There were two options in post #3. I have not specifically used these myself on the command line but you can tell us if they work

You could use the "filtering_db" option with the tick sequences (or use the "-negative_gilist" if you have a few sequences to exclude).

**hlyates** · 04-06-2015, 01:49 PM

Originally posted by GenoMax View Post

There is a better way to do this.

I suggest that you use BBSplit (part of BBMap) to pull out the reads that you know are "contamination" (your wording appears to suggest that these are not of interest). Link for BBSplit: http://seqanswers.com/forums/showthread.php?t=41288

This should leave the reads you are interested in a separate file.

You could employ this tool in reverse as well.

What length are you reads? Blast may not be the best tool to catch all "contaminants".

I don't know what reads are contaminated yet. What I am trying to do is filter out the reads I know that are good first (haven't figured out how to exclude with blastn yet, but geno just indicated it's done with the filtering_db or -negative_gilist) then run blast on the excluded file and see what is significant at 10E-100.

Hmmm, 10E-100 seems really stringent. Honestly, that seems like a really small E value and I am not getting anything with my preliminary blast on ANYTHING. Granted, it's just the first fasta file, but does that seem too stringent? Maybe I did something wrong in my cleaning steps? I only blasted my very first converted fasta file, but I would expect there to be something to come up on it?

**GenoMax** · 04-06-2015, 02:18 PM

It is unlikely that individual reads are contaminated (unless some are chimeras).

You haven't told us the details of what exactly you did to "clean" the reads (what trimming program, settings etc).

BTW: The options for exclusion I included are for the blastn command itself. You will include that db when you run the blast search.

Remove the e-value restriction to see what is the lowest e-value you get for hits before you go about adding that in. Sample your fasta file (5-10K sequences) so the blast can finish quickly in this investigative phase.

**SNPsaurus** · 04-06-2015, 03:32 PM

hlyates, e-100 might not be possible with short reads. I just glanced at a recent run and reads trimmed to 93 got e-40 with a perfect match. Genomax is right: start with a test file, examine the results manually and narrow in on appropriate parameters knowing more about what sort of results to expect. That's a more important skill in bioinformatics than knowing how to run a particular script!

**hlyates** · 04-06-2015, 07:20 PM

Originally posted by GenoMax View Post

It is unlikely that individual reads are contaminated (unless some are chimeras).

You haven't told us the details of what exactly you did to "clean" the reads (what trimming program, settings etc).

BTW: The options for exclusion I included are for the blastn command itself. You will include that db when you run the blast search.

Remove the e-value restriction to see what is the lowest e-value you get for hits before you go about adding that in. Sample your fasta file (5-10K sequences) so the blast can finish quickly in this investigative phase.

Originally posted by GenoMax View Post

It is unlikely that individual reads are contaminated (unless some are chimeras).

You haven't told us the details of what exactly you did to "clean" the reads (what trimming program, settings etc).

BTW: The options for exclusion I included are for the blastn command itself. You will include that db when you run the blast search.

Remove the e-value restriction to see what is the lowest e-value you get for hits before you go about adding that in. Sample your fasta file (5-10K sequences) so the blast can finish quickly in this investigative phase.

I was asked to use galaxy tools, but for the commandline. Thus, I did the following:

Specifically, I was instructed (told) to do the following:
Groom: fasta to fasta
Trimming: trim ends 5' to 3', window size 3, step size 1, max number of bases to exclude 0, aggregate action for window min score, trim until aggregate score is >= 30
Filtering: min size 40, max size 0, min qual 30, max qual 0, max number of bases allowed outside of quality range 5

This is what I am going to do in the meantime:

Verify my cleaning met the parameters I gave above
Take 5 to 10K first reads of a fasta file (any will do). Run blast on it and see what lowest E value produced is?

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 39 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 41 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 35 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 55 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

How to run blastn against my data (and compare to ncbi nt library)?

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News