Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Questions about Blast on Unix Shell

    Part of my Masters project involves the use of deep-amplicon sequencing of the 16S rRNA gene with the Illumina MiSeq platform to generate reads that can be annotated to OTUs. This way, I can characterize microbial communities in groundwater wells and determine whether there's bacteria that come from animal manure and human sewage.

    I'm also taking a bioinformatics course where I have to prepare a small-scale paper related to my Masters thesis using bioinformatics tools. As such, I wanted to focus on the Bacteroidales bacteria in the groundwater wells I'm working with. Bacteroidales bacteria are fecal markers that contain populations exclusively found in the feces of different warm-blooded animals, from cows to humans.

    I've just started getting my hands dirty on the bioinformatics pipeline. I've already learned how to create a BLAST nucleotide database on a Unix shell with the follow command:

    /usr/local/blast/2.2.29/bin/makeblastdb -in mydatabase.fasta -dbtype nucl

    Now I want to make an entire BLAST database that contains all fecal Bacteroidales sequences. I don't want to copy and paste all the Bacteroidales FASTA sequences one by one since it would be time-consuming. Is there a way to streamline that process using the BLAST program?

    Also, I'm working with qPCR assays that target these Bacteroidales populations. I've already done an in silico analysis of each primer pair I'm working with using Primer Blast. Is there a way to perform Primer Blast on UNIX and then paste all the matching sequences onto a fasta file?

    Thanks a ton!

  • #2
    Originally posted by Naphtap View Post
    Now I want to make an entire BLAST database that contains all fecal Bacteroidales sequences. I don't want to copy and paste all the Bacteroidales FASTA sequences one by one since it would be time-consuming. Is there a way to streamline that process using the BLAST program?
    Well, where are you getting your sequences from? Usually you can obtain all of them together in one FastA file which you then give to Blast.


    Also, I'm working with qPCR assays that target these Bacteroidales populations. I've already done an in silico analysis of each primer pair I'm working with using Primer Blast. Is there a way to perform Primer Blast on UNIX and then paste all the matching sequences onto a fasta file?
    I do not use this NCBI service but given that it is a combination of Primer3 and Blast I do not think that there is a standalone program available for it.

    Comment


    • #3
      @Naphtap: You should be able to subset the nt/nr databases using the blastdb_alias program that is part of the Blast+ suite. That will need gi# for all Bacteriodales you are interested in.

      Comment


      • #4
        Originally posted by GenoMax View Post
        @Naphtap: You should be able to subset the nt/nr databases using the blastdb_alias program that is part of the Blast+ suite. That will need gi# for all Bacteriodales you are interested in.
        I have the basic code for blasting my sequences (YourSequences.fasta will be replaced with the name of the fasta file I'm working with) which is shown below.

        /usr/local/blast/2.2.29/bin/blastn -query YourSequences.fasta -db yourDatabase -out blastn.outfmt6 -evalue 1e-20 -num_threads 6 -max_target_seqs 1 -outfmt 6

        I also referenced the BLAST manual and they do indeed give me the ability to subset the nt/nr databases, with the example command below:
        $ blastdb_aliastool -db nematode_mrna -gilist c_elegance_mrna.gi -dbtype \
        nucl -out c_elegance_mrna -title "C. elegans refseq mRNA entries"

        Now, where it says c_elegance_mrna.gi and nematode_mrna, I can replace that with the Bacteroidales gis I'm working with yes? Also, do I have to integrate this command into the blast command I placed above, or does running this command already limit the blast search to the subset I set beforehand?

        Originally posted by westerman View Post
        Well, where are you getting your sequences from? Usually you can obtain all of them together in one FastA file which you then give to Blast.
        I'm currently looking for the Bacteroidales fasta, but I can definitely look for those ones as well.

        Comment


        • #5
          blastdb_alias program will create a subset blast database that you will then use for your subsequent blast searches. You will run this command independently.

          You can grab a list of bacteriodales GI's by doing an entrez search such as this http://www.ncbi.nlm.nih.gov/nuccore/?term=Bacteroidales.

          Click on "Send to" and then choose "File" --> "Format" --> "gi list". That will download a file with all gi's which you can then feed blastdb_alis tool for the -gilist option.

          You can also get a fasta file for these sequences using blastdbcmd program and the gi list file.

          Comment


          • #6
            Originally posted by GenoMax View Post
            blastdb_alias program will create a subset blast database that you will then use for your subsequent blast searches. You will run this command independently.

            You can grab a list of bacteriodales GI's by doing an entrez search such as this http://www.ncbi.nlm.nih.gov/nuccore/?term=Bacteroidales.

            Click on "Send to" and then choose "File" --> "Format" --> "gi list". That will download a file with all gi's which you can then feed blastdb_alis tool for the -gilist option.

            You can also get a fasta file for these sequences using blastdbcmd program and the gi list file.
            Ok so I performed the following command for the blastdb_alias program as follows, since it can create a subset blast database I can use for future blast searches (this one's for all 16S rRNA genes in the BLAST database):

            blatsdb_aliastool -db not -gilist /home/gradstd5/16s.gi.txt -dbtype nucl -out 16S_rrna -title "16S_rRNA"

            After running the command, 16S_rrna.n.gil file was among the outputs. First, what is this binary format the program's telling me?

            "Converted 8847324 GIs from /home/gradstd5/16S.gi.txt to binary format in 16s_rrna.n.gil"

            Also, it gives me this error:

            "BLAST Database error: BLASTDB alias file creation failed. Some referenced files may be missing." Is it something wrong with the .gi.txt file I downloaded from BLAST after searching for all 16S_rrna genes, related to the way I wrote the command, or is it something else?

            Comment


            • #7
              "BLAST Database error: BLASTDB alias file creation failed. Some referenced files may be missing."

              sounds like just what its says, those GIs were missing from the DB. Depends on the relative ages of the DBs you use. It shouldn't be very may GIs so I probably wouldn't worry about it.

              Personally I don't like using the aliastool, it means you have to muck around with large DBs to get the tiny subset you actually want. I prefer to just do the enrez search and download all the fastas then make the db.

              Sorry, I don't know what the binary file is, or how to get the aliastool output into a db.. but you could try doing it the way I do. You can also use esearch and efetch from NCBI (google it to install on your system). For example the following search (simply copied search text from the box in Entrez web search) gives 5337 files.

              esearch -db nucleotide -query '("Bacteroides"[Organism] OR Bacteroides[All Fields]) AND (16S[All Fields] AND ribosomal[All Fields]) NOT genome[All Fields]' | efetch -format fasta > bact16s.fna

              S.

              Comment


              • #8
                Just an afterthought - my experience with 16S and taxonomic identification is that within Bacteroides you are unlikely to get better resolution than the genus level. Are you sure that 16S is the best approach. If you are looking at a higher taxonomic level to profile human Vs animal Faeces with e.g. some indicator species then it might work ok.

                S.

                Comment


                • #9
                  Originally posted by susanklein View Post
                  Just an afterthought - my experience with 16S and taxonomic identification is that within Bacteroides you are unlikely to get better resolution than the genus level. Are you sure that 16S is the best approach. If you are looking at a higher taxonomic level to profile human Vs animal Faeces with e.g. some indicator species then it might work ok.

                  S.
                  Since I'm working with 16S sequences, I also considered just obtaining all of the 16S sequences instead from the BLAST database using the commands I used today. I'm also going to get more sequencing data from the other water and fecal samples I'm working with, so I was also considering performing comparative analyses like PCA and clustering analysis with R after the sequencing annotations are complete.

                  The things I want to do with Bacteroidales (the order) isn't set in stone. It was just an idea I was thinking of because it's one of the fecal indicator markers I'm working with in my qPCR assays. Should I go this route, I would focus on seeing which types of Bacteroidales are in the water samples (human, chicken, pig, etc). Plus, I don't have to work with just Bacteroidales indicator. I also want to search for other indicators and perhaps waterborne pathogens if I go this way.

                  Regardless of the direction I take, I'd first trim the low-quality sequences from my Illumina Sequencing file for the forward and reverse reads (FASTQ format) and merge the corresponding reads together.

                  Comment


                  • #10
                    As an addendum, the server I'm working with already has edirect, but the unix shell is telling me they don't have esearch...

                    EDIT: Never mind, I solved it.
                    Last edited by Naphtap; 09-23-2015, 06:35 PM.

                    Comment


                    • #11
                      Originally posted by Naphtap View Post
                      Regardless of the direction I take, I'd first trim the low-quality sequences from my Illumina Sequencing file for the forward and reverse reads (FASTQ format) and merge the corresponding reads together.
                      It would be better to merge the data first and then trim. BBMerge (from BBMap) or FLASH are options to look at.

                      Comment


                      • #12
                        @Naphtap: If you are only interested in rRNA then you could download them from RNACentral: http://rnacentral.org/search?q=expert_db:%22RDP%22. Use the "Download" button on the right to get a FASTA format file.

                        Comment


                        • #13
                          I feel as though I'm really close to performing the blast search, but this error keeps getting in my way:

                          BLAST Database error: No alias or index file found for nucleotide database [16s.fasta] in search path [/home/gradstd5::]

                          I generated the 16s.fasta with the FASTA format file I got from rnacentral link posted by GenoMax. I then placed the following command, receiving the above error as a result:

                          /usr/local/blast/2.2.29/bin/blastn -query unibac.fasta -db 16s.fasta -out blastn.outfmt6 -evalue 1e-5 -num_threads 6 -max_target_seqs 1 -outfmt 6

                          Comment


                          • #14
                            Can you post a listing of files that start with "16s.fasta*"? If the blast index got created correctly then there should be multiple files with that basename.

                            Comment


                            • #15
                              Originally posted by GenoMax View Post
                              Can you post a listing of files that start with "16s.fasta*"? If the blast index got created correctly then there should be multiple files with that basename.
                              If I perform just ls 16s.fasta on its own on Unix, then I only have this one file.

                              However, I also tried formatdb -p F -i 16s.fasta -n 16sBLASTdb, which resulted in the creation of three files:

                              16sBLASTdb.nhr
                              16sBLASTdb.nin
                              16sBLASTdb.nsq

                              I know formatdb's used to construct a BLAST database from a fasta file, which is what I tried to do here.

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Essential Discoveries and Tools in Epitranscriptomics
                                by seqadmin




                                The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                                04-22-2024, 07:01 AM
                              • seqadmin
                                Current Approaches to Protein Sequencing
                                by seqadmin


                                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                04-04-2024, 04:25 PM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, Yesterday, 11:49 AM
                              0 responses
                              15 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-24-2024, 08:47 AM
                              0 responses
                              16 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-11-2024, 12:08 PM
                              0 responses
                              61 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 10:19 PM
                              0 responses
                              60 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X