Seqanswers Leaderboard Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • Anders Myrvold Dahl
    Junior Member
    • Nov 2010
    • 6

    BLAST contamination search help

    Hi everyone.

    First of all, I'm new to the forum and new to the realm of bioinformatics.

    I'm currently working on a project where I'm set to analyse several sequenced e.coli strains.

    My first task is to check for contamination.
    Running Webblast i find many hits for other bacteria than e.coli with 0 E-value and 95-100% MaxIdent.

    E.coli is however by far the dominant hits.

    I need to get a general impression of contamination in all contigs for the 6 different E.coli strains I have, so I can decide if I can do further analyses with the contigs unmodified of if contamination needs to be removed.

    Seeing there's >100 contigs for each strain and webblast output is limited to one strain at a time, this is not feasable.

    Therefore I've installed blast+ and blastall locally (unix) and downloaded the nr database.

    When running blastall -i trh9.fna -p blastn -d nr -o result.txt

    I get an almost empty result.txt file as output.

    Have I installed the nr database correctly, or is something wrong with my syntax?

    I've downloaded all the archives and put them in a db directory.. (nr.00, nr.01, etc.)

    The input file is a standard(?) fasta formatted file.

    Tips, pointers, help would be greatly appreciated.


    Anders
  • maubp
    Peter (Biopython etc)
    • Jul 2009
    • 1544

    #2
    Just a minor point, you can indeed run NCBI "legacy" standalone BLAST like this:

    Code:
    blastall -p blastn ...
    If you want to use the "new" standalone BLAST+ it would be:

    Code:
    blastn ...
    As to the fact you are getting an almost empty result file, this is probably due to using different settings compared to the web blast. Check things like the gap parameters, evalue threshold, and so on.

    Comment

    • Anders Myrvold Dahl
      Junior Member
      • Nov 2010
      • 6

      #3
      To be more precise; the short output file that is produced only contains

      BLASTN 2.2.24 [Aug-08-2010]


      Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer,
      Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997),
      "Gapped BLAST and PSI-BLAST: a new generation of protein database search
      programs", Nucleic Acids Res. 25:3389-3402.

      Query= contig00001 length=3139 numreads=1128
      (3139 letters)


      So it seems the query is only the first contig of the fasta file, which contains >100 contigs. I need to get all the contigs to be processed.

      So basically there's zero output, and the computational time is very brief.
      Obviously, I'm doing something incorrectly.

      Don't know if adjusting the evalue or gap score would do anything here.

      Also should I go with blast+ instead of legacy?


      Sorry if I'm asking obvious ?'s, but I've googled my butt off the lately, and there seems to be little info to be found.

      Also, am I using the right blast program?
      I'm supposed to run the nucleotide data against the nr database.
      Seeing the nr database is a protein database I should be running blastx?
      Only when I did the search using webblast getting ample results, I was using nucleotide blast (i.e. blastn)...

      Comment

      • maubp
        Peter (Biopython etc)
        • Jul 2009
        • 1544

        #4
        Originally posted by Anders Myrvold Dahl View Post
        To be more precise; the short output file that is produced only contains

        BLASTN 2.2.24 [Aug-08-2010]


        Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer,
        Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997),
        "Gapped BLAST and PSI-BLAST: a new generation of protein database search
        programs", Nucleic Acids Res. 25:3389-3402.

        Query= contig00001 length=3139 numreads=1128
        (3139 letters)
        That looks truncated - you'd normally then get some matches or it would say "no hits", then the next queries, and a footer at the end.

        There were no error messages? This is odd - but see below.

        Originally posted by Anders Myrvold Dahl View Post
        Also should I go with blast+ instead of legacy?
        I would certainly recommend you try it. The NCBI are (I think) currently still supporting legacy BLAST, but only in the short term. You'll have to switch to BLAST+ at some point, so it would be sensible to do it now.
        Originally posted by Anders Myrvold Dahl View Post
        Also, am I using the right blast program?
        I'm supposed to run the nucleotide data against the nr database.
        Seeing the nr database is a protein database I should be running blastx?
        Only when I did the search using webblast getting ample results, I was using nucleotide blast (i.e. blastn)...
        Yes, use blastx -- blastn is for nucleotide query against nucleotide database. There is a nice summary of the different blast programs here:
        The Basic Local Alignment Search Tool (BLAST) finds regions of local similarity between sequences. The program compares nucleotide or protein sequences to sequence databases and calculates the statistical significance of matches. BLAST can be used to infer functional and evolutionary relationships between sequences as well as help identify members of gene families.
        Last edited by maubp; 11-18-2010, 01:25 PM. Reason: typo

        Comment

        • Anders Myrvold Dahl
          Junior Member
          • Nov 2010
          • 6

          #5
          I've tried running both blastn and blastx from the blast+ package against the nr database now.

          Seems blastn gives me an indexing error ( because the database is proteins?).

          Blastx executes, but nothing happens.

          I.e. I have to ctrl+c to break the process. No output neither in the command window or in the output file.

          And yes, I have tried letting the process run for a while...

          Comment

          • cascoamarillo
            Senior Member
            • Oct 2010
            • 164

            #6
            Hi,

            Don't you get in the output the database you're using after the query:


            Database: genome.fa
            139,530 sequences; 107,332,603 total letters

            Maybe is the path to the database...

            Comment

            • Anders Myrvold Dahl
              Junior Member
              • Nov 2010
              • 6

              #7
              I'm pretty confident the database path is correct.

              The database should also be blast-formatted; i.e. I've downloaded the nr.00.tar.gz, etc. archives from the ftp://ftp.ncbi.nlm.nih.gov/blast/db/ site.

              I've run blastdbcheck and get the following output:


              Writing messages to file (test.txt) at verbosity (Summary)
              ISAM testing is ENABLED.
              Legacy testing is DISABLED.
              By default, testing 200 randomly sampled OIDs.

              Testing 5 volume(s).
              /home/andersmy/Blast/db/nr.00 / MetaData: [ERROR] caught exception.
              /home/andersmy/Blast/db/nr.01 / MetaData: [ERROR] caught exception.
              /home/andersmy/Blast/db/nr.02 / MetaData: [ERROR] caught exception.
              /home/andersmy/Blast/db/nr.03 / MetaData: [ERROR] caught exception.
              /home/andersmy/Blast/db/nr.04 / MetaData: [ERROR] caught exception.
              Result=FAILURE. 5 errors reported in 5 volume(s).
              Testing 1 alias(es).
              Result=SUCCESS. No errors reported for 1 alias(es).

              Total errors: 5

              Is there something wrong with the database that makes blastx crash?

              blastx -query Oppgave/trh52.fna -db Blast/db/nr -out result.txt

              Writes the result.txt to disk, there is no command window output, and the command window freezes.

              Comment

              • maubp
                Peter (Biopython etc)
                • Jul 2009
                • 1544

                #8
                Five errors from the five chunks of the NR database -- something is messed up

                Can you also download the nr.*.md5 files and use the md5sum command line tool to verify the nr.*.tar.gz files downloaded correctly? They are just tiny little text files which contain a list of md5 checksums and filenames. e.g. "md5sum --check nr.00.tar.gz.md5" should calculate the md5 checksum for nr.00.tar.gz, and thus spot if it was corrupted on download.

                Comment

                • Anders Myrvold Dahl
                  Junior Member
                  • Nov 2010
                  • 6

                  #9
                  I've downloaded the nr.0*.tar.gz files once more as well as the md5 files, and reinstalled the database files.

                  I've performed the md5sum --check on all files and they're all ok.

                  Still I get the same error message from blastdbcheck after extracting these archives to my database directory.

                  And when I run blastx with the nr database, again the command interface just freezes.

                  I've tested downloading another nucleotide fasta file from NCBI, and blastx still freezes, so the input should not be to blame here. So somehow there's something funky with the database...

                  Comment

                  • maubp
                    Peter (Biopython etc)
                    • Jul 2009
                    • 1544

                    #10
                    Hmm. Have you tried another database? e.g. the NCBI vector nucleotide database is very small.

                    Comment

                    • Anders Myrvold Dahl
                      Junior Member
                      • Nov 2010
                      • 6

                      #11
                      I've run blastn successfully with my Fasta files using the vector database.

                      blastn checks all the contigs in my fasta file against the vector database and produces a smooth output file!


                      I've been told to use the non-redundant one though, and more importantly; I've to assess which of the hits are probable contamination, and not horizontal gene transfer.

                      I'm pretty blank as to how to discern these two. But I was told that any eucaryotic matches would highly likely be contamination of the E.coli strains.

                      Perhaps I should start a new thread regarding the contamination issue?

                      Or any good sources I should check out on the web?

                      Also, seeing theres >100 contigs in each file, is there an easy way to make a truncated list with only the best hits in each contig based on some conditions, say only eucaryotic genome?

                      Comment

                      • maubp
                        Peter (Biopython etc)
                        • Jul 2009
                        • 1544

                        #12
                        It is good that blastn worked with the small NCBI provided vector database. That seems to confirm your installation of BLAST+ is OK.

                        My guess is that your machine does not have enough RAM to do a search against a large database like NR.

                        Comment

                        Latest Articles

                        Collapse

                        • seqadmin
                          Pathogen Surveillance with Advanced Genomic Tools
                          by seqadmin




                          The COVID-19 pandemic highlighted the need for proactive pathogen surveillance systems. As ongoing threats like avian influenza and newly emerging infections continue to pose risks, researchers are working to improve how quickly and accurately pathogens can be identified and tracked. In a recent SEQanswers webinar, two experts discussed how next-generation sequencing (NGS) and machine learning are shaping efforts to monitor viral variation and trace the origins of infectious...
                          03-24-2025, 11:48 AM
                        • seqadmin
                          New Genomics Tools and Methods Shared at AGBT 2025
                          by seqadmin


                          This year’s Advances in Genome Biology and Technology (AGBT) General Meeting commemorated the 25th anniversary of the event at its original venue on Marco Island, Florida. While this year’s event didn’t include high-profile musical performances, the industry announcements and cutting-edge research still drew the attention of leading scientists.

                          The Headliner
                          The biggest announcement was Roche stepping back into the sequencing platform market. In the years since...
                          03-03-2025, 01:39 PM

                        ad_right_rmr

                        Collapse

                        News

                        Collapse

                        Topics Statistics Last Post
                        Started by seqadmin, 03-20-2025, 05:03 AM
                        0 responses
                        49 views
                        0 reactions
                        Last Post seqadmin  
                        Started by seqadmin, 03-19-2025, 07:27 AM
                        0 responses
                        57 views
                        0 reactions
                        Last Post seqadmin  
                        Started by seqadmin, 03-18-2025, 12:50 PM
                        0 responses
                        50 views
                        0 reactions
                        Last Post seqadmin  
                        Started by seqadmin, 03-03-2025, 01:15 PM
                        0 responses
                        201 views
                        0 reactions
                        Last Post seqadmin  
                        Working...