Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Coverting QIIME formatted reference database to be used with BLAST

    Hi there,

    I have a custom database of fungal reference sequences formatted to be used with the RDP classifier in QIIME.

    I would like to be able to merge them into a larger BLAST database for local use.

    This involves merging the FASTA file and the taxonomy map (after that I've got it under control).

    Any idea how to do this? I was hoping RDP had some options but it doesn't seem that way...

    Thanks!

  • #2
    It depends how you want to merge your files. Do you just want to concatenate them? For this you could simply use "cat" under Linux.

    Comment


    • #3
      Thanks, I thought about cat but they need to be merged in a more complicated way i.e.

      In the FASTA file there is the sequence and it is accompanied with a header that links it with an entry in the taxonomy map.

      Sequence file:

      >AB015711
      GTAGTCATATGCTTGTCTCAAAGATTAAGCCATGCATGTCTAAGTATAAGCAAGTATACTGTGAAACTGCGAATGGCTCATTAAATCAGTTATAGTTTATTTGATAGTGCCTTACTACTTGGATAACC...

      Links with taxonomy file:

      AB015711 Fungi;Glomeromycota;Archaeosporomycetes;Archaeosporales;Ambisporaceae;Ambispora_leptoticha

      So... I need to replace the headers in the FASTA file with the entries in the taxonomy file.

      Comment


      • #4
        Originally posted by jstrohm View Post
        Thanks, I thought about cat but they need to be merged in a more complicated way i.e.

        In the FASTA file there is the sequence and it is accompanied with a header that links it with an entry in the taxonomy map.

        Sequence file:

        >AB015711
        GTAGTCATATGCTTGTCTCAAAGATTAAGCCATGCATGTCTAAGTATAAGCAAGTATACTGTGAAACTGCGAATGGCTCATTAAATCAGTTATAGTTTATTTGATAGTGCCTTACTACTTGGATAACC...

        Links with taxonomy file:

        AB015711 Fungi;Glomeromycota;Archaeosporomycetes;Archaeosporales;Ambisporaceae;Ambispora_leptoticha

        So... I need to replace the headers in the FASTA file with the entries in the taxonomy file.
        So, in your example you want to end up with

        >AB015711 Fungi;Glomeromycota;Archaeosporomycetes;Archaeosporales;Ambisporaceae;Ambispora_leptoticha
        GTAGTCATATGCTTGTCTCAAAGATTAAGCCATGCATGTCTAAGTATAAGCAAGTATACTGTGAAACTGCGAATGGCTCATTAAATCAGTTATAGTTTATTTGATAGTGCCTTACTACTTGGATAACC..

        In this case you have to be careful as blast will probably truncate your header at the first space leaving you with the same information as you have had before. I would rather suggest to use an abbreviation of the species name - something like
        >AB015711_Amlep
        You probably have to replace the IDs in your taxonomic afterwards, too, but it will be the same work as before.

        Concerning your original question:
        I have a small java prog which is doing almost exactly what you want. It would just take minutes to modify it to your needs (you would have to give me an example of your exact output before, of course). However, I'm currently on a conference at will not be able to send it before Tuesday.

        So, if noone has a faster solution, just give me a quick reminder

        Comment


        • #5
          If you just need/want to change the fasta headers, then you can do something like this which involves making a two column tab delimited 'annotation' file - which is actually probably the format of your taxonomy file. This is essentially a find and replace for fasta headers, I've modified the perl script from this thread: http://stackoverflow.com/questions/1...h-another-name . You will just need to have BioPerl installed.

          Code:
          #!/usr/bin/perl -w
          =usage 
          
          reformat_fasta_headers.pl -f fasta_file -a annotation file (2 columns tab delimited: find col 1 and replace with col2)
          
          =cut
          
          use strict;
          use warnings;
          use Bio::SeqIO;
          use Getopt::Long;
          
          #set command line arguments
          my ($fasta, $annot) = @ARGV;
          my $version="reformat_fasta_headers.pl\tv0.0.1";
          GetOptions(
          	'f|fasta:s'=>\$fasta,
          	'a|annot:s'=>\$annot,
          	'v|version'=>sub{print $version."\n"; exit;},
          );
          
          
          open my $fh, '<', $annot or die $!;
          my %annot = map { /(\S+)\s+(.+)/; $1 => $2 } <$fh>;
          close $fh;
          
          my $in = Bio::SeqIO->new( -file => $fasta, -format => 'Fasta' );
          
          while ( my $seq = $in->next_seq() ) {
              my $seqID = $annot{ $seq->id } // $seq->id;
              print ">$seqID\n" . $seq->seq . "\n";
          }
          I have a question for you jstrohm - how did you manage to create the custom fungal ITS database with the correct taxonomy information? I have a similar problem working on now, where we have several thousand ITS sequences from various projects and would like to get them into a format to be used with either QIIME or RDP or UTAX (usearch8).

          Comment


          • #6
            Thanks for the script! I'll give it a shot. I have a feeling that even when I merge the two, BLAST won't like the header formats anyway...

            I'm afraid the answer I have for your question isn't helpful. I was given a QIIME formatted version of the marjaam database, so I didn't have to create my own taxonomy file. http://maarjam.botany.ut.ee/

            I have actually been generating my own "libraries" for mapping reads to taxonomy assigned OTUs in usearch8.

            Briefly, I assign IDs to my OTUs using SILVA/SINA or BLAST/ UNITE. Then I use MEGAN 5 to pick the "most correct" IDs.

            From MEGAN 5 I choose "Export -> DSV -> read names, taxon paths.

            Then with some fiddling in Excel, I add the taxonomy paths to the headers in my OTU fasta file
            Last edited by jstrohm; 11-09-2014, 08:23 AM.

            Comment

            Latest Articles

            Collapse

            • seqadmin
              Essential Discoveries and Tools in Epitranscriptomics
              by seqadmin




              The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
              04-22-2024, 07:01 AM
            • seqadmin
              Current Approaches to Protein Sequencing
              by seqadmin


              Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
              04-04-2024, 04:25 PM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, Today, 08:47 AM
            0 responses
            10 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-11-2024, 12:08 PM
            0 responses
            60 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-10-2024, 10:19 PM
            0 responses
            59 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-10-2024, 09:21 AM
            0 responses
            53 views
            0 likes
            Last Post seqadmin  
            Working...
            X