Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • extract data from fasta-files with perl??

    Hello,

    I'm very knew to the subject and this is my first question ever:

    I'm working with 454 data. I translated my contigs to protein sequences, created a blastdatabase with theses and used a bunch of downloaded proteins as a query.

    Now I have an output file with "hit-contigs". My aim is to extract theses hit-contigs from the original fasta-file with all contigs.

    As far as I'm informed things like that can be done automatically by applying perl scripts. Unfortuneately I have no idea how.

    I'm very thankful for help an of course open to learn

    anna

  • #2
    I guess the general strategy would be:
    1. Read all of your hit contig names into a hash
    2. Read through your fasta sequences one at a time
    3. If the contig name matches one of your hit contigs then keep it, otherwise throw it away


    If you have the BioPerl modules installed then reading through a fasta file and writing out selected entries is pretty trivial.

    Comment


    • #3
      Originally posted by anna_ View Post
      Hello,

      I'm very knew to the subject and this is my first question ever:

      I'm working with 454 data. I translated my contigs to protein sequences, created a blastdatabase with theses and used a bunch of downloaded proteins as a query.

      Now I have an output file with "hit-contigs". My aim is to extract theses hit-contigs from the original fasta-file with all contigs.

      As far as I'm informed things like that can be done automatically by applying perl scripts. Unfortuneately I have no idea how.

      I'm very thankful for help an of course open to learn

      anna
      Anna,

      You can do this yourself and it would be a good learning exercise, but since you have already made a BLAST database of the contigs, NCBI has kindly provided tools for doing exactly what you want.

      Create a text file of the contig IDs you want to extract, one ID per line, no other information in the file. Be careful to use the same IDs as BLAST for your contigs. We'll call this file "myContigList.txt".

      The command to use depends on whether you are using the old school (C-Toolkit) BLAST or the new BLAST+. These ancillary commands should have been installed when you installed BLAST

      Old school use the command "fastacmd"

      Code:
      $ fastacmd -d myBlastDBName -p protein -i myContigList.txt -o myHitContigs.fasta
      You can omit the '-p protein' and let the command guess the DB type.

      For the new BLAST+ distribution use "blastdbcmd"

      Code:
      $ blastdbcmd -db myBlastDBName -dbtype prot -entry_batch myContigList.txt -outfmt %f -out myHitContigs.fasta
      Again you could omit the '-dbtype prot' and let the program guess. The -outfmt %f tells the program to output sequences in FASTA format; you could also omit this since this is the default output format.

      Comment


      • #4
        Thank you both,

        I'm trying and the NBCI tool solution:

        I use blast+. To get the text-file with the ID I used the command -outfmt "10 qgi". I when I use this output as the entry batch file I get the massage: "no valid entries to search". I added a ">" to every ID, but still there is the same error.

        I don't know what's wrong.

        Comment


        • #5
          Originally posted by anna_ View Post
          Thank you both,

          I'm trying and the NBCI tool solution:

          I use blast+. To get the text-file with the ID I used the command -outfmt "10 qgi". I when I use this output as the entry batch file I get the massage: "no valid entries to search". I added a ">" to every ID, but still there is the same error.

          I don't know what's wrong.
          Anna,

          You don't want to use outfmt 10 (comma separated values), you want outfmt 6 (tabular data w/o headers). Also you don't want to output the query gi (qgi), you want to report the accessions of the hits (subject) you find. (Your hits won't have gi's since you are working with a custom database, not an NCBI db.) You also want to capture all of the hits for your query so sallacc is the outfmt you want.

          Your blastx command should look like

          Code:
          $ blastx -db myDB -query myQuery -out myContigList.txt -outfmt "6 sallacc" [other blastx options]
          If you had multiple sequences in your query file the resultant myContigList.txt file may have the same contig ID listed more than once (more than one of your queries may have match the same contig). To eliminate the redundancies in this list you can use the sort command with the unique filter.

          Code:
          $ sort -u myContigList.txt > myContigList_uniq.txt
          You could now use the myContigList_uniq.txt file in the blastdbcmd described in my earlier post.

          Hope this helps.

          Comment


          • #6
            Thank you,

            it works. I feel quite a bit ashamed because of my stupidity but I'll go on with asking and learning (hopefully), because I have no other choice.

            thanks again!

            Comment


            • #7
              Hi everybody,

              Thank you, Anna, I wanted to ask exactly the same question !

              Thank you, kmcarr, for you detailed answer. But even I rigourously followed your wy, I get the "no valide entries to search" message....

              What could be the reason ? Do you have any idea ?

              Thank you for your help

              Alex

              Comment


              • #8
                Originally posted by aliealexandre View Post
                Hi everybody,

                Thank you, Anna, I wanted to ask exactly the same question !

                Thank you, kmcarr, for you detailed answer. But even I rigourously followed your wy, I get the "no valide entries to search" message....

                What could be the reason ? Do you have any idea ?

                Thank you for your help

                Alex
                Alex,

                Can you post a few lines of the list file containing the accessions you are trying to look up. Also post a few lines from the following command:

                Code:
                blastdbcmd -db <your_blast_db_name> -outfmt %a -entry all
                Does the format of the accessions in your query file look like the accessions as they appear in the BLAST database?

                Comment


                • #9
                  If I use formatdb command instead makeblastdb, I should use the -o T option, right? I didn't know why my sequences were'nt found in a certain database...

                  Comment


                  • #10
                    Originally posted by jordi View Post
                    If I use formatdb command instead makeblastdb, I should use the -o T option, right? I didn't know why my sequences were'nt found in a certain database...
                    Yes, I believe that is true. Without using "-o T" there will be no index of the SeqIDs (accessions) for fastacmd to look up accessions in. You would have to use gi numbers (if they existed in the original fasta file) to retrieve sequences.

                    Comment


                    • #11
                      kmcarr

                      These are some lines from the blastdbcmd command :

                      BL_ORD_ID:0
                      BL_ORD_ID:1
                      BL_ORD_ID:2
                      BL_ORD_ID:3
                      BL_ORD_ID:4
                      BL_ORD_ID:5

                      And these from the text file comtaining accessions :

                      jgi|Nemve1|24|gw.52.2.1
                      jgi|Nemve1|45|gw.188.2.1
                      jgi|Nemve1|45|gw.188.2.1
                      jgi|Nemve1|45|gw.188.2.1
                      jgi|Nemve1|45|gw.188.2.1
                      jgi|Nemve1|45|gw.188.2.1

                      Of course these ID are exactly the same than in my db

                      Thank you very much for your help

                      Alex

                      Comment


                      • #12
                        Originally posted by aliealexandre View Post
                        kmcarr

                        These are some lines from the blastdbcmd command :

                        BL_ORD_ID:0
                        BL_ORD_ID:1
                        BL_ORD_ID:2
                        BL_ORD_ID:3
                        BL_ORD_ID:4
                        BL_ORD_ID:5

                        And these from the text file comtaining accessions :

                        jgi|Nemve1|24|gw.52.2.1
                        jgi|Nemve1|45|gw.188.2.1
                        jgi|Nemve1|45|gw.188.2.1
                        jgi|Nemve1|45|gw.188.2.1
                        jgi|Nemve1|45|gw.188.2.1
                        jgi|Nemve1|45|gw.188.2.1

                        Of course these ID are exactly the same than in my db

                        Thank you very much for your help

                        Alex
                        Alex,

                        The format of the accessions in your query file (jgi|Nemve1|45|gw.188.2.1) differs from the accessions in your BLASTdb (BL_ORD_ID:5) which is why the search is failing.

                        Fortunately jordi's question above reminded me of an important point which explains why. When a BLASTdb is created the default behavior is to NOT parse and index the accessions (Seq-IDs in BLAST terminology). In that case generic seq_ids will be created by the program (e.g. BL_ORD_ID:0). Without this index of accessions you can't perform searches by accession. You need to make a new BLASTdb but add the option "-parse_seqids" to your makeblastdb command.

                        I noticed that you list the same accession 5 times in your query list (jgi|Nemve1|45|gw.188.2.1), why is that? See my post above about how to make your the entries in your query list unique using the sort command. Also, while it is acceptable to use the full id string (jgi|Nemve1|45|gw.188.2.1) for your query, you can use just the accession portion of the name, gw.188.2.1, to achieve the same result.

                        Comment


                        • #13
                          Great !! it works

                          Thank you, kmcarr, for your very helpfull and very understandable advices.

                          See you

                          Alex

                          Comment


                          • #14
                            kmcarr, another question about makeblastdb from anna again this time:

                            I should explain the procedure at first:

                            I'm searching for proteins that are related to drought stress in a huge conifer transcriptome.

                            I searched the UniProt and the NCBI databases for proteins with "drought" as a keyword and downloaded the hits in as fasta-files. I concatenated the two files with the unix cat command. I guess a lot of Proteins will appear many fold in the concatenated file and that will have consequences for my blast statistics. So I applied a perl script to remove redundancy. Since I cannot write perl myself (I'm learning), I took the script from the O'reilly BLAST book.

                            #!/usr/bin/perl
                            use Bio::SeqIO;

                            my %NR;
                            my $file = Bio::SeqIO->new(-fh => \*ARGV);
                            while (my $fasta = $file->next_seq){
                            my $def = $fasta->id ." ".$fasta->desc;
                            $NR{$fasta->seq}{$def} = 1;
                            }
                            for my $seq(keys %NR){
                            print ">", join(chr(1),keys
                            %{$NR{$seq}}),"\n",$seq,"\n";
                            }

                            After that the sequences were unique and the identifiers were all in one line after one another:

                            >gi|57908279|gb|AAW59263.1| water-stress inducible protein 3 [Pinus taeda]gi|57908277|gb|AAW59262.1| water-stress inducible protein 3 [Pinus taeda]gi|57908297|gb|AAW59272.1| water-stress inducible protein 3 [Pinus taeda]gi|57908325|gb|AAW59286.1| water-stress inducible protein 3 [Pinus taeda]gi|57908331|gb|AAW59289.1| water-stress inducible protein 3 [Pinus taeda]
                            DENDNPPSEVVYSETTTAYGDEVIQSSDVYATGNVNSDEYEKARKEEKHHKHMEEVGGLGTMATGAFALHEKHAEKKDPEHAHRHKIEEEIAAAAAVGEGG

                            I was happy and applied:

                            makeblastdb -in protein_drought_db_uniq -dbtype prot -parse_seqids -logfile protein_drought_db_uniq.logfile > protein_drought_db_uniq

                            and then I got:

                            Blast database creation error: Error: Duplicate seq_ids are found.

                            Do you or anybody else have an idea how solve this dilemma?

                            Comment


                            • #15
                              Very interesting question Anna!
                              I think it could be very useful to customize Uniprot data in order to obtain a column with the NCBI cross-reference ID, since Uniprot allows this option.. but I tried it and it doesn't work! I was unable to obtain the NCBI id's
                              Maybe, a Blast using one dataset as db and the other one as query could be useful parsing a 100% identity alignment with a given alignment length...
                              Or what about this??

                              I mean, you should obtain a list as follows:
                              From To
                              Q9M2S6 AL132975
                              Q5BPS3 AC00716
                              FROM Uniprot Id's you got TO NCBi's Id's.

                              hope this helps.
                              Last edited by jordi; 02-23-2011, 02:40 AM.

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Advancing Precision Medicine for Rare Diseases in Children
                                by seqadmin




                                Many organizations study rare diseases, but few have a mission as impactful as Rady Children’s Institute for Genomic Medicine (RCIGM). “We are all about changing outcomes for children,” explained Dr. Stephen Kingsmore, President and CEO of the group. The institute’s initial goal was to provide rapid diagnoses for critically ill children and shorten their diagnostic odyssey, a term used to describe the long and arduous process it takes patients to obtain an accurate...
                                12-16-2024, 07:57 AM
                              • seqadmin
                                Recent Advances in Sequencing Technologies
                                by seqadmin



                                Innovations in next-generation sequencing technologies and techniques are driving more precise and comprehensive exploration of complex biological systems. Current advancements include improved accessibility for long-read sequencing and significant progress in single-cell and 3D genomics. This article explores some of the most impactful developments in the field over the past year.

                                Long-Read Sequencing
                                Long-read sequencing has seen remarkable advancements,...
                                12-02-2024, 01:49 PM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 12-17-2024, 10:28 AM
                              0 responses
                              39 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 12-13-2024, 08:24 AM
                              0 responses
                              52 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 12-12-2024, 07:41 AM
                              0 responses
                              38 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 12-11-2024, 07:45 AM
                              0 responses
                              46 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X