Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • merging data from velvet emboss and blastp analyses

    Hi, I hope someone could help me sort this out.

    I am working with a marine viral metagenome and I am trying to extrapolate some information on the dataset.

    So I have three files:
    1)contigs.fa >> which is a file with the contigs created with velvet (there are 176502 contigs total) for example: >NODE_77_length_69_cov_4.985507

    2)contigs.orf.fa >> a file created with EMBOSS with ORF (so the node above now has this ids:
    >NODE_77_length_69_cov_4.985507_1 [2 - 97]
    >NODE_77_length_69_cov_4.985507_2 [3 - 98]
    >NODE_77_length_69_cov_4.985507_3 [1 - 99]
    >NODE_77_length_69_cov_4.985507_4 [98 - 3] (REVERSE SENSE)
    >NODE_77_length_69_cov_4.985507_5 [97 - 2] (REVERSE SENSE)
    >NODE_77_length_69_cov_4.985507_6 [99 - 1] (REVERSE SENSE)

    3)contigs.orf.fa.blastp: which contain the blastp output.

    What I want to do is create a spreadsheet with the node_ID extracted from the contigs.fa file, in the next column the node_ID_orf (so every possible ORF for every single node) and finally the corresponding node which made it with the blastp query.

    Is there a script or a way to do this?

    F.

  • #2
    It is very unlikely that there exists a script to do what you want!

    You will need a bespoke solution. It could probably be done using a series of Unix tools, but ultimately a custom Perl/Python script would be quickest.

    The fact is, a lot of basic bioinformatics is massaging data into the form you want.

    Comment


    • #3
      man grep
      man cut
      man sort
      man paste

      should take you quite far..
      savetherhino.org

      Comment


      • #4
        thank you I'll try... got tons of things to learn

        Comment


        • #5
          Originally posted by flacchy View Post
          thank you I'll try... got tons of things to learn
          I've found cut to be an especially helpful command,

          e.g. cut -f 1,2,4 file.txt > output.txt would cut columns 1, 2 and 4 from file.txt into output.txt assuming tab separated fields. With your fasta files you'll first need to extract only the header lines, e.g. grep '>' file.fasta > output.fasta ('>' because unique to header lines).. and so on..
          savetherhino.org

          Comment


          • #6
            so I know I can extract the Id from contigs.fa e contigs.orf.fa.
            The problems is how can I extract the information that I want from the blast output?
            I have a huge file like this:

            "BLASTP 2.2.25 [Feb-01-2011]


            Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer,
            Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997),
            "Gapped BLAST and PSI-BLAST: a new generation of protein database search
            programs", Nucleic Acids Res. 25:3389-3402.
            Reference for compositional score matrix adjustment: Altschul, Stephen F.,
            John C. Wootton, E. Michael Gertz, Richa Agarwala, Aleksandr Morgulis,
            Alejandro A. Schaffer, and Yi-Kuo Yu (2005) "Protein database searches
            using compositionally adjusted substitution matrices", FEBS J. 272:5101-5109.

            Query= NODE_26_length_67_cov_1.000000_6 [97 - 2] (REVERSE SENSE)
            (32 letters)

            Database: MicroB3_Viral_proteins
            120,896 sequences; 31,818,017 total letters

            Searching..................................................done



            Score E
            Sequences producing significant alignments: (bits) Value

            ref|YP_214377.1|genome recombination endonuclease subunit [Proch... 60 5e-13
            ref|YP_004322284.1|genome gp47 gene product [Synechococcus phage... 47 2e-08 ""


            How do I extract the information I want??? I tried sed like this

            $ sed -n '/node*/,/searching/p' contigs.orf.fa.blastp > output.csv
            or

            $ sed -n '/node*/,/blastp/p' contigs.orf.fa.blastp > output.csv (because I want to extract information between node and after the searching is complete)

            but it output everything .... can't figure out why

            Comment


            • #7
              Which output format is that? Maybe you can change it to tabular with blast_formatter (in your blast bin)?
              savetherhino.org

              Comment


              • #8
                I run the blast search this way:
                $ blastall -p blastp -d MicroB3_Viral_proteins.faa -i contigs.orf.fa -o contigs.orf.fa.blastp

                is that wrong???? should I try something different?? or can I just convert the file???

                Comment


                • #9
                  Originally posted by flacchy View Post
                  I run the blast search this way:
                  $ blastall -p blastp -d MicroB3_Viral_proteins.faa -i contigs.orf.fa -o contigs.orf.fa.blastp

                  is that wrong???? should I try something different?? or can I just convert the file???
                  I always use tabular output myself. If your output is in blast archive format, you can convert it with the tool. However, if it's some other format, you either rerun your blast or learn how to parse your output..
                  savetherhino.org

                  Comment


                  • #10
                    how can I run blast in a tabular format??? can you tell me???

                    Comment


                    • #11
                      Originally posted by flacchy View Post
                      how can I run blast in a tabular format??? can you tell me???
                      In blast 2.2.28+ flag is -outfmt 6 -out output.tsv

                      Read the manual. I think in your blast it's -m 8 instead of -outfmt 6 but I might remember wrong..
                      savetherhino.org

                      Comment


                      • #12
                        ok thank you so much

                        Comment


                        • #13
                          I checked is -m 8 for my blastall version... should be easier now to extract results ^_^

                          Comment


                          • #14
                            BLAST Manual

                            BLAST Command Line Applications User Manual

                            Comment

                            Latest Articles

                            Collapse

                            • seqadmin
                              Current Approaches to Protein Sequencing
                              by seqadmin


                              Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                              04-04-2024, 04:25 PM
                            • seqadmin
                              Strategies for Sequencing Challenging Samples
                              by seqadmin


                              Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                              03-22-2024, 06:39 AM

                            ad_right_rmr

                            Collapse

                            News

                            Collapse

                            Topics Statistics Last Post
                            Started by seqadmin, 04-11-2024, 12:08 PM
                            0 responses
                            22 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 04-10-2024, 10:19 PM
                            0 responses
                            24 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 04-10-2024, 09:21 AM
                            0 responses
                            19 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 04-04-2024, 09:00 AM
                            0 responses
                            50 views
                            0 likes
                            Last Post seqadmin  
                            Working...
                            X