Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • merging data from velvet emboss and blastp analyses

    Hi, I hope someone could help me sort this out.

    I am working with a marine viral metagenome and I am trying to extrapolate some information on the dataset.

    So I have three files:
    1)contigs.fa >> which is a file with the contigs created with velvet (there are 176502 contigs total) for example: >NODE_77_length_69_cov_4.985507

    2)contigs.orf.fa >> a file created with EMBOSS with ORF (so the node above now has this ids:
    >NODE_77_length_69_cov_4.985507_1 [2 - 97]
    >NODE_77_length_69_cov_4.985507_2 [3 - 98]
    >NODE_77_length_69_cov_4.985507_3 [1 - 99]
    >NODE_77_length_69_cov_4.985507_4 [98 - 3] (REVERSE SENSE)
    >NODE_77_length_69_cov_4.985507_5 [97 - 2] (REVERSE SENSE)
    >NODE_77_length_69_cov_4.985507_6 [99 - 1] (REVERSE SENSE)

    3)contigs.orf.fa.blastp: which contain the blastp output.

    What I want to do is create a spreadsheet with the node_ID extracted from the contigs.fa file, in the next column the node_ID_orf (so every possible ORF for every single node) and finally the corresponding node which made it with the blastp query.

    Is there a script or a way to do this?

    F.

  • #2
    It is very unlikely that there exists a script to do what you want!

    You will need a bespoke solution. It could probably be done using a series of Unix tools, but ultimately a custom Perl/Python script would be quickest.

    The fact is, a lot of basic bioinformatics is massaging data into the form you want.

    Comment


    • #3
      man grep
      man cut
      man sort
      man paste

      should take you quite far..
      savetherhino.org

      Comment


      • #4
        thank you I'll try... got tons of things to learn

        Comment


        • #5
          Originally posted by flacchy View Post
          thank you I'll try... got tons of things to learn
          I've found cut to be an especially helpful command,

          e.g. cut -f 1,2,4 file.txt > output.txt would cut columns 1, 2 and 4 from file.txt into output.txt assuming tab separated fields. With your fasta files you'll first need to extract only the header lines, e.g. grep '>' file.fasta > output.fasta ('>' because unique to header lines).. and so on..
          savetherhino.org

          Comment


          • #6
            so I know I can extract the Id from contigs.fa e contigs.orf.fa.
            The problems is how can I extract the information that I want from the blast output?
            I have a huge file like this:

            "BLASTP 2.2.25 [Feb-01-2011]


            Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer,
            Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997),
            "Gapped BLAST and PSI-BLAST: a new generation of protein database search
            programs", Nucleic Acids Res. 25:3389-3402.
            Reference for compositional score matrix adjustment: Altschul, Stephen F.,
            John C. Wootton, E. Michael Gertz, Richa Agarwala, Aleksandr Morgulis,
            Alejandro A. Schaffer, and Yi-Kuo Yu (2005) "Protein database searches
            using compositionally adjusted substitution matrices", FEBS J. 272:5101-5109.

            Query= NODE_26_length_67_cov_1.000000_6 [97 - 2] (REVERSE SENSE)
            (32 letters)

            Database: MicroB3_Viral_proteins
            120,896 sequences; 31,818,017 total letters

            Searching..................................................done



            Score E
            Sequences producing significant alignments: (bits) Value

            ref|YP_214377.1|genome recombination endonuclease subunit [Proch... 60 5e-13
            ref|YP_004322284.1|genome gp47 gene product [Synechococcus phage... 47 2e-08 ""


            How do I extract the information I want??? I tried sed like this

            $ sed -n '/node*/,/searching/p' contigs.orf.fa.blastp > output.csv
            or

            $ sed -n '/node*/,/blastp/p' contigs.orf.fa.blastp > output.csv (because I want to extract information between node and after the searching is complete)

            but it output everything .... can't figure out why

            Comment


            • #7
              Which output format is that? Maybe you can change it to tabular with blast_formatter (in your blast bin)?
              savetherhino.org

              Comment


              • #8
                I run the blast search this way:
                $ blastall -p blastp -d MicroB3_Viral_proteins.faa -i contigs.orf.fa -o contigs.orf.fa.blastp

                is that wrong???? should I try something different?? or can I just convert the file???

                Comment


                • #9
                  Originally posted by flacchy View Post
                  I run the blast search this way:
                  $ blastall -p blastp -d MicroB3_Viral_proteins.faa -i contigs.orf.fa -o contigs.orf.fa.blastp

                  is that wrong???? should I try something different?? or can I just convert the file???
                  I always use tabular output myself. If your output is in blast archive format, you can convert it with the tool. However, if it's some other format, you either rerun your blast or learn how to parse your output..
                  savetherhino.org

                  Comment


                  • #10
                    how can I run blast in a tabular format??? can you tell me???

                    Comment


                    • #11
                      Originally posted by flacchy View Post
                      how can I run blast in a tabular format??? can you tell me???
                      In blast 2.2.28+ flag is -outfmt 6 -out output.tsv

                      Read the manual. I think in your blast it's -m 8 instead of -outfmt 6 but I might remember wrong..
                      savetherhino.org

                      Comment


                      • #12
                        ok thank you so much

                        Comment


                        • #13
                          I checked is -m 8 for my blastall version... should be easier now to extract results ^_^

                          Comment


                          • #14
                            BLAST Manual

                            BLAST Command Line Applications User Manual

                            Comment

                            Latest Articles

                            Collapse

                            • seqadmin
                              Advancing Precision Medicine for Rare Diseases in Children
                              by seqadmin




                              Many organizations study rare diseases, but few have a mission as impactful as Rady Children’s Institute for Genomic Medicine (RCIGM). “We are all about changing outcomes for children,” explained Dr. Stephen Kingsmore, President and CEO of the group. The institute’s initial goal was to provide rapid diagnoses for critically ill children and shorten their diagnostic odyssey, a term used to describe the long and arduous process it takes patients to obtain an accurate...
                              12-16-2024, 07:57 AM
                            • seqadmin
                              Recent Advances in Sequencing Technologies
                              by seqadmin



                              Innovations in next-generation sequencing technologies and techniques are driving more precise and comprehensive exploration of complex biological systems. Current advancements include improved accessibility for long-read sequencing and significant progress in single-cell and 3D genomics. This article explores some of the most impactful developments in the field over the past year.

                              Long-Read Sequencing
                              Long-read sequencing has seen remarkable advancements,...
                              12-02-2024, 01:49 PM

                            ad_right_rmr

                            Collapse

                            News

                            Collapse

                            Topics Statistics Last Post
                            Started by seqadmin, 12-17-2024, 10:28 AM
                            0 responses
                            26 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 12-13-2024, 08:24 AM
                            0 responses
                            42 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 12-12-2024, 07:41 AM
                            0 responses
                            28 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 12-11-2024, 07:45 AM
                            0 responses
                            42 views
                            0 likes
                            Last Post seqadmin  
                            Working...
                            X