Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Finding the Longest ORF for all sequences in EMBOSS

    Hi,

    I have trancriptome file consisting around 39000 DNA sequences. Now I would like find the longest ORFs for all the 39000 DNA sequences. I used EMBOSS's getORF from its webservice to find the all the possible ORFs. As I kept the default parameter 30 for minimum number of amino acids for a peptide, I got lot of ORFs which has more than 30 amino acids sequences for a single transcript. Now I would like to retain only the longest peptide with maximum of number of amino acids for all sequences.

    How can I achieve that? is there any alternate way to get only the longest ORF fro all transcript? Kindly guide me

  • #2
    Discussion of next-gen sequencing related bioinformatics: resources, algorithms, open source efforts, etc

    Comment


    • #3
      Hi genomax,

      I downloaded the the python script and corresponding xml files from the GitHub by right clicking it and the saved the link as it is. Later I installed Biopython.
      But when i ran the command, it got the following error

      Code:
      File "get_orfs_or_cdss.py", line 4
          <!DOCTYPE html>
          ^
      SyntaxError: invalid syntax
      This is how I gave the command,
      Code:
      python get_orfs_or_cdss.py $input_fasta smed_dd_v4.fasta $input_format FASTA $table 1 $ftype CDS $ends open $min_len 30 $strand both $mode top $out_nuc_file dd_nucleotide.fasta $out_prot_file dd_prot.fasta
      Kindly guide me

      Comment


      • #4
        You didn't download the Python script, but an HTML file showing the Python script with nice colours etc. You need to use the "raw" link on GitHub, i.e.


        The resulting get_orfs_or_cdss.py file should be plain text and start with:

        Code:
        #!/usr/bin/env python
        """Find ORFs in a nucleotide sequence file.
        
        ...
        If it was unclear, in place of $input_fasta you would put the filename of your input FASTA file (and so on). i.e.

        Code:
        python get_orfs_or_cdss.py smed_dd_v4.fasta FASTA 1 CDS open 30 both top dd_nucleotide.fasta dd_prot.fasta
        (And yes, I know this is not a very friendly command line interface - it was written primarily for use via Galaxy and I have not yet had reason/time to go back and make this more Unix-like. Sorry)
        Last edited by maubp; 02-25-2015, 12:19 PM. Reason: Adding usage example

        Comment


        • #5
          Originally posted by maubp View Post
          You didn't download the Python script, but an HTML file showing the Python script with nice colours etc. You need to use the "raw" link on GitHub, i.e.


          The resulting get_orfs_or_cdss.py file should be plain text and start with:

          Code:
          #!/usr/bin/env python
          """Find ORFs in a nucleotide sequence file.
          
          ...
          If it was unclear, in place of $input_fasta you would put the filename of your input FASTA file (and so on). i.e.

          Code:
          python get_orfs_or_cdss.py smed_dd_v4.fasta FASTA 1 CDS open 30 both top dd_nucleotide.fasta dd_prot.fasta
          (And yes, I know this is not a very friendly command line interface - it was written primarily for use via Galaxy and I have not yet had reason/time to go back and make this more Unix-like. Sorry)
          Hi Maubp,

          You mean to say that I have to copy the code from the plain text to a editor and save it as a pythin script and later run it as python program? Am I right?

          Comment


          • #6
            Right click on the link Peter provided and then choose "save as" (or "save link as"). That will save the script file locally. You can then run it.

            Comment


            • #7
              Originally posted by GenoMax View Post
              Right click on the link Peter provided and then choose "save as" (or "save link as"). That will save the script file locally. You can then run it.
              Hi Genomax,

              I tried exactly what you said, but it throwed me an error as I stated above

              Comment


              • #8
                Did you modify/try the command as Peter showed?

                Comment


                • #9
                  Originally posted by GenoMax View Post
                  Did you modify/try the command as Peter showed?
                  No. I didnt modify any command. i just ran after saving the link. What has to modified?

                  Comment


                  • #10
                    Code:
                    $ python get_orfs_or_cdss.py smed_dd_v4.fasta FASTA 1 CDS open 30 both top dd_nucleotide.fasta dd_prot.fasta

                    Comment


                    • #11
                      Originally posted by GenoMax View Post
                      Code:
                      $ python get_orfs_or_cdss.py smed_dd_v4.fasta FASTA 1 CDS open 30 both top dd_nucleotide.fasta dd_prot.fasta
                      I tried the above command but it stiil shows syntax error

                      Comment


                      • #12
                        We will have to wait for Peter to chime in then.

                        Comment


                        • #13
                          Originally posted by dena.dinesh View Post
                          Hi Maubp,

                          You mean to say that I have to copy the code from the plain text to a editor and save it as a pythin script and later run it as python program? Am I right?
                          That should work but is unnecessarily complicated. As GenoMax suggested, right clicking on the link https://raw.githubusercontent.com/pe...rfs_or_cdss.py in your browser should give you a save option. I'm puzzled what went wrong, perhaps this depends on your web-browser?

                          The simplest approach would be to download it at the command line with:
                          Code:
                          $ wget https://raw.githubusercontent.com/peterjc/pico_galaxy/master/tools/get_orfs_or_cdss/get_orfs_or_cdss.py
                          Check this worked with:

                          Code:
                          $ head get_orfs_or_cdss.py 
                          #!/usr/bin/env python
                          """Find ORFs in a nucleotide sequence file.
                          
                          get_orfs_or_cdss.py $input_fasta $input_format $table $ftype $ends $mode $min_len $strand $out_nuc_file $out_prot_file
                          
                          Takes ten command line options, input sequence filename, format, genetic
                          code, CDS vs ORF, end type (open, closed), selection mode (all, top, one),
                          minimum length (in amino acids), strand (both, forward, reverse), output
                          nucleotide filename, and output protein filename.

                          Comment

                          Latest Articles

                          Collapse

                          • seqadmin
                            Essential Discoveries and Tools in Epitranscriptomics
                            by seqadmin


                            The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist on Modified Bases...
                            Yesterday, 07:01 AM
                          • seqadmin
                            Current Approaches to Protein Sequencing
                            by seqadmin


                            Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                            04-04-2024, 04:25 PM

                          ad_right_rmr

                          Collapse

                          News

                          Collapse

                          Topics Statistics Last Post
                          Started by seqadmin, 04-11-2024, 12:08 PM
                          0 responses
                          55 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 04-10-2024, 10:19 PM
                          0 responses
                          51 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 04-10-2024, 09:21 AM
                          0 responses
                          45 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 04-04-2024, 09:00 AM
                          0 responses
                          55 views
                          0 likes
                          Last Post seqadmin  
                          Working...
                          X