Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Prokka -contig name to long even after --centre command

    I want to use Prokka to annotate a Velvet alignment of a Pseudomonas aeruginosa genome. When I try to run it I get this error:

    Contig ID must <= 20 chars long: NODE_1_length_41061_cov_17.678381
    [11:20:10] Please rename your contigs or use --centre XXX to generate clean contig names.

    So I inserted the --centre XXX command in the running command. But now I become this error:

    Contig ID must <= 20 chars long: gnl|XXX|PROKKA_contig000001
    [11:22:08] Please rename your contigs or use --centre XXX to generate clean contig names.

    So what do i do now? Is there a way to get the name shorter?

    I appreciate any help

  • #2
    I would try renaming all the contigs in the input FASTA file before calling Prokka,

    e.g. NODE_1_length_41061_cov_17.678381 --> contig000001

    Comment


    • #3
      I also got an answer for the developer team.

      So maybe this helps other people who run into the same problem in the future:

      1) Try using the "--compliant" option (and do NOT use --centre)

      2) Or try "--compliant --prefix XX"

      3) Or try "--compliant --prefix XX --centre XX"

      Comment


      • #4
        I tried using "--compliant" and "--centre XX" alone and in combination and it didn't work.

        Comment


        • #5
          Try downloading BBMap and running this command:

          rename.sh in=contigs.fa out=renamed.fa prefix=contig

          Comment


          • #6
            We use [strain#]_AS[AS#]_CO[contig#]

            It is extremely important to have both descriptive and consistent naming schema for the contigs/scaffolds for all downstream analysis.

            Unfortunately NCBI names are a bit too long and usually have white spaces before the contig#, which makes them unsuitable as human and machine readable fasta_ID...

            In our case we usually use the following contig names:

            >[strain_name]_AS[AS#]_CO[contig#] {some optional description/orig name/etc}*

            like:

            >NRRL2338_AS1006_CO1

            in case of scaffolds we put SC instead of CO,

            >NRRL2338_AS1006_SC1

            so if you has assembled DH10B and yours assembly #5 has multifasta has:
            >contig0001
            ...
            >contig0012

            you do search and replace
            >contig00
            with
            >DH10B_AS5_CO

            PS: in the case of long strain names they may need a bit of shortening.

            contigs renaming happens after assembly selection/polishing.

            For the data downloaded from the NCBI/EMBL, it is done in KATE or similar text editor supporting regexp. Also one can use perl one liners (google - perl search and replace) if you are familiar with perl regexp.

            Being consistent in sequence naming has a huge impact in all types of downstream analysis (blast, mapping, annotation).

            Comment

            Latest Articles

            Collapse

            • seqadmin
              Essential Discoveries and Tools in Epitranscriptomics
              by seqadmin




              The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
              04-22-2024, 07:01 AM
            • seqadmin
              Current Approaches to Protein Sequencing
              by seqadmin


              Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
              04-04-2024, 04:25 PM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, Yesterday, 08:47 AM
            0 responses
            12 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-11-2024, 12:08 PM
            0 responses
            60 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-10-2024, 10:19 PM
            0 responses
            59 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-10-2024, 09:21 AM
            0 responses
            54 views
            0 likes
            Last Post seqadmin  
            Working...
            X