Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • extract fasta sequences from multifasta file using partial or gene names

    Hi All,

    I am working PROKKA v1.12 files. I have a list of gene names such as

    sacX
    arcB
    metB
    sprT
    adrB_2
    fadD

    and my fasta file is like so

    >BOKHJPML_00001 hypothetical protein
    ATGC
    >BOKHJPML_00002 hypothetical protein
    ATGC
    >BOKHJPML_00003 Protease HtpX
    ATGC
    >BOKHJPML_00006 ATP-dependent Clp protease ATP-binding subunit ClpC
    ATGC
    BOKHJPML_00016 Inner membrane protein YfdC
    ATGC

    I want to extract the fasta sequences from the list. I have tried following previous suggestions using faidhttps://www.biostars.org/p/126204/x and biopyhttps://www.biostars.org/p/2822/thon
    With no success. This faidx example is the closest I have come to success but I get a string of errors

    warning: sacX not found in file
    warning: arcB not found in file
    warning: metB not found in file
    warning: sprT not found in file
    warning: adrB_2 not found in file
    warning: fadD not found in file

    Thanks in advance

  • #2
    One way would be to extract the full read headers from the sequence file using your ID's.
    Code:
    for i in `cat ./id_file `; do grep -i $i sequence.fa >> ID_in_sequence_file;done
    Then use one of the methods you have found or faSomeRecords utility from Jim Kent to get the sequences extracted.

    Comment


    • #3
      You can also use BBMap's filterbyname.sh tool, particularly if you have a long list of names:

      Code:
      filterbyname.sh in=file.fa out=filtered.fa include=t names=names.txt substring
      The "substring" flag allows partial matches.

      Comment


      • #4
        Originally posted by Brian Bushnell View Post
        You can also use BBMap's filterbyname.sh tool, particularly if you have a long list of names:

        Code:
        filterbyname.sh in=file.fa out=filtered.fa include=t names=names.txt substring
        The "substring" flag allows partial matches.
        Thanks so much Brian, this is very straight forward. Just a quick question about the tool. My gene name list contains some ambiguous names such as group_XXXX, as it is an output of roary. Would setting substring=t or substring=names cause it to partially match fasta headers from prokka output via the locus tag? If so is there a way to prevent this I have been using the following command:

        Code:
        filterbyname.sh in=seqs.ffn out=test.fasta include=t names=list.txt substring=name casesensitive=f
        There are some seqs on my output which I feel should not be present. Although I do think the casesensitive flag is most likely the culprit?
        Thanks.

        Comment


        • #5
          "substring=names" will consider a sequence to be a match if the sequence name contains any line in list.txt as a substring; and in this case, it's ignoring case. I suggest not ignoring case unless it's essential. Note that if you have any really short names in your file, like "A", it might match just about everything...

          Comment

          Latest Articles

          Collapse

          • seqadmin
            Current Approaches to Protein Sequencing
            by seqadmin


            Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
            04-04-2024, 04:25 PM
          • seqadmin
            Strategies for Sequencing Challenging Samples
            by seqadmin


            Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
            03-22-2024, 06:39 AM

          ad_right_rmr

          Collapse

          News

          Collapse

          Topics Statistics Last Post
          Started by seqadmin, 04-11-2024, 12:08 PM
          0 responses
          27 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-10-2024, 10:19 PM
          0 responses
          30 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-10-2024, 09:21 AM
          0 responses
          26 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-04-2024, 09:00 AM
          0 responses
          52 views
          0 likes
          Last Post seqadmin  
          Working...
          X