Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • how to split a fasta file according to a list of gene ID

    Hi ALL,

    I have a fasta file and I want to split it in two two fasta files according to a list of sequence names in a text file (one seq name per line). So those seqs which have a match with the sequences names can be output to one fasta file and the others in another file.

    Could anybody provide me a script or some programs to perform this work? There are some online tools, but it would take a large amount of time to upload my file.

    Thanks.

  • #2
    If your sequences aren't split to multiple lines you can do this with grep. I think:

    grep -A 1 -f yourSeqIDFile.txt yourFastaFile.fasta > SeqsFromIDList.fasta
    grep -A 1 -v -f yourSeqIDFile.txt yourFastaFile.fasta > TheOtherSeqs.fasta

    might remember wrong..


    If you have QIIME, you can do this with filter_fasta.py..
    Last edited by rhinoceros; 08-13-2013, 08:56 AM.
    savetherhino.org

    Comment


    • #3
      Here is a script I wrote a while back to almost do what you want. It takes as input a FASTA file, a text file with a list of sequence IDs (one per line) and a mode argument to include or exclude the IDs in your list from the output. You could simply run the script twice, once in each mode to get the two complementary outputs, or if you feel like it modify the code to generate two output files. As it works now output is written to STDOUT so you can only capture one output by redirecting STDOUT to a file.

      Code:
      Usage:
      
      % subSetFasta.pl -f <fastaFileName> -l <listFileName> -m [i or e]
      
      Example:
      
      % subSetFasta.pl -f mySeqs.fasta -l myList.txt -m i > inList.fasta
      % subSetFasta.pl -f mySeqs.fasta -l myList.txt -m e > notInList.fasta
      If you do not specify a -mode argument the script defaults to the 'include' mode.

      A note about ID matching: the script bases a match on the first non-white space delimited text on the defline. If your defline is:

      Code:
      >sequenceID sequence description follows
      The script will only attempt to match 'sequenceID' so make sure that is the text in list file.
      Attached Files
      Last edited by kmcarr; 08-13-2013, 10:16 AM. Reason: Add note about default mode.

      Comment


      • #4
        Originally posted by lran2008 View Post
        Hi ALL,

        I have a fasta file and I want to split it in two two fasta files according to a list of sequence names in a text file (one seq name per line). So those seqs which have a match with the sequences names can be output to one fasta file and the others in another file.

        Could anybody provide me a script or some programs to perform this work? There are some online tools, but it would take a large amount of time to upload my file.

        Thanks.
        Try this: https://code.google.com/p/nash-bioin...ta.pl&can=2&q=

        Hopefully it will do the job you need.

        J
        Last edited by JohnN; 08-13-2013, 10:19 AM. Reason: Wrong URL

        Comment


        • #5
          Originally posted by rhinoceros View Post
          If your sequences aren't split to multiple lines you can do this with grep. I think:

          grep -A 1 -f yourSeqIDFile.txt yourFastaFile.fasta > SeqsFromIDList.fasta
          grep -A 1 -v -f yourSeqIDFile.txt yourFastaFile.fasta > TheOtherSeqs.fasta

          might remember wrong..


          If you have QIIME, you can do this with filter_fasta.py..
          Thanks. The second command didn't work.

          Comment


          • #6
            Originally posted by kmcarr View Post
            Here is a script I wrote a while back to almost do what you want. It takes as input a FASTA file, a text file with a list of sequence IDs (one per line) and a mode argument to include or exclude the IDs in your list from the output. You could simply run the script twice, once in each mode to get the two complementary outputs, or if you feel like it modify the code to generate two output files. As it works now output is written to STDOUT so you can only capture one output by redirecting STDOUT to a file.

            Code:
            Usage:
            
            % subSetFasta.pl -f <fastaFileName> -l <listFileName> -m [i or e]
            
            Example:
            
            % subSetFasta.pl -f mySeqs.fasta -l myList.txt -m i > inList.fasta
            % subSetFasta.pl -f mySeqs.fasta -l myList.txt -m e > notInList.fasta
            If you do not specify a -mode argument the script defaults to the 'include' mode.

            A note about ID matching: the script bases a match on the first non-white space delimited text on the defline. If your defline is:

            Code:
            >sequenceID sequence description follows
            The script will only attempt to match 'sequenceID' so make sure that is the text in list file.
            Thanks very much. The script works perfectly!

            Comment


            • #7
              In case anyone needed more alternatives, you can also use fastq_select.tcl which is bundled in with mira. This also got discussed in an earlier thread, which might be useful.

              Comment


              • #8
                If you want a Galaxy solution, try this:


                Or this related but subtly different tool which pulls out the reads in the ID order given

                Comment


                • #9
                  Originally posted by maubp View Post
                  If you want a Galaxy solution, try this:


                  Or this related but subtly different tool which pulls out the reads in the ID order given
                  http://toolshed.g2.bx.psu.edu/view/p...q_select_by_id
                  This should work. I didn't try it,so I don't know whether it can output a fasta file for unmatched seq.

                  Comment


                  • #10
                    Originally posted by lran2008 View Post
                    This should work. I didn't try it,so I don't know whether it can output a fasta file for unmatched seq.
                    Yes, my sequence filter tool can produce a FASTA file with matched IDs, a FASTA file with non-matching IDs, or both (two FASTA files):


                    There is a preview/mockup of the tool available to view within the Tool Shed which should help explain this.

                    Comment

                    Latest Articles

                    Collapse

                    • seqadmin
                      Strategies for Sequencing Challenging Samples
                      by seqadmin


                      Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                      03-22-2024, 06:39 AM
                    • seqadmin
                      Techniques and Challenges in Conservation Genomics
                      by seqadmin



                      The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                      Avian Conservation
                      Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                      03-08-2024, 10:41 AM

                    ad_right_rmr

                    Collapse

                    News

                    Collapse

                    Topics Statistics Last Post
                    Started by seqadmin, Yesterday, 06:37 PM
                    0 responses
                    8 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, Yesterday, 06:07 PM
                    0 responses
                    8 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 03-22-2024, 10:03 AM
                    0 responses
                    49 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 03-21-2024, 07:32 AM
                    0 responses
                    66 views
                    0 likes
                    Last Post seqadmin  
                    Working...
                    X