Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • looking for a simple script to pull a subset of contigs from an assembly

    i'm sure this is a simple enough task but i'm just an end user no scripting experience at all.

    looking for a script to pull contigs listed in a .txt file from assembly.fa and output the results to a new .fa file

    any help would be much appreciated. thanks.

  • #2
    Use seqtk subseq, https://github.com/lh3/seqtk

    Comment


    • #4
      Out of curiosity, GenoMax, how does second perl function handle the searching of the sample name file?
      I wrote something similar under linux bash and it went bonkers with the RAM. I think the issue was that when ever it found an ID in the fasta file, it would then not remove this ID from the file containing the query IDs, and then start the search again from the start. Either way, my scripting abilities produced something that seemed unfeasible to execute on a large query ID file and large fasta file.

      Cheers,

      J

      Comment


      • #5
        Thanks all!

        Comment


        • #7
          awk 'BEGIN{while((getline x<ARGV[1])>0){a[i++]=x;}while((getline y<ARGV[2])>0){if(substr(y,0,1)==">"){m=0;for(j=0;j<i;j++){if(y==a[j])m=1;}}if(m==1)print y;}}' $1 $2


          $1 is match file
          $2 is fasta file

          Comment


          • #8
            Originally posted by JackieBadger View Post
            Out of curiosity, GenoMax, how does second perl function handle the searching of the sample name file?
            I wrote something similar under linux bash and it went bonkers with the RAM. I think the issue was that when ever it found an ID in the fasta file, it would then not remove this ID from the file containing the query IDs, and then start the search again from the start. Either way, my scripting abilities produced something that seemed unfeasible to execute on a large query ID file and large fasta file.

            Cheers,

            J
            @JackieBadger: Second perl function is using -n and -e switches. -n wraps a while loop around the program while -p feeds the program value of $_ each time.

            A nice example that illustrates this (equivalent to unix 'cat' command)

            Code:
            $ perl -ne 'print $_' filename
            or
            Code:
            $ perl -ne 'print' filename
            Last edited by GenoMax; 02-08-2014, 06:05 PM.

            Comment


            • #9
              This little BioPython script will nicely do the job:

              Code:
              from Bio import SeqIO
              import sys
              
              #Usage: filter_fasta_per_ids.py input.fasta filter_ids.txt output.fasta
              
              input_file =sys.argv[1]
              id_file =sys.argv[2]
              output_file =sys.argv[3]
              wanted = set(line.rstrip("\n").split(None,1)[0] for line in open(id_file))
              print("Found %i unique identifiers in %s" % (len(wanted), id_file))
              records = (r for r in SeqIO.parse(input_file, "fasta") if r.id in wanted)
              count = SeqIO.write(records, output_file, "fasta")
              print("Saved %i records from %s to %s" % (count, input_file, output_file))
              if count < len(wanted):
                  print("Warning %i IDs not found in %s" % (len(wanted)-count, input_file))

              Comment

              Latest Articles

              Collapse

              • seqadmin
                Essential Discoveries and Tools in Epitranscriptomics
                by seqadmin




                The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                04-22-2024, 07:01 AM
              • seqadmin
                Current Approaches to Protein Sequencing
                by seqadmin


                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                04-04-2024, 04:25 PM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by seqadmin, Yesterday, 08:47 AM
              0 responses
              14 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-11-2024, 12:08 PM
              0 responses
              60 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-10-2024, 10:19 PM
              0 responses
              60 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-10-2024, 09:21 AM
              0 responses
              54 views
              0 likes
              Last Post seqadmin  
              Working...
              X