Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • looking for a simple script to pull a subset of contigs from an assembly

    i'm sure this is a simple enough task but i'm just an end user no scripting experience at all.

    looking for a script to pull contigs listed in a .txt file from assembly.fa and output the results to a new .fa file

    any help would be much appreciated. thanks.

  • #2
    Use seqtk subseq, https://github.com/lh3/seqtk

    Comment


    • #4
      Out of curiosity, GenoMax, how does second perl function handle the searching of the sample name file?
      I wrote something similar under linux bash and it went bonkers with the RAM. I think the issue was that when ever it found an ID in the fasta file, it would then not remove this ID from the file containing the query IDs, and then start the search again from the start. Either way, my scripting abilities produced something that seemed unfeasible to execute on a large query ID file and large fasta file.

      Cheers,

      J

      Comment


      • #5
        Thanks all!

        Comment


        • #7
          awk 'BEGIN{while((getline x<ARGV[1])>0){a[i++]=x;}while((getline y<ARGV[2])>0){if(substr(y,0,1)==">"){m=0;for(j=0;j<i;j++){if(y==a[j])m=1;}}if(m==1)print y;}}' $1 $2


          $1 is match file
          $2 is fasta file

          Comment


          • #8
            Originally posted by JackieBadger View Post
            Out of curiosity, GenoMax, how does second perl function handle the searching of the sample name file?
            I wrote something similar under linux bash and it went bonkers with the RAM. I think the issue was that when ever it found an ID in the fasta file, it would then not remove this ID from the file containing the query IDs, and then start the search again from the start. Either way, my scripting abilities produced something that seemed unfeasible to execute on a large query ID file and large fasta file.

            Cheers,

            J
            @JackieBadger: Second perl function is using -n and -e switches. -n wraps a while loop around the program while -p feeds the program value of $_ each time.

            A nice example that illustrates this (equivalent to unix 'cat' command)

            Code:
            $ perl -ne 'print $_' filename
            or
            Code:
            $ perl -ne 'print' filename
            Last edited by GenoMax; 02-08-2014, 06:05 PM.

            Comment


            • #9
              This little BioPython script will nicely do the job:

              Code:
              from Bio import SeqIO
              import sys
              
              #Usage: filter_fasta_per_ids.py input.fasta filter_ids.txt output.fasta
              
              input_file =sys.argv[1]
              id_file =sys.argv[2]
              output_file =sys.argv[3]
              wanted = set(line.rstrip("\n").split(None,1)[0] for line in open(id_file))
              print("Found %i unique identifiers in %s" % (len(wanted), id_file))
              records = (r for r in SeqIO.parse(input_file, "fasta") if r.id in wanted)
              count = SeqIO.write(records, output_file, "fasta")
              print("Saved %i records from %s to %s" % (count, input_file, output_file))
              if count < len(wanted):
                  print("Warning %i IDs not found in %s" % (len(wanted)-count, input_file))

              Comment

              Latest Articles

              Collapse

              • seqadmin
                Recent Innovations in Spatial Biology
                by seqadmin


                Spatial biology is an exciting field that encompasses a wide range of techniques and technologies aimed at mapping the organization and interactions of various biomolecules in their native environments. As this area of research progresses, new tools and methodologies are being introduced, accompanied by efforts to establish benchmarking standards and drive technological innovation.

                3D Genomics
                While spatial biology often involves studying proteins and RNAs in their...
                01-01-2025, 07:30 PM
              • seqadmin
                Advancing Precision Medicine for Rare Diseases in Children
                by seqadmin




                Many organizations study rare diseases, but few have a mission as impactful as Rady Children’s Institute for Genomic Medicine (RCIGM). “We are all about changing outcomes for children,” explained Dr. Stephen Kingsmore, President and CEO of the group. The institute’s initial goal was to provide rapid diagnoses for critically ill children and shorten their diagnostic odyssey, a term used to describe the long and arduous process it takes patients to obtain an accurate...
                12-16-2024, 07:57 AM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by seqadmin, 01-09-2025, 04:04 PM
              0 responses
              432 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 01-09-2025, 09:42 AM
              0 responses
              441 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 01-08-2025, 03:17 PM
              0 responses
              453 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 01-03-2025, 11:18 AM
              1 response
              50 views
              1 like
              Last Post Tonia
              by Tonia
               
              Working...
              X