Seqanswers Leaderboard Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • lac302
    Member
    • Dec 2012
    • 64

    looking for a simple script to pull a subset of contigs from an assembly

    i'm sure this is a simple enough task but i'm just an end user no scripting experience at all.

    looking for a script to pull contigs listed in a .txt file from assembly.fa and output the results to a new .fa file

    any help would be much appreciated. thanks.
  • Monika_bioinf
    Junior Member
    • Sep 2011
    • 7

    #2
    Use seqtk subseq, https://github.com/lh3/seqtk

    Comment

    • JackieBadger
      Senior Member
      • Mar 2009
      • 385

      #4
      Out of curiosity, GenoMax, how does second perl function handle the searching of the sample name file?
      I wrote something similar under linux bash and it went bonkers with the RAM. I think the issue was that when ever it found an ID in the fasta file, it would then not remove this ID from the file containing the query IDs, and then start the search again from the start. Either way, my scripting abilities produced something that seemed unfeasible to execute on a large query ID file and large fasta file.

      Cheers,

      J

      Comment

      • lac302
        Member
        • Dec 2012
        • 64

        #5
        Thanks all!

        Comment

        • Richard Finney
          Senior Member
          • Feb 2009
          • 701

          #7
          awk 'BEGIN{while((getline x<ARGV[1])>0){a[i++]=x;}while((getline y<ARGV[2])>0){if(substr(y,0,1)==">"){m=0;for(j=0;j<i;j++){if(y==a[j])m=1;}}if(m==1)print y;}}' $1 $2


          $1 is match file
          $2 is fasta file

          Comment

          • GenoMax
            Senior Member
            • Feb 2008
            • 7142

            #8
            Originally posted by JackieBadger View Post
            Out of curiosity, GenoMax, how does second perl function handle the searching of the sample name file?
            I wrote something similar under linux bash and it went bonkers with the RAM. I think the issue was that when ever it found an ID in the fasta file, it would then not remove this ID from the file containing the query IDs, and then start the search again from the start. Either way, my scripting abilities produced something that seemed unfeasible to execute on a large query ID file and large fasta file.

            Cheers,

            J
            @JackieBadger: Second perl function is using -n and -e switches. -n wraps a while loop around the program while -p feeds the program value of $_ each time.

            A nice example that illustrates this (equivalent to unix 'cat' command)

            Code:
            $ perl -ne 'print $_' filename
            or
            Code:
            $ perl -ne 'print' filename
            Last edited by GenoMax; 02-08-2014, 06:05 PM.

            Comment

            • Birdman
              Member
              • Jan 2014
              • 21

              #9
              This little BioPython script will nicely do the job:

              Code:
              from Bio import SeqIO
              import sys
              
              #Usage: filter_fasta_per_ids.py input.fasta filter_ids.txt output.fasta
              
              input_file =sys.argv[1]
              id_file =sys.argv[2]
              output_file =sys.argv[3]
              wanted = set(line.rstrip("\n").split(None,1)[0] for line in open(id_file))
              print("Found %i unique identifiers in %s" % (len(wanted), id_file))
              records = (r for r in SeqIO.parse(input_file, "fasta") if r.id in wanted)
              count = SeqIO.write(records, output_file, "fasta")
              print("Saved %i records from %s to %s" % (count, input_file, output_file))
              if count < len(wanted):
                  print("Warning %i IDs not found in %s" % (len(wanted)-count, input_file))

              Comment

              Latest Articles

              Collapse

              • seqadmin
                Pathogen Surveillance with Advanced Genomic Tools
                by seqadmin




                The COVID-19 pandemic highlighted the need for proactive pathogen surveillance systems. As ongoing threats like avian influenza and newly emerging infections continue to pose risks, researchers are working to improve how quickly and accurately pathogens can be identified and tracked. In a recent SEQanswers webinar, two experts discussed how next-generation sequencing (NGS) and machine learning are shaping efforts to monitor viral variation and trace the origins of infectious...
                03-24-2025, 11:48 AM
              • seqadmin
                New Genomics Tools and Methods Shared at AGBT 2025
                by seqadmin


                This year’s Advances in Genome Biology and Technology (AGBT) General Meeting commemorated the 25th anniversary of the event at its original venue on Marco Island, Florida. While this year’s event didn’t include high-profile musical performances, the industry announcements and cutting-edge research still drew the attention of leading scientists.

                The Headliner
                The biggest announcement was Roche stepping back into the sequencing platform market. In the years since...
                03-03-2025, 01:39 PM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by seqadmin, Today, 12:59 PM
              0 responses
              6 views
              0 reactions
              Last Post seqadmin  
              Started by seqadmin, Yesterday, 10:17 AM
              0 responses
              7 views
              0 reactions
              Last Post seqadmin  
              Started by seqadmin, 03-20-2025, 05:03 AM
              0 responses
              49 views
              0 reactions
              Last Post seqadmin  
              Started by seqadmin, 03-19-2025, 07:27 AM
              0 responses
              60 views
              0 reactions
              Last Post seqadmin  
              Working...