Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • DrYak
    Member
    • Sep 2013
    • 13

    Subsetting a fasta file based on a set of BLAST results

    Hi,

    I have assembled a de novo transcriptome from my species of interest using trinity. The species has an internal symbiont (algae) and those sequences were obviously sequenced along with it. I have done a blast on my assembly which has ID'd a number of contigs corresponding to the symbiont. I would like to use the results of the blast to remove those contigs from the assembly into a new file so I can deal with the two organisms separately. It there an easy way to do this?

    I have cleaned up the trinity fasta output so it has a single unique ID for each contig like so:
    >TR1-c0_g1_i1
    AGCTGTTTGGCCAAGGCTGCGGCCTGGTGGCAGCCTTGCGAGAGCAAGGGCAGCAAGGGC (etc...)

    I have extracted the sequence IDs from the symbiont BLAST flatfile output using cut (id-symb) and then made another file with all the IDs of the symb blast hits removed from the full id file, thus representing the "host" sequences (id-host), using sort and uniq.

    I therefore have the following.
    combined.fasta (trinity output of all the contigs)
    id-symb (single column text file of the IDs extracted from a blast search against the symbiont transcriptome)
    id-host (single column text file of the all the IDs from the combined.fasta file minus the id-symb IDs)

    I would like to generate the following:
    symb.fasta = all those sequences in the combined.fasta from the id-symb list
    and
    host.fasta = the rest (i.e. combined.fasta - symb.fasta) aka all those sequences in the combined.fasta from the id-host list

    I've been trying to use a looped fasgrep (from the FAST perl module) but that is far too slow (has taken more than a day to get through less than 10% of the file) so I'm sure there must be a better way.

    The assembly contains ~250,000 contigs.


    Thanks.
  • SylvainL
    Senior Member
    • Feb 2012
    • 180

    #2
    Using R, quite easy and fast...

    library(Biostrings)

    all_fasta <- read.DNAStringSet("combined.fasta") ## You have to give the path to your file combined.fasta

    id_symb <- scan("id_symb", what="character", sep="\n")

    symbFasta <- all_fasta[names(all_fasta) %in% id_symb]
    hostFasta <- all_fasta[! names(all_fasta) %in% id_symb]

    Comment

    • GenoMax
      Senior Member
      • Feb 2008
      • 7142

      #3
      Another option is Jim Kent's faSomeRecords utility. I am linking the linux version but he has source/OS X executables available as well.



      faSomeRecords - Extract multiple fa records
      usage:
      faSomeRecords in.fa listFile out.fa
      options:
      -exclude - output sequences not in the list file.

      Comment

      • DrYak
        Member
        • Sep 2013
        • 13

        #4
        faSomeRecords = lifesaver

        Thank you so much genomax - that took less than 2 seconds for each file. The other way was still chugging away after two days...

        Comment

        Latest Articles

        Collapse

        • SEQadmin2
          Nine Things a Sample Prep Scientist Thinks About Before Sequencing
          by SEQadmin2


          I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.

          Here are nine questions we think about, in roughly the order they matter, before...
          06-18-2026, 07:11 AM
        • SEQadmin2
          From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
          by SEQadmin2


          Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


          The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
          ...
          06-02-2026, 10:05 AM

        ad_right_rmr

        Collapse

        News

        Collapse

        Topics Statistics Last Post
        Started by SEQadmin2, 06-26-2026, 11:10 AM
        0 responses
        12 views
        0 reactions
        Last Post SEQadmin2  
        Started by SEQadmin2, 06-17-2026, 06:09 AM
        0 responses
        48 views
        0 reactions
        Last Post SEQadmin2  
        Started by SEQadmin2, 06-09-2026, 11:58 AM
        0 responses
        107 views
        0 reactions
        Last Post SEQadmin2  
        Started by SEQadmin2, 06-05-2026, 10:09 AM
        0 responses
        125 views
        0 reactions
        Last Post SEQadmin2  
        Working...