Go Back   SEQanswers > Bioinformatics > Bioinformatics

Similar Threads
Thread Thread Starter Forum Replies Last Post
extract full fasta file for local blast hits Oyster Bioinformatics 9 07-07-2019 08:39 AM
Different blast results between CLCBio and local blast arundurvasula Bioinformatics 0 07-01-2014 10:17 AM
Subsetting reads based on Alleles of Heterozygous SNPs DuncanSproul General 1 03-18-2014 07:56 AM
Shortlisting genes from various gene prediction methods based on Blast results. ap88 Bioinformatics 0 03-04-2014 12:25 PM
subsetting a bam file with specific alingment length joseph Bioinformatics 4 12-28-2011 08:39 PM

Thread Tools
Old 10-02-2015, 04:33 AM   #1
Location: South Africa

Join Date: Sep 2013
Posts: 12
Question Subsetting a fasta file based on a set of BLAST results


I have assembled a de novo transcriptome from my species of interest using trinity. The species has an internal symbiont (algae) and those sequences were obviously sequenced along with it. I have done a blast on my assembly which has ID'd a number of contigs corresponding to the symbiont. I would like to use the results of the blast to remove those contigs from the assembly into a new file so I can deal with the two organisms separately. It there an easy way to do this?

I have cleaned up the trinity fasta output so it has a single unique ID for each contig like so:

I have extracted the sequence IDs from the symbiont BLAST flatfile output using cut (id-symb) and then made another file with all the IDs of the symb blast hits removed from the full id file, thus representing the "host" sequences (id-host), using sort and uniq.

I therefore have the following.
combined.fasta (trinity output of all the contigs)
id-symb (single column text file of the IDs extracted from a blast search against the symbiont transcriptome)
id-host (single column text file of the all the IDs from the combined.fasta file minus the id-symb IDs)

I would like to generate the following:
symb.fasta = all those sequences in the combined.fasta from the id-symb list
host.fasta = the rest (i.e. combined.fasta - symb.fasta) aka all those sequences in the combined.fasta from the id-host list

I've been trying to use a looped fasgrep (from the FAST perl module) but that is far too slow (has taken more than a day to get through less than 10% of the file) so I'm sure there must be a better way.

The assembly contains ~250,000 contigs.

DrYak is offline   Reply With Quote
Old 10-02-2015, 04:51 AM   #2
Senior Member
Location: Geneva

Join Date: Feb 2012
Posts: 179

Using R, quite easy and fast...


all_fasta <- read.DNAStringSet("combined.fasta") ## You have to give the path to your file combined.fasta

id_symb <- scan("id_symb", what="character", sep="\n")

symbFasta <- all_fasta[names(all_fasta) %in% id_symb]
hostFasta <- all_fasta[! names(all_fasta) %in% id_symb]
SylvainL is offline   Reply With Quote
Old 10-02-2015, 05:11 AM   #3
Senior Member
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,080

Another option is Jim Kent's faSomeRecords utility. I am linking the linux version but he has source/OS X executables available as well.

faSomeRecords - Extract multiple fa records
faSomeRecords in.fa listFile out.fa
-exclude - output sequences not in the list file.
GenoMax is offline   Reply With Quote
Old 10-02-2015, 05:39 AM   #4
Location: South Africa

Join Date: Sep 2013
Posts: 12
Talking faSomeRecords = lifesaver

Thank you so much genomax - that took less than 2 seconds for each file. The other way was still chugging away after two days...
DrYak is offline   Reply With Quote

blast, fasta extraction, rna-seq, subsetting

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

All times are GMT -8. The time now is 02:14 PM.

Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2021, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO