SEQanswers

SEQanswers (http://seqanswers.com/forums/index.php)
-   Bioinformatics (http://seqanswers.com/forums/forumdisplay.php?f=18)
-   -   Subsetting a fasta file based on a set of BLAST results (http://seqanswers.com/forums/showthread.php?t=63207)

DrYak 10-02-2015 04:33 AM

Subsetting a fasta file based on a set of BLAST results
 
Hi,

I have assembled a de novo transcriptome from my species of interest using trinity. The species has an internal symbiont (algae) and those sequences were obviously sequenced along with it. I have done a blast on my assembly which has ID'd a number of contigs corresponding to the symbiont. I would like to use the results of the blast to remove those contigs from the assembly into a new file so I can deal with the two organisms separately. It there an easy way to do this?

I have cleaned up the trinity fasta output so it has a single unique ID for each contig like so:
>TR1-c0_g1_i1
AGCTGTTTGGCCAAGGCTGCGGCCTGGTGGCAGCCTTGCGAGAGCAAGGGCAGCAAGGGC (etc...)

I have extracted the sequence IDs from the symbiont BLAST flatfile output using cut (id-symb) and then made another file with all the IDs of the symb blast hits removed from the full id file, thus representing the "host" sequences (id-host), using sort and uniq.

I therefore have the following.
combined.fasta (trinity output of all the contigs)
id-symb (single column text file of the IDs extracted from a blast search against the symbiont transcriptome)
id-host (single column text file of the all the IDs from the combined.fasta file minus the id-symb IDs)

I would like to generate the following:
symb.fasta = all those sequences in the combined.fasta from the id-symb list
and
host.fasta = the rest (i.e. combined.fasta - symb.fasta) aka all those sequences in the combined.fasta from the id-host list

I've been trying to use a looped fasgrep (from the FAST perl module) but that is far too slow (has taken more than a day to get through less than 10% of the file) so I'm sure there must be a better way.

The assembly contains ~250,000 contigs.


Thanks.

SylvainL 10-02-2015 04:51 AM

Using R, quite easy and fast...

library(Biostrings)

all_fasta <- read.DNAStringSet("combined.fasta") ## You have to give the path to your file combined.fasta

id_symb <- scan("id_symb", what="character", sep="\n")

symbFasta <- all_fasta[names(all_fasta) %in% id_symb]
hostFasta <- all_fasta[! names(all_fasta) %in% id_symb]

GenoMax 10-02-2015 05:11 AM

Another option is Jim Kent's faSomeRecords utility. I am linking the linux version but he has source/OS X executables available as well.

http://hgdownload.soe.ucsc.edu/admin.../faSomeRecords

Quote:

faSomeRecords - Extract multiple fa records
usage:
faSomeRecords in.fa listFile out.fa
options:
-exclude - output sequences not in the list file.

DrYak 10-02-2015 05:39 AM

faSomeRecords = lifesaver
 
Thank you so much genomax - that took less than 2 seconds for each file. The other way was still chugging away after two days...


All times are GMT -8. The time now is 01:49 PM.

Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.