View Single Post
Old 10-02-2015, 04:33 AM   #1
DrYak
Member
 
Location: South Africa

Join Date: Sep 2013
Posts: 12
Question Subsetting a fasta file based on a set of BLAST results

Hi,

I have assembled a de novo transcriptome from my species of interest using trinity. The species has an internal symbiont (algae) and those sequences were obviously sequenced along with it. I have done a blast on my assembly which has ID'd a number of contigs corresponding to the symbiont. I would like to use the results of the blast to remove those contigs from the assembly into a new file so I can deal with the two organisms separately. It there an easy way to do this?

I have cleaned up the trinity fasta output so it has a single unique ID for each contig like so:
>TR1-c0_g1_i1
AGCTGTTTGGCCAAGGCTGCGGCCTGGTGGCAGCCTTGCGAGAGCAAGGGCAGCAAGGGC (etc...)

I have extracted the sequence IDs from the symbiont BLAST flatfile output using cut (id-symb) and then made another file with all the IDs of the symb blast hits removed from the full id file, thus representing the "host" sequences (id-host), using sort and uniq.

I therefore have the following.
combined.fasta (trinity output of all the contigs)
id-symb (single column text file of the IDs extracted from a blast search against the symbiont transcriptome)
id-host (single column text file of the all the IDs from the combined.fasta file minus the id-symb IDs)

I would like to generate the following:
symb.fasta = all those sequences in the combined.fasta from the id-symb list
and
host.fasta = the rest (i.e. combined.fasta - symb.fasta) aka all those sequences in the combined.fasta from the id-host list

I've been trying to use a looped fasgrep (from the FAST perl module) but that is far too slow (has taken more than a day to get through less than 10% of the file) so I'm sure there must be a better way.

The assembly contains ~250,000 contigs.


Thanks.
DrYak is offline   Reply With Quote