SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
extract full fasta file for local blast hits Oyster Bioinformatics 8 02-16-2016 12:34 PM
Different blast results between CLCBio and local blast arundurvasula Bioinformatics 0 07-01-2014 09:17 AM
Subsetting reads based on Alleles of Heterozygous SNPs DuncanSproul General 1 03-18-2014 06:56 AM
Shortlisting genes from various gene prediction methods based on Blast results. ap88 Bioinformatics 0 03-04-2014 11:25 AM
subsetting a bam file with specific alingment length joseph Bioinformatics 4 12-28-2011 07:39 PM

Reply
 
Thread Tools
Old 10-02-2015, 03:33 AM   #1
DrYak
Member
 
Location: South Africa

Join Date: Sep 2013
Posts: 12
Question Subsetting a fasta file based on a set of BLAST results

Hi,

I have assembled a de novo transcriptome from my species of interest using trinity. The species has an internal symbiont (algae) and those sequences were obviously sequenced along with it. I have done a blast on my assembly which has ID'd a number of contigs corresponding to the symbiont. I would like to use the results of the blast to remove those contigs from the assembly into a new file so I can deal with the two organisms separately. It there an easy way to do this?

I have cleaned up the trinity fasta output so it has a single unique ID for each contig like so:
>TR1-c0_g1_i1
AGCTGTTTGGCCAAGGCTGCGGCCTGGTGGCAGCCTTGCGAGAGCAAGGGCAGCAAGGGC (etc...)

I have extracted the sequence IDs from the symbiont BLAST flatfile output using cut (id-symb) and then made another file with all the IDs of the symb blast hits removed from the full id file, thus representing the "host" sequences (id-host), using sort and uniq.

I therefore have the following.
combined.fasta (trinity output of all the contigs)
id-symb (single column text file of the IDs extracted from a blast search against the symbiont transcriptome)
id-host (single column text file of the all the IDs from the combined.fasta file minus the id-symb IDs)

I would like to generate the following:
symb.fasta = all those sequences in the combined.fasta from the id-symb list
and
host.fasta = the rest (i.e. combined.fasta - symb.fasta) aka all those sequences in the combined.fasta from the id-host list

I've been trying to use a looped fasgrep (from the FAST perl module) but that is far too slow (has taken more than a day to get through less than 10% of the file) so I'm sure there must be a better way.

The assembly contains ~250,000 contigs.


Thanks.
DrYak is offline   Reply With Quote
Old 10-02-2015, 03:51 AM   #2
SylvainL
Senior Member
 
Location: Geneva

Join Date: Feb 2012
Posts: 175
Default

Using R, quite easy and fast...

library(Biostrings)

all_fasta <- read.DNAStringSet("combined.fasta") ## You have to give the path to your file combined.fasta

id_symb <- scan("id_symb", what="character", sep="\n")

symbFasta <- all_fasta[names(all_fasta) %in% id_symb]
hostFasta <- all_fasta[! names(all_fasta) %in% id_symb]
SylvainL is offline   Reply With Quote
Old 10-02-2015, 04:11 AM   #3
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,745
Default

Another option is Jim Kent's faSomeRecords utility. I am linking the linux version but he has source/OS X executables available as well.

http://hgdownload.soe.ucsc.edu/admin.../faSomeRecords

Quote:
faSomeRecords - Extract multiple fa records
usage:
faSomeRecords in.fa listFile out.fa
options:
-exclude - output sequences not in the list file.
GenoMax is offline   Reply With Quote
Old 10-02-2015, 04:39 AM   #4
DrYak
Member
 
Location: South Africa

Join Date: Sep 2013
Posts: 12
Talking faSomeRecords = lifesaver

Thank you so much genomax - that took less than 2 seconds for each file. The other way was still chugging away after two days...
DrYak is offline   Reply With Quote
Reply

Tags
blast, fasta extraction, rna-seq, subsetting

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 07:05 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2018, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO