![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
extract full fasta file for local blast hits | Oyster | Bioinformatics | 9 | 07-07-2019 08:39 AM |
Different blast results between CLCBio and local blast | arundurvasula | Bioinformatics | 0 | 07-01-2014 10:17 AM |
Subsetting reads based on Alleles of Heterozygous SNPs | DuncanSproul | General | 1 | 03-18-2014 07:56 AM |
Shortlisting genes from various gene prediction methods based on Blast results. | ap88 | Bioinformatics | 0 | 03-04-2014 12:25 PM |
subsetting a bam file with specific alingment length | joseph | Bioinformatics | 4 | 12-28-2011 08:39 PM |
![]() |
|
Thread Tools |
![]() |
#1 |
Member
Location: South Africa Join Date: Sep 2013
Posts: 12
|
![]()
Hi,
I have assembled a de novo transcriptome from my species of interest using trinity. The species has an internal symbiont (algae) and those sequences were obviously sequenced along with it. I have done a blast on my assembly which has ID'd a number of contigs corresponding to the symbiont. I would like to use the results of the blast to remove those contigs from the assembly into a new file so I can deal with the two organisms separately. It there an easy way to do this? I have cleaned up the trinity fasta output so it has a single unique ID for each contig like so: >TR1-c0_g1_i1 AGCTGTTTGGCCAAGGCTGCGGCCTGGTGGCAGCCTTGCGAGAGCAAGGGCAGCAAGGGC (etc...) I have extracted the sequence IDs from the symbiont BLAST flatfile output using cut (id-symb) and then made another file with all the IDs of the symb blast hits removed from the full id file, thus representing the "host" sequences (id-host), using sort and uniq. I therefore have the following. combined.fasta (trinity output of all the contigs) id-symb (single column text file of the IDs extracted from a blast search against the symbiont transcriptome) id-host (single column text file of the all the IDs from the combined.fasta file minus the id-symb IDs) I would like to generate the following: symb.fasta = all those sequences in the combined.fasta from the id-symb list and host.fasta = the rest (i.e. combined.fasta - symb.fasta) aka all those sequences in the combined.fasta from the id-host list I've been trying to use a looped fasgrep (from the FAST perl module) but that is far too slow (has taken more than a day to get through less than 10% of the file) so I'm sure there must be a better way. The assembly contains ~250,000 contigs. Thanks. |
![]() |
![]() |
![]() |
#2 |
Senior Member
Location: Geneva Join Date: Feb 2012
Posts: 179
|
![]()
Using R, quite easy and fast...
library(Biostrings) all_fasta <- read.DNAStringSet("combined.fasta") ## You have to give the path to your file combined.fasta id_symb <- scan("id_symb", what="character", sep="\n") symbFasta <- all_fasta[names(all_fasta) %in% id_symb] hostFasta <- all_fasta[! names(all_fasta) %in% id_symb] |
![]() |
![]() |
![]() |
#3 | |
Senior Member
Location: East Coast USA Join Date: Feb 2008
Posts: 7,083
|
![]()
Another option is Jim Kent's faSomeRecords utility. I am linking the linux version but he has source/OS X executables available as well.
http://hgdownload.soe.ucsc.edu/admin.../faSomeRecords Quote:
|
|
![]() |
![]() |
![]() |
#4 |
Member
Location: South Africa Join Date: Sep 2013
Posts: 12
|
![]()
Thank you so much genomax - that took less than 2 seconds for each file. The other way was still chugging away after two days...
|
![]() |
![]() |
![]() |
Tags |
blast, fasta extraction, rna-seq, subsetting |
Thread Tools | |
|
|