![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
install local blast | shuang | Bioinformatics | 6 | 07-07-2019 08:53 AM |
Extract sequence from multi fasta file with PERL | andreitudor | Bioinformatics | 27 | 07-07-2019 08:45 AM |
how to extract unique hits from a sam file | gfmgfm | Bioinformatics | 10 | 07-07-2019 08:33 AM |
BLAST database error - when changing to new BLAST+ local program | biobio | Bioinformatics | 4 | 06-15-2011 06:20 AM |
Local Blast Error of index file | Shani | Bioinformatics | 0 | 02-12-2011 02:45 AM |
![]() |
|
Thread Tools |
![]() |
#1 |
Junior Member
Location: Rhode Island Join Date: Nov 2010
Posts: 4
|
![]()
Hi all,
I would like to take local blast hit sequences, i.e. hsp_hseq in blast xml format, and extract the full sequences of those hits from the original fasta file (as the xml output only provides those hsp's) and put them in a fasta file of all hits that I can use later. This should be a widely performed task but I can't find any scripts that directly address this issue. I have not acquired the skills to make my own scripts for this task. Thanks in advance |
![]() |
![]() |
![]() |
#2 |
Senior Member
Location: USA, Midwest Join Date: May 2008
Posts: 1,178
|
![]()
Oyster,
This thread (http://seqanswers.com/forums/showthread.php?t=8805) I think covers your question. Basically you construct a list of accessions or ids (in your case for your hits of interest) and use that list to extract the full sequences from the BLAST database using the accompanying BLAST command line tools. |
![]() |
![]() |
![]() |
#3 |
Junior Member
Location: Rhode Island Join Date: Nov 2010
Posts: 4
|
![]()
Thanks kmcarr. Indeed, that post was exactly what I needed, yet I remain with a couple of questions.
'makeblastdb' with the '-parse_seqids' option seems like way to go to do this task. However, I have 12 lanes of illumina data. Trying makeblastdb on one lane took me 3 days and I had to stop this because I need the RAM for some other work. So I thought I'd go about it in a different way using the list of hits from blastn blastn -db myDB -query myQuery -out myContigList.txt -outfmt "6 sallacc" I have a text file with all my hits. finding these hits in the original fasta file. I tried out a bunch of scripts to do this, like those I've attached, but they all have failed. I consistently get a "Segmentation fault" after the index builds to some 545 GB, taking up all of my hard drive's memory. Obviously there is something funny going on with the indexing. The fasta file is straightforward: >1 CTCAAATTATAGTAATATCACGATGCAGTTATATGATATTACATGTGATACGTAAAAAAAATTATTATTTGTGCTATAATCATCTTGTCATGTGAACATGGAATGAG >2 TCTTTGCAAAATCTGTGTTAAATTGACAGAATAGTTTGTATTGTTCAACCAGGTAAAAATATATCACTCATTTTAATTCACTAAAAGTCACAAAATTTACTTTTCAT >3 GACTTTTTGTTAACGTATTTAGCCACATAGAAACCAACAGCCATATAACTGGTAGCTTTAAGCGGCTCACCTTTAGCATCAACAGGCCACAACCAACCAGAACGTGA yet it runs to 35 million or so (The wrapping looks weird here but that is just due to the text box's properties.) I reformatted the fasta file using sreformat as I saw it mentioned elsewhere that this may help with indexing- still no luck. Does anyone have an strategy to index large fasta files (~3.5GB) composed of some 35 million short reads? Once indexed properly, I assume it will be no problem to extract the sequences of interests, which number in the thousands for me. Or maybe I should just have to stop cutting corners and let "makeblastdb" run for a few more days. Any suggestions? |
![]() |
![]() |
![]() |
#4 |
Member
Location: Oregon Join Date: Feb 2011
Posts: 29
|
![]()
if they are ordered like that then you know where in the file they are,
just go and get them from the command line maybe collapse the sequence onto the same line as the defline to avoid any arithmetic |
![]() |
![]() |
![]() |
#5 |
Junior Member
Location: Rhode Island Join Date: Nov 2010
Posts: 4
|
![]()
I have not succeeded in grabbing the +3000 sequences from my +35 million sequences.
If you know a way, please do share. Thanks |
![]() |
![]() |
![]() |
#6 | |
Senior Member
Location: USA, Midwest Join Date: May 2008
Posts: 1,178
|
![]() Quote:
I'm going to suggest that you may need to find an alternate approach to your problem. Creating and searching against a BLAST database of hundreds of millions of short reads is an impossible task, not to mention the output would be pretty uninformative. The short reads have no intrinsic information associated with them. What could you learn from this other than how many short reads are an approximate match to whatever query sequence you submit. Perhaps you could describe what you are trying to accomplish and we could see if there is a more efficient approach. |
|
![]() |
![]() |
![]() |
#7 |
Member
Location: Oregon Join Date: Feb 2011
Posts: 29
|
![]()
You somehow know the identity of 3000+ sequences if interest
in your 3.5G fasta file and you would like to isolate them. (more background information would help.) you do not seem to be able to index the fasta file because of some limitation. this is not encouraging. (information of the system you are using would help) You indicate the sequences in the fasta file have simple integers for their defline and that they are short reads. I am going to make a leap here and assume they continue to be ordered beyond the first three you show. and that each short read is all on its own line. (basically the defline) furthermore, that you are able to get your list of sequences of interest into sorted order one per line. maybe, Code:
sort -n random.list > sorted.list you could use your favorite scripting language to open the fasta file and carefully seek to each record and output it without any searching. or use a command line tool like awk and change what you are searching for as you go *note this relies on both files being ordered* Code:
awk 'BEGIN{i=0;while(getline a[i++]<"sorted.list">0){};i=0}a[i]==$1{i++;print;getline;print}' bigseqfile.fa and they are resource related, you are likely going to continue to have problems till you address that. |
![]() |
![]() |
![]() |
#8 |
Junior Member
Location: Buffalo Join Date: Feb 2016
Posts: 2
|
![]()
Hello-
(I posted this in another forum, as well, but this is similar): I'm trying to 'extract' nucleotide sequences from my results after running a local blastn against my local database. The output format right now is a standard "blast-looking" result page, but it's not easy to work with the results when I want to further compare sequences. (I have a gene- and will have genes- of interest. I want to see how they compare to the local database of my sequences, but then I want the results in FASTA format for further analyses and comparisons). The help manual for local blast doesn't seem to have the answers I am looking for Will gladly provide additional information if more is needed. I really thank anyone who is able to help! |
![]() |
![]() |
![]() |
#9 |
Member
Location: Oregon Join Date: Feb 2011
Posts: 29
|
![]()
Lets try saying it this way.
Don't extract sequences from your blast results. Extract a list of sequence identifiers (top hit perhaps) from your blast results. Then use that list of identifiers to extract the fasta sequence from the blast _database_ your blast results are from. That is: There are command line tools that given an identifier (or list of identifiers) will return those fasta that went into making your "local blast database". |
![]() |
![]() |
![]() |
#10 |
Member
Location: Bhopal Join Date: Jul 2019
Posts: 19
|
![]()
on the off chance that they are requested that way, at that point you know where in the record they are,
simply proceed to get them from the order line possibly breakdown the grouping onto a similar line as the defline to keep away from any number juggling |
![]() |
![]() |
![]() |
Thread Tools | |
|
|