Unconfigured Ad

**kmcarr** · 03-11-2011, 02:00 PM

Oyster,

This thread (http://seqanswers.com/forums/showthread.php?t=8805) I think covers your question. Basically you construct a list of accessions or ids (in your case for your hits of interest) and use that list to extract the full sequences from the BLAST database using the accompanying BLAST command line tools.

**Oyster** · 03-14-2011, 12:38 PM

Thanks kmcarr. Indeed, that post was exactly what I needed, yet I remain with a couple of questions.

'makeblastdb' with the '-parse_seqids' option seems like way to go to do this task.

However, I have 12 lanes of illumina data. Trying makeblastdb on one lane took me 3 days and I had to stop this because I need the RAM for some other work.

So I thought I'd go about it in a different way using the list of hits from blastn

blastn -db myDB -query myQuery -out myContigList.txt -outfmt "6 sallacc"

I have a text file with all my hits. finding these hits in the original fasta file. I tried out a bunch of scripts to do this, like those I've attached, but they all have failed.

I consistently get a "Segmentation fault" after the index builds to some 545 GB, taking up all of my hard drive's memory. Obviously there is something funny going on with the indexing.

The fasta file is straightforward:
>1
CTCAAATTATAGTAATATCACGATGCAGTTATATGATATTACATGTGATACGTAAAAAAAATTATTATTTGTGCTATAATCATCTTGTCATGTGAACATGGAATGAG
>2
TCTTTGCAAAATCTGTGTTAAATTGACAGAATAGTTTGTATTGTTCAACCAGGTAAAAATATATCACTCATTTTAATTCACTAAAAGTCACAAAATTTACTTTTCAT
>3
GACTTTTTGTTAACGTATTTAGCCACATAGAAACCAACAGCCATATAACTGGTAGCTTTAAGCGGCTCACCTTTAGCATCAACAGGCCACAACCAACCAGAACGTGA

yet it runs to 35 million or so

(The wrapping looks weird here but that is just due to the text box's properties.) I reformatted the fasta file using sreformat as I saw it mentioned elsewhere that this may help with indexing- still no luck.

Does anyone have an strategy to index large fasta files (~3.5GB) composed of some 35 million short reads? Once indexed properly, I assume it will be no problem to extract the sequences of interests, which number in the thousands for me.

Or maybe I should just have to stop cutting corners and let "makeblastdb" run for a few more days. Any suggestions?

Attached Files

**tomc** · 03-14-2011, 03:03 PM

you got their number ...

if they are ordered like that then you know where in the file they are,
just go and get them from the command line

maybe collapse the sequence onto the same line as the defline to avoid
any arithmetic

**Oyster** · 03-15-2011, 08:44 AM

yes i do have their number

I have not succeeded in grabbing the +3000 sequences from my +35 million sequences.

If you know a way, please do share.

Thanks

**kmcarr** · 03-15-2011, 02:01 PM

Originally posted by Oyster View Post

Thanks kmcarr. Indeed, that post was exactly what I needed, yet I remain with a couple of questions.

'makeblastdb' with the '-parse_seqids' option seems like way to go to do this task.

However, I have 12 lanes of illumina data. Trying makeblastdb on one lane took me 3 days and I had to stop this because I need the RAM for some other work.

So I thought I'd go about it in a different way using the list of hits from blastn

blastn -db myDB -query myQuery -out myContigList.txt -outfmt "6 sallacc"

I have a text file with all my hits. finding these hits in the original fasta file. I tried out a bunch of scripts to do this, like those I've attached, but they all have failed.

I consistently get a "Segmentation fault" after the index builds to some 545 GB, taking up all of my hard drive's memory. Obviously there is something funny going on with the indexing.

The fasta file is straightforward:
>1
CTCAAATTATAGTAATATCACGATGCAGTTATATGATATTACATGTGATACGTAAAAAAAATTATTATTTGTGCTATAATCATCTTGTCATGTGAACATGGAATGAG
>2
TCTTTGCAAAATCTGTGTTAAATTGACAGAATAGTTTGTATTGTTCAACCAGGTAAAAATATATCACTCATTTTAATTCACTAAAAGTCACAAAATTTACTTTTCAT
>3
GACTTTTTGTTAACGTATTTAGCCACATAGAAACCAACAGCCATATAACTGGTAGCTTTAAGCGGCTCACCTTTAGCATCAACAGGCCACAACCAACCAGAACGTGA

yet it runs to 35 million or so

(The wrapping looks weird here but that is just due to the text box's properties.) I reformatted the fasta file using sreformat as I saw it mentioned elsewhere that this may help with indexing- still no luck.

Does anyone have an strategy to index large fasta files (~3.5GB) composed of some 35 million short reads? Once indexed properly, I assume it will be no problem to extract the sequences of interests, which number in the thousands for me.

Or maybe I should just have to stop cutting corners and let "makeblastdb" run for a few more days. Any suggestions?

Oyster,

I'm going to suggest that you may need to find an alternate approach to your problem. Creating and searching against a BLAST database of hundreds of millions of short reads is an impossible task, not to mention the output would be pretty uninformative. The short reads have no intrinsic information associated with them. What could you learn from this other than how many short reads are an approximate match to whatever query sequence you submit.

Perhaps you could describe what you are trying to accomplish and we could see if there is a more efficient approach.

**tomc** · 03-15-2011, 10:59 PM

You somehow know the identity of 3000+ sequences if interest
in your 3.5G fasta file and you would like to isolate them.

(more background information would help.)

you do not seem to be able to index the fasta file
because of some limitation. this is not encouraging.

(information of the system you are using would help)

You indicate the sequences in the fasta file have simple integers
for their defline and that they are short reads.
I am going to make a leap here and assume they continue to be ordered
beyond the first three you show. and that each short read
is all on its own line. (basically the defline)

furthermore, that you are able to get your list of sequences of interest
into sorted order one per line. maybe,

Code:

 sort -n random.list > sorted.list

notice that each fasta record n is at lines 2n-1 and 2*n
you could use your favorite scripting language to open the
fasta file and carefully seek to each record and output it
without any searching.

or use a command line tool like awk and change what you are
searching for as you go
*note this relies on both files being ordered*

Code:

awk 'BEGIN{i=0;while(getline a[i++]<"sorted.list">0){};i=0}a[i]==$1{i++;print;getline;print}' bigseqfile.fa

but if you are having problems indexing the fasta file
and they are resource related, you are likely going to
continue to have problems till you address that.

**mgallo2** · 02-16-2016, 12:08 PM

Hello-

(I posted this in another forum, as well, but this is similar):
I'm trying to 'extract' nucleotide sequences from my results after running a local blastn against my local database. The output format right now is a standard "blast-looking" result page, but it's not easy to work with the results when I want to further compare sequences. (I have a gene- and will have genes- of interest. I want to see how they compare to the local database of my sequences, but then I want the results in FASTA format for further analyses and comparisons). The help manual for local blast doesn't seem to have the answers I am looking for

Will gladly provide additional information if more is needed. I really thank anyone who is able to help!

**tomc** · 02-16-2016, 01:34 PM

Sequence of blast results

Lets try saying it this way.

Don't extract sequences from your blast results.

Extract a list of sequence identifiers (top hit perhaps) from your blast results.
Then use that list of identifiers to extract the fasta sequence from the blast _database_ your blast results are from.

That is:
There are command line tools that given an identifier (or list of identifiers) will return those fasta that went into making your "local blast database".

Topics	Statistics	Last Post
New AI Model Captures Long-Range Genomic Signals to Improve RNA Splice Site Prediction by SEQadmin2 Started by SEQadmin2, Yesterday, 05:37 AM	0 responses 6 views 0 reactions	Last Post by SEQadmin2 Yesterday, 05:37 AM
Large-Scale Protein Screen Uncovers Hidden Regulators of Alternative Polyadenylation by SEQadmin2 Started by SEQadmin2, 06-26-2026, 11:10 AM	0 responses 16 views 0 reactions	Last Post by SEQadmin2 06-26-2026, 11:10 AM
Whole-Genome Sequencing Traces Faroe Islands Ancestry to a North Atlantic Founder Population by SEQadmin2 Started by SEQadmin2, 06-17-2026, 06:09 AM	0 responses 51 views 0 reactions	Last Post by SEQadmin2 06-17-2026, 06:09 AM
Sequencing the Two-Toed Sloth Genome Reveals Jumping Genes Tied to Its Extreme Metabolism by SEQadmin2 Started by SEQadmin2, 06-09-2026, 11:58 AM	0 responses 110 views 0 reactions	Last Post by SEQadmin2 06-09-2026, 11:58 AM

Unconfigured Ad

extract full fasta file for local blast hits

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News