Dear All
I am looking for a fast way to remove similar sequence reads or to find unique records.
I have 7 large fasta files with about 2 *10^6 short sequence reads in each. I have collapsed each file (removed identical sequences) in order to decrease file size. Now I would like to compare two files at a time and find all the unique sequences for each file. It is easy to remove identical records but in my case I would also like to find (highly) similar sequences. I tried blastn -task blastn-short (because the sequences are <100nt) but it takes a long time even with -num_threads 10. Megablast would be faster but I think it will not work with short sequences. Any suggestions?
Thank you very much and I am looking forward to your replies
I am looking for a fast way to remove similar sequence reads or to find unique records.
I have 7 large fasta files with about 2 *10^6 short sequence reads in each. I have collapsed each file (removed identical sequences) in order to decrease file size. Now I would like to compare two files at a time and find all the unique sequences for each file. It is easy to remove identical records but in my case I would also like to find (highly) similar sequences. I tried blastn -task blastn-short (because the sequences are <100nt) but it takes a long time even with -num_threads 10. Megablast would be faster but I think it will not work with short sequences. Any suggestions?
Thank you very much and I am looking forward to your replies
Comment