Hello,
I am using the package bold in R to download COI sequences from BOLD and specifying the taxon of interest and the marker like this:
bold_seq(taxon='Chironomidae', marker='COI-5P', response=TRUE) as well as
bold_seq(taxon = 'Chironomidae', marker = "COX1", response = TRUE)
for which I get a fasta file. When I tried to format a database in BLAST (makeblastdb) using this file I got an error that duplicate IDs were present. I looked inside the fasta and found that the duplicates
were sequences of the same organism one from COI-5P and another from CAD, for example:
>CHRSV056-08|Cricotopus glacialis|CAD
CTGGCGTCAAAAGTATAAGCTCGCTGAGTGGATGAAGAAGCACAACGTCGTTGGAATCAGTGGAATTGACACC...
>CHRSV056-08|Cricotopus glacialis|COI-5P|KC130785
AACATTATATTTTATTTTCGGGGCTTGATCAGGGATAGTAGGAACTTCCTTAAGAATCTTAATTCGAGCTGAA...
and there's plenty of them. I can't removed them all one by one. There's like 160k sequences in this file.
Any idea how I can remove duplicate ID+seq from the fasta file or any way to fix the R comand to avoid getting sequences from other genes?
thanks
I am using the package bold in R to download COI sequences from BOLD and specifying the taxon of interest and the marker like this:
bold_seq(taxon='Chironomidae', marker='COI-5P', response=TRUE) as well as
bold_seq(taxon = 'Chironomidae', marker = "COX1", response = TRUE)
for which I get a fasta file. When I tried to format a database in BLAST (makeblastdb) using this file I got an error that duplicate IDs were present. I looked inside the fasta and found that the duplicates
were sequences of the same organism one from COI-5P and another from CAD, for example:
>CHRSV056-08|Cricotopus glacialis|CAD
CTGGCGTCAAAAGTATAAGCTCGCTGAGTGGATGAAGAAGCACAACGTCGTTGGAATCAGTGGAATTGACACC...
>CHRSV056-08|Cricotopus glacialis|COI-5P|KC130785
AACATTATATTTTATTTTCGGGGCTTGATCAGGGATAGTAGGAACTTCCTTAAGAATCTTAATTCGAGCTGAA...
and there's plenty of them. I can't removed them all one by one. There's like 160k sequences in this file.
Any idea how I can remove duplicate ID+seq from the fasta file or any way to fix the R comand to avoid getting sequences from other genes?
thanks
Comment