View Single Post
Old 12-04-2019, 12:06 PM   #1
sdmoore
Member
 
Location: Florida

Join Date: Jun 2014
Posts: 11
Default Recover only longest version of sequence from multiple sequence fasta file - help

Greetings,
I have a large list of fasta sequences (not paired end) from which I want to isolate the longest version of each "sequence" (obviously the longest one has a different sequence). For example:

input.fa:
>Seq_1
AAACCCGGGTTT

Seq_2
CCCGGG

output_keep.fa:
>Seq_1
AAACCCGGGTTT

output_ditch.fa:
>Seq_2
CCCGGG

I tried a number of approaches using a mock dataset containing mismatches, exact copies, and known truncations from either end of the long form - "test_dups.fa":

(1) dedupe.sh in SAMtools
dedupe.sh test_dups.fa test_dups_dedupe.fa outd=test_dups_removed.fa

this didn't work: its output didn't make sense, and an exact duplicate pair was removed completely (didn't leave behind one of them, which is bad news).

(2) with the ac=f option
dedupe.sh test_dups.fa test_dups_dedupe.fa outd=test_dups_removed.fa ac=f

that only removed the exact copy pair, again not leaving one copy in file.

(3) clustering by overlap
dedupe.sh test_dups.fa test_dups_dedupe.fa outd=test_dups_removed.fa ac=f mo=50 c pc csf=stats.txt outbest=test_best.fa mcs=1

didn't work (retained were not longest examples).

Then I moved on to VSEARCH, to sort by size, then form clusters and output the "seed":

(4) vsearch clustering
--cluster_fast dups.fa --centroids test_unique_1.0.fa --id 1.0
oddly, this kept a truncation of the longest sequence, but was close...

(5) using the iddef = 0, to ignore end gaps
vsearch --cluster_fast dups.fa --centroids test_unique_1.0.fa --id 1.0 --iddef 0

this did not ignore the end gaps penalties as expected.

(6) Align two seqs and see what the perceived idenity is
vsearch --allpairs_global dups_9_10.fa --acceptall --alnout test_aln.fa --iddef 0

the resulting alignment shows 100% identity, as expected with the iddef 0 option.

Why are those sequences not in the same cluster then?

Searching for help on "remove duplicates" is a disaster here, so I hope I can get some help. One consideration is to merge the sequences with 100% identity for the "overlap", but those tools merge from one end of the read, as is used to combine paired end reads. The other consideration is spending a day writing a script. I reasoned I can not be the first person to do this.
sdmoore is offline   Reply With Quote