Go Back   SEQanswers > Bioinformatics > Bioinformatics

Similar Threads
Thread Thread Starter Forum Replies Last Post
Merge three assembled files into one. netpumber Bioinformatics 6 03-27-2015 02:07 AM
cuffdiff- no assembled transcripts crh Bioinformatics 0 06-19-2014 05:53 PM
How to extract assembled transcript sequence from RNA-seq data instead of ref genome? lzu Bioinformatics 10 02-24-2014 12:56 AM
Improvements to assembled genome NPalopoli De novo discovery 6 01-31-2012 09:00 AM

Thread Tools
Old 04-07-2016, 05:58 AM   #1
Location: South Africa

Join Date: Sep 2013
Posts: 12
Default Dedupe on assembled RNA-Seq?


I am trying to get rid of "redundant" sequences from a trinity assembly. I used to get rid of duplicates in the illumina source files and got a very good result.

After assembling with trinity I get 85497 output sequences. If I cluster these with cd-hit at 95% I get 69413 clusters (mostly with >99% identity).

How can I extract a single sequence from each cluster (the longest I assume)? I'm not sure how to go from the cd-hit clstr file to getting the largest sequence of each cluster out of my assembled fasta file...

I tried to use dedupe on the assembled file but it only removed 2 sequences (which I assume were identical). What flag would I set to remove duplicates at the 99% identity level?

Thank you in advance.
DrYak is offline   Reply With Quote
Old 04-07-2016, 06:58 AM   #2
Location: South Africa

Join Date: Sep 2013
Posts: 12


Well, I found (to my chagrin) that cd-hit has an aux tools package containing the cd-hit-dup tool.

I do not, however, get the same results using cd-hit-est and cd-hit-dup.

If I use cd-hit with the following parameters:

cd-hit-est -i in.fasta -o out -c 0.95 -n 10 -d 0 - T 20

I get 85497 finished 69413 clusters

i.e. 69413 clusters from 85497 starting sequences.

If I use cd-hit-dup with the following parameters:

cd-hit-dup -i in.fasta -o out-nodupes.fasta -m false -e 0.05 -f true

Which as far as I know should have the same similarity cut-off (95%) and remove smaller sequences (-m false) and chimeras, I get:

Number of reads: 85497
Number of clusters found: 82927
Number of chimeric clusters found: 6

i.e 82921 clusters from 85497 starting sequences.

Can someone suggest an explanation for the such a huge difference?

Thanks in advance.
DrYak is offline   Reply With Quote
Old 04-07-2016, 07:05 AM   #3
Senior Member
Location: uk

Join Date: Mar 2009
Posts: 667

I think what you want is software that calls a consensus sequence from each cluster, rather than dedupe.
mastal is offline   Reply With Quote

bbmap, cd-hit, rna-seq advice

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

All times are GMT -8. The time now is 11:35 PM.

Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO