Seqanswers Leaderboard Ad

**jollymrt** · 05-09-2012, 04:11 PM

the duplicate entry error can be removed by selecting only the distinct rows in the similarSequence table

**guyleonard** · 01-07-2013, 03:11 AM

Originally posted by jollymrt View Post

the duplicate entry error can be removed by selecting only the distinct rows in the similarSequence table

Never mind. There was a duplicate or two that I had missed. Sorting and using the tool 'uniq' works but you have to use the command -w and a number (I used 40) to limit the match to just the accession - sometimes the duplicates had different score values and so were effectively unique... Phew.

Any chance you could expand on that?

I have the same error as the OP, my file is 6.5GB.

I've gone through the file and removed duplicates...or at least I thought I had.

I managed to find a list of duplicate accessions and so removed them from similarsequences with AWK. I then also sorted the file on the first column and performed a uniq removal of any next neighbour duplicates...

Every time the same error:
Duplicate entry 'didi|DDB_G0279353-didi|DDB_G0283451' for key 'better_hit_ix' at /home/cs02gl/programs/orthomclSoftware-v2.0.3/bin/orthomclPairs line 693, <F> line 14.

Looking at those accessions in the file (using grep) reveals no duplicates for that matching.

**jollymrt** · 01-07-2013, 10:37 AM

to check if you have duplicate entries use the following command

select * from similarSequences group by query_id,subject_id having count(*)>1;

this command will give you the rows that are duplicated.

Then you can create a new table that will have only distinct rows.

create table holdup as select distinct * from similarSequences;

**robinvvelzen** · 09-24-2013, 01:20 AM

I am also having the same errors as the OP.

Apart from trying ways to fix it I am wondering what causes the duplicate error in the orthoMCL pipeline. Given that duplicates are to be expected after an all-vs-all blast I would expect that the orthoMCL scripts would appropriately deal with them.

Is this a matter of orthomclBlastParser, orthomclLoadBlast or orthomclPairs not doing a proper job?

Originally posted by jollymrt View Post

to check if you have duplicate entries use the following command

select * from similarSequences group by query_id,subject_id having count(*)>1;

this command will give you the rows that are duplicated.

Then you can create a new table that will have only distinct rows.

create table holdup as select distinct * from similarSequences;

Thanks for the help but I have a few questions:
1. Looking at those commands I assume these are to be executed within mysql, is that correct?
2. Will they simply replace the similarSequences table with itself with duplicates removed, or do I need to do more to be able to continue the analysis?
3. Will this (changing the table somewhere midway the orthoMCL pipeline) not compromise the analysis?

Thanks again!

**jollymrt** · 09-24-2013, 06:42 AM

Yes, the commands have to be executed on mysql.
first you need to select distinct rows in a different table(holdup table), delete all the rows in similarSequence table and then insert the rows from the holdup table. I hope this clears it

**robinvvelzen** · 09-24-2013, 06:48 AM

Originally posted by jollymrt View Post

Yes, the commands have to be executed on mysql.
first you need to select distinct rows in a different table(holdup table), delete all the rows in similarSequence table and then insert the rows from the holdup table. I hope this clears it

Yes, it does. Thanks very much!

In the meantime I had redone the whole orthoMCL pipeline and for some reason got no more duplicate errors

. Possibly, some table entries were accidentally copied the last time. This just to inform others that may run into the same problems..

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 26 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 29 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 25 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 52 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

OrthoMCL duplicate entry error

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News