Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • CD-HIT Doesn't report the total number of sequences correctly, any known fix?

    I am using CD-HIT to reduce redundacy in a dataset of 20405 peptides, CD-HIT seem to work fine but it identifies only 18404 peptides as shown in the output code below:
    Code:
    Program: CD-HIT, V4.8.1 (+OpenMP), Nov 13 2019, 13:22:53
    Command: cd-hit -i BD_final_con_nombres.fasta -o BDpos.fa -c
             0.9 -g 1 -T 0 -M 0 -n 5
    
    Started: Mon Dec 16 16:51:57 2019
    ================================================================
                                Output                              
    ----------------------------------------------------------------
    total number of CPUs in the system is 12
    Actual number of CPUs to be used: 12
    
    total seq: 18404
    longest and shortest : 300 and 11
    Total letters: 737624
    Sequences have been sorted
    
    Approximated minimal memory consumption:
    Sequence        : 3M
    Buffer          : 12 X 10M = 129M
    Table           : 2 X 65M = 131M
    Miscellaneous   : 0M
    Total           : 263M
    
    Table limit with the given memory limit:
    Max number of representatives: 744016
    Max number of word counting entries: 14908239
    
    # comparing sequences from          0  to       1314
    .---------- new table with      840 representatives
    # comparing sequences from       1314  to       2534
    ----------    994 remaining sequences to the next cycle
    ---------- new table with      187 representatives
    # comparing sequences from       1540  to       2744
    ----------   1023 remaining sequences to the next cycle
    ---------- new table with      117 representatives
    # comparing sequences from       1721  to       2912
    ----------   1010 remaining sequences to the next cycle
    ---------- new table with      110 representatives
    # comparing sequences from       1902  to       3080
    ----------    996 remaining sequences to the next cycle
    ---------- new table with      100 representatives
    # comparing sequences from       2084  to       3249
    ----------    962 remaining sequences to the next cycle
    ---------- new table with      123 representatives
    # comparing sequences from       2287  to       3438
    ----------    953 remaining sequences to the next cycle
    ---------- new table with      116 representatives
    # comparing sequences from       2485  to       3622
    ----------    958 remaining sequences to the next cycle
    ---------- new table with      117 representatives
    # comparing sequences from       2664  to       3788
    ----------    935 remaining sequences to the next cycle
    ---------- new table with      100 representatives
    # comparing sequences from       2853  to       3963
    ----------    932 remaining sequences to the next cycle
    ---------- new table with      124 representatives
    # comparing sequences from       3031  to       4129
    ----------    891 remaining sequences to the next cycle
    ---------- new table with      113 representatives
    # comparing sequences from       3238  to       4321
    ----------    700 remaining sequences to the next cycle
    ---------- new table with      100 representatives
    # comparing sequences from       3621  to       4676
    ----------    844 remaining sequences to the next cycle
    ---------- new table with      115 representatives
    # comparing sequences from       3832  to       4872
    ----------    822 remaining sequences to the next cycle
    ---------- new table with      154 representatives
    # comparing sequences from       4050  to       5075
    ----------    760 remaining sequences to the next cycle
    ---------- new table with      127 representatives
    # comparing sequences from       4315  to       5321
    ----------    768 remaining sequences to the next cycle
    ---------- new table with      138 representatives
    # comparing sequences from       4553  to       5542
    ----------    737 remaining sequences to the next cycle
    ---------- new table with      118 representatives
    # comparing sequences from       4805  to       5776
    ----------    727 remaining sequences to the next cycle
    ---------- new table with      111 representatives
    # comparing sequences from       5049  to       6002
    ----------    707 remaining sequences to the next cycle
    ---------- new table with      100 representatives
    # comparing sequences from       5295  to       6231
    ----------    651 remaining sequences to the next cycle
    ---------- new table with      127 representatives
    # comparing sequences from       5580  to       6496
    ----------    629 remaining sequences to the next cycle
    ---------- new table with      100 representatives
    # comparing sequences from       5867  to       6762
    ----------    563 remaining sequences to the next cycle
    ---------- new table with      115 representatives
    # comparing sequences from       6199  to       7070
    ----------    585 remaining sequences to the next cycle
    ---------- new table with      100 representatives
    # comparing sequences from       6485  to       7336
    ----------    521 remaining sequences to the next cycle
    ---------- new table with      100 representatives
    # comparing sequences from       6815  to       7642
    ----------    545 remaining sequences to the next cycle
    ---------- new table with      116 representatives
    # comparing sequences from       7097  to       7904
    ----------    514 remaining sequences to the next cycle
    ---------- new table with      127 representatives
    # comparing sequences from       7390  to       8176
    ----------    550 remaining sequences to the next cycle
    ---------- new table with      110 representatives
    # comparing sequences from       7626  to       8395
    ----------    551 remaining sequences to the next cycle
    ---------- new table with      123 representatives
    # comparing sequences from       7844  to       8598
    ----------    529 remaining sequences to the next cycle
    ---------- new table with      118 representatives
    # comparing sequences from       8069  to       8807
    ----------    465 remaining sequences to the next cycle
    ---------- new table with      139 representatives
    # comparing sequences from       8342  to       9060
    ----------    438 remaining sequences to the next cycle
    ---------- new table with      140 representatives
    # comparing sequences from       8622  to       9320
    ----------    431 remaining sequences to the next cycle
    ---------- new table with      130 representatives
    # comparing sequences from       8889  to       9568
    ----------    392 remaining sequences to the next cycle
    ---------- new table with      117 representatives
    # comparing sequences from       9176  to       9835
    ----------    377 remaining sequences to the next cycle
    ---------- new table with      114 representatives
    # comparing sequences from       9458  to      10097
    ----------    364 remaining sequences to the next cycle
    ---------- new table with      130 representatives
    # comparing sequences from       9733  to      10352
    ----------    373 remaining sequences to the next cycle
    ---------- new table with      122 representatives
    # comparing sequences from       9979  to      10580
    ..........    10000  finished       5044  clusters
    ----------    326 remaining sequences to the next cycle
    ---------- new table with      113 representatives
    # comparing sequences from      10254  to      10836
    ----------    296 remaining sequences to the next cycle
    ---------- new table with      124 representatives
    # comparing sequences from      10540  to      11101
    ----------    285 remaining sequences to the next cycle
    ---------- new table with      107 representatives
    # comparing sequences from      10816  to      11358
    ----------    260 remaining sequences to the next cycle
    ---------- new table with      100 representatives
    # comparing sequences from      11098  to      11619
    ----------    245 remaining sequences to the next cycle
    ---------- new table with      130 representatives
    # comparing sequences from      11374  to      11876
    ----------    277 remaining sequences to the next cycle
    ---------- new table with      157 representatives
    # comparing sequences from      11599  to      12085
    ----------    246 remaining sequences to the next cycle
    ---------- new table with      146 representatives
    # comparing sequences from      11839  to      12307
    ----------    223 remaining sequences to the next cycle
    ---------- new table with      146 representatives
    # comparing sequences from      12084  to      12535
    ----------    225 remaining sequences to the next cycle
    ---------- new table with      128 representatives
    # comparing sequences from      12310  to      12745
    ----------    225 remaining sequences to the next cycle
    ---------- new table with      117 representatives
    # comparing sequences from      12520  to      12940
    ----------    184 remaining sequences to the next cycle
    ---------- new table with      108 representatives
    # comparing sequences from      12756  to      13159
    ----------    190 remaining sequences to the next cycle
    ---------- new table with      131 representatives
    # comparing sequences from      12969  to      13357
    ----------    180 remaining sequences to the next cycle
    ---------- new table with      122 representatives
    # comparing sequences from      13177  to      13550
    ----------    154 remaining sequences to the next cycle
    ---------- new table with      129 representatives
    # comparing sequences from      13396  to      13753
    ----------    167 remaining sequences to the next cycle
    ---------- new table with      102 representatives
    # comparing sequences from      13586  to      13930
    ----------    149 remaining sequences to the next cycle
    ---------- new table with      115 representatives
    # comparing sequences from      13781  to      14111
    ----------    143 remaining sequences to the next cycle
    ---------- new table with      100 representatives
    # comparing sequences from      13968  to      14284
    ----------     99 remaining sequences to the next cycle
    ---------- new table with      100 representatives
    # comparing sequences from      14185  to      14486
    ----------    112 remaining sequences to the next cycle
    ---------- new table with      100 representatives
    # comparing sequences from      14374  to      14661
    ----------     69 remaining sequences to the next cycle
    ---------- new table with      100 representatives
    # comparing sequences from      14592  to      14864
    ----------     78 remaining sequences to the next cycle
    ---------- new table with      118 representatives
    # comparing sequences from      14786  to      15044
    ----------     76 remaining sequences to the next cycle
    ---------- new table with      115 representatives
    # comparing sequences from      14968  to      15213
    ----------     72 remaining sequences to the next cycle
    ---------- new table with      100 representatives
    # comparing sequences from      15141  to      15374
    ----------     51 remaining sequences to the next cycle
    ---------- new table with      100 representatives
    # comparing sequences from      15323  to      15543
    ----------     53 remaining sequences to the next cycle
    ---------- new table with      100 representatives
    # comparing sequences from      15490  to      15698
    ----------      9 remaining sequences to the next cycle
    ---------- new table with      100 representatives
    # comparing sequences from      15689  to      15882
    ....................---------- new table with       89 representatives
    # comparing sequences from      15882  to      16062
    ----------      1 remaining sequences to the next cycle
    ---------- new table with      100 representatives
    # comparing sequences from      16061  to      16228
    ----------      2 remaining sequences to the next cycle
    ---------- new table with      100 representatives
    # comparing sequences from      16226  to      16381
    ----------     11 remaining sequences to the next cycle
    ---------- new table with      100 representatives
    # comparing sequences from      16370  to      16515
    ..................---------- new table with       90 representatives
    # comparing sequences from      16515  to      16649
    ...................---------- new table with       77 representatives
    # comparing sequences from      16649  to      16774
    ..................---------- new table with       73 representatives
    # comparing sequences from      16774  to      16890
    ...................---------- new table with       57 representatives
    # comparing sequences from      16890  to      16998
    ..................---------- new table with       56 representatives
    # comparing sequences from      16998  to      17098
    ..................---------- new table with       59 representatives
    # comparing sequences from      17098  to      17191
    ...................---------- new table with       63 representatives
    # comparing sequences from      17191  to      17277
    .................---------- new table with       47 representatives
    # comparing sequences from      17277  to      17357
    ................---------- new table with       49 representatives
    # comparing sequences from      17357  to      17431
    ..................---------- new table with       42 representatives
    # comparing sequences from      17431  to      18404
    .....................---------- new table with      536 representatives
    
        18404  finished       9584  clusters
    
    Approximated maximum memory consumption: 265M
    writing new database
    writing clustering information
    program completed !
    I am sure that the fasta is correctly formated in the form:

    >header

    SEQUENCE

    Also the command:
    Code:
    grep -c '>' BD_final_con_nombres.fasta
    returns the correct number of peptides

    Does anybody know any way to fix this?

Latest Articles

Collapse

  • seqadmin
    Current Approaches to Protein Sequencing
    by seqadmin


    Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
    04-04-2024, 04:25 PM
  • seqadmin
    Strategies for Sequencing Challenging Samples
    by seqadmin


    Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
    03-22-2024, 06:39 AM

ad_right_rmr

Collapse

News

Collapse

Topics Statistics Last Post
Started by seqadmin, 04-11-2024, 12:08 PM
0 responses
30 views
0 likes
Last Post seqadmin  
Started by seqadmin, 04-10-2024, 10:19 PM
0 responses
32 views
0 likes
Last Post seqadmin  
Started by seqadmin, 04-10-2024, 09:21 AM
0 responses
28 views
0 likes
Last Post seqadmin  
Started by seqadmin, 04-04-2024, 09:00 AM
0 responses
53 views
0 likes
Last Post seqadmin  
Working...
X