Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • DrYak
    Member
    • Sep 2013
    • 13

    Dedupe on assembled RNA-Seq?

    Hi,

    I am trying to get rid of "redundant" sequences from a trinity assembly. I used dedupe.sh to get rid of duplicates in the illumina source files and got a very good result.

    After assembling with trinity I get 85497 output sequences. If I cluster these with cd-hit at 95% I get 69413 clusters (mostly with >99% identity).

    How can I extract a single sequence from each cluster (the longest I assume)? I'm not sure how to go from the cd-hit clstr file to getting the largest sequence of each cluster out of my assembled fasta file...

    I tried to use dedupe on the assembled file but it only removed 2 sequences (which I assume were identical). What flag would I set to remove duplicates at the 99% identity level?

    Thank you in advance.
  • DrYak
    Member
    • Sep 2013
    • 13

    #2
    Hi,

    Well, I found (to my chagrin) that cd-hit has an aux tools package containing the cd-hit-dup tool.

    I do not, however, get the same results using cd-hit-est and cd-hit-dup.

    If I use cd-hit with the following parameters:

    cd-hit-est -i in.fasta -o out -c 0.95 -n 10 -d 0 - T 20

    I get 85497 finished 69413 clusters

    i.e. 69413 clusters from 85497 starting sequences.

    If I use cd-hit-dup with the following parameters:

    cd-hit-dup -i in.fasta -o out-nodupes.fasta -m false -e 0.05 -f true

    Which as far as I know should have the same similarity cut-off (95%) and remove smaller sequences (-m false) and chimeras, I get:

    Number of reads: 85497
    Number of clusters found: 82927
    Number of chimeric clusters found: 6

    i.e 82921 clusters from 85497 starting sequences.

    Can someone suggest an explanation for the such a huge difference?

    Thanks in advance.

    Comment

    • mastal
      Senior Member
      • Mar 2009
      • 666

      #3
      I think what you want is software that calls a consensus sequence from each cluster, rather than dedupe.

      Comment

      Latest Articles

      Collapse

      • GATTACAT
        Reply to Nine Things a Sample Prep Scientist Thinks About Before Sequencing
        by GATTACAT
        Love this - good data definitely starts from good input, and poor input can only give relatively poor data. I particularly like the mention of Nanodrop/absorbance based methods for quantification. It's such a toss up if you'll get an accurate reading or what amounts to a randomly generated number, and a lot of library/sequencing related issues can be traced back to poor quant.
        Yesterday, 11:43 AM
      • SEQadmin2
        Nine Things a Sample Prep Scientist Thinks About Before Sequencing
        by SEQadmin2


        I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.

        Here are nine questions we think about, in roughly the order they matter, before...
        06-18-2026, 07:11 AM

      ad_right_rmr

      Collapse

      News

      Collapse

      Topics Statistics Last Post
      Started by SEQadmin2, Today, 11:08 AM
      0 responses
      6 views
      0 reactions
      Last Post SEQadmin2  
      Started by SEQadmin2, 06-30-2026, 05:37 AM
      0 responses
      11 views
      0 reactions
      Last Post SEQadmin2  
      Started by SEQadmin2, 06-26-2026, 11:10 AM
      0 responses
      19 views
      0 reactions
      Last Post SEQadmin2  
      Started by SEQadmin2, 06-17-2026, 06:09 AM
      0 responses
      53 views
      0 reactions
      Last Post SEQadmin2  
      Working...