Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Duplicate removal without alignment to reference genome

    Hi guys,

    So our HiSeq data is showing a large number of duplicate sequences. I've come across tools like Picard MarkDups or samtools rmdup which remove duplicates - however they seem to require alignment to a reference genome and use position information to perform the removal.

    Is there some way of performing duplicate removal without using alignment to a reference? (since we don't have a reference!) A naive pair-wise comparison of all sequences to each other would probably take too much time, and not account for localized errors as well, correct ? Should I use a hashtable to store all the sequences and then perform a constant time lookup for each sequence ? Or am I missing an easy way of doing this ?


    Thanks!

  • #2
    You could try a tool like fastx collapser. It would collapse similar sequences into unique reads and give the count of each sequence in the header

    Comment


    • #3
      I believe that fastx toolkit can do that with the collapser sub command, but it will only identify 100% matches. Mothur also has a similar command, uniq.seqs but again only looks at 100% identity.

      Comment


      • #4
        That was quick! Thanks Vivek, I'll try fastx collapser.

        From a quick google search, it looks like this tool can only collapse completely identical reads. I might still want to figure out how to deal with a situation where two reads differ only by 1-2 nucleotides due to error - I would want to collapse these 'almost-duplicates' as well.


        Also, just out of curiosity, in principle the fastx-collapser could do the duplicate removal job for everyone right, regardless of whether they have access to a reference genome or not ? Why were the tools MarkDups and rmdup created anyway ? They have an additional time consuming alignment step as well.

        Comment


        • #5
          Originally posted by mcnelson.phd View Post
          I believe that fastx toolkit can do that with the collapser sub command, but it will only identify 100% matches. Mothur also has a similar command, uniq.seqs but again only looks at 100% identity.
          Thanks mcnelson, I missed your reply. I'l have a go at mothur too, but like you said, I need a fix for the sub 100% match as well.

          Comment


          • #6
            Originally posted by curious.genome View Post
            I might still want to figure out how to deal with a situation where two reads differ only by 1-2 nucleotides due to error - I would want to collapse these 'almost-duplicates' as well.


            Also, just out of curiosity, in principle the fastx-collapser could do the duplicate removal job for everyone right, regardless of whether they have access to a reference genome or not ? Why were the tools MarkDups and rmdup created anyway ? They have an additional time consuming alignment step as well.
            The answer to both those questions gets at the root of why you want to remove duplicate reads. There are two types of duplicate reads in sequencing libraries, those that are true biological duplicates and those that are due to PCR (or some other bias in library preparation). The reason why some tools utilize a mapping step is to identify if reads are biological duplicates or PCR duplicates. By mapping to a reference, you can then look at a coverage map and determine if a sequence is represented multiple times in your genome/transcriptome/etc and that's why you have duplicate reads or if it's simply b/c some reads were PCR amplified more efficiently.

            On the basis of that, the question I'd pose to you is why you want to identify and remove reads that differ by 1-2 nt? Depending on what you plan on doing with your data, you don't want to remove those reads (or duplicates either for that matter). If you really want to go that route though, you can treat your data like it's amplicon data and basically perform sequence clustering with an identity cutoff that would reflect 1 or 2 nt changes over the length of your reads (e.g 0.99 for 1 base difference in 100bp reads). The best option that I can recommend for doing that is probably usearch since it's faster on HiSeq sized datasets compared to other methods, but there are a number of programs (including mothur) that can do that for you.

            Comment


            • #7
              Originally posted by mcnelson.phd View Post
              The answer to both those questions gets at the root of why you want to remove duplicate reads. There are two types of duplicate reads in sequencing libraries, those that are true biological duplicates and those that are due to PCR (or some other bias in library preparation). The reason why some tools utilize a mapping step is to identify if reads are biological duplicates or PCR duplicates. By mapping to a reference, you can then look at a coverage map and determine if a sequence is represented multiple times in your genome/transcriptome/etc and that's why you have duplicate reads or if it's simply b/c some reads were PCR amplified more efficiently.

              On the basis of that, the question I'd pose to you is why you want to identify and remove reads that differ by 1-2 nt? Depending on what you plan on doing with your data, you don't want to remove those reads (or duplicates either for that matter). If you really want to go that route though, you can treat your data like it's amplicon data and basically perform sequence clustering with an identity cutoff that would reflect 1 or 2 nt changes over the length of your reads (e.g 0.99 for 1 base difference in 100bp reads). The best option that I can recommend for doing that is probably usearch since it's faster on HiSeq sized datasets compared to other methods, but there are a number of programs (including mothur) that can do that for you.


              Thanks mcnelson ! For now, all I plan to do with the reads is to perform assembly. I am not interested in variants/etc. I planned on removing the duplicates to aid the assembler. Is this a bad idea ?

              Comment


              • #8
                I'd advise against removing duplicates for your first time doing the assembly because the k-mer abundance profile may get skewed, which can cause your assemblies to be wrong. Most assemblers for Illumina data are very tolerant to errors, and they can do an amazing job with very poor input data. After your get your initial assembly, then it might be worth it to see if removing duplicate and/or erroneous reads will improve things, but what often happens is that you'll get a poorer assembly b/c you've reduced the volume of data you have to work with and this the assembler has a harder time trying to correctly resolve the graph.

                Comment


                • #9
                  Thanks for the sound advice, mcnelson. I'll go ahead with the assembly for now

                  Comment

                  Latest Articles

                  Collapse

                  • seqadmin
                    Current Approaches to Protein Sequencing
                    by seqadmin


                    Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                    04-04-2024, 04:25 PM
                  • seqadmin
                    Strategies for Sequencing Challenging Samples
                    by seqadmin


                    Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                    03-22-2024, 06:39 AM

                  ad_right_rmr

                  Collapse

                  News

                  Collapse

                  Topics Statistics Last Post
                  Started by seqadmin, 04-11-2024, 12:08 PM
                  0 responses
                  18 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 04-10-2024, 10:19 PM
                  0 responses
                  22 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 04-10-2024, 09:21 AM
                  0 responses
                  17 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 04-04-2024, 09:00 AM
                  0 responses
                  49 views
                  0 likes
                  Last Post seqadmin  
                  Working...
                  X