Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • multiple reads having the same sequence...

    Hi guys, I have a doubt about the source of multiple identical reads that are generated during SOLEXA sequencing. Indeed what we find currently in our runs is that we get around 12 million custers, which are then filtered (looks like by a read"purity" treshold as well as by their aligment to the corresponding genome...but Im not 100% sure about it) to around 6 million reads aligning to unique sites into the studied genome. Nvertheless a further filtering removes reads containing more than 2 mismatches as well as multiple reads. When we look at the fraction of this "for me unexpected multiple identical reads" we found that indeed such event is more frequent than the mismatches...nevertheless I dont understand the source of multiple identical reads. Indeed, since the fragmentation process for ChIP assays is a completely random process, for me looks quite unlikely to get fragments having the same tips (I meant the DNA ends that are sequenced). Did you see a similar problem and do you know the source of this multiple identical reads??? furthermore, by accident we have seen that if the initial number of clusters is lower (around 7 millions), the fraction of multiple identical reads dropsdown significantly...even though for the moment we dont know if it is pure coincidence. Thanks for your hints

  • #2
    Interesting question. In ChIP-seq, we often see "odd" stuff, which includes biases to certain sections of clearly unexpected regions of the genome. That often includes large "peaks" in centromeres, or just large stacks of duplicates.

    However, while we don't know the sources of all of this "odd" stuff, we can account for most of it with good controls. (I doubt that the fragmentation is completely random, though, regardless of which method you use...)

    If you're looking for other sources, many groups do a PCR step on their DNA before sequencing, which might preferentially amplify fragments, and of course, you are isolating DNA from a large population of cells, so it's possible that you're just getting a lot of pulled down material from a whole collection of cells where that signal is strong.

    Anyhow, I would also suggest that your pipeline of how you handle the reads also makes a difference. You don't specify the aligner or the filtering techniques being used, so that makes it really hard to get to the bottom of what you're seeing.

    Good luck making sense of your data!
    The more you know, the more you know you don't know. —Aristotle

    Comment


    • #3
      Seems we have similar problem.
      I have several identical reads and of course they mapped to the same position.
      When I analyze the 454 data, i keep one and remove others, because it is likely caused by some technical problem.
      But for Solexa data, I don't know any reason can make me remove them.

      Comment


      • #4
        another source: the current human genome sequence is imperfect. there are likely sequences which are in fact repeats but do not appear so in the current genome assembly.

        if we see 'read-towers' we regard them as artefact until proved otherwise.

        Comment


        • #5
          During library construction (454/Illumina etc...) almost all protocols have a PCR amplification stage, if only to get enough material to sequence. Unless you are expecting it, I would remove any exactly identical sequence reads if they were going to affect downstream analysis. Removing reads may sound like a bad thing, but we have found that the bias that is caused by keeping replicated reads can be huge (and muddies an already muddy pool!), so although it is conservative, and may be removing useful data, without any way to prove the reads come from idependant sources, i would always remove them. You might consider barcoding your library when you amplify (easy to do) and at least this way, any identical, but idependantly produced, sequences will now be seperable.

          Sorry for the long post... ...

          Comment


          • #6
            ieuanclay is correct. The duplication is caused by the library prep steps. We've found by lowering the number of PCR cycles or doing a 2 stage PCR instead you get less duplication. So basically you get so much sequence you're seeing 2 products of a PCR reaction sequenced.

            It only works for paired end sequencing but I judge library diversity by looking at the number of identical paired end reads (same exact start-end for the pair). Weather you want to remove them or not is left up to you as, for a low diversity library, they can cause spurious SNP calls and such depending on the algorithm and the PCR fidelity.

            And the purity filter doesn't work on alignment, just call quality. Think of it like trimming away the bad phred scores.

            Comment


            • #7
              duplicates in ChIPSeq

              Hello,

              i have exactly the same problem but find this thread just now
              Please look at - http://seqanswers.com/forums/showthread.php?t=2592

              Many thanks for your help, it is much appreciated!!!

              tec

              Comment


              • #8
                multiple reads having the same sequence...

                Hello all,

                the problem with duplicate reads still keeps me busy..
                Therefore we performed a Topo cloning resequencing check of the library.
                Surprisingly, over 75% of the clones were unique - which doesn't correlate with the sequencing run!!!

                Does anyone have an idea???

                Thanks! tec

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  Current Approaches to Protein Sequencing
                  by seqadmin


                  Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                  04-04-2024, 04:25 PM
                • seqadmin
                  Strategies for Sequencing Challenging Samples
                  by seqadmin


                  Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                  03-22-2024, 06:39 AM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by seqadmin, 04-11-2024, 12:08 PM
                0 responses
                30 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-10-2024, 10:19 PM
                0 responses
                32 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-10-2024, 09:21 AM
                0 responses
                28 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-04-2024, 09:00 AM
                0 responses
                52 views
                0 likes
                Last Post seqadmin  
                Working...
                X