Go Back   SEQanswers > Sequencing Technologies/Companies > Illumina/Solexa

Similar Threads
Thread Thread Starter Forum Replies Last Post
Multiple sequence alignment analysis biobudhan Bioinformatics 1 03-28-2012 07:11 PM
what should i do with multiple mapped reads ? Triple_W Bioinformatics 2 01-11-2012 05:47 PM
multiple mapped reads aquleaf Bioinformatics 2 10-13-2011 05:04 PM
BAM/SAM to a gapped multiple sequence alignment query Bioinformatics 6 04-06-2011 05:07 AM
Using multiple MIDs in titanium sequence runs JurgenP 454 Pyrosequencing 8 07-23-2010 07:24 AM

Thread Tools
Old 02-19-2009, 03:23 PM   #1
Junior Member
Location: strasbourg

Join Date: Oct 2008
Posts: 1
Default multiple reads having the same sequence...

Hi guys, I have a doubt about the source of multiple identical reads that are generated during SOLEXA sequencing. Indeed what we find currently in our runs is that we get around 12 million custers, which are then filtered (looks like by a read"purity" treshold as well as by their aligment to the corresponding genome...but Im not 100% sure about it) to around 6 million reads aligning to unique sites into the studied genome. Nvertheless a further filtering removes reads containing more than 2 mismatches as well as multiple reads. When we look at the fraction of this "for me unexpected multiple identical reads" we found that indeed such event is more frequent than the mismatches...nevertheless I dont understand the source of multiple identical reads. Indeed, since the fragmentation process for ChIP assays is a completely random process, for me looks quite unlikely to get fragments having the same tips (I meant the DNA ends that are sequenced). Did you see a similar problem and do you know the source of this multiple identical reads??? furthermore, by accident we have seen that if the initial number of clusters is lower (around 7 millions), the fraction of multiple identical reads dropsdown significantly...even though for the moment we dont know if it is pure coincidence. Thanks for your hints
carlitos is offline   Reply With Quote
Old 02-19-2009, 03:31 PM   #2
Senior Member
Location: Oakland, California

Join Date: Feb 2008
Posts: 236

Interesting question. In ChIP-seq, we often see "odd" stuff, which includes biases to certain sections of clearly unexpected regions of the genome. That often includes large "peaks" in centromeres, or just large stacks of duplicates.

However, while we don't know the sources of all of this "odd" stuff, we can account for most of it with good controls. (I doubt that the fragmentation is completely random, though, regardless of which method you use...)

If you're looking for other sources, many groups do a PCR step on their DNA before sequencing, which might preferentially amplify fragments, and of course, you are isolating DNA from a large population of cells, so it's possible that you're just getting a lot of pulled down material from a whole collection of cells where that signal is strong.

Anyhow, I would also suggest that your pipeline of how you handle the reads also makes a difference. You don't specify the aligner or the filtering techniques being used, so that makes it really hard to get to the bottom of what you're seeing.

Good luck making sense of your data!
The more you know, the more you know you don't know. —Aristotle
apfejes is offline   Reply With Quote
Old 02-26-2009, 05:15 AM   #3
Location: Germany

Join Date: Jan 2009
Posts: 41

Seems we have similar problem.
I have several identical reads and of course they mapped to the same position.
When I analyze the 454 data, i keep one and remove others, because it is likely caused by some technical problem.
But for Solexa data, I don't know any reason can make me remove them.
mingkunli is offline   Reply With Quote
Old 02-26-2009, 09:59 AM   #4
Location: london, uk

Join Date: Jul 2008
Posts: 35

another source: the current human genome sequence is imperfect. there are likely sequences which are in fact repeats but do not appear so in the current genome assembly.

if we see 'read-towers' we regard them as artefact until proved otherwise.
dvh is offline   Reply With Quote
Old 03-10-2009, 06:17 AM   #5
Location: Basel, Switzerland

Join Date: Feb 2009
Posts: 27

During library construction (454/Illumina etc...) almost all protocols have a PCR amplification stage, if only to get enough material to sequence. Unless you are expecting it, I would remove any exactly identical sequence reads if they were going to affect downstream analysis. Removing reads may sound like a bad thing, but we have found that the bias that is caused by keeping replicated reads can be huge (and muddies an already muddy pool!), so although it is conservative, and may be removing useful data, without any way to prove the reads come from idependant sources, i would always remove them. You might consider barcoding your library when you amplify (easy to do) and at least this way, any identical, but idependantly produced, sequences will now be seperable.

Sorry for the long post... ...
ieuanclay is offline   Reply With Quote
Old 03-12-2009, 09:26 AM   #6
Location: San Diego

Join Date: Mar 2009
Posts: 26

ieuanclay is correct. The duplication is caused by the library prep steps. We've found by lowering the number of PCR cycles or doing a 2 stage PCR instead you get less duplication. So basically you get so much sequence you're seeing 2 products of a PCR reaction sequenced.

It only works for paired end sequencing but I judge library diversity by looking at the number of identical paired end reads (same exact start-end for the pair). Weather you want to remove them or not is left up to you as, for a low diversity library, they can cause spurious SNP calls and such depending on the algorithm and the PCR fidelity.

And the purity filter doesn't work on alignment, just call quality. Think of it like trimming away the bad phred scores.
basickler is offline   Reply With Quote
Old 09-16-2009, 05:56 AM   #7
Location: germany

Join Date: Apr 2008
Posts: 14
Default duplicates in ChIPSeq


i have exactly the same problem but find this thread just now
Please look at -

Many thanks for your help, it is much appreciated!!!

tec is offline   Reply With Quote
Old 10-07-2009, 04:30 AM   #8
Location: germany

Join Date: Apr 2008
Posts: 14
Exclamation multiple reads having the same sequence...

Hello all,

the problem with duplicate reads still keeps me busy..
Therefore we performed a Topo cloning resequencing check of the library.
Surprisingly, over 75% of the clones were unique - which doesn't correlate with the sequencing run!!!

Does anyone have an idea???

Thanks! tec
tec is offline   Reply With Quote

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

All times are GMT -8. The time now is 07:40 PM.

Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO