SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
removing duplicates in small (si)RNA data Kennels Bioinformatics 1 02-29-2012 09:03 PM
Removing Duplicates Scenario Exome Resequencing Hkins552 Bioinformatics 1 12-05-2011 05:23 PM
Read Duplicates hlmeng Bioinformatics 1 02-15-2011 06:17 PM
26% duplicates KevinLam Bioinformatics 4 08-19-2010 04:20 PM
Removing primers Khanjan 454 Pyrosequencing 1 02-05-2010 11:09 AM

Reply
 
Thread Tools
Old 02-02-2012, 04:20 AM   #21
dingxiaofan1
Member
 
Location: Hong kong

Join Date: Jul 2010
Posts: 17
Default

Quote:
Originally Posted by foxyg View Post
I know samtool and picard can remove duplicates. But is it really necessary? A duplicate could be PCR effect or reading same fragment twice, there is no way to tell.

Also how do you define a duplicte? Why do both sametools and picard take in bam files as input? In theory, you can remove duplicate from raw data already. Is it because they only check the aligned location not the actual read?
My case is quite similar to you. So finally how do you deal with your data. Is there a paper published already?
dingxiaofan1 is offline   Reply With Quote
Old 03-19-2012, 06:13 PM   #22
dongshenglulv
Member
 
Location: Shanghai

Join Date: May 2011
Posts: 15
Default

Why did the "recurrent sequencing errors" is likely to be caused by PCR?
dongshenglulv is offline   Reply With Quote
Old 03-19-2012, 06:15 PM   #23
dongshenglulv
Member
 
Location: Shanghai

Join Date: May 2011
Posts: 15
Default

Quote:
Originally Posted by lh3 View Post
When we looked at structural variation and SNP calls from that data set, we found many recurrent "sequencing" errors. Richard pointed out that this was likely to be caused by PCR. I then implemented the "rmdup" component in maq. When we applied that, we got much cleaner SNP/SV calls.

Why did the "recurrent sequencing errors seems to be caused by PCR?
dongshenglulv is offline   Reply With Quote
Old 03-19-2012, 07:37 PM   #24
Heisman
Senior Member
 
Location: St. Louis

Join Date: Dec 2010
Posts: 535
Default

Quote:
Originally Posted by dongshenglulv View Post
Why did the "recurrent sequencing errors seems to be caused by PCR?
Obviously I'm not Heng, but most likely let's say you see evidence for a SNP on a few reads, but on each of those reads the snp occurs on the 37th base pair of the read. That implies a PCR duplicate. In reality, you should see a SNP occur on a bunch of different strands and at different base calling cycles within those strands.
Heisman is offline   Reply With Quote
Old 03-20-2012, 12:45 PM   #25
lh3
Senior Member
 
Location: Boston

Join Date: Feb 2008
Posts: 693
Default

There is a double quotation mark around sequencing. They are not sequencing errors. They are errors introduced by PCR and then get amplified in the following PCR cycles. When the duplicate rate is very high, you can get multiple reads containing this PCR error.
lh3 is offline   Reply With Quote
Old 08-01-2012, 02:50 AM   #26
ahven
Junior Member
 
Location: Singapore

Join Date: Dec 2011
Posts: 1
Default

Hi
I we have done pooled sample targeted sequencing and according to rmdup and MarkDuplicates I have 90-95% of PCR duplicates. However, in my case, I think this is quite normal? We have relatively small capture region ~400Kb and we are sequencing it to very high coverage (2000x or more). Since we are trying to detect variations in the pooled samples we need high coverage. However, if I now remove potential PCR duplicates I do not have sufficient depth. Please advice!

lh3 could you please clearify your formula for the theoretical false dedup rate 0.28*m/s/L? I have 10-20M pairs for each pool and targeted region ~400kb.

Thanks!
ahven is offline   Reply With Quote
Old 08-01-2012, 09:20 AM   #27
dfornika
Junior Member
 
Location: Vancouver, BC

Join Date: Aug 2009
Posts: 4
Default

Quote:
Originally Posted by ahven View Post
Hi
I we have done pooled sample targeted sequencing and according to rmdup and MarkDuplicates I have 90-95% of PCR duplicates. However, in my case, I think this is quite normal? We have relatively small capture region ~400Kb and we are sequencing it to very high coverage (2000x or more). Since we are trying to detect variations in the pooled samples we need high coverage. However, if I now remove potential PCR duplicates I do not have sufficient depth. Please advice!
This is similar to my experiment, where I did pooled sequencing of the mitochondrial genome (16.5kb). It isn't appropriate to remove PCR duplicates in this situation because you can't distinguish PCR duplicates from independent reads that map to exactly the same location.
dfornika is offline   Reply With Quote
Old 08-01-2012, 11:21 AM   #28
swbarnes2
Senior Member
 
Location: San Diego

Join Date: May 2008
Posts: 912
Default

Removing duplicates imposes a cap on sequencing coverage. For single end data, that cap is 2x the length of the read. (For instance with 50-mers, a base at position 100 can have at most 100 reads covering it if single end duplicates have been removed: one read going forward from bp 51-100, another forward read from 52-101, ... and then 50 more reads in the reverse direction.) For paired end, the cap is far higher, maybe several hundredx, depending on the tightness of the insert sizes. If your coverage is well below that ceiling, then any duplicates are likely PCR artifacts, and getting rid of PCR artifacts is good. If your coverage is well above that ceiling, then some of those duplicates are "real" (not from PCR, but really origainte from different pieces of DNA that sheared exactly the same way) and removing duplicates is going to get rid of some "real" data.
swbarnes2 is offline   Reply With Quote
Old 11-25-2012, 07:31 PM   #29
edge
Senior Member
 
Location: China

Join Date: Sep 2009
Posts: 199
Default

Hi,

As I know due to inherent mistakes in the sequencing technology, some reads will be exact copies of each other. They share the same sequence and the same alignment position and could cause trouble during SNP calling as possibly some allele is overrepresented due to amplification biases.

My concern is whether remove or mark duplicate necessary for Transcriptome Data before calling SNP?
What I was doing now for my transcriptome data set is I align, remove duplicates, realign indel, SNP calling.

Thanks for any advice.
edge is offline   Reply With Quote
Old 11-15-2016, 06:14 AM   #30
moistplus
Member
 
Location: Germany

Join Date: Feb 2016
Posts: 40
Default

Do you think it could be useful to remove those duplicates (with Picard tools MarkDuplicates) for a de novo genome assembly ?
moistplus is offline   Reply With Quote
Old 11-15-2016, 12:45 PM   #31
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Not usually, and never if you are using unamplified reads.
Brian Bushnell is offline   Reply With Quote
Old 11-15-2016, 12:50 PM   #32
husamia
Member
 
Location: cinci

Join Date: Apr 2010
Posts: 66
Default

"it depends on the data in question". I suggest remove duplicates and see how it affects the overall quality. Then, decide weather removing duplicates is useful.
husamia is offline   Reply With Quote
Old 12-02-2016, 11:42 AM   #33
moistplus
Member
 
Location: Germany

Join Date: Feb 2016
Posts: 40
Default

PCR duplicatas are just redundant informations so removing it will not afftect the final quality of the assembly I guess ?
moistplus is offline   Reply With Quote
Old 12-02-2016, 12:42 PM   #34
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Assemblers use coverage to determine things like how many copies of a repeat there are and whether a base is an error or not. Removing duplicates can effect this; it could increase or decrease the quality of the assembly. Are you using PCR-amplified data?
Brian Bushnell is offline   Reply With Quote
Old 12-02-2016, 01:17 PM   #35
moistplus
Member
 
Location: Germany

Join Date: Feb 2016
Posts: 40
Default

Quote:
Originally Posted by Brian Bushnell View Post
Assemblers use coverage to determine things like how many copies of a repeat there are and whether a base is an error or not. Removing duplicates can effect this; it could increase or decrease the quality of the assembly. Are you using PCR-amplified data?
Yes it's PCR amplified.
moistplus is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 03:57 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO