SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Paired-end Illumina RNA-seq adapter trimming fabrice Bioinformatics 8 01-05-2015 08:48 AM
EBARDenovo - A new RNA-seq do novo assembler for paired-end Illumina data htchu.taiwan RNA Sequencing 2 06-10-2013 01:13 AM
EBARDenovo - A new RNA-seq do novo assembler for paired-end Illumina data htchu.taiwan Illumina/Solexa 9 04-16-2013 12:08 AM
Source of duplication in illumina hiseq paired-end reads? amango Bioinformatics 4 01-31-2012 03:55 AM
RNA-Seq: Single Read and Paired End mRNA-Seq Illumina Libraries from 10 Nanograms Tot Newsbot! Literature Watch 0 11-09-2011 03:10 AM

Reply
 
Thread Tools
Old 08-05-2011, 06:30 AM   #1
fabrice
Member
 
Location: paris

Join Date: Oct 2009
Posts: 86
Default Human Illumina Paired-end RNA-Seq remove duplication.

I am using Human Illumina Paired-end RNA-Seq. I analysis purpose is to
get expression of isoform level. Not for SNP calling.

When I used fastqc(0.94) to examin my RNA-seq data, I found that there
are very high duplication level in it. About 70% are duplication
repost by fastqc. So I tried to use Picard(1.50) to remove duplicate
reads.

The command is:

java -Xmx4g -jar ~/bin/picard/MarkDuplicates.jar REMOVE_DUPLICATES=true
INPUT=accepted_hits.bam OUTPUT=remove_accepted_hits.bam
METRICS_FILE=dup.txt

After run picard, I used fastqc to check again. It is better but it is
still have a high duplication level (63% duplication). Does it mean
picard do not work well or fastqc report have a problem?

I looked the output from Picard,
In the METRICS_FILE of picard output, the PERCENT_DUPLICATION is 0.312927.
But fastqc give the DUPLICATION level percent is 70%.

Why have this difference?


Thanks.
fabrice is offline   Reply With Quote
Old 08-05-2011, 06:52 AM   #2
Robby
Member
 
Location: Germany

Join Date: Mar 2011
Posts: 68
Default

Hi fabrice,
I am not sure, but I think fastqc counts all identical reads as a duplicate. In comparision to that, picard marks only these reads as duplicates, where the position of the forward and the reverse read is the same. So for Picard it is not enough, that for example just the forward reads is the same.
Robby is offline   Reply With Quote
Old 08-05-2011, 07:04 AM   #3
fabrice
Member
 
Location: paris

Join Date: Oct 2009
Posts: 86
Default

Thank you.

So does it mean that Fastqc more closer to the truth?

Picard worked as this:

Q: How does MarkDuplicates work?
A: Essentially what it does (for pairs; single-end data is also handled) is to find the 5' coordinates and mapping orientations of each read pair. When doing this it takes into account all clipping that has taking place as well as any gaps or jumps in the alignment. You can thus think of it as determining "if all the bases from the read were aligned, where would the 5' most base have been aligned". It then matches all read pairs that have identical 5' coordinates and orientations and marks as duplicates all but the "best" pair. "Best" is defined as the read pair having the highest sum of base qualities as bases with Q >= 15.

If your reads have been divided into separate BAMs by chromosome, inter-chromosomal pairs will not be identified, but MarkDuplicates will not fail due to inability to find the mate pair for a read.
fabrice is offline   Reply With Quote
Old 08-05-2011, 07:23 AM   #4
Robby
Member
 
Location: Germany

Join Date: Mar 2011
Posts: 68
Default

I think Picard is closer to the truth. If two reads have the same sequence, it is still possible, that it isn't a PCR-duplicate. If the reverse reads of two identical forward reads are different, it is probably no PCR-duplicate. But if the position and orientation of forward and reverse read are identical it is likely a PCR-duplicate. But of course, even the percentage calculated by Picard is overestimated.
Robby is offline   Reply With Quote
Old 08-05-2011, 07:39 AM   #5
fabrice
Member
 
Location: paris

Join Date: Oct 2009
Posts: 86
Default

Robby,
Thanks for your explain.
Here you said Picard is overestimated. Does it mean Picard always give an overestimate duplication level?
Because Picard take bam/sam as input, it means that if I want to estimate the duplication level in my sequence. I must map the reads firstly. Is it possible take the fastq file to estimate the duplication level? Fastqc take the fastq files, but it seems not very correctly.
fabrice is offline   Reply With Quote
Old 08-06-2011, 08:13 PM   #6
DZhang
Senior Member
 
Location: East Coast, US

Join Date: Jun 2010
Posts: 177
Default

Hi,

Fastqc and picard work at different levels. Fastqc works on the read level - it takes the read sequences and estimates the duplication level. Picard works on the alignment level - as already explained, it also considers the location of the read (and its mate if applicable). So it depends what you are looking for. As far removing PCR-introduced duplicated reads, picard definitely is more relevant.
DZhang is offline   Reply With Quote
Old 09-27-2012, 01:50 AM   #7
ddaneels
Member
 
Location: Belgium

Join Date: Mar 2012
Posts: 19
Default

Your picard value was around 0.13. Is this 0.13% or do you still need to multiply it with 100. So that it's actually 13 % ?
ddaneels is offline   Reply With Quote
Old 10-15-2012, 07:34 AM   #8
arkal
advancing one byte at a time!
 
Location: Bangalore, India

Join Date: Jun 2011
Posts: 56
Lightbulb

I had the same query. What i personally feel is that Picard is more accurate.

My reassoning is that fastqc is calling around 70% duplication looking at only the read 1 file and a similar amount using the read 2 file without using any information on the mapping.
Picard on the other hand looks at teh mapping and uses both start and end regions of the fragments to call duplicates which logically makes more sense, right? Fragments starting at the same position but ending at varying ones can't be seen as duplicates and discarded IMHO!
arkal is offline   Reply With Quote
Old 10-15-2012, 10:10 PM   #9
arkal
advancing one byte at a time!
 
Location: Bangalore, India

Join Date: Jun 2011
Posts: 56
Question

Quote:
Originally Posted by Robby View Post
I think Picard is closer to the truth. If two reads have the same sequence, it is still possible, that it isn't a PCR-duplicate. If the reverse reads of two identical forward reads are different, it is probably no PCR-duplicate. But if the position and orientation of forward and reverse read are identical it is likely a PCR-duplicate. But of course, even the percentage calculated by Picard is overestimated.
A trivial question: When you say overestimated, do u mean the duplication in his sample is LESS THAN 13% or MORE THAN 13%.
arkal is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 09:28 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO