The example you are showing is not significant to determine if there was any problem with picard. You are showing the two ends of a pair. They are both in your sam file before and after running picard. That's the expected behavior.
Seqanswers Leaderboard Ad
Collapse
Announcement
Collapse
No announcement yet.
X
-
Originally posted by Michael.James.Clark View PostThere is a nice discussion of such things (including a post where Heng explains the answer to your question) as well as links to numerous other threads on the topic here: http://seqanswers.com/forums/showthread.php?t=6854
Hope that helps.
--
bioinfosm
Comment
-
Originally posted by swbarnes2 View PostI tried to use picard's mark duplicates, but it doesn't seem to be doing much of a job (The sam file is from a bwa paired end alignment)
java -jar SortSam.jar I=full_genome_grepped.sam O=full_genome_grepped_sorted.sam SO=coordinate VALIDATION_STRINGENCY=LENIENT
java -jar MarkDuplicates.jar I=full_genome_grepped_sorted.sam O=full_genome_grepped_sorted_deduped.sam M=Dedup_metrics.txt VALIDATION_STRINGENCY=LENIENT QUIET=true REMOVE_DUPLICATES=true
The resultant file is only a little smaller than the original. So I checked some known duplicates;
grep 1:78:10366:8808 full_genome_grepped_sorted_deduped.sam
100729:1:78:10366:8808:a: 99 chr9 3000477 0 50M = 3000810 383 AGTGGATGTTTCTCATTTTCCATGATTTTCAGTTTTCTTGCCATATTCCA gggggggggggggggggggggdgfcgggggdggggggggegggggggggg X0:i:23 X1:i:27 MD:Z:50 XG:i:0 AM:i:0 NM:i:0 SM:i:0 XM:i:0 XO:i:0 XT:A:R
100729:1:78:10366:8808:b: 147 chr9 3000810 0 50M = 3000477 -383 GCCATATTTTACGTCCTAAAGTGTGCATTTCTCATTTTTCACGTTTTTCA hhhhhhhghhhhhhhhhhhhhhhhhghhhhhhhghhhhhhhhhhhhhhhh X0:i:4 X1:i:0 XA:Z:chr9,-3022120,50M,3;chr9,-3011768,50M,3;chr9,-3019372,50M,3; MD:Z:9C1G6C31 XG:i:0 AM:i:0 NM:i:3 SM:i:0 XM:i:3 XO:i:0 XT:A:R
grep 1:80:15287:8131 full_genome_grepped_sorted_deduped.sam
100729:1:80:15287:8131:a: 99 chr9 3000477 3 50M = 3000810 383 AGTGGATGTTTCTCATTTTCCATGATTTTCAGTTTTCTTGCCATATTCCA ggggggggggggggfggggggcggggggggggggggggggggggggggga X0:i:23 X1:i:27 MD:Z:50 XG:i:0 AM:i:0 NM:i:0 SM:i:0 XM:i:0 XO:i:0 XT:A:R
100729:1:80:15287:8131:b: 147 chr9 3000810 3 50M = 3000477 -383 GCCATATTTTACGTCCTAAAGTGTGCATTTCTCATTTTTCACGTTTTTCA hhhhhhhhhhhghhhhhhhhhfhhhhhhhhhhghhhhhhhhhhhhhhhhh X0:i:4 X1:i:0 XA:Z:chr9,-3000810,50M,3;chr9,-3011768,50M,3;chr9,-3019372,50M,3; MD:Z:9C1G6C31 XG:i:0 AM:i:0 NM:i:3 SM:i:0 XM:i:3 XO:i:0 XT:A:R
I was hoping that one of the two would be deleted from the deduped file, since they look like perfect duplicates to me.
Picard's markdups program's supposed advantage is in its ability to remove duplicates across chromosomes. I ran markdups and samtools and samtools ended up removing more dups, but not across chromosomes of course. Either way, you can physically go through and filter out pairs and sequences with poor pairing or mapping QVs. I also remove reads across chromosomes now. Depends on what you're doing really.
Comment
-
Originally posted by JohnK View PostHi,
Picard's markdups program's supposed advantage is in its ability to remove duplicates across chromosomes. I ran markdups and samtools and samtools ended up removing more dups, but not across chromosomes of course. Either way, you can physically go through and filter out pairs and sequences with poor pairing or mapping QVs. I also remove reads across chromosomes now. Depends on what you're doing really.
But Picard doesn't seem to be removing dupes within chromosomes. I posted the .sam entries for two different clusters with the exact same sequence at each end; why didn't Picard remove one cluster or the other?
Comment
-
I'm not sure really. I actually use samtools now to remove the dups and I believe we're moving towards filtering out the singletons, cross-chromosome pairs, invalid insert sizes, and anything else under a QV of ten. Just for the sake of the most accurate data.
Comment
-
Originally posted by swbarnes2 View PostI posted the .sam entries for two different clusters with the exact same sequence at each end; why didn't Picard remove one cluster or the other?
Notice the MAPQ of those alignments. It is very low. Those alignments should be filtered out and should not contribute to the downstream analysis.-drd
Comment
-
Originally posted by drio View PostPicard looks for the mapping coordinates (and orientation), not the actual sequence.
Notice the MAPQ of those alignments. It is very low. Those alignments should be filtered out and should not contribute to the downstream analysis.
But I have other things that align uniquely, and still aren't being trimmed.
grep 2:78:8570:13184 full_genome_grepped_sorted_deduped.sam
100729:2:78:8570:13184:b: 129 chr1 149128373 37 50M = 149128649 276 CCTTTTCCTCTGCATGCAATAGATCATTGAGTGTGTGTGTGTGTGTGTGT hhghhhhhhhhhhghhhhhhchhhhghheafdfdfafcfcfcfafafafa X0:i:1 X1:i:0 MD:Z:50 XG:i:0 AM:i:37 NM:i:0 SM:i:37 XM:i:0 XO:i:0 XT:A:U
100729:2:78:8570:13184:a: 65 chr1 149128649 37 50M = 149128373 -276 ACTTTCAATCTCTGCTTTTCAAAATGTGAAAATGCCCAATTGATTACAGC hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhghghhhhhhfhh X0:i:1 X1:i:0 MD:Z:50 XG:i:0 AM:i:37 NM:i:0 SM:i:37 XM:i:0 XO:i:0 XT:A:U
grep 2:33:3309:20360 full_genome_grepped_sorted_deduped.sam
100729:2:33:3309:20360:b: 129 chr1 149128373 37 50M = 149128649 276 CCTTTTCCTCTGCATGCAATAGATCATTGAGTGTGTGTGTGTGTGTGTGT ggggggggggdffffgggggggfagggggffafcfafacab_b]b_fa]a X0:i:1 X1:i:0 MD:Z:50 XG:i:0 AM:i:37 NM:i:0 SM:i:37 XM:i:0 XO:i:0 XT:A:U
100729:2:33:3309:20360:a: 65 chr1 149128649 37 50M = 149128373 -276 ACTTTCAATCTCTGCTTTTCAAAATGTGAAAATGCCCAATTGATTACAGC hhhhhhhhhhhhhhhhhhhhhhhg_hhhhhhhhahhhhhhfhfhheggcf X0:i:1 X1:i:0 MD:Z:50 XG:i:0 AM:i:37 NM:i:0 SM:i:37 XM:i:0 XO:i:0 XT:A:U
The coordinates for each cluster are identical, even the sequences are identical at both ends, and isn't a mapping quality of 37 fine? Why isn't one of these being removed by Picard?
Comment
-
Originally posted by swbarnes2 View PostI tried to use picard's mark duplicates, but it doesn't seem to be doing much of a job (The sam file is from a bwa paired end alignment)
java -jar SortSam.jar I=full_genome_grepped.sam O=full_genome_grepped_sorted.sam SO=coordinate VALIDATION_STRINGENCY=LENIENT
java -jar MarkDuplicates.jar I=full_genome_grepped_sorted.sam O=full_genome_grepped_sorted_deduped.sam M=Dedup_metrics.txt VALIDATION_STRINGENCY=LENIENT QUIET=true REMOVE_DUPLICATES=true
The resultant file is only a little smaller than the original. So I checked some known duplicates;
grep 1:78:10366:8808 full_genome_grepped_sorted_deduped.sam
100729:1:78:10366:8808:a: 99 chr9 3000477 0 50M = 3000810 383 AGTGGATGTTTCTCATTTTCCATGATTTTCAGTTTTCTTGCCATATTCCA gggggggggggggggggggggdgfcgggggdggggggggegggggggggg X0:i:23 X1:i:27 MD:Z:50 XG:i:0 AM:i:0 NM:i:0 SM:i:0 XM:i:0 XO:i:0 XT:A:R
100729:1:78:10366:8808:b: 147 chr9 3000810 0 50M = 3000477 -383 GCCATATTTTACGTCCTAAAGTGTGCATTTCTCATTTTTCACGTTTTTCA hhhhhhhghhhhhhhhhhhhhhhhhghhhhhhhghhhhhhhhhhhhhhhh X0:i:4 X1:i:0 XA:Z:chr9,-3022120,50M,3;chr9,-3011768,50M,3;chr9,-3019372,50M,3; MD:Z:9C1G6C31 XG:i:0 AM:i:0 NM:i:3 SM:i:0 XM:i:3 XO:i:0 XT:A:R
grep 1:80:15287:8131 full_genome_grepped_sorted_deduped.sam
100729:1:80:15287:8131:a: 99 chr9 3000477 3 50M = 3000810 383 AGTGGATGTTTCTCATTTTCCATGATTTTCAGTTTTCTTGCCATATTCCA ggggggggggggggfggggggcggggggggggggggggggggggggggga X0:i:23 X1:i:27 MD:Z:50 XG:i:0 AM:i:0 NM:i:0 SM:i:0 XM:i:0 XO:i:0 XT:A:R
100729:1:80:15287:8131:b: 147 chr9 3000810 3 50M = 3000477 -383 GCCATATTTTACGTCCTAAAGTGTGCATTTCTCATTTTTCACGTTTTTCA hhhhhhhhhhhghhhhhhhhhfhhhhhhhhhhghhhhhhhhhhhhhhhhh X0:i:4 X1:i:0 XA:Z:chr9,-3000810,50M,3;chr9,-3011768,50M,3;chr9,-3019372,50M,3; MD:Z:9C1G6C31 XG:i:0 AM:i:0 NM:i:3 SM:i:0 XM:i:3 XO:i:0 XT:A:R
I was hoping that one of the two would be deleted from the deduped file, since they look like perfect duplicates to me.
I was getting the same problem and played with the code a bit. If your forward and reverse reads have slightly different names - am guessing you have :a: and :b: as suffixes for your forward and reverse read names - then the program doesn't recognise them as being from the same pair.
Chris
Comment
-
Originally posted by cjp View PostHi there,
I was getting the same problem and played with the code a bit. If your forward and reverse reads have slightly different names - am guessing you have :a: and :b: as suffixes for your forward and reverse read names - then the program doesn't recognise them as being from the same pair.
Chris
Comment
-
Originally posted by cjp View PostHi there,
I was getting the same problem and played with the code a bit. If your forward and reverse reads have slightly different names - am guessing you have :a: and :b: as suffixes for your forward and reverse read names - then the program doesn't recognise them as being from the same pair.
Chris
See the example.Mendelian Disorder: A blogshare of random useful information for general public consumption. [Blog]
Breakway: A Program to Identify Structural Variations in Genomic Data [Website] [Forum Post]
Projects: U87MG whole genome sequence [Website] [Paper]
Comment
-
Originally posted by swbarnes2 View PostThanks for figuring it out. That makes sense, though I thought when I ran my file as it was, that it did trim a few entries out, just not as many as I thought should be trimmed. And samtools does gets that samples named like that are paired.
Before read names were changed to be the same:
LIBRARY UNPAIRED_READS_EXAMINED READ_PAIRS_EXAMINED UNMAPPED_READS UNPAIRED_READ_DUPLICATES READ_PAIR_DUPLICATES READ_PAIR_OPTICAL_DUPLICATES PERCENT_DUPLICATION ESTIMATED_LIBRARY_SIZE
A0600028 3172904 67528513 0 2607750 0 0 0.018865
After making the name change:
LIBRARY UNPAIRED_READS_EXAMINED READ_PAIRS_EXAMINED UNMAPPED_READS UNPAIRED_READ_DUPLICATES READ_PAIR_DUPLICATES READ_PAIR_OPTICAL_DUPLICATES PERCENT_DUPLICATION ESTIMATED_LIBRARY_SIZE
A0600028 3172904 67528513 0 2607750 38116791 0 0.570364 34131325
Chris
Comment
-
Nevermind, figured out my problem. The problem was I'm an idiot. Our paired end sequence data comes back with read one ending in '#0/1' and read two ending in '#0/3', and the 3 has to be converted to a 2 for MarkDuplicates to work. I naturally used a command that changed every 3 to a 2...Last edited by Heisman; 09-20-2011, 05:03 PM.
Comment
-
Mark duplicates post fastq trimming?
Hello, I am fairly new to bioinformatics and have a question about using Picard Mark Duplicates after trimming sequence reads.
(I could not find the exact answer to my question (so I apologize if this a repeat question), but if it exists please link to the answer.)
Does anyone know how Picard Mark Duplicates knows which bases were trimmed from a read? If you trim bases from the 5' end of the read, how does Picard know that? See quoted comment from xguo below.
Note, I am using trimmomatic to trim low quality bases and adapters from paired end whole genome DNA re-sequencing data. I am aligning this data to a reference then using the GATK to call snp variants for population genetic analyses... so I need remove duplicates to ensure accurate genotype/snp calls.
Thank you!
Amanda
Originally posted by xguo View PostThe following is what I found from samtools-help list:
"Essentially what Picard does (for pairs; single-end data is also handled) is to find the 5' coordinates and mapping orientations of each read pair. When doing this it takes into account all clipping that has taking place as well as any gaps or jumps in the alignment. You can thus think of it as determining "if all the bases from the read were aligned, where would the 5' most base
have been aligned". It then matches all read pairs that have identical 5' coordinates and orientations and marks as duplicates all but the "best" pair. "Best" is defined as the read pair having the highest sum of base qualities as bases with Q >= 15."
Comment
-
I think that MarkDuplicates is considering the bases that were "clipped" during the alignment process: those terminal parts of the sequences do belong to the read but are not reported as part of the alignments. This is not the same as the "trimming" process that removes low-quality and/or adapter sequences before the alignment. As you pointed out, those are definitely lost.
Comment
Latest Articles
Collapse
-
by seqadmin
Many organizations study rare diseases, but few have a mission as impactful as Rady Children’s Institute for Genomic Medicine (RCIGM). “We are all about changing outcomes for children,” explained Dr. Stephen Kingsmore, President and CEO of the group. The institute’s initial goal was to provide rapid diagnoses for critically ill children and shorten their diagnostic odyssey, a term used to describe the long and arduous process it takes patients to obtain an accurate...-
Channel: Articles
12-16-2024, 07:57 AM -
-
by seqadmin
Innovations in next-generation sequencing technologies and techniques are driving more precise and comprehensive exploration of complex biological systems. Current advancements include improved accessibility for long-read sequencing and significant progress in single-cell and 3D genomics. This article explores some of the most impactful developments in the field over the past year.
Long-Read Sequencing
Long-read sequencing has seen remarkable advancements,...-
Channel: Articles
12-02-2024, 01:49 PM -
ad_right_rmr
Collapse
News
Collapse
Topics | Statistics | Last Post | ||
---|---|---|---|---|
Started by seqadmin, 12-17-2024, 10:28 AM
|
0 responses
26 views
0 likes
|
Last Post
by seqadmin
12-17-2024, 10:28 AM
|
||
Started by seqadmin, 12-13-2024, 08:24 AM
|
0 responses
43 views
0 likes
|
Last Post
by seqadmin
12-13-2024, 08:24 AM
|
||
Started by seqadmin, 12-12-2024, 07:41 AM
|
0 responses
29 views
0 likes
|
Last Post
by seqadmin
12-12-2024, 07:41 AM
|
||
Started by seqadmin, 12-11-2024, 07:45 AM
|
0 responses
42 views
0 likes
|
Last Post
by seqadmin
12-11-2024, 07:45 AM
|
Comment