SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics

Similar Threads
Thread Thread Starter Forum Replies Last Post
Removing duplicates is it really necessary? foxyg Bioinformatics 34 12-02-2016 02:17 PM
hello from Mark at Omixon mzpitman Introductions 0 11-08-2011 08:23 AM
Picard - MakeDuplicates (remove pcr duplicates) dmb Bioinformatics 2 03-16-2011 08:56 AM
Picard MarkDuplicates - How to identify duplicates in generated BAM file makarovv Bioinformatics 6 11-10-2010 09:02 AM
26% duplicates KevinLam Bioinformatics 4 08-19-2010 05:20 PM

Reply
 
Thread Tools
Old 09-01-2011, 02:14 AM   #1
hlwright
Member
 
Location: Liverpool, UK

Join Date: Feb 2011
Posts: 30
Default Picard Mark Duplicates help please

I am tying to use Picard MarkDuplicates to remove my pcr duplicates from a human rna-seq bam file. The run was paired-end but I only have about 30% properly paired (that is another story).

My command for picard was this:
PHP Code:
java -Xmx8 -jar /path/to/MarkDuplicates.jar INPUT=accepted_hits_sorted.bam OUTPUT=picard.bam METRICS_FILE=picard_info.txt REMOVE_DUPLICATES=true ASSUME_SORTED=true VALIDATION_STRINGENCY=LENIENT 
The picard_info.txt file suggests I have 52% duplicates in my data. However, when I convert the output file to SAM I can see (by eye) that there are duplicates still there:

Original data:
PHP Code:
1329_105_1480_F3        355     chr1    13484   1       50M     =       13572   123     CAGCTGCACCACTGCCTGGCGCTGTGCCCTTCCTTTGCTCTGCCCGCTGG      ,,@NPYG423BC553AC.2BPB;:7OH0.-=><1,I3!5=D<4)-OD=44      NM:i:0  NH:i:4  CC:Z:chr12      CP:i:92080      XS:A:+  HI:i:0
1863_1224_411_F3        99      chr1    13484   3       50M     
=       13572   123     CAGCTGCACCACTGCCTGGCGCTGTGCCCTTCCTTTGCTCTGCCCGCTGG      UUZ_[][VXY[XRNOYJFURZZULJZ_ZOQ[SRRTW@CPBJHJWU\FAMM      NM:i:0  NH:i:2  CC:Z:chr2       CP:i:114357483  XS:A:+  HI:i:0
1939_1338_752_F3        355     chr1    13484   1       50M     
=       13572   123     CAGCTGCACCACTGCCTGGCGCTGTGCCCTTCCTTTGCTCTGCCCGCTGG      66?=3CEA:=LIE?CH8?F@J;?G4@QH7;KEHJIE1@@4=AD9=P@8<<      NM:i:0  NH:i:4  CC:Z:chr12      CP:i:92080      XS:A:+  HI:i:0
1942_131_1549_F3        355     chr1    13484   3       50M     
=       13572   123     CAGCTGCACCACTGCCTGGCGCTGTGCCCTTCCTTTGCTCTGCCCGCTGG      ,,9:.A>-.34-./491+#&DA20)>?-,36,**8<%#++20..6I<-..      NM:i:0  NH:i:2  CC:Z:chr2       CP:i:114357483  XS:A:+  HI:i:0
2022_911_2004_F3        99      chr1    13484   3       50M     =       13572   123     CAGCTGCACCACTGCCTGGCGCTGTGCCCTTCCTTTGCTCTGCCCGCTGG      OOTYX[TKCHXUHIUVONHDTUPTOTXQKMWQPTRB/DH?QJKNNZJJQQ      NM:i:0  NH:i:2  CC:Z:chr2       CP:i:114357483  XS:A:+  HI:i:0
486_302_1756_F3 99      chr1    13528   0       50M     
=       13615   122     CGCTGGAGACGGTGTTTGTCATGGGCCTGGTCTGCAGGGATCCTGCTACA      __ZZZ[ZQWUVV?I]`TEQWYX<;RYH==GXI;5F];!(EYE0CV\D7NN      NM:i:0  NH:i:5  CC:Z:chr12      CP:i:92036      XS:A:+  HI:i:0
667_379_431_F3  355     chr1    13528   0       50M     =       13615   122     CGCTGGAGACGGTGTTTGTCATGGGCCTGGTCTGCAGGGATCCTGCTACA      
``]_]][X^__WAIZ\K=RYYY??PRPORMLEB<DYM'*NYT;9R[E5@@      NM:i:0  NH:i:5  CC:Z:chr12      CP:i:92036      XS:A:+  HI:i:0
1481_526_56_F3  355     chr1    13528   0       50M     =       13615   122     CGCTGGAGACGGTGTTTGTCATGGGCCTGGTCTGCAGGGATCCTGCTACA      RRRNC==:HHC@3CG=:1<=BB2DMG>0:B70/.0@>!!/;K>/=B1)11      NM:i:0  NH:i:5  CC:Z:chr12      CP:i:92036      XS:A:+  HI:i:0
1631_628_1988_F3        355     chr1    13528   0       50M     =       13615   122     CGCTGGAGACGGTGTTTGTCATGGGCCTGGTCTGCAGGGATCCTGCTACA      ^^]
``_NFW]^[IO_^V39UWN=JTVH6BSYB8HSYOD@>LR;9QY6!00      NM:i:0  NH:i:5  CC:Z:chr12      CP:i:92036      XS:A:+  HI:i:0
1934_635_52_F3  99      chr1    13528   0       50M     =       13615   122     CGCTGGAGACGGTGTTTGTCATGGGCCTGGTCTGCAGGGATCCTGCTACA      ZZ]\\[YZ^WVZHL_
`^PQYVS<EYVQHNWWJ=1@V@9=9LSABRW2!22      NM:i:0  NH:i:5  CC:Z:chr12      CP:i:92036      XS:A:+  HI:i:0
2219_1966_235_F3        355     chr1    13528   0       50M     
=       13615   122     CGCTGGAGACGGTGTTTGTCATGGGCCTGGTCTGCAGGGATCCTGCTACA      NNBAQRE9HJH6#CMNF5EFFA$0<AJ;9CE,!-?K0!22;M4+BC,%77      NM:i:0  NH:i:5  CC:Z:chr12      CP:i:92036      XS:A:+  HI:i:0
1329_105_1480_F5-BC     403     chr1    13572   1       35M     =       13484   -123    GCTACAAAGGTGAAACCCAGGAGAGTGTGGAGTCC     ::74<@':CAOIH9AHFUWL:FXE6?KNOUVCCKK     NM:i:0  NH:i:4  CC:Z:chr12      CP:i:92007      XS:A:+  HI:i:0
1863_1224_411_F5-BC     147     chr1    13572   3       35M     =       13484   -123    GCTACAAAGGTGAAACCCAGGAGAGTGTGGAGTCC     __]XRR?G_\_a^^]]bcccbba_[[accccbabb     NM:i:0  NH:i:2  CC:Z:chr2       CP:i:114357410  XS:A:+  HI:i:0
1939_1338_752_F5-BC     403     chr1    13572   1       35M     =       13484   -123    GCTACAAAGGTGAAACCCAGGAGAGTGTGGAGTCC     44.$5H+6GIJFQHHTW[VVTOQGEJOWYVSSVYY     NM:i:0  NH:i:4  CC:Z:chr12      CP:i:92007      XS:A:+  HI:i:0 
Post picard:
PHP Code:
1329_105_1480_F3        355     chr1    13484   1       50M     =       13572   123     CAGCTGCACCACTGCCTGGCGCTGTGCCCTTCCTTTGCTCTGCCCGCTGG      ,,@NPYG423BC553AC.2BPB;:7OH0.-=><1,I3!5=D<4)-OD=44      NM:i:0  NH:i:4  CC:Z:chr12      CP:i:92080      XS:A:+  HI:i:0
1863_1224_411_F3        99      chr1    13484   3       50M     
=       13572   123     CAGCTGCACCACTGCCTGGCGCTGTGCCCTTCCTTTGCTCTGCCCGCTGG      UUZ_[][VXY[XRNOYJFURZZULJZ_ZOQ[SRRTW@CPBJHJWU\FAMM      NM:i:0  NH:i:2  CC:Z:chr2       CP:i:114357483  XS:A:+  HI:i:0
1939_1338_752_F3        355     chr1    13484   1       50M     
=       13572   123     CAGCTGCACCACTGCCTGGCGCTGTGCCCTTCCTTTGCTCTGCCCGCTGG      66?=3CEA:=LIE?CH8?F@J;?G4@QH7;KEHJIE1@@4=AD9=P@8<<      NM:i:0  NH:i:4  CC:Z:chr12      CP:i:92080      XS:A:+  HI:i:0
1942_131_1549_F3        355     chr1    13484   3       50M     
=       13572   123     CAGCTGCACCACTGCCTGGCGCTGTGCCCTTCCTTTGCTCTGCCCGCTGG      ,,9:.A>-.34-./491+#&DA20)>?-,36,**8<%#++20..6I<-..      NM:i:0  NH:i:2  CC:Z:chr2       CP:i:114357483  XS:A:+  HI:i:0
2022_911_2004_F3        99      chr1    13484   3       50M     =       13572   123     CAGCTGCACCACTGCCTGGCGCTGTGCCCTTCCTTTGCTCTGCCCGCTGG      OOTYX[TKCHXUHIUVONHDTUPTOTXQKMWQPTRB/DH?QJKNNZJJQQ      NM:i:0  NH:i:2  CC:Z:chr2       CP:i:114357483  XS:A:+  HI:i:0
486_302_1756_F3 99      chr1    13528   0       50M     
=       13615   122     CGCTGGAGACGGTGTTTGTCATGGGCCTGGTCTGCAGGGATCCTGCTACA      __ZZZ[ZQWUVV?I]`TEQWYX<;RYH==GXI;5F];!(EYE0CV\D7NN      NM:i:0  NH:i:5  CC:Z:chr12      CP:i:92036      XS:A:+  HI:i:0
667_379_431_F3  355     chr1    13528   0       50M     =       13615   122     CGCTGGAGACGGTGTTTGTCATGGGCCTGGTCTGCAGGGATCCTGCTACA      
``]_]][X^__WAIZ\K=RYYY??PRPORMLEB<DYM'*NYT;9R[E5@@      NM:i:0  NH:i:5  CC:Z:chr12      CP:i:92036      XS:A:+  HI:i:0
1481_526_56_F3  355     chr1    13528   0       50M     =       13615   122     CGCTGGAGACGGTGTTTGTCATGGGCCTGGTCTGCAGGGATCCTGCTACA      RRRNC==:HHC@3CG=:1<=BB2DMG>0:B70/.0@>!!/;K>/=B1)11      NM:i:0  NH:i:5  CC:Z:chr12      CP:i:92036      XS:A:+  HI:i:0
1631_628_1988_F3        355     chr1    13528   0       50M     =       13615   122     CGCTGGAGACGGTGTTTGTCATGGGCCTGGTCTGCAGGGATCCTGCTACA      ^^]
``_NFW]^[IO_^V39UWN=JTVH6BSYB8HSYOD@>LR;9QY6!00      NM:i:0  NH:i:5  CC:Z:chr12      CP:i:92036      XS:A:+  HI:i:0
1934_635_52_F3  99      chr1    13528   0       50M     =       13615   122     CGCTGGAGACGGTGTTTGTCATGGGCCTGGTCTGCAGGGATCCTGCTACA      ZZ]\\[YZ^WVZHL_
`^PQYVS<EYVQHNWWJ=1@V@9=9LSABRW2!22      NM:i:0  NH:i:5  CC:Z:chr12      CP:i:92036      XS:A:+  HI:i:0
2219_1966_235_F3        355     chr1    13528   0       50M     
=       13615   122     CGCTGGAGACGGTGTTTGTCATGGGCCTGGTCTGCAGGGATCCTGCTACA      NNBAQRE9HJH6#CMNF5EFFA$0<AJ;9CE,!-?K0!22;M4+BC,%77      NM:i:0  NH:i:5  CC:Z:chr12      CP:i:92036      XS:A:+  HI:i:0
1329_105_1480_F5-BC     403     chr1    13572   1       35M     =       13484   -123    GCTACAAAGGTGAAACCCAGGAGAGTGTGGAGTCC     ::74<@':CAOIH9AHFUWL:FXE6?KNOUVCCKK     NM:i:0  NH:i:4  CC:Z:chr12      CP:i:92007      XS:A:+  HI:i:0
1863_1224_411_F5-BC     147     chr1    13572   3       35M     =       13484   -123    GCTACAAAGGTGAAACCCAGGAGAGTGTGGAGTCC     __]XRR?G_\_a^^]]bcccbba_[[accccbabb     NM:i:0  NH:i:2  CC:Z:chr2       CP:i:114357410  XS:A:+  HI:i:0
1939_1338_752_F5-BC     403     chr1    13572   1       35M     =       13484   -123    GCTACAAAGGTGAAACCCAGGAGAGTGTGGAGTCC     44.$5H+6GIJFQHHTW[VVTOQGEJOWYVSSVYY     NM:i:0  NH:i:4  CC:Z:chr12      CP:i:92007      XS:A:+  HI:i:0 
I have also tried to set the REMOVE_DUPLICATES flag to false (I think this just flags them and leaves them in the new file rather than excluding them) but this gives me exactly the same result:
PHP Code:
1329_105_1480_F3        355     chr1    13484   1       50M     =       13572   123     CAGCTGCACCACTGCCTGGCGCTGTGCCCTTCCTTTGCTCTGCCCGCTGG      ,,@NPYG423BC553AC.2BPB;:7OH0.-=><1,I3!5=D<4)-OD=44      NM:i:0  NH:i:4  CC:Z:chr12      CP:i:92080      XS:A:+  HI:i:0
1863_1224_411_F3        99      chr1    13484   3       50M     
=       13572   123     CAGCTGCACCACTGCCTGGCGCTGTGCCCTTCCTTTGCTCTGCCCGCTGG      UUZ_[][VXY[XRNOYJFURZZULJZ_ZOQ[SRRTW@CPBJHJWU\FAMM      NM:i:0  NH:i:2  CC:Z:chr2       CP:i:114357483  XS:A:+  HI:i:0
1939_1338_752_F3        355     chr1    13484   1       50M     
=       13572   123     CAGCTGCACCACTGCCTGGCGCTGTGCCCTTCCTTTGCTCTGCCCGCTGG      66?=3CEA:=LIE?CH8?F@J;?G4@QH7;KEHJIE1@@4=AD9=P@8<<      NM:i:0  NH:i:4  CC:Z:chr12      CP:i:92080      XS:A:+  HI:i:0
1942_131_1549_F3        355     chr1    13484   3       50M     
=       13572   123     CAGCTGCACCACTGCCTGGCGCTGTGCCCTTCCTTTGCTCTGCCCGCTGG      ,,9:.A>-.34-./491+#&DA20)>?-,36,**8<%#++20..6I<-..      NM:i:0  NH:i:2  CC:Z:chr2       CP:i:114357483  XS:A:+  HI:i:0
2022_911_2004_F3        99      chr1    13484   3       50M     =       13572   123     CAGCTGCACCACTGCCTGGCGCTGTGCCCTTCCTTTGCTCTGCCCGCTGG      OOTYX[TKCHXUHIUVONHDTUPTOTXQKMWQPTRB/DH?QJKNNZJJQQ      NM:i:0  NH:i:2  CC:Z:chr2       CP:i:114357483  XS:A:+  HI:i:0
486_302_1756_F3 99      chr1    13528   0       50M     
=       13615   122     CGCTGGAGACGGTGTTTGTCATGGGCCTGGTCTGCAGGGATCCTGCTACA      __ZZZ[ZQWUVV?I]`TEQWYX<;RYH==GXI;5F];!(EYE0CV\D7NN      NM:i:0  NH:i:5  CC:Z:chr12      CP:i:92036      XS:A:+  HI:i:0
667_379_431_F3  355     chr1    13528   0       50M     =       13615   122     CGCTGGAGACGGTGTTTGTCATGGGCCTGGTCTGCAGGGATCCTGCTACA      
``]_]][X^__WAIZ\K=RYYY??PRPORMLEB<DYM'*NYT;9R[E5@@      NM:i:0  NH:i:5  CC:Z:chr12      CP:i:92036      XS:A:+  HI:i:0
1481_526_56_F3  355     chr1    13528   0       50M     =       13615   122     CGCTGGAGACGGTGTTTGTCATGGGCCTGGTCTGCAGGGATCCTGCTACA      RRRNC==:HHC@3CG=:1<=BB2DMG>0:B70/.0@>!!/;K>/=B1)11      NM:i:0  NH:i:5  CC:Z:chr12      CP:i:92036      XS:A:+  HI:i:0
1631_628_1988_F3        355     chr1    13528   0       50M     =       13615   122     CGCTGGAGACGGTGTTTGTCATGGGCCTGGTCTGCAGGGATCCTGCTACA      ^^]
``_NFW]^[IO_^V39UWN=JTVH6BSYB8HSYOD@>LR;9QY6!00      NM:i:0  NH:i:5  CC:Z:chr12      CP:i:92036      XS:A:+  HI:i:0
1934_635_52_F3  99      chr1    13528   0       50M     =       13615   122     CGCTGGAGACGGTGTTTGTCATGGGCCTGGTCTGCAGGGATCCTGCTACA      ZZ]\\[YZ^WVZHL_
`^PQYVS<EYVQHNWWJ=1@V@9=9LSABRW2!22      NM:i:0  NH:i:5  CC:Z:chr12      CP:i:92036      XS:A:+  HI:i:0
2219_1966_235_F3        355     chr1    13528   0       50M     
=       13615   122     CGCTGGAGACGGTGTTTGTCATGGGCCTGGTCTGCAGGGATCCTGCTACA      NNBAQRE9HJH6#CMNF5EFFA$0<AJ;9CE,!-?K0!22;M4+BC,%77      NM:i:0  NH:i:5  CC:Z:chr12      CP:i:92036      XS:A:+  HI:i:0
1329_105_1480_F5-BC     403     chr1    13572   1       35M     =       13484   -123    GCTACAAAGGTGAAACCCAGGAGAGTGTGGAGTCC     ::74<@':CAOIH9AHFUWL:FXE6?KNOUVCCKK     NM:i:0  NH:i:4  CC:Z:chr12      CP:i:92007      XS:A:+  HI:i:0
1863_1224_411_F5-BC     147     chr1    13572   3       35M     =       13484   -123    GCTACAAAGGTGAAACCCAGGAGAGTGTGGAGTCC     __]XRR?G_\_a^^]]bcccbba_[[accccbabb     NM:i:0  NH:i:2  CC:Z:chr2       CP:i:114357410  XS:A:+  HI:i:0
1939_1338_752_F5-BC     403     chr1    13572   1       35M     =       13484   -123    GCTACAAAGGTGAAACCCAGGAGAGTGTGGAGTCC     44.$5H+6GIJFQHHTW[VVTOQGEJOWYVSSVYY     NM:i:0  NH:i:4  CC:Z:chr12      CP:i:92007      XS:A:+  HI:i:0 
On all three example SAM file extracts you can clearly see that five reads on chr1 at start position 13484 are duplicates, and are paired with reads starting at 13572, yet all have been left in and I can't see that any flag had changed.

Am I doing something wrong? Please help!

Thanks
Helen
hlwright is offline   Reply With Quote
Old 09-01-2011, 11:14 AM   #2
swbarnes2
Senior Member
 
Location: San Diego

Join Date: May 2008
Posts: 912
Default

Quote:
Originally Posted by hlwright View Post

On all three example SAM file extracts you can clearly see that five reads on chr1 at start position 13484 are duplicates, and are paired with reads starting at 13572, yet all have been left in and I can't see that any flag had changed.

Am I doing something wrong? Please help!

Thanks
Helen
I think I see the problem, and it's the same one I had when I first tried to use Picard's MarkDuplicates. Whatever program you used to make the .sam file understood that read 1329_105_1480_F3 and 1329_105_1480_F5-BC were paired, even though their names are different. Someone on this board said that Picard won't work unless the names are identical. So try fixing the names, and then running Picard. I use samtools rmdup, which has a known issue of not removing duplicates that span chromosomes, but it does work if the names aren't identical.
swbarnes2 is offline   Reply With Quote
Old 09-01-2011, 11:49 AM   #3
hlwright
Member
 
Location: Liverpool, UK

Join Date: Feb 2011
Posts: 30
Default

Thank you for the reply, I will try your suggestion of changing the read names.

I have tried samtools rmdup but it removes >90% of my reads (as duplicates). I was hoping for a 'second opinion' from Picard that was less shocking!

Helen
hlwright is offline   Reply With Quote
Old 09-05-2011, 12:28 AM   #4
hlwright
Member
 
Location: Liverpool, UK

Join Date: Feb 2011
Posts: 30
Default

I have changed the read names as suggested and re-run picard, but am still not happy that it has worked properly.

I also included an extra command in the command line READ_NAME_REGEX="[0-9]_[0-9]_[0-9]". An example of the output can be seen below:

PHP Code:
1329_105_1480   355     chr1    13484   1       50M     =       13572   123     CAGCTGCACCACTGCCTGGCGCTGTGCCCTTCCTTTGCTCTGCCCGCTGG      ,,@NPYG423BC553AC.2BPB;:7OH0.-=><1,I3!5=D<4)-OD=44      NM:i:0  NH:i:4  CC:Z:chr12      CP:i:92080      XS:A:+  HI:i:0
1863_1224_411   99      chr1    13484   3       50M     
=       13572   123     CAGCTGCACCACTGCCTGGCGCTGTGCCCTTCCTTTGCTCTGCCCGCTGG      UUZ_[][VXY[XRNOYJFURZZULJZ_ZOQ[SRRTW@CPBJHJWU\FAMM      NM:i:0  NH:i:2  CC:Z:chr2       CP:i:114357483  XS:A:+  HI:i:0
1939_1338_752   355     chr1    13484   1       50M     
=       13572   123     CAGCTGCACCACTGCCTGGCGCTGTGCCCTTCCTTTGCTCTGCCCGCTGG      66?=3CEA:=LIE?CH8?F@J;?G4@QH7;KEHJIE1@@4=AD9=P@8<<      NM:i:0  NH:i:4  CC:Z:chr12      CP:i:92080      XS:A:+  HI:i:0
1942_131_1549   355     chr1    13484   3       50M     
=       13572   123     CAGCTGCACCACTGCCTGGCGCTGTGCCCTTCCTTTGCTCTGCCCGCTGG      ,,9:.A>-.34-./491+#&DA20)>?-,36,**8<%#++20..6I<-..      NM:i:0  NH:i:2  CC:Z:chr2       CP:i:114357483  XS:A:+  HI:i:0
486_302_1756    99      chr1    13528   0       50M     =       13615   122     CGCTGGAGACGGTGTTTGTCATGGGCCTGGTCTGCAGGGATCCTGCTACA      __ZZZ[ZQWUVV?I]`TEQWYX<;RYH==GXI;5F];!(EYE0CV\D7NN      NM:i:0  NH:i:5  CC:Z:chr12      CP:i:92036      XS:A:+  HI:i:0
667_379_431     355     chr1    13528   0       50M     =       13615   122     CGCTGGAGACGGTGTTTGTCATGGGCCTGGTCTGCAGGGATCCTGCTACA      
``]_]][X^__WAIZ\K=RYYY??PRPORMLEB<DYM'*NYT;9R[E5@@      NM:i:0  NH:i:5  CC:Z:chr12      CP:i:92036      XS:A:+  HI:i:0
1481_526_56     355     chr1    13528   0       50M     =       13615   122     CGCTGGAGACGGTGTTTGTCATGGGCCTGGTCTGCAGGGATCCTGCTACA      RRRNC==:HHC@3CG=:1<=BB2DMG>0:B70/.0@>!!/;K>/=B1)11      NM:i:0  NH:i:5  CC:Z:chr12      CP:i:92036      XS:A:+  HI:i:0
1631_628_1988   355     chr1    13528   0       50M     =       13615   122     CGCTGGAGACGGTGTTTGTCATGGGCCTGGTCTGCAGGGATCCTGCTACA      ^^]
``_NFW]^[IO_^V39UWN=JTVH6BSYB8HSYOD@>LR;9QY6!00      NM:i:0  NH:i:5  CC:Z:chr12      CP:i:92036      XS:A:+  HI:i:0
2219_1966_235   355     chr1    13528   0       50M     =       13615   122     CGCTGGAGACGGTGTTTGTCATGGGCCTGGTCTGCAGGGATCCTGCTACA      NNBAQRE9HJH6#CMNF5EFFA$0<AJ;9CE,!-?K0!22;M4+BC,%77      NM:i:0  NH:i:5  CC:Z:chr12      CP:i:92036      XS:A:+  HI:i:0
1329_105_1480   403     chr1    13572   1       35M     =       13484   -123    GCTACAAAGGTGAAACCCAGGAGAGTGTGGAGTCC     ::74<@':CAOIH9AHFUWL:FXE6?KNOUVCCKK     NM:i:0  NH:i:4  CC:Z:chr12      CP:i:92007      XS:A:+  HI:i:0
1863_1224_411   147     chr1    13572   3       35M     =       13484   -123    GCTACAAAGGTGAAACCCAGGAGAGTGTGGAGTCC     __]XRR?G_\_a^^]]bcccbba_[[accccbabb     NM:i:0  NH:i:2  CC:Z:chr2       CP:i:114357410  XS:A:+  HI:i:0
1939_1338_752   403     chr1    13572   1       35M     =       13484   -123    GCTACAAAGGTGAAACCCAGGAGAGTGTGGAGTCC     44.$5H+6GIJFQHHTW[VVTOQGEJOWYVSSVYY     NM:i:0  NH:i:4  CC:Z:chr12      CP:i:92007      XS:A:+  HI:i:0
1942_131_1549   403     chr1    13572   3       35M     =       13484   -123    GCTACAAAGGTGAAACCCAGGAGAGTGTGGAGTCC     ))++//!#+,98-+9<2?DC9,./.6CBGO?21@@     NM:i:0  NH:i:2  CC:Z:chr2       CP:i:114357410  XS:A:+  HI:i:0 
Picard has removed ~60% of my reads as duplicates, but you can see that some of the reads that have been left in the file are duplicates (or at least I think they look like duplicates). Read 1329_105_1480 for example appears to have 4 copies which have all mapped as pairs to the same location.

I ran samtools rmdup on this Picard output and that removed a further 75% as duplicates! But why is picard not picking them up?

My full Picard command is:
HTML Code:
java -Xmx10g -jar /path/to/MarkDuplicates.jar INPUT=filename.bam OUTPUT=output.bam METRICS_FILE=output.txt REMOVE_DUPLICATES=true ASSUME_SORTED=true VALIDATION_STRINGENCY=LENIENT READ_NAME_REGEX="[0-9]_[0-9]_[0-9]"
Thanks
Helen
hlwright is offline   Reply With Quote
Old 01-26-2012, 03:20 AM   #5
polarise
Member
 
Location: Ireland

Join Date: Jan 2011
Posts: 13
Default Might this work?

This might be helpful. It suggests that the problem has to do with underscores in the names so the underscore has to be part of the regex.

http://biostar.stackexchange.com/que...markduplicates

PK
polarise is offline   Reply With Quote
Old 01-26-2012, 05:36 PM   #6
jflowers
Member
 
Location: New York, NY

Join Date: Oct 2011
Posts: 41
Default

Your regular expression won't match your read names as you have them. The problem is that you need to include (1) capturing parentheses and (2) use + (to indicate one or more character matches).

To match identifiers like yours (e.g., 1240_122_334) use:
READ_NAME_REGEX="([0-9]+)_([0-9]+)_([0-9]+)"

Also be sure that the three number in your read names are (in this order) tile/region, x coordinate, y coordinate per the Picard tools documentation
jflowers is offline   Reply With Quote
Old 01-26-2012, 06:04 PM   #7
jflowers
Member
 
Location: New York, NY

Join Date: Oct 2011
Posts: 41
Default

One final thought, check the metrics file to determine how many of your reads are being detected as optical and PCR duplicates. If your regex is working, there should be > 0 optical duplicates
jflowers is offline   Reply With Quote
Old 01-26-2012, 10:40 PM   #8
lifeng.tian
Member
 
Location: Philadelphia

Join Date: Jul 2009
Posts: 16
Default

Helen,

It seems like you are using an aligner that reports multiple alignments, right?

One thing I noticed is that all these remaining reads are "not primary alignment" (I used http://picard.sourceforge.net/explain-flags.html to interpret the flag in col 2).
Picard might skip these reads for dup detection.
I've used samtools view -F 0x100 xx.bam to remove the non-primary-alignment.


Lifeng
lifeng.tian is offline   Reply With Quote
Reply

Tags
pcr duplicates, picard

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 01:08 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2021, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO