Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Picard Mark Duplicates help please

    I am tying to use Picard MarkDuplicates to remove my pcr duplicates from a human rna-seq bam file. The run was paired-end but I only have about 30% properly paired (that is another story).

    My command for picard was this:
    PHP Code:
    java -Xmx8 -jar /path/to/MarkDuplicates.jar INPUT=accepted_hits_sorted.bam OUTPUT=picard.bam METRICS_FILE=picard_info.txt REMOVE_DUPLICATES=true ASSUME_SORTED=true VALIDATION_STRINGENCY=LENIENT 
    The picard_info.txt file suggests I have 52% duplicates in my data. However, when I convert the output file to SAM I can see (by eye) that there are duplicates still there:

    Original data:
    PHP Code:
    1329_105_1480_F3        355     chr1    13484   1       50M     =       13572   123     CAGCTGCACCACTGCCTGGCGCTGTGCCCTTCCTTTGCTCTGCCCGCTGG      ,,@NPYG423BC553AC.2BPB;:7OH0.-=><1,I3!5=D<4)-OD=44      NM:i:0  NH:i:4  CC:Z:chr12      CP:i:92080      XS:A:+  HI:i:0
    1863_1224_411_F3        99      chr1    13484   3       50M     
    =       13572   123     CAGCTGCACCACTGCCTGGCGCTGTGCCCTTCCTTTGCTCTGCCCGCTGG      UUZ_[][VXY[XRNOYJFURZZULJZ_ZOQ[SRRTW@CPBJHJWU\FAMM      NM:i:0  NH:i:2  CC:Z:chr2       CP:i:114357483  XS:A:+  HI:i:0
    1939_1338_752_F3        355     chr1    13484   1       50M     
    =       13572   123     CAGCTGCACCACTGCCTGGCGCTGTGCCCTTCCTTTGCTCTGCCCGCTGG      66?=3CEA:=LIE?CH8?F@J;?G4@QH7;KEHJIE1@@4=AD9=P@8<<      NM:i:0  NH:i:4  CC:Z:chr12      CP:i:92080      XS:A:+  HI:i:0
    1942_131_1549_F3        355     chr1    13484   3       50M     
    =       13572   123     CAGCTGCACCACTGCCTGGCGCTGTGCCCTTCCTTTGCTCTGCCCGCTGG      ,,9:.A>-.34-./491+#&DA20)>?-,36,**8<%#++20..6I<-..      NM:i:0  NH:i:2  CC:Z:chr2       CP:i:114357483  XS:A:+  HI:i:0
    2022_911_2004_F3        99      chr1    13484   3       50M     =       13572   123     CAGCTGCACCACTGCCTGGCGCTGTGCCCTTCCTTTGCTCTGCCCGCTGG      OOTYX[TKCHXUHIUVONHDTUPTOTXQKMWQPTRB/DH?QJKNNZJJQQ      NM:i:0  NH:i:2  CC:Z:chr2       CP:i:114357483  XS:A:+  HI:i:0
    486_302_1756_F3 99      chr1    13528   0       50M     
    =       13615   122     CGCTGGAGACGGTGTTTGTCATGGGCCTGGTCTGCAGGGATCCTGCTACA      __ZZZ[ZQWUVV?I]`TEQWYX<;RYH==GXI;5F];!(EYE0CV\D7NN      NM:i:0  NH:i:5  CC:Z:chr12      CP:i:92036      XS:A:+  HI:i:0
    667_379_431_F3  355     chr1    13528   0       50M     =       13615   122     CGCTGGAGACGGTGTTTGTCATGGGCCTGGTCTGCAGGGATCCTGCTACA      
    ``]_]][X^__WAIZ\K=RYYY??PRPORMLEB<DYM'*NYT;9R[E5@@      NM:i:0  NH:i:5  CC:Z:chr12      CP:i:92036      XS:A:+  HI:i:0
    1481_526_56_F3  355     chr1    13528   0       50M     =       13615   122     CGCTGGAGACGGTGTTTGTCATGGGCCTGGTCTGCAGGGATCCTGCTACA      RRRNC==:HHC@3CG=:1<=BB2DMG>0:B70/.0@>!!/;K>/=B1)11      NM:i:0  NH:i:5  CC:Z:chr12      CP:i:92036      XS:A:+  HI:i:0
    1631_628_1988_F3        355     chr1    13528   0       50M     =       13615   122     CGCTGGAGACGGTGTTTGTCATGGGCCTGGTCTGCAGGGATCCTGCTACA      ^^]
    ``_NFW]^[IO_^V39UWN=JTVH6BSYB8HSYOD@>LR;9QY6!00      NM:i:0  NH:i:5  CC:Z:chr12      CP:i:92036      XS:A:+  HI:i:0
    1934_635_52_F3  99      chr1    13528   0       50M     =       13615   122     CGCTGGAGACGGTGTTTGTCATGGGCCTGGTCTGCAGGGATCCTGCTACA      ZZ]\\[YZ^WVZHL_
    `^PQYVS<EYVQHNWWJ=1@V@9=9LSABRW2!22      NM:i:0  NH:i:5  CC:Z:chr12      CP:i:92036      XS:A:+  HI:i:0
    2219_1966_235_F3        355     chr1    13528   0       50M     
    =       13615   122     CGCTGGAGACGGTGTTTGTCATGGGCCTGGTCTGCAGGGATCCTGCTACA      NNBAQRE9HJH6#CMNF5EFFA$0<AJ;9CE,!-?K0!22;M4+BC,%77      NM:i:0  NH:i:5  CC:Z:chr12      CP:i:92036      XS:A:+  HI:i:0
    1329_105_1480_F5-BC     403     chr1    13572   1       35M     =       13484   -123    GCTACAAAGGTGAAACCCAGGAGAGTGTGGAGTCC     ::74<@':CAOIH9AHFUWL:FXE6?KNOUVCCKK     NM:i:0  NH:i:4  CC:Z:chr12      CP:i:92007      XS:A:+  HI:i:0
    1863_1224_411_F5-BC     147     chr1    13572   3       35M     =       13484   -123    GCTACAAAGGTGAAACCCAGGAGAGTGTGGAGTCC     __]XRR?G_\_a^^]]bcccbba_[[accccbabb     NM:i:0  NH:i:2  CC:Z:chr2       CP:i:114357410  XS:A:+  HI:i:0
    1939_1338_752_F5-BC     403     chr1    13572   1       35M     =       13484   -123    GCTACAAAGGTGAAACCCAGGAGAGTGTGGAGTCC     44.$5H+6GIJFQHHTW[VVTOQGEJOWYVSSVYY     NM:i:0  NH:i:4  CC:Z:chr12      CP:i:92007      XS:A:+  HI:i:0 
    Post picard:
    PHP Code:
    1329_105_1480_F3        355     chr1    13484   1       50M     =       13572   123     CAGCTGCACCACTGCCTGGCGCTGTGCCCTTCCTTTGCTCTGCCCGCTGG      ,,@NPYG423BC553AC.2BPB;:7OH0.-=><1,I3!5=D<4)-OD=44      NM:i:0  NH:i:4  CC:Z:chr12      CP:i:92080      XS:A:+  HI:i:0
    1863_1224_411_F3        99      chr1    13484   3       50M     
    =       13572   123     CAGCTGCACCACTGCCTGGCGCTGTGCCCTTCCTTTGCTCTGCCCGCTGG      UUZ_[][VXY[XRNOYJFURZZULJZ_ZOQ[SRRTW@CPBJHJWU\FAMM      NM:i:0  NH:i:2  CC:Z:chr2       CP:i:114357483  XS:A:+  HI:i:0
    1939_1338_752_F3        355     chr1    13484   1       50M     
    =       13572   123     CAGCTGCACCACTGCCTGGCGCTGTGCCCTTCCTTTGCTCTGCCCGCTGG      66?=3CEA:=LIE?CH8?F@J;?G4@QH7;KEHJIE1@@4=AD9=P@8<<      NM:i:0  NH:i:4  CC:Z:chr12      CP:i:92080      XS:A:+  HI:i:0
    1942_131_1549_F3        355     chr1    13484   3       50M     
    =       13572   123     CAGCTGCACCACTGCCTGGCGCTGTGCCCTTCCTTTGCTCTGCCCGCTGG      ,,9:.A>-.34-./491+#&DA20)>?-,36,**8<%#++20..6I<-..      NM:i:0  NH:i:2  CC:Z:chr2       CP:i:114357483  XS:A:+  HI:i:0
    2022_911_2004_F3        99      chr1    13484   3       50M     =       13572   123     CAGCTGCACCACTGCCTGGCGCTGTGCCCTTCCTTTGCTCTGCCCGCTGG      OOTYX[TKCHXUHIUVONHDTUPTOTXQKMWQPTRB/DH?QJKNNZJJQQ      NM:i:0  NH:i:2  CC:Z:chr2       CP:i:114357483  XS:A:+  HI:i:0
    486_302_1756_F3 99      chr1    13528   0       50M     
    =       13615   122     CGCTGGAGACGGTGTTTGTCATGGGCCTGGTCTGCAGGGATCCTGCTACA      __ZZZ[ZQWUVV?I]`TEQWYX<;RYH==GXI;5F];!(EYE0CV\D7NN      NM:i:0  NH:i:5  CC:Z:chr12      CP:i:92036      XS:A:+  HI:i:0
    667_379_431_F3  355     chr1    13528   0       50M     =       13615   122     CGCTGGAGACGGTGTTTGTCATGGGCCTGGTCTGCAGGGATCCTGCTACA      
    ``]_]][X^__WAIZ\K=RYYY??PRPORMLEB<DYM'*NYT;9R[E5@@      NM:i:0  NH:i:5  CC:Z:chr12      CP:i:92036      XS:A:+  HI:i:0
    1481_526_56_F3  355     chr1    13528   0       50M     =       13615   122     CGCTGGAGACGGTGTTTGTCATGGGCCTGGTCTGCAGGGATCCTGCTACA      RRRNC==:HHC@3CG=:1<=BB2DMG>0:B70/.0@>!!/;K>/=B1)11      NM:i:0  NH:i:5  CC:Z:chr12      CP:i:92036      XS:A:+  HI:i:0
    1631_628_1988_F3        355     chr1    13528   0       50M     =       13615   122     CGCTGGAGACGGTGTTTGTCATGGGCCTGGTCTGCAGGGATCCTGCTACA      ^^]
    ``_NFW]^[IO_^V39UWN=JTVH6BSYB8HSYOD@>LR;9QY6!00      NM:i:0  NH:i:5  CC:Z:chr12      CP:i:92036      XS:A:+  HI:i:0
    1934_635_52_F3  99      chr1    13528   0       50M     =       13615   122     CGCTGGAGACGGTGTTTGTCATGGGCCTGGTCTGCAGGGATCCTGCTACA      ZZ]\\[YZ^WVZHL_
    `^PQYVS<EYVQHNWWJ=1@V@9=9LSABRW2!22      NM:i:0  NH:i:5  CC:Z:chr12      CP:i:92036      XS:A:+  HI:i:0
    2219_1966_235_F3        355     chr1    13528   0       50M     
    =       13615   122     CGCTGGAGACGGTGTTTGTCATGGGCCTGGTCTGCAGGGATCCTGCTACA      NNBAQRE9HJH6#CMNF5EFFA$0<AJ;9CE,!-?K0!22;M4+BC,%77      NM:i:0  NH:i:5  CC:Z:chr12      CP:i:92036      XS:A:+  HI:i:0
    1329_105_1480_F5-BC     403     chr1    13572   1       35M     =       13484   -123    GCTACAAAGGTGAAACCCAGGAGAGTGTGGAGTCC     ::74<@':CAOIH9AHFUWL:FXE6?KNOUVCCKK     NM:i:0  NH:i:4  CC:Z:chr12      CP:i:92007      XS:A:+  HI:i:0
    1863_1224_411_F5-BC     147     chr1    13572   3       35M     =       13484   -123    GCTACAAAGGTGAAACCCAGGAGAGTGTGGAGTCC     __]XRR?G_\_a^^]]bcccbba_[[accccbabb     NM:i:0  NH:i:2  CC:Z:chr2       CP:i:114357410  XS:A:+  HI:i:0
    1939_1338_752_F5-BC     403     chr1    13572   1       35M     =       13484   -123    GCTACAAAGGTGAAACCCAGGAGAGTGTGGAGTCC     44.$5H+6GIJFQHHTW[VVTOQGEJOWYVSSVYY     NM:i:0  NH:i:4  CC:Z:chr12      CP:i:92007      XS:A:+  HI:i:0 
    I have also tried to set the REMOVE_DUPLICATES flag to false (I think this just flags them and leaves them in the new file rather than excluding them) but this gives me exactly the same result:
    PHP Code:
    1329_105_1480_F3        355     chr1    13484   1       50M     =       13572   123     CAGCTGCACCACTGCCTGGCGCTGTGCCCTTCCTTTGCTCTGCCCGCTGG      ,,@NPYG423BC553AC.2BPB;:7OH0.-=><1,I3!5=D<4)-OD=44      NM:i:0  NH:i:4  CC:Z:chr12      CP:i:92080      XS:A:+  HI:i:0
    1863_1224_411_F3        99      chr1    13484   3       50M     
    =       13572   123     CAGCTGCACCACTGCCTGGCGCTGTGCCCTTCCTTTGCTCTGCCCGCTGG      UUZ_[][VXY[XRNOYJFURZZULJZ_ZOQ[SRRTW@CPBJHJWU\FAMM      NM:i:0  NH:i:2  CC:Z:chr2       CP:i:114357483  XS:A:+  HI:i:0
    1939_1338_752_F3        355     chr1    13484   1       50M     
    =       13572   123     CAGCTGCACCACTGCCTGGCGCTGTGCCCTTCCTTTGCTCTGCCCGCTGG      66?=3CEA:=LIE?CH8?F@J;?G4@QH7;KEHJIE1@@4=AD9=P@8<<      NM:i:0  NH:i:4  CC:Z:chr12      CP:i:92080      XS:A:+  HI:i:0
    1942_131_1549_F3        355     chr1    13484   3       50M     
    =       13572   123     CAGCTGCACCACTGCCTGGCGCTGTGCCCTTCCTTTGCTCTGCCCGCTGG      ,,9:.A>-.34-./491+#&DA20)>?-,36,**8<%#++20..6I<-..      NM:i:0  NH:i:2  CC:Z:chr2       CP:i:114357483  XS:A:+  HI:i:0
    2022_911_2004_F3        99      chr1    13484   3       50M     =       13572   123     CAGCTGCACCACTGCCTGGCGCTGTGCCCTTCCTTTGCTCTGCCCGCTGG      OOTYX[TKCHXUHIUVONHDTUPTOTXQKMWQPTRB/DH?QJKNNZJJQQ      NM:i:0  NH:i:2  CC:Z:chr2       CP:i:114357483  XS:A:+  HI:i:0
    486_302_1756_F3 99      chr1    13528   0       50M     
    =       13615   122     CGCTGGAGACGGTGTTTGTCATGGGCCTGGTCTGCAGGGATCCTGCTACA      __ZZZ[ZQWUVV?I]`TEQWYX<;RYH==GXI;5F];!(EYE0CV\D7NN      NM:i:0  NH:i:5  CC:Z:chr12      CP:i:92036      XS:A:+  HI:i:0
    667_379_431_F3  355     chr1    13528   0       50M     =       13615   122     CGCTGGAGACGGTGTTTGTCATGGGCCTGGTCTGCAGGGATCCTGCTACA      
    ``]_]][X^__WAIZ\K=RYYY??PRPORMLEB<DYM'*NYT;9R[E5@@      NM:i:0  NH:i:5  CC:Z:chr12      CP:i:92036      XS:A:+  HI:i:0
    1481_526_56_F3  355     chr1    13528   0       50M     =       13615   122     CGCTGGAGACGGTGTTTGTCATGGGCCTGGTCTGCAGGGATCCTGCTACA      RRRNC==:HHC@3CG=:1<=BB2DMG>0:B70/.0@>!!/;K>/=B1)11      NM:i:0  NH:i:5  CC:Z:chr12      CP:i:92036      XS:A:+  HI:i:0
    1631_628_1988_F3        355     chr1    13528   0       50M     =       13615   122     CGCTGGAGACGGTGTTTGTCATGGGCCTGGTCTGCAGGGATCCTGCTACA      ^^]
    ``_NFW]^[IO_^V39UWN=JTVH6BSYB8HSYOD@>LR;9QY6!00      NM:i:0  NH:i:5  CC:Z:chr12      CP:i:92036      XS:A:+  HI:i:0
    1934_635_52_F3  99      chr1    13528   0       50M     =       13615   122     CGCTGGAGACGGTGTTTGTCATGGGCCTGGTCTGCAGGGATCCTGCTACA      ZZ]\\[YZ^WVZHL_
    `^PQYVS<EYVQHNWWJ=1@V@9=9LSABRW2!22      NM:i:0  NH:i:5  CC:Z:chr12      CP:i:92036      XS:A:+  HI:i:0
    2219_1966_235_F3        355     chr1    13528   0       50M     
    =       13615   122     CGCTGGAGACGGTGTTTGTCATGGGCCTGGTCTGCAGGGATCCTGCTACA      NNBAQRE9HJH6#CMNF5EFFA$0<AJ;9CE,!-?K0!22;M4+BC,%77      NM:i:0  NH:i:5  CC:Z:chr12      CP:i:92036      XS:A:+  HI:i:0
    1329_105_1480_F5-BC     403     chr1    13572   1       35M     =       13484   -123    GCTACAAAGGTGAAACCCAGGAGAGTGTGGAGTCC     ::74<@':CAOIH9AHFUWL:FXE6?KNOUVCCKK     NM:i:0  NH:i:4  CC:Z:chr12      CP:i:92007      XS:A:+  HI:i:0
    1863_1224_411_F5-BC     147     chr1    13572   3       35M     =       13484   -123    GCTACAAAGGTGAAACCCAGGAGAGTGTGGAGTCC     __]XRR?G_\_a^^]]bcccbba_[[accccbabb     NM:i:0  NH:i:2  CC:Z:chr2       CP:i:114357410  XS:A:+  HI:i:0
    1939_1338_752_F5-BC     403     chr1    13572   1       35M     =       13484   -123    GCTACAAAGGTGAAACCCAGGAGAGTGTGGAGTCC     44.$5H+6GIJFQHHTW[VVTOQGEJOWYVSSVYY     NM:i:0  NH:i:4  CC:Z:chr12      CP:i:92007      XS:A:+  HI:i:0 
    On all three example SAM file extracts you can clearly see that five reads on chr1 at start position 13484 are duplicates, and are paired with reads starting at 13572, yet all have been left in and I can't see that any flag had changed.

    Am I doing something wrong? Please help!

    Thanks
    Helen

  • #2
    Originally posted by hlwright View Post

    On all three example SAM file extracts you can clearly see that five reads on chr1 at start position 13484 are duplicates, and are paired with reads starting at 13572, yet all have been left in and I can't see that any flag had changed.

    Am I doing something wrong? Please help!

    Thanks
    Helen
    I think I see the problem, and it's the same one I had when I first tried to use Picard's MarkDuplicates. Whatever program you used to make the .sam file understood that read 1329_105_1480_F3 and 1329_105_1480_F5-BC were paired, even though their names are different. Someone on this board said that Picard won't work unless the names are identical. So try fixing the names, and then running Picard. I use samtools rmdup, which has a known issue of not removing duplicates that span chromosomes, but it does work if the names aren't identical.

    Comment


    • #3
      Thank you for the reply, I will try your suggestion of changing the read names.

      I have tried samtools rmdup but it removes >90% of my reads (as duplicates). I was hoping for a 'second opinion' from Picard that was less shocking!

      Helen

      Comment


      • #4
        I have changed the read names as suggested and re-run picard, but am still not happy that it has worked properly.

        I also included an extra command in the command line READ_NAME_REGEX="[0-9]_[0-9]_[0-9]". An example of the output can be seen below:

        PHP Code:
        1329_105_1480   355     chr1    13484   1       50M     =       13572   123     CAGCTGCACCACTGCCTGGCGCTGTGCCCTTCCTTTGCTCTGCCCGCTGG      ,,@NPYG423BC553AC.2BPB;:7OH0.-=><1,I3!5=D<4)-OD=44      NM:i:0  NH:i:4  CC:Z:chr12      CP:i:92080      XS:A:+  HI:i:0
        1863_1224_411   99      chr1    13484   3       50M     
        =       13572   123     CAGCTGCACCACTGCCTGGCGCTGTGCCCTTCCTTTGCTCTGCCCGCTGG      UUZ_[][VXY[XRNOYJFURZZULJZ_ZOQ[SRRTW@CPBJHJWU\FAMM      NM:i:0  NH:i:2  CC:Z:chr2       CP:i:114357483  XS:A:+  HI:i:0
        1939_1338_752   355     chr1    13484   1       50M     
        =       13572   123     CAGCTGCACCACTGCCTGGCGCTGTGCCCTTCCTTTGCTCTGCCCGCTGG      66?=3CEA:=LIE?CH8?F@J;?G4@QH7;KEHJIE1@@4=AD9=P@8<<      NM:i:0  NH:i:4  CC:Z:chr12      CP:i:92080      XS:A:+  HI:i:0
        1942_131_1549   355     chr1    13484   3       50M     
        =       13572   123     CAGCTGCACCACTGCCTGGCGCTGTGCCCTTCCTTTGCTCTGCCCGCTGG      ,,9:.A>-.34-./491+#&DA20)>?-,36,**8<%#++20..6I<-..      NM:i:0  NH:i:2  CC:Z:chr2       CP:i:114357483  XS:A:+  HI:i:0
        486_302_1756    99      chr1    13528   0       50M     =       13615   122     CGCTGGAGACGGTGTTTGTCATGGGCCTGGTCTGCAGGGATCCTGCTACA      __ZZZ[ZQWUVV?I]`TEQWYX<;RYH==GXI;5F];!(EYE0CV\D7NN      NM:i:0  NH:i:5  CC:Z:chr12      CP:i:92036      XS:A:+  HI:i:0
        667_379_431     355     chr1    13528   0       50M     =       13615   122     CGCTGGAGACGGTGTTTGTCATGGGCCTGGTCTGCAGGGATCCTGCTACA      
        ``]_]][X^__WAIZ\K=RYYY??PRPORMLEB<DYM'*NYT;9R[E5@@      NM:i:0  NH:i:5  CC:Z:chr12      CP:i:92036      XS:A:+  HI:i:0
        1481_526_56     355     chr1    13528   0       50M     =       13615   122     CGCTGGAGACGGTGTTTGTCATGGGCCTGGTCTGCAGGGATCCTGCTACA      RRRNC==:HHC@3CG=:1<=BB2DMG>0:B70/.0@>!!/;K>/=B1)11      NM:i:0  NH:i:5  CC:Z:chr12      CP:i:92036      XS:A:+  HI:i:0
        1631_628_1988   355     chr1    13528   0       50M     =       13615   122     CGCTGGAGACGGTGTTTGTCATGGGCCTGGTCTGCAGGGATCCTGCTACA      ^^]
        ``_NFW]^[IO_^V39UWN=JTVH6BSYB8HSYOD@>LR;9QY6!00      NM:i:0  NH:i:5  CC:Z:chr12      CP:i:92036      XS:A:+  HI:i:0
        2219_1966_235   355     chr1    13528   0       50M     =       13615   122     CGCTGGAGACGGTGTTTGTCATGGGCCTGGTCTGCAGGGATCCTGCTACA      NNBAQRE9HJH6#CMNF5EFFA$0<AJ;9CE,!-?K0!22;M4+BC,%77      NM:i:0  NH:i:5  CC:Z:chr12      CP:i:92036      XS:A:+  HI:i:0
        1329_105_1480   403     chr1    13572   1       35M     =       13484   -123    GCTACAAAGGTGAAACCCAGGAGAGTGTGGAGTCC     ::74<@':CAOIH9AHFUWL:FXE6?KNOUVCCKK     NM:i:0  NH:i:4  CC:Z:chr12      CP:i:92007      XS:A:+  HI:i:0
        1863_1224_411   147     chr1    13572   3       35M     =       13484   -123    GCTACAAAGGTGAAACCCAGGAGAGTGTGGAGTCC     __]XRR?G_\_a^^]]bcccbba_[[accccbabb     NM:i:0  NH:i:2  CC:Z:chr2       CP:i:114357410  XS:A:+  HI:i:0
        1939_1338_752   403     chr1    13572   1       35M     =       13484   -123    GCTACAAAGGTGAAACCCAGGAGAGTGTGGAGTCC     44.$5H+6GIJFQHHTW[VVTOQGEJOWYVSSVYY     NM:i:0  NH:i:4  CC:Z:chr12      CP:i:92007      XS:A:+  HI:i:0
        1942_131_1549   403     chr1    13572   3       35M     =       13484   -123    GCTACAAAGGTGAAACCCAGGAGAGTGTGGAGTCC     ))++//!#+,98-+9<2?DC9,./.6CBGO?21@@     NM:i:0  NH:i:2  CC:Z:chr2       CP:i:114357410  XS:A:+  HI:i:0 
        Picard has removed ~60% of my reads as duplicates, but you can see that some of the reads that have been left in the file are duplicates (or at least I think they look like duplicates). Read 1329_105_1480 for example appears to have 4 copies which have all mapped as pairs to the same location.

        I ran samtools rmdup on this Picard output and that removed a further 75% as duplicates! But why is picard not picking them up?

        My full Picard command is:
        HTML Code:
        java -Xmx10g -jar /path/to/MarkDuplicates.jar INPUT=filename.bam OUTPUT=output.bam METRICS_FILE=output.txt REMOVE_DUPLICATES=true ASSUME_SORTED=true VALIDATION_STRINGENCY=LENIENT READ_NAME_REGEX="[0-9]_[0-9]_[0-9]"
        Thanks
        Helen

        Comment


        • #5
          Might this work?

          This might be helpful. It suggests that the problem has to do with underscores in the names so the underscore has to be part of the regex.



          PK

          Comment


          • #6
            Your regular expression won't match your read names as you have them. The problem is that you need to include (1) capturing parentheses and (2) use + (to indicate one or more character matches).

            To match identifiers like yours (e.g., 1240_122_334) use:
            READ_NAME_REGEX="([0-9]+)_([0-9]+)_([0-9]+)"

            Also be sure that the three number in your read names are (in this order) tile/region, x coordinate, y coordinate per the Picard tools documentation

            Comment


            • #7
              One final thought, check the metrics file to determine how many of your reads are being detected as optical and PCR duplicates. If your regex is working, there should be > 0 optical duplicates

              Comment


              • #8
                Helen,

                It seems like you are using an aligner that reports multiple alignments, right?

                One thing I noticed is that all these remaining reads are "not primary alignment" (I used http://picard.sourceforge.net/explain-flags.html to interpret the flag in col 2).
                Picard might skip these reads for dup detection.
                I've used samtools view -F 0x100 xx.bam to remove the non-primary-alignment.


                Lifeng

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  Essential Discoveries and Tools in Epitranscriptomics
                  by seqadmin




                  The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                  04-22-2024, 07:01 AM
                • seqadmin
                  Current Approaches to Protein Sequencing
                  by seqadmin


                  Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                  04-04-2024, 04:25 PM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by seqadmin, Today, 08:47 AM
                0 responses
                10 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-11-2024, 12:08 PM
                0 responses
                60 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-10-2024, 10:19 PM
                0 responses
                57 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-10-2024, 09:21 AM
                0 responses
                53 views
                0 likes
                Last Post seqadmin  
                Working...
                X