Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Picard MarkDuplicates - whole bam file marked as duplicates

    Hi,

    I ran an RNA-Seq data set using the RUM-pipeline to align the data. I tried to use the Picard's MarkDuplicates - and it tagged very single read in the bam files as a duplicate.
    I used the following parameters for MarkDuplicates:
    java -jar /private/software/packages/picard-tools-1.84/MarkDuplicates.jar I=RUM-sorted.bam O=RUM-sorted-dups_marked.bam METRICS_FILE=dups_metrics AS=true VALIDATION_STRINGENCY=SILENT

    (I had sorted the bam using samtools- but it was not recognized by MarkDuplicates that is why I used AS=true)

    In the dups_metrics file it lists the percent_duplication at 43%

    Any ideas?

    Thanks!
    Tirza
    --
    Tirza Doniger, Ph.D.
    Bioinformatics Unit
    The Mina and Everard Faculty of Life Sciences
    Bar Ilan University

  • #2
    Hey tdoniger

    Did you ever find the solution to your problem? I have the same problem and my metrics file only claims one percent!

    I'm currently running 3 different solutions to the problem and i'm waiting for the batches to finish..

    - use samtools rmdup instead
    - add VALIDATION_STRINGENCY=LENIENT to my command string for MarkDuplicates
    - don't bother to mark duplicates (I only did this becuase GATK requires it - my other pipeline will run happily without this step) and try to trick GATK into accepting my file by adding an @PG line into my header to say I ran MarkDuplicates (yeah I know this is probably not recommeneded I just thought I'd see what happened) i.e. use samtols reheader to take the header from the file that MarkDuplicates marked ever read as a duplicate in and put it onto the file I wuld of used as input.

    My command was

    java -Xmx2G -jar MarkDuplicates.jar INPUT=infile.sorted.bam OUTPUT=outfile.sorted.dedupe.bam METRICS_FILE=myMetricsFile

    I'd also previoulsy sorted the bam using samtools

    Comment


    • #3
      I thought because every line contained: PG:Z:MarkDuplicates - that these were the reads marked as duplicates. This is not the case. It is the flag set in the second column that indicates whether it is a duplicate or not.

      try: samtools flagstat library_no_dups.bam
      You can find the flags that represent the duplicates. See- http://picard.sourceforge.net/explain-flags.html
      --
      Tirza Doniger, Ph.D.
      Bioinformatics Unit
      The Mina and Everard Faculty of Life Sciences
      Bar Ilan University

      Comment


      • #4
        I solved this and now I can't remember how. Really slack of me not to come back and post the solution but tommorow I'll check my pipelines and have a look

        Comment


        • #5
          Hi,

          Thanks! But I was trying to explain that I managed to solve it. I had thought that every line marked by "PG:Z:MarkDuplicates" was a duplicate, but really this was not the case. The duplicates are marked in the flag in the second column of the same file.

          Best,
          Tirza
          --
          Tirza Doniger, Ph.D.
          Bioinformatics Unit
          The Mina and Everard Faculty of Life Sciences
          Bar Ilan University

          Comment


          • #6
            Going off piste here, but 43% duplication is pretty high.
            How much PCR did you do?

            Comment


            • #7
              quite a bit. very little starting material
              --
              Tirza Doniger, Ph.D.
              Bioinformatics Unit
              The Mina and Everard Faculty of Life Sciences
              Bar Ilan University

              Comment

              Latest Articles

              Collapse

              • seqadmin
                Essential Discoveries and Tools in Epitranscriptomics
                by seqadmin




                The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                04-22-2024, 07:01 AM
              • seqadmin
                Current Approaches to Protein Sequencing
                by seqadmin


                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                04-04-2024, 04:25 PM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by seqadmin, Yesterday, 11:49 AM
              0 responses
              15 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-24-2024, 08:47 AM
              0 responses
              16 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-11-2024, 12:08 PM
              0 responses
              62 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-10-2024, 10:19 PM
              0 responses
              60 views
              0 likes
              Last Post seqadmin  
              Working...
              X