Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Picard MarkDuplicates - whole bam file marked as duplicates

    Hi,

    I ran an RNA-Seq data set using the RUM-pipeline to align the data. I tried to use the Picard's MarkDuplicates - and it tagged very single read in the bam files as a duplicate.
    I used the following parameters for MarkDuplicates:
    java -jar /private/software/packages/picard-tools-1.84/MarkDuplicates.jar I=RUM-sorted.bam O=RUM-sorted-dups_marked.bam METRICS_FILE=dups_metrics AS=true VALIDATION_STRINGENCY=SILENT

    (I had sorted the bam using samtools- but it was not recognized by MarkDuplicates that is why I used AS=true)

    In the dups_metrics file it lists the percent_duplication at 43%

    Any ideas?

    Thanks!
    Tirza
    --
    Tirza Doniger, Ph.D.
    Bioinformatics Unit
    The Mina and Everard Faculty of Life Sciences
    Bar Ilan University

  • #2
    Hey tdoniger

    Did you ever find the solution to your problem? I have the same problem and my metrics file only claims one percent!

    I'm currently running 3 different solutions to the problem and i'm waiting for the batches to finish..

    - use samtools rmdup instead
    - add VALIDATION_STRINGENCY=LENIENT to my command string for MarkDuplicates
    - don't bother to mark duplicates (I only did this becuase GATK requires it - my other pipeline will run happily without this step) and try to trick GATK into accepting my file by adding an @PG line into my header to say I ran MarkDuplicates (yeah I know this is probably not recommeneded I just thought I'd see what happened) i.e. use samtols reheader to take the header from the file that MarkDuplicates marked ever read as a duplicate in and put it onto the file I wuld of used as input.

    My command was

    java -Xmx2G -jar MarkDuplicates.jar INPUT=infile.sorted.bam OUTPUT=outfile.sorted.dedupe.bam METRICS_FILE=myMetricsFile

    I'd also previoulsy sorted the bam using samtools

    Comment


    • #3
      I thought because every line contained: PG:Z:MarkDuplicates - that these were the reads marked as duplicates. This is not the case. It is the flag set in the second column that indicates whether it is a duplicate or not.

      try: samtools flagstat library_no_dups.bam
      You can find the flags that represent the duplicates. See- http://picard.sourceforge.net/explain-flags.html
      --
      Tirza Doniger, Ph.D.
      Bioinformatics Unit
      The Mina and Everard Faculty of Life Sciences
      Bar Ilan University

      Comment


      • #4
        I solved this and now I can't remember how. Really slack of me not to come back and post the solution but tommorow I'll check my pipelines and have a look

        Comment


        • #5
          Hi,

          Thanks! But I was trying to explain that I managed to solve it. I had thought that every line marked by "PG:Z:MarkDuplicates" was a duplicate, but really this was not the case. The duplicates are marked in the flag in the second column of the same file.

          Best,
          Tirza
          --
          Tirza Doniger, Ph.D.
          Bioinformatics Unit
          The Mina and Everard Faculty of Life Sciences
          Bar Ilan University

          Comment


          • #6
            Going off piste here, but 43% duplication is pretty high.
            How much PCR did you do?

            Comment


            • #7
              quite a bit. very little starting material
              --
              Tirza Doniger, Ph.D.
              Bioinformatics Unit
              The Mina and Everard Faculty of Life Sciences
              Bar Ilan University

              Comment

              Latest Articles

              Collapse

              • seqadmin
                Recent Advances in Sequencing Analysis Tools
                by seqadmin


                The sequencing world is rapidly changing due to declining costs, enhanced accuracies, and the advent of newer, cutting-edge instruments. Equally important to these developments are improvements in sequencing analysis, a process that converts vast amounts of raw data into a comprehensible and meaningful form. This complex task requires expertise and the right analysis tools. In this article, we highlight the progress and innovation in sequencing analysis by reviewing several of the...
                Today, 07:48 AM
              • seqadmin
                Essential Discoveries and Tools in Epitranscriptomics
                by seqadmin




                The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                04-22-2024, 07:01 AM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by seqadmin, Today, 07:17 AM
              0 responses
              11 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 05-02-2024, 08:06 AM
              0 responses
              19 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-30-2024, 12:17 PM
              0 responses
              20 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-29-2024, 10:49 AM
              0 responses
              28 views
              0 likes
              Last Post seqadmin  
              Working...
              X