Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • picard MarkDuplicates READ_NAME_REGEX

    hi, everyone.
    I got a problem about the parameter READ_NAME_REGEX in picard MarkDuplicates command.I try to find by steps as follows:

    a. If I rum as:java -Xmx120g -jar MarkDuplicates.jar INPUT=ERR173172_unpaired.bam OUTPUT=ERR173172_unpaired_rmdup.bam METRICS_FILE=unpaired_duplicates.txt ASSUME_SORTED=true REMOVE_DUPLICATES=true
    I got the OUTPUT and M_FILE, but during the running ,there is a "WARNING" like this:
    WARNING 2014-07-16 14:11:40 AbstractDuplicateFindingAlgorithm Default READ_NAME_REGEX '[a-zA-Z0-9]+:[0-9][0-9]+)[0-9]+)[0-9]+).*' did not match read name 'ERR173172.62049924'. You may need to specify a READ_NAME_REGEX in order to correctly identify optical duplicates. Note that this message will not be emitted again even if other read names do not match the regex.

    b. If I rum as:java -Xmx120g -jar MarkDuplicates.jar INPUT=ERR173172_unpaired.bam OUTPUT=ERR173172_unpaired_rmdup.bam METRICS_FILE=unpaired_duplicates.txt ASSUME_SORTED=true REMOVE_DUPLICATES=true READ_NAME_REGEX=[a-zA-Z0-9]+\.[0-9]+
    I can't get the result. just gave the error:
    [Wed Jul 16 14:16:34 CST 2014] picard.sam.MarkDuplicates INPUT=[ERR173172_merged.bam] OUTPUT=ERR173172_rmdup1.bam METRICS_FILE=ERR173172_duplicates1.txt REMOVE_DUPLICATES=true READ_NAME_REGEX=[a-zA-Z0-9]+.[0-9]+ PROGRAM_RECORD_ID=MarkDuplicates PROGRAM_GROUP_NAME=MarkDuplicates ASSUME_SORTED=false MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000 MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=8000 SORTING_COLLECTION_SIZE_RATIO=0.25 OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false
    [Wed Jul 16 14:16:34 CST 2014] Executing as lvlh@ubuntu on Linux 3.5.0-23-generic amd64; OpenJDK 64-Bit Server VM 1.6.0_31-b31; Picard version: 1.115(30b1e546cc4dd80c918e151dbfe46b061e63f315_1402927010) JdkDeflater
    INFO 2014-07-16 14:16:34 MarkDuplicates Start of doWork freeMemory: 2046462440; totalMemory: 2058027008; maxMemory: 114532483072
    INFO 2014-07-16 14:16:34 MarkDuplicates Reading input file and constructing read end information.
    INFO 2014-07-16 14:16:34 MarkDuplicates Will retain up to 454493980 data points before spilling to disk.
    [Wed Jul 16 14:16:37 CST 2014] picard.sam.MarkDuplicates done. Elapsed time: 0.05 minutes.
    Runtime.totalMemory()=9330032640
    To get help, see http://picard.sourceforge.net/index.shtml#GettingHelp
    Exception in thread "main" picard.PicardException: Input file /datapool/lvlh/pig_reseq/ERX149135/ERR173172/ERR173172_merged.bam is not coordinate sorted.
    at picard.sam.MarkDuplicates.openInputs(MarkDuplicates.java:359)
    at picard.sam.MarkDuplicates.buildSortedReadEndLists(MarkDuplicates.java:405)
    at picard.sam.MarkDuplicates.doWork(MarkDuplicates.java:177)
    at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:183)
    at picard.sam.MarkDuplicates.main(MarkDuplicates.java:161)


    c. If I set the READ_NAME_REGEX=null
    I got the OUTPUT, and there is no warning. It Just seems that I am right to run this command,why? Is there something wrong about the parameter READ_NAME_REGEX?
    Last edited by Lv Ray; 07-15-2014, 09:41 PM.

  • #2
    To add:
    run command: samtools view ERR173172_merged.bam | less -S
    you can see :
    ERR173172.26885410 89 1 1629 1 100M = 1629 0 TTGGGT
    ERR173172.26885410 133 1 1629 0 * = 1629 0 CGGTAT
    ERR173172.8687716 89 1 1638 1 95M = 1638 0 GTTGGT
    ERR173172.8687716 133 1 1638 0 * = 1638 0 GTCTGA
    ERR173172.4507000 153 1 1648 1 92M8S = 1648 0 TCCCGT
    ERR173172.4507000 69 1 1648 0 * = 1648 0 TGTCTT
    ERR173172.53744916 89 1 4280 11 2S69M = 4280 0 GATGCC
    ERR173172.53744916 133 1 4280 0 * = 4280 0 GATAGT
    ERR173172.60595146 153 1 4308 11 100M = 4308 0 CCCCCC
    ERR173172.60595146 69 1 4308 0 * = 4308 0 TGGATA
    ERR173172.55733737 153 1 4310 11 100M = 4310 0 CTCTCC
    ERR173172.55733737 69 1 4310 0 * = 4310 0 TTATTT
    ERR173172.48676987 153 1 4313 11 100M = 4313 0 TCCCCC
    ERR173172.48676987 69 1 4313 0 * = 4313 0 ATTTGG
    ERR173172.8193734 89 1 4314 1 73M = 4314 0 CCCCCA
    ERR173172.8193734 133 1 4314 0 * = 4314 0 GATTTG

    Comment


    • #3
      Look at the Picard error message in part b of your original message. It tells you exactly what the problem is.

      Exception in thread "main" picard.PicardException: Input file /datapool/lvlh/pig_reseq/ERX149135/ERR173172/ERR173172_merged.bam is not coordinate sorted.
      The samtools view output shown in you second message further confirms that your BAM file is sorted by read name, not coordinate. Go back and sort your BAM file to put it in the proper (coordinate sorted) order and then repeat the Picard command as in (b) above.

      Comment


      • #4
        Thank you , kmcarr. But i think you are wrong.
        ERR173172.26885410 89 1 1629 1 100M = 1629 0 TTGGGT
        Like this, in my second message ,the 4th column confirms that my BAM file is sorted by coordinates,not the first column(read names)

        Comment


        • #5
          Originally posted by Lv Ray View Post
          Thank you , kmcarr. But i think you are wrong.
          ERR173172.26885410 89 1 1629 1 100M = 1629 0 TTGGGT
          Like this, in my second message ,the 4th column confirms that my BAM file is sorted by coordinates,not the first column(read names)
          I see now that the fragment of BAM file you copied is very unusual in the fact that all of the read pairs shown have only one mate mapped. This is why the sorting appears at first look to be by name order. But then the BAM output is from ERR173172_merged.bam and your are trying to run MarkDuplicates on a different file ERR173172_unpaired.bam. Are you sure about the sort order of ERR173172_unpaired.bam is correct? What does the BAM file header look like (run "samtools view -H ERR173172_unpaired.bam").

          None the less the error message still clearly indicates that Picard believes that the BAM file is not properly sorted and the problem has nothing to do with the read name regex. This may be caused by the unusual nature of this BAM file, i.e. that only contains reads with one unmapped mate.

          Comment


          • #6
            I am sorry ,kmcarr. I made a fault about my quetion, however , I checked my some dataset as you told me("samtools view -H *.bam")
            samtools view -H ERR173172_unpaired.sorted.bam |less
            @HD VN:1.0 SO:unsorted
            @SQ SN:1 LN:315321322
            @SQ SN:10 LN:79102373
            @SQ SN:11 LN:87690581
            @SQ SN:12 LN:63588571
            @SQ SN:13 LN:218635234
            @SQ SN:14 LN:153851969
            @SQ SN:15 LN:157681621
            @SQ SN:16 LN:86898991
            @SQ SN:17 LN:69701581
            @SQ SN:18 LN:61220071
            @SQ SN:2 LN:162569375
            @SQ SN:3 LN:144787322
            @SQ SN:4 LN:143465943
            @SQ SN:5 LN:111506441
            @SQ SN:6 LN:157765593
            @SQ SN:7 LN:134764511
            @SQ SN:8 LN:148491826
            @SQ SN:9 LN:153670197
            @SQ SN:MT LN:16613
            @SQ SN:X LN:144288218
            @SQ SN:Y LN:1637716
            @SQ SN:JH118944.1 LN:594937
            @SQ SN:JH118636.1 LN:547643
            @SQ SN:JH118966.1 LN:497305
            @SQ SN:JH118951.1 LN:479775
            @SQ SN:JH118524.1 LN:477901
            @SQ SN:JH118901.1 LN:451395

            It seems that I sorted it ,but the command "samtools view -H "gives me the unsorted information([COLOR="rgb(255, 140, 0)"]@HD VN:1.0 SO:unsorted[/COLOR])

            Comment

            Latest Articles

            Collapse

            • seqadmin
              Current Approaches to Protein Sequencing
              by seqadmin


              Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
              04-04-2024, 04:25 PM
            • seqadmin
              Strategies for Sequencing Challenging Samples
              by seqadmin


              Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
              03-22-2024, 06:39 AM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, 04-11-2024, 12:08 PM
            0 responses
            22 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-10-2024, 10:19 PM
            0 responses
            24 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-10-2024, 09:21 AM
            0 responses
            20 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-04-2024, 09:00 AM
            0 responses
            52 views
            0 likes
            Last Post seqadmin  
            Working...
            X