hi, everyone.
I got a problem about the parameter READ_NAME_REGEX in picard MarkDuplicates command.I try to find by steps as follows:
a. If I rum as:java -Xmx120g -jar MarkDuplicates.jar INPUT=ERR173172_unpaired.bam OUTPUT=ERR173172_unpaired_rmdup.bam METRICS_FILE=unpaired_duplicates.txt ASSUME_SORTED=true REMOVE_DUPLICATES=true
I got the OUTPUT and M_FILE, but during the running ,there is a "WARNING" like this:
WARNING 2014-07-16 14:11:40 AbstractDuplicateFindingAlgorithm Default READ_NAME_REGEX '[a-zA-Z0-9]+:[0-9][0-9]+)[0-9]+)[0-9]+).*' did not match read name 'ERR173172.62049924'. You may need to specify a READ_NAME_REGEX in order to correctly identify optical duplicates. Note that this message will not be emitted again even if other read names do not match the regex.
b. If I rum as:java -Xmx120g -jar MarkDuplicates.jar INPUT=ERR173172_unpaired.bam OUTPUT=ERR173172_unpaired_rmdup.bam METRICS_FILE=unpaired_duplicates.txt ASSUME_SORTED=true REMOVE_DUPLICATES=true READ_NAME_REGEX=[a-zA-Z0-9]+\.[0-9]+
I can't get the result. just gave the error:
[Wed Jul 16 14:16:34 CST 2014] picard.sam.MarkDuplicates INPUT=[ERR173172_merged.bam] OUTPUT=ERR173172_rmdup1.bam METRICS_FILE=ERR173172_duplicates1.txt REMOVE_DUPLICATES=true READ_NAME_REGEX=[a-zA-Z0-9]+.[0-9]+ PROGRAM_RECORD_ID=MarkDuplicates PROGRAM_GROUP_NAME=MarkDuplicates ASSUME_SORTED=false MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000 MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=8000 SORTING_COLLECTION_SIZE_RATIO=0.25 OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false
[Wed Jul 16 14:16:34 CST 2014] Executing as lvlh@ubuntu on Linux 3.5.0-23-generic amd64; OpenJDK 64-Bit Server VM 1.6.0_31-b31; Picard version: 1.115(30b1e546cc4dd80c918e151dbfe46b061e63f315_1402927010) JdkDeflater
INFO 2014-07-16 14:16:34 MarkDuplicates Start of doWork freeMemory: 2046462440; totalMemory: 2058027008; maxMemory: 114532483072
INFO 2014-07-16 14:16:34 MarkDuplicates Reading input file and constructing read end information.
INFO 2014-07-16 14:16:34 MarkDuplicates Will retain up to 454493980 data points before spilling to disk.
[Wed Jul 16 14:16:37 CST 2014] picard.sam.MarkDuplicates done. Elapsed time: 0.05 minutes.
Runtime.totalMemory()=9330032640
To get help, see http://picard.sourceforge.net/index.shtml#GettingHelp
Exception in thread "main" picard.PicardException: Input file /datapool/lvlh/pig_reseq/ERX149135/ERR173172/ERR173172_merged.bam is not coordinate sorted.
at picard.sam.MarkDuplicates.openInputs(MarkDuplicates.java:359)
at picard.sam.MarkDuplicates.buildSortedReadEndLists(MarkDuplicates.java:405)
at picard.sam.MarkDuplicates.doWork(MarkDuplicates.java:177)
at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:183)
at picard.sam.MarkDuplicates.main(MarkDuplicates.java:161)
c. If I set the READ_NAME_REGEX=null
I got the OUTPUT, and there is no warning. It Just seems that I am right to run this command,why? Is there something wrong about the parameter READ_NAME_REGEX?
I got a problem about the parameter READ_NAME_REGEX in picard MarkDuplicates command.I try to find by steps as follows:
a. If I rum as:java -Xmx120g -jar MarkDuplicates.jar INPUT=ERR173172_unpaired.bam OUTPUT=ERR173172_unpaired_rmdup.bam METRICS_FILE=unpaired_duplicates.txt ASSUME_SORTED=true REMOVE_DUPLICATES=true
I got the OUTPUT and M_FILE, but during the running ,there is a "WARNING" like this:
WARNING 2014-07-16 14:11:40 AbstractDuplicateFindingAlgorithm Default READ_NAME_REGEX '[a-zA-Z0-9]+:[0-9][0-9]+)[0-9]+)[0-9]+).*' did not match read name 'ERR173172.62049924'. You may need to specify a READ_NAME_REGEX in order to correctly identify optical duplicates. Note that this message will not be emitted again even if other read names do not match the regex.
b. If I rum as:java -Xmx120g -jar MarkDuplicates.jar INPUT=ERR173172_unpaired.bam OUTPUT=ERR173172_unpaired_rmdup.bam METRICS_FILE=unpaired_duplicates.txt ASSUME_SORTED=true REMOVE_DUPLICATES=true READ_NAME_REGEX=[a-zA-Z0-9]+\.[0-9]+
I can't get the result. just gave the error:
[Wed Jul 16 14:16:34 CST 2014] picard.sam.MarkDuplicates INPUT=[ERR173172_merged.bam] OUTPUT=ERR173172_rmdup1.bam METRICS_FILE=ERR173172_duplicates1.txt REMOVE_DUPLICATES=true READ_NAME_REGEX=[a-zA-Z0-9]+.[0-9]+ PROGRAM_RECORD_ID=MarkDuplicates PROGRAM_GROUP_NAME=MarkDuplicates ASSUME_SORTED=false MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000 MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=8000 SORTING_COLLECTION_SIZE_RATIO=0.25 OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false
[Wed Jul 16 14:16:34 CST 2014] Executing as lvlh@ubuntu on Linux 3.5.0-23-generic amd64; OpenJDK 64-Bit Server VM 1.6.0_31-b31; Picard version: 1.115(30b1e546cc4dd80c918e151dbfe46b061e63f315_1402927010) JdkDeflater
INFO 2014-07-16 14:16:34 MarkDuplicates Start of doWork freeMemory: 2046462440; totalMemory: 2058027008; maxMemory: 114532483072
INFO 2014-07-16 14:16:34 MarkDuplicates Reading input file and constructing read end information.
INFO 2014-07-16 14:16:34 MarkDuplicates Will retain up to 454493980 data points before spilling to disk.
[Wed Jul 16 14:16:37 CST 2014] picard.sam.MarkDuplicates done. Elapsed time: 0.05 minutes.
Runtime.totalMemory()=9330032640
To get help, see http://picard.sourceforge.net/index.shtml#GettingHelp
Exception in thread "main" picard.PicardException: Input file /datapool/lvlh/pig_reseq/ERX149135/ERR173172/ERR173172_merged.bam is not coordinate sorted.
at picard.sam.MarkDuplicates.openInputs(MarkDuplicates.java:359)
at picard.sam.MarkDuplicates.buildSortedReadEndLists(MarkDuplicates.java:405)
at picard.sam.MarkDuplicates.doWork(MarkDuplicates.java:177)
at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:183)
at picard.sam.MarkDuplicates.main(MarkDuplicates.java:161)
c. If I set the READ_NAME_REGEX=null
I got the OUTPUT, and there is no warning. It Just seems that I am right to run this command,why? Is there something wrong about the parameter READ_NAME_REGEX?
Comment