Seqanswers Leaderboard Ad

**NGSfan** · 06-24-2010, 12:30 AM

I have used GATK successfully, but it looks like they've updated it since the last version I've been using (they separated mismatches from indels and merged them - now it looks like they do it all in one). I will try this new version you've shown.

SRMA I just tried once, but it crashed. But I really need to give it another fair chance since I really didn't look into the problem.

I was doing recalibration followed by local realignment, but I think perhaps it is better to do local realignment first? Seems to be the order recommended by GATK.

Anyone else have any thoughts on this?

**seq_GA** · 06-24-2010, 12:36 AM

Thx for your reply.

Are you referring to these steps under samtools protocol.

1.Collect regions around potential indels and clusters of mismatches:
2.Merge intervals
3.Realign reads overlapping specified regions:
4.Write the cleaned alignment file:

GATK manual suggests you to to basequality recalibration followed by local realignment. Its bit confusing in the steps involved.

**seq_GA** · 06-24-2010, 12:38 AM

whole testing smra also crashed

**NGSfan** · 06-24-2010, 01:20 AM

Originally posted by seq_GA View Post

Thx for your reply.

Are you referring to these steps under samtools protocol.

1.Collect regions around potential indels and clusters of mismatches:
2.Merge intervals
3.Realign reads overlapping specified regions:
4.Write the cleaned alignment file:

GATK manual suggests you to to basequality recalibration followed by local realignment. Its bit confusing in the steps involved.

Yes, exactly! I initially started from samtools protocols , which implies that you do:

1) recalibration with GATK
2) local (MSA) realignment with GATK

But if you look at the GATK powerpoint from Mark DePristo (great set of slides btw!) GATK ppt then you see in the flow chart
on slide #20 show that MSA realignment comes before recalibration. So I am going back to correct my previous runs.

The realignment before recalibration actually makes more sense - since you will fix the reads that had problem alignments, and then recalibrate based on these better alignments!

I don't know how big or small an effect the order of these two steps has, but I think now this is the correct way.

I will try the new version of GATK and get back to you.

**zee** · 06-24-2010, 01:30 AM

I have run gatk successfully, using version 1.0.3471

This is how my system is configured:

I have a script simply named "gatk" which contains

#!/bin/sh

SRC="/somepath/GenomeAnalysisTK-1.0.3471"
MEMORY="4096M"

java -Xmx$MEMORY -jar $SRC/GenomeAnalysisTK.jar $*

Then I just run it in the following way, assuming you got "genome.fasta" and "input.bam". Note that gatk assumes that the sequence dictionary for the genome exists as "genome.dict". I therefore add the Picard step to do the dictionary construction.

#create sequence dictionary with Picard

java -jar /opt/picard/dist/CreateSequenceDictionary.jar R=genome.fasta O=genome.dict

#find realigner targets
gatk -T RealignerTargetCreator -I input.bam -R genome.fasta -o output.intervals

# run realigner
gatk -T IndelRealigner -I input.bam -R genome.fasta -targetIntervals output.intervals --output realigned.bam

And it worked fine for me.

**nilshomer** · 06-24-2010, 08:36 AM

Originally posted by seq_GA View Post

whole testing smra also crashed

What version and could I get an example to debug? I would love this type of feedback.

Originally posted by NGSfan View Post

SRMA I just tried once, but it crashed. But I really need to give it another fair chance since I really didn't look into the problem.

Agian, what version and could I get an example to debug? Without your feedback I cannot support and improve the tool. I am very responsive solving these types of problems

**seq_GA** · 06-24-2010, 05:29 PM

Hi,

I am using srma-0.1.6 version from http://sourceforge.net/projects/srma/files/
I tried giving around 2G for MAX_HEAP_SIZE but still it went out of memory.

2. As explained in the manual, as my region of interest is around 19000 regions from the whole exome, I constructed a tab delimited file and then sorted it as below (sample)
RANGES=range_sorted.txt

Code:

chr10   100000807       100000927
chr10   100001260       100001500
chr10   100002097       100002217
chr10   100003241       100003601
chr10   100005284       100005519
chr10   100006490       100006719
chr10   100007353       100007593
chr10   100007699       100007939
chr10   100008745       100008985
chr10   100009096       100009216

I get the error as below.

Code:

Allele coverage cutoffs:
coverage:  1    minimum allele coverage: 0
coverage:  2    minimum allele coverage: 0
coverage:  3    minimum allele coverage: 0
coverage:  4    minimum allele coverage: 1
coverage:  5    minimum allele coverage: 1
coverage:  6    minimum allele coverage: 1
coverage:  7    minimum allele coverage: 2
coverage:  8    minimum allele coverage: 2
coverage:  9    minimum allele coverage: 3
coverage: >9    minimum allele coverage: 3
java.lang.Exception: Ranges must be in sorted order in RANGES (line numbers 271-272
        at srma.Ranges.addRange(Ranges.java:188)
        at srma.Ranges.<init>(Ranges.java:36)
        at srma.SRMA.doWork(SRMA.java:178)
        at net.sf.picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:150)
        at srma.SRMA.main(SRMA.java:94)

Let me know whether I am missing something here. Thanks.

**nilshomer** · 06-24-2010, 05:40 PM

Originally posted by seq_GA View Post

Hi,

I am using srma-0.1.6 version from http://sourceforge.net/projects/srma/files/
I tried giving around 2G for MAX_HEAP_SIZE but still it went out of memory.

Did you use the "-Xmx2g" option to Java ("java -Xmx2g -jar srma-0.1.6.jar [options]")? The "MAX_HEAP_SIZE" is a # used internally by SRMA, and is not passed into java. It shouldn't crash if the internal heap size is too small, only some regions will be ignored.

Originally posted by seq_GA View Post

2. As explained in the manual, as my region of interest is around 19000 regions from the whole exome, I constructed a tab delimited file and then sorted it as below (sample)
RANGES=range_sorted.txt

Code:

chr10   100000807       100000927
chr10   100001260       100001500
chr10   100002097       100002217
chr10   100003241       100003601
chr10   100005284       100005519
chr10   100006490       100006719
chr10   100007353       100007593
chr10   100007699       100007939
chr10   100008745       100008985
chr10   100009096       100009216

I get the error as below.

Code:

Allele coverage cutoffs:
coverage:  1    minimum allele coverage: 0
coverage:  2    minimum allele coverage: 0
coverage:  3    minimum allele coverage: 0
coverage:  4    minimum allele coverage: 1
coverage:  5    minimum allele coverage: 1
coverage:  6    minimum allele coverage: 1
coverage:  7    minimum allele coverage: 2
coverage:  8    minimum allele coverage: 2
coverage:  9    minimum allele coverage: 3
coverage: >9    minimum allele coverage: 3
java.lang.Exception: Ranges must be in sorted order in RANGES (line numbers 271-272
        at srma.Ranges.addRange(Ranges.java:188)
        at srma.Ranges.<init>(Ranges.java:36)
        at srma.SRMA.doWork(SRMA.java:178)
        at net.sf.picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:150)
        at srma.SRMA.main(SRMA.java:94)

Let me know whether I am missing something here. Thanks.

The ranges in your file should be in sorted order. It is complaining that the out-of-order line is 271-272. Can you check that line?

**NGSfan** · 06-25-2010, 05:34 AM

Hmm.. the new version of GATK realigner now gives me errors, complaining about read IDs whereas the old version (from April 2010) still works.

Also, GATK complains the "rod' file that I got from GATK is not in the right format for the -B option.

**NGSfan** · 06-25-2010, 06:30 AM

Hi Nils,

I am testing SRMA with a sorted bam file from a BFAST pair-end alignment of a whole exome data set from Brainbridge et al (http://genomebiology.com/2010/11/6/R62 , and SRA012615).

Just curious, is there a requirement for the number of threads (must be a power of 2?) I tried 8, 16, and got "unsynchronized addition":

java -Xmx128g -jar /net/chipdata/NGS.analysis/bin/srma/srma-0.1.6.jar I=s_1_paired_sequence.bfast.sorted.bam O=s_1_paired_sequence.bfast.srma.bam R=/net/chipdata/HumanGenome/hg18/hg18.fa NUM_THREADS=16 >&srma_log
[Fri Jun 25 16:24:59 CEST 2010] srma.SRMA INPUT=s_1_paired_sequence.bfast.sorted.bam OUTPUT=s_1_paired_sequence.bfast.srma.bam REFERENCE=/net/chipdata/HumanGenome/hg18/hg18.fa NUM_THREADS=22 OFFSET=20 MIN_MAPQ=0 MINIMUM_ALLELE_PROBABILITY=0.1 MINIMUM_A
LELE_COVERAGE=3 MAXIMUM_TOTAL_COVERAGE=100 CORRECT_BASES=false USE_SEQUENCE_QUALITIES=true QUIET_STDERR=false MAX_HEAP_SIZE=8192 MAX_QUEUE_SIZE=65536 TMP_DIR=/tmp/leparc VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECOR
S_IN_RAM=500000
Allele coverage cutoffs:
coverage: 1 minimum allele coverage: 0
coverage: 2 minimum allele coverage: 0
coverage: 3 minimum allele coverage: 0
coverage: 4 minimum allele coverage: 1
coverage: 5 minimum allele coverage: 1
coverage: 6 minimum allele coverage: 1
coverage: 7 minimum allele coverage: 2
coverage: 8 minimum allele coverage: 2
coverage: 9 minimum allele coverage: 3
coverage: >9 minimum allele coverage: 3
** Warning: option NUM_THREADS currently may not increase performance significantly. **
** Try running multiple processes with RANGES if the speed does not increase. **
java.lang.Exception: Unsynchronized addition
at srma.Graph.addSAMRecord(Graph.java:61)
at srma.SRMA$GraphThread.run(SRMA.java:691)
Please report bugs to [email protected]
java.lang.Exception: Unsynchronized addition
at srma.Graph.addSAMRecord(Graph.java:61)
at srma.SRMA$GraphThread.run(SRMA.java:691)
Please report bugs to [email protected]
java.lang.ArrayIndexOutOfBoundsException: -12
at java.util.ArrayList.elementData(ArrayList.java:338)
at java.util.ArrayList.get(ArrayList.java:351)
at srma.Graph.addNode(Graph.java:154)
at srma.Graph.addSAMRecord(Graph.java:109)
at srma.SRMA$GraphThread.run(SRMA.java:691)
Please report bugs to [email protected]
java.lang.Exception: Unsynchronized addition
at srma.Graph.addSAMRecord(Graph.java:61)
at srma.SRMA$GraphThread.run(SRMA.java:691)
Please report bugs to [email protected]
java.lang.Exception: Unsynchronized addition
at srma.Graph.addSAMRecord(Graph.java:61)
at srma.SRMA$GraphThread.run(SRMA.java:691)
Please report bugs to [email protected]

**nilshomer** · 06-25-2010, 08:10 AM

Originally posted by NGSfan View Post

Hi Nils,

I am testing SRMA with a sorted bam file from a BFAST pair-end alignment of a whole exome data set from Brainbridge et al (http://genomebiology.com/2010/11/6/R62 , and SRA012615).

Just curious, is there a requirement for the number of threads (must be a power of 2?) I tried 8, 16, and got "unsynchronized addition":

java -Xmx128g -jar /net/chipdata/NGS.analysis/bin/srma/srma-0.1.6.jar I=s_1_paired_sequence.bfast.sorted.bam O=s_1_paired_sequence.bfast.srma.bam R=/net/chipdata/HumanGenome/hg18/hg18.fa NUM_THREADS=16 >&srma_log
[Fri Jun 25 16:24:59 CEST 2010] srma.SRMA INPUT=s_1_paired_sequence.bfast.sorted.bam OUTPUT=s_1_paired_sequence.bfast.srma.bam REFERENCE=/net/chipdata/HumanGenome/hg18/hg18.fa NUM_THREADS=22 OFFSET=20 MIN_MAPQ=0 MINIMUM_ALLELE_PROBABILITY=0.1 MINIMUM_A
LELE_COVERAGE=3 MAXIMUM_TOTAL_COVERAGE=100 CORRECT_BASES=false USE_SEQUENCE_QUALITIES=true QUIET_STDERR=false MAX_HEAP_SIZE=8192 MAX_QUEUE_SIZE=65536 TMP_DIR=/tmp/leparc VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECOR
S_IN_RAM=500000
Allele coverage cutoffs:
coverage: 1 minimum allele coverage: 0
coverage: 2 minimum allele coverage: 0
coverage: 3 minimum allele coverage: 0
coverage: 4 minimum allele coverage: 1
coverage: 5 minimum allele coverage: 1
coverage: 6 minimum allele coverage: 1
coverage: 7 minimum allele coverage: 2
coverage: 8 minimum allele coverage: 2
coverage: 9 minimum allele coverage: 3
coverage: >9 minimum allele coverage: 3
** Warning: option NUM_THREADS currently may not increase performance significantly. **
** Try running multiple processes with RANGES if the speed does not increase. **
java.lang.Exception: Unsynchronized addition
at srma.Graph.addSAMRecord(Graph.java:61)
at srma.SRMA$GraphThread.run(SRMA.java:691)
Please report bugs to [email protected]
java.lang.Exception: Unsynchronized addition
at srma.Graph.addSAMRecord(Graph.java:61)
at srma.SRMA$GraphThread.run(SRMA.java:691)
Please report bugs to [email protected]
java.lang.ArrayIndexOutOfBoundsException: -12
at java.util.ArrayList.elementData(ArrayList.java:338)
at java.util.ArrayList.get(ArrayList.java:351)
at srma.Graph.addNode(Graph.java:154)
at srma.Graph.addSAMRecord(Graph.java:109)
at srma.SRMA$GraphThread.run(SRMA.java:691)
Please report bugs to [email protected]
java.lang.Exception: Unsynchronized addition
at srma.Graph.addSAMRecord(Graph.java:61)
at srma.SRMA$GraphThread.run(SRMA.java:691)
Please report bugs to [email protected]
java.lang.Exception: Unsynchronized addition
at srma.Graph.addSAMRecord(Graph.java:61)
at srma.SRMA$GraphThread.run(SRMA.java:691)
Please report bugs to [email protected]

Don't use threading, it isn't helping running time right now on most systems (not sure why, not a Java expert). I do wan the BAM file so I can debug the synchronization issue (crashing srma will make it stronger). Can you post the BAM file somewhere so I can take a look? For now, try without threads.

**NGSfan** · 06-28-2010, 01:05 AM

I will try to find a place to host the BAM file for you. It's about 2.6GB after sorting.

**NGSfan** · 06-28-2010, 05:22 AM

access to Bainbridge whole exome bam file

Hi Nils,

Due to some local issues (proxies/security) I had to chop the bam file up and place it on a file sharing service:

There are five files:

http://www.filedropper.com/bainbridgeexomeaa

http://www.filedropper.com/bainbridgeexomeab

http://www.filedropper.com/bainbridgeexomeac

http://www.filedropper.com/bainbridgeexomead

http://www.filedropper.com/bainbridgeexomeae

You can cat the files back together in the same order (aa,ab,ac,ad,ae) to get the complete bam.

The BAM file is already sorted and ready to go. If you want to have the index (.bai) already made, you can also download it here:

http://www.filedropper.com/s1pairedsequencebfastsortedbam

Here is the output from samtools "flagstat":

38641522 in total
0 QC failure
0 duplicates
37596789 mapped (97.30%)
38641522 paired in sequencing
19320761 read1
19320761 read2
38641522 properly paired (100.00%)
37188108 with itself and mate mapped
408681 singletons (1.06%)
115342 with mate mapped to a different chr
106661 with mate mapped to a different chr (mapQ>=5)

Please let me know if you have any problems! And thanks for looking into the SRMA issue.

**NGSfan** · 06-29-2010, 12:39 AM

Just some extra notes for you. I just completed the realignment of the reads from that bam file I posted, using GATK's local realignment tool (38 hours on 1 cpu! they do not allow multiple threads with this module). So no corruption or other issues with the bam file.

In parallel , I ran SRMA with a single thread, and looks like it was progressing really well until I got an error. Looks like I should increase the MAX_HEAP_SIZE ? I'll try that and report back

Would it be possible to split the processing by chromosome to simplify things?

Like running chromosomes 1,2,3, etc, to 9 individually, then process in one thread each on the sets: chr 10+11, 12+13, 14+15, 16+17+18, 19+20+21+22 , X+Y . Would thread out nicely on a 16-cpu machine...

[Fri Jun 25 16:33:11 CEST 2010] srma.SRMA INPUT=s_1_paired_sequence.bfast.sorted.bam OUTPUT=s_1_paired_sequence.bfast.srma2.bam REFERENCE=/net/chipdata/HumanGenome/hg18/hg18.fa OFFSET=20 MIN_MAPQ=0 MINIMUM_ALLELE_PROBABILITY=0.1 MINIMUM_ALLELE_COVERAGE
3 MAXIMUM_TOTAL_COVERAGE=100 CORRECT_BASES=false USE_SEQUENCE_QUALITIES=true QUIET_STDERR=false MAX_HEAP_SIZE=8192 MAX_QUEUE_SIZE=65536 NUM_THREADS=1 TMP_DIR=/tmp/leparc VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECOR
S_IN_RAM=500000
Allele coverage cutoffs:
coverage: 1 minimum allele coverage: 0
coverage: 2 minimum allele coverage: 0
coverage: 3 minimum allele coverage: 0
coverage: 4 minimum allele coverage: 1
coverage: 5 minimum allele coverage: 1
coverage: 6 minimum allele coverage: 1
coverage: 7 minimum allele coverage: 2
coverage: 8 minimum allele coverage: 2
coverage: 9 minimum allele coverage: 3
coverage: >9 minimum allele coverage: 3
Records processsed: 196556 (last chr1:4875295-4875369)
Records processsed: 327629 (last chr1:11802305-11802379)
Records processsed: 458668 (last chr1:16329237-16329311)
Records processsed: 589729 (last chr1:19505170-19505244)
........... output trimmed .............
Records processsed: 8580872 (last chr3:129300240-129300314)
Records processsed: 8717321 (last chr3:146329681-136329753)
Records processsed: 8848435 (last chr3:158581531-158581605)
Exception in thread "Thread-274" java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:2772)
at java.util.Arrays.copyOf(Arrays.java:2746)
at java.util.ArrayList.ensureCapacity(ArrayList.java:187)
at java.util.ArrayList.add(ArrayList.java:395)
at java.util.ArrayList$SubList.add(ArrayList.java:928)
at java.util.ArrayList$SubList.add(ArrayList.java:928)
at java.util.ArrayList$SubList.add(ArrayList.java:928)
at java.util.ArrayList$SubList.add(ArrayList.java:928)
at java.util.ArrayList$SubList.add(ArrayList.java:928)
at java.util.ArrayList$SubList.add(ArrayList.java:928)
at java.util.ArrayList$SubList.add(ArrayList.java:928)
at java.util.ArrayList$SubList.add(ArrayList.java:928)
at java.util.ArrayList$SubList.add(ArrayList.java:928)
at java.util.ArrayList$SubList.add(ArrayList.java:928)
at java.util.ArrayList$SubList.add(ArrayList.java:928)
at java.util.ArrayList$SubList.add(ArrayList.java:928)
at java.util.ArrayList$SubList.add(ArrayList.java:928)
at java.util.ArrayList$SubList.add(ArrayList.java:928)
at java.util.ArrayList$SubList.add(ArrayList.java:928)
at java.util.ArrayList$SubList.add(ArrayList.java:928)
at java.util.ArrayList$SubList.add(ArrayList.java:928)
at java.util.ArrayList$SubList.add(ArrayList.java:928)
at java.util.ArrayList$SubList.add(ArrayList.java:928)
at java.util.ArrayList$SubList.add(ArrayList.java:928)
at java.util.ArrayList$SubList.add(ArrayList.java:928)
at java.util.ArrayList$SubList.add(ArrayList.java:928)
at java.util.ArrayList$SubList.add(ArrayList.java:928)
at java.util.ArrayList$SubList.add(ArrayList.java:928)
at java.util.ArrayList$SubList.add(ArrayList.java:928)
at java.util.ArrayList$SubList.add(ArrayList.java:928)
at java.util.ArrayList$SubList.add(ArrayList.java:928)
at java.util.ArrayList$SubList.add(ArrayList.java:928)
java.util.ConcurrentModificationException
at java.util.ArrayList$SubList.checkForComodification(ArrayList.java:1091)
at java.util.ArrayList$SubList.size(ArrayList.java:921)
at srma.Graph.contains(Graph.java:195)
at srma.Graph.addNode(Graph.java:143)
at srma.Graph.addSAMRecord(Graph.java:109)
at srma.SRMA$GraphThread.run(SRMA.java:691)
Please report bugs to [email protected]

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 55 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 51 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 45 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 55 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Local realignment using GATK and smra

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News