![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Local realignment around indels (454 technology) | desmo | Bioinformatics | 2 | 05-17-2012 06:31 AM |
GATK: VCF file for Local realignment around indels | jorge | Bioinformatics | 2 | 10-11-2011 12:15 AM |
Samtools mpileup creates extra large file after local realignment | Hkins552 | Bioinformatics | 7 | 07-21-2011 02:20 PM |
Interpreting Local Realignment results from GATK | alma | Bioinformatics | 0 | 07-07-2011 03:46 AM |
GATK recalibration score and local realignment: what is the first step? | m_elena_bioinfo | Bioinformatics | 0 | 01-24-2011 03:59 AM |
![]() |
|
Thread Tools |
![]() |
#1 |
Senior Member
Location: Asiana Join Date: Feb 2009
Posts: 124
|
![]()
Hi,
I am trying to use GATK to do base quality recalibration followed by local realignment to reduce the flase positive indels. I am finding problem in getting GATK to work as have few doubts and smra seems very easy in a sigle step to move ahead. Anyone has done the comparison before. I am struck with GATK where in the first step of realignment step, Code:
java -jar /path/to/GenomeAnalysisTK.jar \ -T RealignerTargetCreator \ -I /path/to/input.bam \ -R /path/to/reference.fasta \ [-L intervals] \ -o /path/to/output.intervals \ [-B snps,VCF,/path/to/SNP_calls.vcf] \ [-B indels,VCF,/path/to/indel_calls.vcf] \ [-B dbsnp,dbsnp,/path/to/dbsnp.rod] I would appreciate very much if someone could through light on these queries. Thanks. |
![]() |
![]() |
![]() |
#2 |
Senior Member
Location: Austria Join Date: Apr 2009
Posts: 181
|
![]()
I have used GATK successfully, but it looks like they've updated it since the last version I've been using (they separated mismatches from indels and merged them - now it looks like they do it all in one). I will try this new version you've shown.
SRMA I just tried once, but it crashed. But I really need to give it another fair chance since I really didn't look into the problem. I was doing recalibration followed by local realignment, but I think perhaps it is better to do local realignment first? Seems to be the order recommended by GATK. Anyone else have any thoughts on this? |
![]() |
![]() |
![]() |
#3 |
Senior Member
Location: Asiana Join Date: Feb 2009
Posts: 124
|
![]()
Thx for your reply.
Are you referring to these steps under samtools protocol. 1.Collect regions around potential indels and clusters of mismatches: 2.Merge intervals 3.Realign reads overlapping specified regions: 4.Write the cleaned alignment file: GATK manual suggests you to to basequality recalibration followed by local realignment. Its bit confusing in the steps involved. |
![]() |
![]() |
![]() |
#4 |
Senior Member
Location: Asiana Join Date: Feb 2009
Posts: 124
|
![]()
whole testing smra also crashed
![]() |
![]() |
![]() |
![]() |
#5 | |
Senior Member
Location: Austria Join Date: Apr 2009
Posts: 181
|
![]() Quote:
1) recalibration with GATK 2) local (MSA) realignment with GATK But if you look at the GATK powerpoint from Mark DePristo (great set of slides btw!) GATK ppt then you see in the flow chart on slide #20 show that MSA realignment comes before recalibration. So I am going back to correct my previous runs. The realignment before recalibration actually makes more sense - since you will fix the reads that had problem alignments, and then recalibrate based on these better alignments! I don't know how big or small an effect the order of these two steps has, but I think now this is the correct way. I will try the new version of GATK and get back to you. |
|
![]() |
![]() |
![]() |
#6 | ||
NGS specialist
Location: Malaysia Join Date: Apr 2008
Posts: 249
|
![]()
I have run gatk successfully, using version 1.0.3471
This is how my system is configured: I have a script simply named "gatk" which contains Quote:
Quote:
|
||
![]() |
![]() |
![]() |
#7 | |
Nils Homer
Location: Boston, MA, USA Join Date: Nov 2008
Posts: 1,285
|
![]()
What version and could I get an example to debug? I would love this type of feedback.
Quote:
![]() |
|
![]() |
![]() |
![]() |
#8 |
Senior Member
Location: Asiana Join Date: Feb 2009
Posts: 124
|
![]()
Hi,
I am using srma-0.1.6 version from http://sourceforge.net/projects/srma/files/ I tried giving around 2G for MAX_HEAP_SIZE but still it went out of memory. 2. As explained in the manual, as my region of interest is around 19000 regions from the whole exome, I constructed a tab delimited file and then sorted it as below (sample) RANGES=range_sorted.txt Code:
chr10 100000807 100000927 chr10 100001260 100001500 chr10 100002097 100002217 chr10 100003241 100003601 chr10 100005284 100005519 chr10 100006490 100006719 chr10 100007353 100007593 chr10 100007699 100007939 chr10 100008745 100008985 chr10 100009096 100009216 Code:
Allele coverage cutoffs: coverage: 1 minimum allele coverage: 0 coverage: 2 minimum allele coverage: 0 coverage: 3 minimum allele coverage: 0 coverage: 4 minimum allele coverage: 1 coverage: 5 minimum allele coverage: 1 coverage: 6 minimum allele coverage: 1 coverage: 7 minimum allele coverage: 2 coverage: 8 minimum allele coverage: 2 coverage: 9 minimum allele coverage: 3 coverage: >9 minimum allele coverage: 3 java.lang.Exception: Ranges must be in sorted order in RANGES (line numbers 271-272 at srma.Ranges.addRange(Ranges.java:188) at srma.Ranges.<init>(Ranges.java:36) at srma.SRMA.doWork(SRMA.java:178) at net.sf.picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:150) at srma.SRMA.main(SRMA.java:94) |
![]() |
![]() |
![]() |
#9 | ||
Nils Homer
Location: Boston, MA, USA Join Date: Nov 2008
Posts: 1,285
|
![]() Quote:
Quote:
|
||
![]() |
![]() |
![]() |
#10 |
Senior Member
Location: Austria Join Date: Apr 2009
Posts: 181
|
![]()
Hmm.. the new version of GATK realigner now gives me errors, complaining about read IDs whereas the old version (from April 2010) still works.
Also, GATK complains the "rod' file that I got from GATK is not in the right format for the -B option. Last edited by NGSfan; 06-25-2010 at 07:19 AM. |
![]() |
![]() |
![]() |
#11 |
Senior Member
Location: Austria Join Date: Apr 2009
Posts: 181
|
![]()
Hi Nils,
I am testing SRMA with a sorted bam file from a BFAST pair-end alignment of a whole exome data set from Brainbridge et al (http://genomebiology.com/2010/11/6/R62 , and SRA012615). Just curious, is there a requirement for the number of threads (must be a power of 2?) I tried 8, 16, and got "unsynchronized addition": java -Xmx128g -jar /net/chipdata/NGS.analysis/bin/srma/srma-0.1.6.jar I=s_1_paired_sequence.bfast.sorted.bam O=s_1_paired_sequence.bfast.srma.bam R=/net/chipdata/HumanGenome/hg18/hg18.fa NUM_THREADS=16 >&srma_log [Fri Jun 25 16:24:59 CEST 2010] srma.SRMA INPUT=s_1_paired_sequence.bfast.sorted.bam OUTPUT=s_1_paired_sequence.bfast.srma.bam REFERENCE=/net/chipdata/HumanGenome/hg18/hg18.fa NUM_THREADS=22 OFFSET=20 MIN_MAPQ=0 MINIMUM_ALLELE_PROBABILITY=0.1 MINIMUM_A LELE_COVERAGE=3 MAXIMUM_TOTAL_COVERAGE=100 CORRECT_BASES=false USE_SEQUENCE_QUALITIES=true QUIET_STDERR=false MAX_HEAP_SIZE=8192 MAX_QUEUE_SIZE=65536 TMP_DIR=/tmp/leparc VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECOR S_IN_RAM=500000 Allele coverage cutoffs: coverage: 1 minimum allele coverage: 0 coverage: 2 minimum allele coverage: 0 coverage: 3 minimum allele coverage: 0 coverage: 4 minimum allele coverage: 1 coverage: 5 minimum allele coverage: 1 coverage: 6 minimum allele coverage: 1 coverage: 7 minimum allele coverage: 2 coverage: 8 minimum allele coverage: 2 coverage: 9 minimum allele coverage: 3 coverage: >9 minimum allele coverage: 3 ** Warning: option NUM_THREADS currently may not increase performance significantly. ** ** Try running multiple processes with RANGES if the speed does not increase. ** java.lang.Exception: Unsynchronized addition at srma.Graph.addSAMRecord(Graph.java:61) at srma.SRMA$GraphThread.run(SRMA.java:691) Please report bugs to srma-help@lists.sourceforge.net java.lang.Exception: Unsynchronized addition at srma.Graph.addSAMRecord(Graph.java:61) at srma.SRMA$GraphThread.run(SRMA.java:691) Please report bugs to srma-help@lists.sourceforge.net java.lang.ArrayIndexOutOfBoundsException: -12 at java.util.ArrayList.elementData(ArrayList.java:338) at java.util.ArrayList.get(ArrayList.java:351) at srma.Graph.addNode(Graph.java:154) at srma.Graph.addSAMRecord(Graph.java:109) at srma.SRMA$GraphThread.run(SRMA.java:691) Please report bugs to srma-help@lists.sourceforge.net java.lang.Exception: Unsynchronized addition at srma.Graph.addSAMRecord(Graph.java:61) at srma.SRMA$GraphThread.run(SRMA.java:691) Please report bugs to srma-help@lists.sourceforge.net java.lang.Exception: Unsynchronized addition at srma.Graph.addSAMRecord(Graph.java:61) at srma.SRMA$GraphThread.run(SRMA.java:691) Please report bugs to srma-help@lists.sourceforge.net Last edited by NGSfan; 06-25-2010 at 07:33 AM. |
![]() |
![]() |
![]() |
#12 | |
Nils Homer
Location: Boston, MA, USA Join Date: Nov 2008
Posts: 1,285
|
![]() Quote:
|
|
![]() |
![]() |
![]() |
#13 |
Senior Member
Location: Austria Join Date: Apr 2009
Posts: 181
|
![]()
I will try to find a place to host the BAM file for you. It's about 2.6GB after sorting.
|
![]() |
![]() |
![]() |
#14 |
Senior Member
Location: Austria Join Date: Apr 2009
Posts: 181
|
![]()
Hi Nils,
Due to some local issues (proxies/security) I had to chop the bam file up and place it on a file sharing service: There are five files: http://www.filedropper.com/bainbridgeexomeaa http://www.filedropper.com/bainbridgeexomeab http://www.filedropper.com/bainbridgeexomeac http://www.filedropper.com/bainbridgeexomead http://www.filedropper.com/bainbridgeexomeae You can cat the files back together in the same order (aa,ab,ac,ad,ae) to get the complete bam. The BAM file is already sorted and ready to go. If you want to have the index (.bai) already made, you can also download it here: http://www.filedropper.com/s1paireds...bfastsortedbam Here is the output from samtools "flagstat": 38641522 in total 0 QC failure 0 duplicates 37596789 mapped (97.30%) 38641522 paired in sequencing 19320761 read1 19320761 read2 38641522 properly paired (100.00%) 37188108 with itself and mate mapped 408681 singletons (1.06%) 115342 with mate mapped to a different chr 106661 with mate mapped to a different chr (mapQ>=5) Please let me know if you have any problems! And thanks for looking into the SRMA issue. |
![]() |
![]() |
![]() |
#15 |
Senior Member
Location: Austria Join Date: Apr 2009
Posts: 181
|
![]()
Just some extra notes for you. I just completed the realignment of the reads from that bam file I posted, using GATK's local realignment tool (38 hours on 1 cpu! they do not allow multiple threads with this module). So no corruption or other issues with the bam file.
In parallel , I ran SRMA with a single thread, and looks like it was progressing really well until I got an error. Looks like I should increase the MAX_HEAP_SIZE ? I'll try that and report back Would it be possible to split the processing by chromosome to simplify things? Like running chromosomes 1,2,3, etc, to 9 individually, then process in one thread each on the sets: chr 10+11, 12+13, 14+15, 16+17+18, 19+20+21+22 , X+Y . Would thread out nicely on a 16-cpu machine... [Fri Jun 25 16:33:11 CEST 2010] srma.SRMA INPUT=s_1_paired_sequence.bfast.sorted.bam OUTPUT=s_1_paired_sequence.bfast.srma2.bam REFERENCE=/net/chipdata/HumanGenome/hg18/hg18.fa OFFSET=20 MIN_MAPQ=0 MINIMUM_ALLELE_PROBABILITY=0.1 MINIMUM_ALLELE_COVERAGE 3 MAXIMUM_TOTAL_COVERAGE=100 CORRECT_BASES=false USE_SEQUENCE_QUALITIES=true QUIET_STDERR=false MAX_HEAP_SIZE=8192 MAX_QUEUE_SIZE=65536 NUM_THREADS=1 TMP_DIR=/tmp/leparc VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECOR S_IN_RAM=500000 Allele coverage cutoffs: coverage: 1 minimum allele coverage: 0 coverage: 2 minimum allele coverage: 0 coverage: 3 minimum allele coverage: 0 coverage: 4 minimum allele coverage: 1 coverage: 5 minimum allele coverage: 1 coverage: 6 minimum allele coverage: 1 coverage: 7 minimum allele coverage: 2 coverage: 8 minimum allele coverage: 2 coverage: 9 minimum allele coverage: 3 coverage: >9 minimum allele coverage: 3 Records processsed: 196556 (last chr1:4875295-4875369) Records processsed: 327629 (last chr1:11802305-11802379) Records processsed: 458668 (last chr1:16329237-16329311) Records processsed: 589729 (last chr1:19505170-19505244) ........... output trimmed ............. Records processsed: 8580872 (last chr3:129300240-129300314) Records processsed: 8717321 (last chr3:146329681-136329753) Records processsed: 8848435 (last chr3:158581531-158581605) Exception in thread "Thread-274" java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2772) at java.util.Arrays.copyOf(Arrays.java:2746) at java.util.ArrayList.ensureCapacity(ArrayList.java:187) at java.util.ArrayList.add(ArrayList.java:395) at java.util.ArrayList$SubList.add(ArrayList.java:928) at java.util.ArrayList$SubList.add(ArrayList.java:928) at java.util.ArrayList$SubList.add(ArrayList.java:928) at java.util.ArrayList$SubList.add(ArrayList.java:928) at java.util.ArrayList$SubList.add(ArrayList.java:928) at java.util.ArrayList$SubList.add(ArrayList.java:928) at java.util.ArrayList$SubList.add(ArrayList.java:928) at java.util.ArrayList$SubList.add(ArrayList.java:928) at java.util.ArrayList$SubList.add(ArrayList.java:928) at java.util.ArrayList$SubList.add(ArrayList.java:928) at java.util.ArrayList$SubList.add(ArrayList.java:928) at java.util.ArrayList$SubList.add(ArrayList.java:928) at java.util.ArrayList$SubList.add(ArrayList.java:928) at java.util.ArrayList$SubList.add(ArrayList.java:928) at java.util.ArrayList$SubList.add(ArrayList.java:928) at java.util.ArrayList$SubList.add(ArrayList.java:928) at java.util.ArrayList$SubList.add(ArrayList.java:928) at java.util.ArrayList$SubList.add(ArrayList.java:928) at java.util.ArrayList$SubList.add(ArrayList.java:928) at java.util.ArrayList$SubList.add(ArrayList.java:928) at java.util.ArrayList$SubList.add(ArrayList.java:928) at java.util.ArrayList$SubList.add(ArrayList.java:928) at java.util.ArrayList$SubList.add(ArrayList.java:928) at java.util.ArrayList$SubList.add(ArrayList.java:928) at java.util.ArrayList$SubList.add(ArrayList.java:928) at java.util.ArrayList$SubList.add(ArrayList.java:928) at java.util.ArrayList$SubList.add(ArrayList.java:928) at java.util.ArrayList$SubList.add(ArrayList.java:928) java.util.ConcurrentModificationException at java.util.ArrayList$SubList.checkForComodification(ArrayList.java:1091) at java.util.ArrayList$SubList.size(ArrayList.java:921) at srma.Graph.contains(Graph.java:195) at srma.Graph.addNode(Graph.java:143) at srma.Graph.addSAMRecord(Graph.java:109) at srma.SRMA$GraphThread.run(SRMA.java:691) Please report bugs to srma-help@lists.sourceforge.net |
![]() |
![]() |
![]() |
#16 | |
NGS specialist
Location: Malaysia Join Date: Apr 2008
Posts: 249
|
![]()
Splitting the data up is exactly what I did to simplify things. You can give srma a set of ranges using the "RANGES=ranges.txt" with
e.g. Quote:
So just make a "ranges.txt" for each chromosome and launch your separate threads ![]() |
|
![]() |
![]() |
![]() |
#17 | |
Senior Member
Location: Austria Join Date: Apr 2009
Posts: 181
|
![]() Quote:
Ahh ! Nice! Thanks for the tip. This makes life easier since I constantly have different target enrichment designs, so ranges will constantly change. Just having it at a chromosome level makes it more flexible. |
|
![]() |
![]() |
![]() |
#18 |
NGS specialist
Location: Malaysia Join Date: Apr 2008
Posts: 249
|
![]()
Just a one last note on that. Ensure that the ranges.txt is sorted by chromosomes and position.
|
![]() |
![]() |
![]() |
#19 |
Senior Member
Location: Asiana Join Date: Feb 2009
Posts: 124
|
![]()
Hi Zee,
I alomst tried teh same what u had said above. The first step in getting the intervals is ok. But I get the error during the second step.I am getting an error as below: Code:
$sh gatk.sh -T RealignerTargetCreator -I KLM.bam -R /Genome/hg18/hg18.fa -o output.intervals $ sh gatk.sh -T IndelRealigner -I KLM.bam -R /Genome/hg18/hg18.fa -targetIntervals output.intervals --output realigned_b.bam FATAL 13:53:37,074 CommandLineProgram - Exception caught by base Command Line Program. Stack trace is as follows: java.lang.RuntimeException: org.broadinstitute.sting.utils.StingException: Read G004_1:5:9:3288:12810 does not overlap the previous read in this interval; please ensure that you are using the same input bam that was used in the RealignerTargetCreator step at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:253) at org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:97) Caused by: org.broadinstitute.sting.utils.StingException: Read G004_1:5:9:3288:12810 does not overlap the previous read in this interval; please ensure that you are using the same input bam that was used in the RealignerTargetCreator step at org.broadinstitute.sting.gatk.walkers.indels.IndelRealigner$ReadBin.add(IndelRealigner.java:1286) at org.broadinstitute.sting.gatk.walkers.indels.IndelRealigner.map(IndelRealigner.java:323) at org.broadinstitute.sting.gatk.walkers.indels.IndelRealigner.map(IndelRealigner.java:55) at org.broadinstitute.sting.gatk.traversals.TraverseReads.traverse(TraverseReads.java:98) at org.broadinstitute.sting.gatk.traversals.TraverseReads.traverse(TraverseReads.java:48) at org.broadinstitute.sting.gatk.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:73) at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:175) at org.broadinstitute.sting.gatk.CommandLineExecutable.executeGATK(CommandLineExecutable.java:93) at org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:75) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:238) ... 1 more ------------------------------------------------------------------------------------------ The following error has occurred: org.broadinstitute.sting.utils.StingException: Read G004_1:5:9:3288:12810 does not overlap the previous read in this interval; please ensure that you are using the same input bam that was used in the RealignerTargetCreator step: Please check your command line arguments for any typos or inconsistencies. Please review our general documentation at http://www.broadinstitute.org/gsa/wiki or contact us via our support site at http://getsatisfaction.com/gsa to report bugs or get help resolving undocumented issues |
![]() |
![]() |
![]() |
#20 | |
NGS specialist
Location: Malaysia Join Date: Apr 2008
Posts: 249
|
![]()
I have not seen this error before
Quote:
Perhaps you could pull out the location of that read and compare it to the start and end of the interval in "output.intervals". It looks as though gatk does not like aligning areas where reads do not overlap but perhaps that is a question of algorithm design. I am using version 1.0.3471,. |
|
![]() |
![]() |
![]() |
Thread Tools | |
|
|