Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Local realignment using GATK and smra

    Hi,
    I am trying to use GATK to do base quality recalibration followed by local realignment to reduce the flase positive indels.

    I am finding problem in getting GATK to work as have few doubts and smra seems very easy in a sigle step to move ahead. Anyone has done the comparison before.

    I am struck with GATK where in the first step of realignment step,
    Code:
    java -jar /path/to/GenomeAnalysisTK.jar \
      -T RealignerTargetCreator \
      -I /path/to/input.bam \
      -R /path/to/reference.fasta \
      [-L intervals] \
      -o /path/to/output.intervals \
      [-B snps,VCF,/path/to/SNP_calls.vcf] \
      [-B indels,VCF,/path/to/indel_calls.vcf] \
      [-B dbsnp,dbsnp,/path/to/dbsnp.rod]
    for the option L, what is the format of the file. My region of interest would be around 19000 regions from hg18 (all chromosomes). Regardsing -B option, I have only hg18 rod file downloaded from GATK site. How to proceed these steps.

    I would appreciate very much if someone could through light on these queries. Thanks.

  • #2
    I have used GATK successfully, but it looks like they've updated it since the last version I've been using (they separated mismatches from indels and merged them - now it looks like they do it all in one). I will try this new version you've shown.

    SRMA I just tried once, but it crashed. But I really need to give it another fair chance since I really didn't look into the problem.

    I was doing recalibration followed by local realignment, but I think perhaps it is better to do local realignment first? Seems to be the order recommended by GATK.

    Anyone else have any thoughts on this?

    Comment


    • #3
      Thx for your reply.

      Are you referring to these steps under samtools protocol.

      1.Collect regions around potential indels and clusters of mismatches:
      2.Merge intervals
      3.Realign reads overlapping specified regions:
      4.Write the cleaned alignment file:

      GATK manual suggests you to to basequality recalibration followed by local realignment. Its bit confusing in the steps involved.

      Comment


      • #4
        whole testing smra also crashed

        Comment


        • #5
          Originally posted by seq_GA View Post
          Thx for your reply.

          Are you referring to these steps under samtools protocol.

          1.Collect regions around potential indels and clusters of mismatches:
          2.Merge intervals
          3.Realign reads overlapping specified regions:
          4.Write the cleaned alignment file:

          GATK manual suggests you to to basequality recalibration followed by local realignment. Its bit confusing in the steps involved.
          Yes, exactly! I initially started from samtools protocols , which implies that you do:

          1) recalibration with GATK
          2) local (MSA) realignment with GATK


          But if you look at the GATK powerpoint from Mark DePristo (great set of slides btw!) GATK ppt then you see in the flow chart
          on slide #20 show that MSA realignment comes before recalibration. So I am going back to correct my previous runs.

          The realignment before recalibration actually makes more sense - since you will fix the reads that had problem alignments, and then recalibrate based on these better alignments!

          I don't know how big or small an effect the order of these two steps has, but I think now this is the correct way.

          I will try the new version of GATK and get back to you.

          Comment


          • #6
            I have run gatk successfully, using version 1.0.3471

            This is how my system is configured:

            I have a script simply named "gatk" which contains

            #!/bin/sh

            SRC="/somepath/GenomeAnalysisTK-1.0.3471"
            MEMORY="4096M"

            java -Xmx$MEMORY -jar $SRC/GenomeAnalysisTK.jar $*
            Then I just run it in the following way, assuming you got "genome.fasta" and "input.bam". Note that gatk assumes that the sequence dictionary for the genome exists as "genome.dict". I therefore add the Picard step to do the dictionary construction.


            #create sequence dictionary with Picard

            java -jar /opt/picard/dist/CreateSequenceDictionary.jar R=genome.fasta O=genome.dict

            #find realigner targets
            gatk -T RealignerTargetCreator -I input.bam -R genome.fasta -o output.intervals

            # run realigner
            gatk -T IndelRealigner -I input.bam -R genome.fasta -targetIntervals output.intervals --output realigned.bam
            And it worked fine for me.

            Comment


            • #7
              Originally posted by seq_GA View Post
              whole testing smra also crashed
              What version and could I get an example to debug? I would love this type of feedback.

              Originally posted by NGSfan View Post
              SRMA I just tried once, but it crashed. But I really need to give it another fair chance since I really didn't look into the problem.
              Agian, what version and could I get an example to debug? Without your feedback I cannot support and improve the tool. I am very responsive solving these types of problems

              Comment


              • #8
                Hi,

                I am using srma-0.1.6 version from http://sourceforge.net/projects/srma/files/
                I tried giving around 2G for MAX_HEAP_SIZE but still it went out of memory.

                2. As explained in the manual, as my region of interest is around 19000 regions from the whole exome, I constructed a tab delimited file and then sorted it as below (sample)
                RANGES=range_sorted.txt
                Code:
                chr10   100000807       100000927
                chr10   100001260       100001500
                chr10   100002097       100002217
                chr10   100003241       100003601
                chr10   100005284       100005519
                chr10   100006490       100006719
                chr10   100007353       100007593
                chr10   100007699       100007939
                chr10   100008745       100008985
                chr10   100009096       100009216
                I get the error as below.

                Code:
                Allele coverage cutoffs:
                coverage:  1    minimum allele coverage: 0
                coverage:  2    minimum allele coverage: 0
                coverage:  3    minimum allele coverage: 0
                coverage:  4    minimum allele coverage: 1
                coverage:  5    minimum allele coverage: 1
                coverage:  6    minimum allele coverage: 1
                coverage:  7    minimum allele coverage: 2
                coverage:  8    minimum allele coverage: 2
                coverage:  9    minimum allele coverage: 3
                coverage: >9    minimum allele coverage: 3
                java.lang.Exception: Ranges must be in sorted order in RANGES (line numbers 271-272
                        at srma.Ranges.addRange(Ranges.java:188)
                        at srma.Ranges.<init>(Ranges.java:36)
                        at srma.SRMA.doWork(SRMA.java:178)
                        at net.sf.picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:150)
                        at srma.SRMA.main(SRMA.java:94)
                Let me know whether I am missing something here. Thanks.

                Comment


                • #9
                  Originally posted by seq_GA View Post
                  Hi,

                  I am using srma-0.1.6 version from http://sourceforge.net/projects/srma/files/
                  I tried giving around 2G for MAX_HEAP_SIZE but still it went out of memory.
                  Did you use the "-Xmx2g" option to Java ("java -Xmx2g -jar srma-0.1.6.jar [options]")? The "MAX_HEAP_SIZE" is a # used internally by SRMA, and is not passed into java. It shouldn't crash if the internal heap size is too small, only some regions will be ignored.

                  Originally posted by seq_GA View Post
                  2. As explained in the manual, as my region of interest is around 19000 regions from the whole exome, I constructed a tab delimited file and then sorted it as below (sample)
                  RANGES=range_sorted.txt
                  Code:
                  chr10   100000807       100000927
                  chr10   100001260       100001500
                  chr10   100002097       100002217
                  chr10   100003241       100003601
                  chr10   100005284       100005519
                  chr10   100006490       100006719
                  chr10   100007353       100007593
                  chr10   100007699       100007939
                  chr10   100008745       100008985
                  chr10   100009096       100009216
                  I get the error as below.

                  Code:
                  Allele coverage cutoffs:
                  coverage:  1    minimum allele coverage: 0
                  coverage:  2    minimum allele coverage: 0
                  coverage:  3    minimum allele coverage: 0
                  coverage:  4    minimum allele coverage: 1
                  coverage:  5    minimum allele coverage: 1
                  coverage:  6    minimum allele coverage: 1
                  coverage:  7    minimum allele coverage: 2
                  coverage:  8    minimum allele coverage: 2
                  coverage:  9    minimum allele coverage: 3
                  coverage: >9    minimum allele coverage: 3
                  java.lang.Exception: Ranges must be in sorted order in RANGES (line numbers 271-272
                          at srma.Ranges.addRange(Ranges.java:188)
                          at srma.Ranges.<init>(Ranges.java:36)
                          at srma.SRMA.doWork(SRMA.java:178)
                          at net.sf.picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:150)
                          at srma.SRMA.main(SRMA.java:94)
                  Let me know whether I am missing something here. Thanks.
                  The ranges in your file should be in sorted order. It is complaining that the out-of-order line is 271-272. Can you check that line?

                  Comment


                  • #10
                    Hmm.. the new version of GATK realigner now gives me errors, complaining about read IDs whereas the old version (from April 2010) still works.

                    Also, GATK complains the "rod' file that I got from GATK is not in the right format for the -B option.
                    Last edited by NGSfan; 06-25-2010, 06:19 AM.

                    Comment


                    • #11
                      Hi Nils,

                      I am testing SRMA with a sorted bam file from a BFAST pair-end alignment of a whole exome data set from Brainbridge et al (http://genomebiology.com/2010/11/6/R62 , and SRA012615).

                      Just curious, is there a requirement for the number of threads (must be a power of 2?) I tried 8, 16, and got "unsynchronized addition":

                      java -Xmx128g -jar /net/chipdata/NGS.analysis/bin/srma/srma-0.1.6.jar I=s_1_paired_sequence.bfast.sorted.bam O=s_1_paired_sequence.bfast.srma.bam R=/net/chipdata/HumanGenome/hg18/hg18.fa NUM_THREADS=16 >&srma_log
                      [Fri Jun 25 16:24:59 CEST 2010] srma.SRMA INPUT=s_1_paired_sequence.bfast.sorted.bam OUTPUT=s_1_paired_sequence.bfast.srma.bam REFERENCE=/net/chipdata/HumanGenome/hg18/hg18.fa NUM_THREADS=22 OFFSET=20 MIN_MAPQ=0 MINIMUM_ALLELE_PROBABILITY=0.1 MINIMUM_A
                      LELE_COVERAGE=3 MAXIMUM_TOTAL_COVERAGE=100 CORRECT_BASES=false USE_SEQUENCE_QUALITIES=true QUIET_STDERR=false MAX_HEAP_SIZE=8192 MAX_QUEUE_SIZE=65536 TMP_DIR=/tmp/leparc VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECOR
                      S_IN_RAM=500000
                      Allele coverage cutoffs:
                      coverage: 1 minimum allele coverage: 0
                      coverage: 2 minimum allele coverage: 0
                      coverage: 3 minimum allele coverage: 0
                      coverage: 4 minimum allele coverage: 1
                      coverage: 5 minimum allele coverage: 1
                      coverage: 6 minimum allele coverage: 1
                      coverage: 7 minimum allele coverage: 2
                      coverage: 8 minimum allele coverage: 2
                      coverage: 9 minimum allele coverage: 3
                      coverage: >9 minimum allele coverage: 3
                      ** Warning: option NUM_THREADS currently may not increase performance significantly. **
                      ** Try running multiple processes with RANGES if the speed does not increase. **
                      java.lang.Exception: Unsynchronized addition
                      at srma.Graph.addSAMRecord(Graph.java:61)
                      at srma.SRMA$GraphThread.run(SRMA.java:691)
                      Please report bugs to [email protected]
                      java.lang.Exception: Unsynchronized addition
                      at srma.Graph.addSAMRecord(Graph.java:61)
                      at srma.SRMA$GraphThread.run(SRMA.java:691)
                      Please report bugs to [email protected]
                      java.lang.ArrayIndexOutOfBoundsException: -12
                      at java.util.ArrayList.elementData(ArrayList.java:338)
                      at java.util.ArrayList.get(ArrayList.java:351)
                      at srma.Graph.addNode(Graph.java:154)
                      at srma.Graph.addSAMRecord(Graph.java:109)
                      at srma.SRMA$GraphThread.run(SRMA.java:691)
                      Please report bugs to [email protected]
                      java.lang.Exception: Unsynchronized addition
                      at srma.Graph.addSAMRecord(Graph.java:61)
                      at srma.SRMA$GraphThread.run(SRMA.java:691)
                      Please report bugs to [email protected]
                      java.lang.Exception: Unsynchronized addition
                      at srma.Graph.addSAMRecord(Graph.java:61)
                      at srma.SRMA$GraphThread.run(SRMA.java:691)
                      Please report bugs to [email protected]
                      Last edited by NGSfan; 06-25-2010, 06:33 AM.

                      Comment


                      • #12
                        Originally posted by NGSfan View Post
                        Hi Nils,

                        I am testing SRMA with a sorted bam file from a BFAST pair-end alignment of a whole exome data set from Brainbridge et al (http://genomebiology.com/2010/11/6/R62 , and SRA012615).

                        Just curious, is there a requirement for the number of threads (must be a power of 2?) I tried 8, 16, and got "unsynchronized addition":

                        java -Xmx128g -jar /net/chipdata/NGS.analysis/bin/srma/srma-0.1.6.jar I=s_1_paired_sequence.bfast.sorted.bam O=s_1_paired_sequence.bfast.srma.bam R=/net/chipdata/HumanGenome/hg18/hg18.fa NUM_THREADS=16 >&srma_log
                        [Fri Jun 25 16:24:59 CEST 2010] srma.SRMA INPUT=s_1_paired_sequence.bfast.sorted.bam OUTPUT=s_1_paired_sequence.bfast.srma.bam REFERENCE=/net/chipdata/HumanGenome/hg18/hg18.fa NUM_THREADS=22 OFFSET=20 MIN_MAPQ=0 MINIMUM_ALLELE_PROBABILITY=0.1 MINIMUM_A
                        LELE_COVERAGE=3 MAXIMUM_TOTAL_COVERAGE=100 CORRECT_BASES=false USE_SEQUENCE_QUALITIES=true QUIET_STDERR=false MAX_HEAP_SIZE=8192 MAX_QUEUE_SIZE=65536 TMP_DIR=/tmp/leparc VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECOR
                        S_IN_RAM=500000
                        Allele coverage cutoffs:
                        coverage: 1 minimum allele coverage: 0
                        coverage: 2 minimum allele coverage: 0
                        coverage: 3 minimum allele coverage: 0
                        coverage: 4 minimum allele coverage: 1
                        coverage: 5 minimum allele coverage: 1
                        coverage: 6 minimum allele coverage: 1
                        coverage: 7 minimum allele coverage: 2
                        coverage: 8 minimum allele coverage: 2
                        coverage: 9 minimum allele coverage: 3
                        coverage: >9 minimum allele coverage: 3
                        ** Warning: option NUM_THREADS currently may not increase performance significantly. **
                        ** Try running multiple processes with RANGES if the speed does not increase. **
                        java.lang.Exception: Unsynchronized addition
                        at srma.Graph.addSAMRecord(Graph.java:61)
                        at srma.SRMA$GraphThread.run(SRMA.java:691)
                        Please report bugs to [email protected]
                        java.lang.Exception: Unsynchronized addition
                        at srma.Graph.addSAMRecord(Graph.java:61)
                        at srma.SRMA$GraphThread.run(SRMA.java:691)
                        Please report bugs to [email protected]
                        java.lang.ArrayIndexOutOfBoundsException: -12
                        at java.util.ArrayList.elementData(ArrayList.java:338)
                        at java.util.ArrayList.get(ArrayList.java:351)
                        at srma.Graph.addNode(Graph.java:154)
                        at srma.Graph.addSAMRecord(Graph.java:109)
                        at srma.SRMA$GraphThread.run(SRMA.java:691)
                        Please report bugs to [email protected]
                        java.lang.Exception: Unsynchronized addition
                        at srma.Graph.addSAMRecord(Graph.java:61)
                        at srma.SRMA$GraphThread.run(SRMA.java:691)
                        Please report bugs to [email protected]
                        java.lang.Exception: Unsynchronized addition
                        at srma.Graph.addSAMRecord(Graph.java:61)
                        at srma.SRMA$GraphThread.run(SRMA.java:691)
                        Please report bugs to [email protected]
                        Don't use threading, it isn't helping running time right now on most systems (not sure why, not a Java expert). I do wan the BAM file so I can debug the synchronization issue (crashing srma will make it stronger). Can you post the BAM file somewhere so I can take a look? For now, try without threads.

                        Comment


                        • #13
                          I will try to find a place to host the BAM file for you. It's about 2.6GB after sorting.

                          Comment


                          • #14
                            access to Bainbridge whole exome bam file

                            Hi Nils,

                            Due to some local issues (proxies/security) I had to chop the bam file up and place it on a file sharing service:

                            There are five files:







                            You can cat the files back together in the same order (aa,ab,ac,ad,ae) to get the complete bam.

                            The BAM file is already sorted and ready to go. If you want to have the index (.bai) already made, you can also download it here:




                            Here is the output from samtools "flagstat":

                            38641522 in total
                            0 QC failure
                            0 duplicates
                            37596789 mapped (97.30%)
                            38641522 paired in sequencing
                            19320761 read1
                            19320761 read2
                            38641522 properly paired (100.00%)
                            37188108 with itself and mate mapped
                            408681 singletons (1.06%)
                            115342 with mate mapped to a different chr
                            106661 with mate mapped to a different chr (mapQ>=5)


                            Please let me know if you have any problems! And thanks for looking into the SRMA issue.

                            Comment


                            • #15
                              Just some extra notes for you. I just completed the realignment of the reads from that bam file I posted, using GATK's local realignment tool (38 hours on 1 cpu! they do not allow multiple threads with this module). So no corruption or other issues with the bam file.

                              In parallel , I ran SRMA with a single thread, and looks like it was progressing really well until I got an error. Looks like I should increase the MAX_HEAP_SIZE ? I'll try that and report back

                              Would it be possible to split the processing by chromosome to simplify things?

                              Like running chromosomes 1,2,3, etc, to 9 individually, then process in one thread each on the sets: chr 10+11, 12+13, 14+15, 16+17+18, 19+20+21+22 , X+Y . Would thread out nicely on a 16-cpu machine...

                              [Fri Jun 25 16:33:11 CEST 2010] srma.SRMA INPUT=s_1_paired_sequence.bfast.sorted.bam OUTPUT=s_1_paired_sequence.bfast.srma2.bam REFERENCE=/net/chipdata/HumanGenome/hg18/hg18.fa OFFSET=20 MIN_MAPQ=0 MINIMUM_ALLELE_PROBABILITY=0.1 MINIMUM_ALLELE_COVERAGE
                              3 MAXIMUM_TOTAL_COVERAGE=100 CORRECT_BASES=false USE_SEQUENCE_QUALITIES=true QUIET_STDERR=false MAX_HEAP_SIZE=8192 MAX_QUEUE_SIZE=65536 NUM_THREADS=1 TMP_DIR=/tmp/leparc VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECOR
                              S_IN_RAM=500000
                              Allele coverage cutoffs:
                              coverage: 1 minimum allele coverage: 0
                              coverage: 2 minimum allele coverage: 0
                              coverage: 3 minimum allele coverage: 0
                              coverage: 4 minimum allele coverage: 1
                              coverage: 5 minimum allele coverage: 1
                              coverage: 6 minimum allele coverage: 1
                              coverage: 7 minimum allele coverage: 2
                              coverage: 8 minimum allele coverage: 2
                              coverage: 9 minimum allele coverage: 3
                              coverage: >9 minimum allele coverage: 3
                              Records processsed: 196556 (last chr1:4875295-4875369)
                              Records processsed: 327629 (last chr1:11802305-11802379)
                              Records processsed: 458668 (last chr1:16329237-16329311)
                              Records processsed: 589729 (last chr1:19505170-19505244)
                              ........... output trimmed .............
                              Records processsed: 8580872 (last chr3:129300240-129300314)
                              Records processsed: 8717321 (last chr3:146329681-136329753)
                              Records processsed: 8848435 (last chr3:158581531-158581605)
                              Exception in thread "Thread-274" java.lang.OutOfMemoryError: Java heap space
                              at java.util.Arrays.copyOf(Arrays.java:2772)
                              at java.util.Arrays.copyOf(Arrays.java:2746)
                              at java.util.ArrayList.ensureCapacity(ArrayList.java:187)
                              at java.util.ArrayList.add(ArrayList.java:395)
                              at java.util.ArrayList$SubList.add(ArrayList.java:928)
                              at java.util.ArrayList$SubList.add(ArrayList.java:928)
                              at java.util.ArrayList$SubList.add(ArrayList.java:928)
                              at java.util.ArrayList$SubList.add(ArrayList.java:928)
                              at java.util.ArrayList$SubList.add(ArrayList.java:928)
                              at java.util.ArrayList$SubList.add(ArrayList.java:928)
                              at java.util.ArrayList$SubList.add(ArrayList.java:928)
                              at java.util.ArrayList$SubList.add(ArrayList.java:928)
                              at java.util.ArrayList$SubList.add(ArrayList.java:928)
                              at java.util.ArrayList$SubList.add(ArrayList.java:928)
                              at java.util.ArrayList$SubList.add(ArrayList.java:928)
                              at java.util.ArrayList$SubList.add(ArrayList.java:928)
                              at java.util.ArrayList$SubList.add(ArrayList.java:928)
                              at java.util.ArrayList$SubList.add(ArrayList.java:928)
                              at java.util.ArrayList$SubList.add(ArrayList.java:928)
                              at java.util.ArrayList$SubList.add(ArrayList.java:928)
                              at java.util.ArrayList$SubList.add(ArrayList.java:928)
                              at java.util.ArrayList$SubList.add(ArrayList.java:928)
                              at java.util.ArrayList$SubList.add(ArrayList.java:928)
                              at java.util.ArrayList$SubList.add(ArrayList.java:928)
                              at java.util.ArrayList$SubList.add(ArrayList.java:928)
                              at java.util.ArrayList$SubList.add(ArrayList.java:928)
                              at java.util.ArrayList$SubList.add(ArrayList.java:928)
                              at java.util.ArrayList$SubList.add(ArrayList.java:928)
                              at java.util.ArrayList$SubList.add(ArrayList.java:928)
                              at java.util.ArrayList$SubList.add(ArrayList.java:928)
                              at java.util.ArrayList$SubList.add(ArrayList.java:928)
                              at java.util.ArrayList$SubList.add(ArrayList.java:928)
                              java.util.ConcurrentModificationException
                              at java.util.ArrayList$SubList.checkForComodification(ArrayList.java:1091)
                              at java.util.ArrayList$SubList.size(ArrayList.java:921)
                              at srma.Graph.contains(Graph.java:195)
                              at srma.Graph.addNode(Graph.java:143)
                              at srma.Graph.addSAMRecord(Graph.java:109)
                              at srma.SRMA$GraphThread.run(SRMA.java:691)
                              Please report bugs to [email protected]

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Essential Discoveries and Tools in Epitranscriptomics
                                by seqadmin


                                The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist on Modified Bases...
                                Yesterday, 07:01 AM
                              • seqadmin
                                Current Approaches to Protein Sequencing
                                by seqadmin


                                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                04-04-2024, 04:25 PM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 04-11-2024, 12:08 PM
                              0 responses
                              55 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 10:19 PM
                              0 responses
                              51 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 09:21 AM
                              0 responses
                              45 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-04-2024, 09:00 AM
                              0 responses
                              55 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X