Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Any way to speed up GATK RealignerTargetCreator ?

    Hi

    I am using GATK to realign around known indels, and am finding it rather slow going.
    on the PC I am currently using (i5-750, 16Gb ram) this stage is taking approx 40 min per sample (approx 0.13 gigabases worth of reads per paired-end sample).

    Briefly my pipeline looks like this:
    demultiplex at FASTQ stage, then carry out the following on each demultiplexed sample individually:

    1) align (bwa)
    2) convert to sam (samtools)
    3) dup mark & coverage metrics (picard)
    4) Find areas to realign based on known indels vcf file from GATK (GATK RealignerTargetCreator) <- this step is long
    5) Realign around indels (GATK IndelRealigner)
    6) call variants (GATK UnifiedGenotyper)
    7) Apply variant filtration annotation to vcf (GATK VariantFiltration)
    8) Annotate variants
    snpEff
    Annovar <-takes a long time at dbSNP annotation step

    I have a small targeted resequencing assay with approx 1000 targets, and I was wondering if RealignerTargetCreator was so slow because it was having to search through a whole genomes-worth of indels in the vcf supplied by the Broad with GATK?

    If I filtered it just for the indels in my target region would that cause a problem?

    Would indexing the vcf do anything?

    Any other suggestion would be appreciated

    Thanks,

    Chris

  • #2
    okay, a bit of investigation..
    I'm running it like this:

    java -d64 -Xmx4g -jar ${GATK_path}/GenomeAnalysisTK.jar \
    -I $current_directory/${output_fileName}.sorted.bam \
    -R ${ref_genome_fasta} \
    -T RealignerTargetCreator \
    -o $current_directory/input.bam.list \
    --known ${Thousand_genomes_indels_VCFfile}

    I now realise that RealignerTargetCreator needs only be run once for one demultiplexed sample, and the end result applied back to them all.
    However, the output from regions list file is not the same if I run it on A quick check of the number of rows of the output file is different (but not by much). Can i safely ignore the differences and adopt one targets file for realignment and apply to all sample bam files individually?.

    I could not find an option to direct realignerTargetCreator to specific regions to start off with which would be helpful,

    Chris

    Comment


    • #3
      Alright, two things:

      1. You can run the realigner target creator with or without specifying candidate indels AND limiting it to only look in your targeted regions of interest. This will make it MUCH faster if you have a small target size. Look into using the "-L" option with the GATK.

      2. If you update to the latest version of ANNOVAR, it is much faster than previous versions. However, if you have many samples, I suggest you wait until you have VCF files for all of them, then convert all of the VCF files to the annovar input files, and then attach a column to the end of each of these input files with a sample specific identifier. Then, concatenate all of these together and run the whole set through ANNOVAR. After this, you can then grep out each identifier to get individual files if you would like.

      Let me know if any of this is not clear.

      Comment


      • #4
        That's really helpful, it makes sense that realignertargetcreator can accept a regions file as the other GATK tools seem to. I'll look into it.

        On the Annovar front, the version I'm using is fairly recent ...last 8?weeks.
        The stage that seems to two forever is annotation with its dbSNP file.
        I was wondering if I extracted the variants falling in my target regions I wouldn't have to present the entire dbSNP file to Annovar?

        Comment


        • #5
          As long as you have the version from February 23 or later it should be fast: http://www.openbioinformatics.org/annovar/

          Did you make sure to download the newest index files? It should have a file size of 103885428 (for dbSNP 135).

          I don't think you can get away with just extracting the variants in your target regions because then the index file will be messed up.

          If it's a real issue I would go with my idea of concatenating all samples data together with sample-specific tags after they are all analyzed specifically and then running it through annovar. If you need help setting that up let me know.

          Comment


          • #6
            RE: ANNOVAR dbSNP annotation taking a long time.

            Try using "--batchsize 50m" in your annotate_variation.pl command.

            Dropped my run time on dbSNP135 from 20 hours to 20 minutes on 7.5M variants.
            Mendelian Disorder: A blogshare of random useful information for general public consumption. [Blog]
            Breakway: A Program to Identify Structural Variations in Genomic Data [Website] [Forum Post]
            Projects: U87MG whole genome sequence [Website] [Paper]

            Comment


            • #7
              I am running Annovar from the summarize_annovar.pl script.

              My understanding of what you are suggesting is to edit this file at the step which annotates with dbsnp as follows:

              if ($valistep{7}) {
              $sc = "annotate_variation.pl -filter -batchsize 50m -dbtype snp$verdbsnp -buildver $buildver -outfile $outfile $queryfile $dbloc";
              print STDERR "\nNOTICE: Running step 7 with system command <$sc>\n";
              system ($sc) and die "Error running system command: <$sc>\n";

              ...however this does not seem to change the run time

              am I doing it correctly ?

              Comment


              • #8
                If your are just annotating the VCF, you could may be split the VCF by chromosome and run jobs in parallel if you have the facility to do that.

                Comment


                • #9
                  GATK RealignerTargetCreator error

                  Hello! I am working with some Illumina FASTQ files. I used bwa to align the data. I then sorted, indexed and marked the bam file for PCR duplicates using picard. And then in order to realign around the indels, I used the RealignerTargetCreator. Here is the error:

                  INFO 11:12:43,040 HelpFormatter - --------------------------------------------------------------------------------
                  INFO 11:12:43,043 HelpFormatter - The Genome Analysis Toolkit (GATK) v2.7-4-g6f46d11, Compiled 2013/10/10 17:27:51
                  INFO 11:12:43,043 HelpFormatter - Copyright (c) 2010 The Broad Institute
                  INFO 11:12:43,043 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk
                  INFO 11:12:43,048 HelpFormatter - Program Args: -T RealignerTargetCreator -R NC_010473.fasta -I dedup_reads.bam -o target_intervals.list
                  INFO 11:12:43,049 HelpFormatter - Date/Time: 2013/12/26 11:12:43
                  INFO 11:12:43,049 HelpFormatter - --------------------------------------------------------------------------------
                  INFO 11:12:43,049 HelpFormatter - --------------------------------------------------------------------------------
                  INFO 11:12:43,708 GenomeAnalysisEngine - Strictness is SILENT
                  INFO 11:12:44,116 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000
                  INFO 11:12:44,127 SAMDataSource$SAMReaders - Initializing SAMRecords in serial
                  INFO 11:12:44,145 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.02
                  INFO 11:12:44,237 GenomeAnalysisEngine - Preparing for traversal over 1 BAM files
                  INFO 11:12:44,361 GenomeAnalysisEngine - Done preparing for traversal
                  INFO 11:12:44,361 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING]
                  INFO 11:12:44,361 ProgressMeter - Location processed.sites runtime per.1M.sites completed total.runtime remaining
                  INFO 11:12:48,747 GATKRunReport - Uploaded run statistics report to AWS S3
                  ##### ERROR ------------------------------------------------------------------------------------------
                  ##### ERROR A USER ERROR has occurred (version 2.7-4-g6f46d11):
                  ##### ERROR
                  ##### ERROR This means that one or more arguments or inputs in your command are incorrect.
                  ##### ERROR The error message below tells you what is the problem.
                  ##### ERROR
                  ##### ERROR If the problem is an invalid argument, please check the online documentation guide
                  ##### ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool.
                  ##### ERROR
                  ##### ERROR Visit our website and forum for extensive documentation and answers to
                  ##### ERROR commonly asked questions http://www.broadinstitute.org/gatk
                  ##### ERROR
                  ##### ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself.
                  ##### ERROR
                  ##### ERROR MESSAGE: SAM/BAM file SAMFileReader{/home/lmbin/dedup_reads.bam} appears to be using the wrong encoding for quality scores: we encountered an extremely high quality score of 63; please see the GATK --help documentation for options related to this error
                  ##### ERROR ------------------------------------------------------------------------------------------
                  I'm a greener, I don't know how to solve the problem in order to finish realignment. Can you give some advice?

                  Comment


                  • #10
                    Originally posted by binlangman View Post
                    Hello! I am working with some Illumina FASTQ files. I used bwa to align the data. I then sorted, indexed and marked the bam file for PCR duplicates using picard. And then in order to realign around the indels, I used the RealignerTargetCreator. Here is the error:

                    INFO 11:12:43,040 HelpFormatter - --------------------------------------------------------------------------------
                    INFO 11:12:43,043 HelpFormatter - The Genome Analysis Toolkit (GATK) v2.7-4-g6f46d11, Compiled 2013/10/10 17:27:51
                    INFO 11:12:43,043 HelpFormatter - Copyright (c) 2010 The Broad Institute
                    INFO 11:12:43,043 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk
                    INFO 11:12:43,048 HelpFormatter - Program Args: -T RealignerTargetCreator -R NC_010473.fasta -I dedup_reads.bam -o target_intervals.list
                    INFO 11:12:43,049 HelpFormatter - Date/Time: 2013/12/26 11:12:43
                    INFO 11:12:43,049 HelpFormatter - --------------------------------------------------------------------------------
                    INFO 11:12:43,049 HelpFormatter - --------------------------------------------------------------------------------
                    INFO 11:12:43,708 GenomeAnalysisEngine - Strictness is SILENT
                    INFO 11:12:44,116 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000
                    INFO 11:12:44,127 SAMDataSource$SAMReaders - Initializing SAMRecords in serial
                    INFO 11:12:44,145 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.02
                    INFO 11:12:44,237 GenomeAnalysisEngine - Preparing for traversal over 1 BAM files
                    INFO 11:12:44,361 GenomeAnalysisEngine - Done preparing for traversal
                    INFO 11:12:44,361 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING]
                    INFO 11:12:44,361 ProgressMeter - Location processed.sites runtime per.1M.sites completed total.runtime remaining
                    INFO 11:12:48,747 GATKRunReport - Uploaded run statistics report to AWS S3
                    ##### ERROR ------------------------------------------------------------------------------------------
                    ##### ERROR A USER ERROR has occurred (version 2.7-4-g6f46d11):
                    ##### ERROR
                    ##### ERROR This means that one or more arguments or inputs in your command are incorrect.
                    ##### ERROR The error message below tells you what is the problem.
                    ##### ERROR
                    ##### ERROR If the problem is an invalid argument, please check the online documentation guide
                    ##### ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool.
                    ##### ERROR
                    ##### ERROR Visit our website and forum for extensive documentation and answers to
                    ##### ERROR commonly asked questions http://www.broadinstitute.org/gatk
                    ##### ERROR
                    ##### ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself.
                    ##### ERROR
                    ##### ERROR MESSAGE: SAM/BAM file SAMFileReader{/home/lmbin/dedup_reads.bam} appears to be using the wrong encoding for quality scores: we encountered an extremely high quality score of 63; please see the GATK --help documentation for options related to this error
                    ##### ERROR ------------------------------------------------------------------------------------------
                    I'm a greener, I don't know how to solve the problem in order to finish realignment. Can you give some advice?
                    Can you post the full command that gave you this error?

                    Comment


                    • #11
                      You are probably aligning reads from illumina 1.3 or 1.5. (phred +64)

                      GATK expects to be 1.8 (phred+33)

                      The easiest workaround is to convert base qualities, take a look at:



                      There is also command lines examples to convert them.

                      cheers

                      Comment

                      Latest Articles

                      Collapse

                      • seqadmin
                        Current Approaches to Protein Sequencing
                        by seqadmin


                        Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                        04-04-2024, 04:25 PM
                      • seqadmin
                        Strategies for Sequencing Challenging Samples
                        by seqadmin


                        Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                        03-22-2024, 06:39 AM

                      ad_right_rmr

                      Collapse

                      News

                      Collapse

                      Topics Statistics Last Post
                      Started by seqadmin, 04-11-2024, 12:08 PM
                      0 responses
                      18 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 04-10-2024, 10:19 PM
                      0 responses
                      22 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 04-10-2024, 09:21 AM
                      0 responses
                      16 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 04-04-2024, 09:00 AM
                      0 responses
                      47 views
                      0 likes
                      Last Post seqadmin  
                      Working...
                      X