Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Hi, I have a doubt in GATK

    Hi,

    I'm having troubles with my comands and probably I only notice them while using GATK

    Let me show you what I'm doing:

    bwa aln -t 8 /GenoStorage/Genomas/hg19/hg19RefGenome.fasta sample_1.fastq > Sample_1_align.sai

    bwa aln -t 8 /GenoStorage/Genomas/hg19/hg19RefGenome.fasta Sample_2.fastq > Sample_2_align.sai

    bwa sampe /GenoStorage/Genomas/hg19/hg19RefGenome.fasta Sample_1_align.sai Sample_2_align.sai Sample_1.fastq Sample_2.fastq > Sample_align.sam



    # Converto to bam and sort (using samtools):

    samtools import /GenoStorage/Genomas/hg19/hg19RefGenome.fasta.fai Sample_align.sam Sample_align.bam


    samtools sort Sample_align.bam Sample_align_sorted.bam



    # From a sorted BAM alignment, raw SNP and indel calls are acquired by (see #more in http://samtools.sourceforge.net/cns0.shtml):

    samtools mpileup -uf /GenoStorage/Genomas/hg19/hg19RefGenome.fasta.fai sample02187A_align_sorted.bam.bam | bcftools view -bvcg - > var.raw.bcf

    bcftools view var.raw.bcf > vcf_sample

    vcfutils.pl varFilter -D100 > vcf_sample_filtered.flt.vcf

    java -Xmx4g -jar /GenoStorage/Software/GATK/GenomeAnalysisTK.jar -T TableRecalibration -R /GenoStorage/Genomas/hg19/hg19RefGenome.fasta -I sample_align_sorted.bam -recalFile SM.recal_data_sample.csv -l INFO -log SM.newQualSample.log -o SM.recalSample.bam

    Am I doing anything wrong?

    Thanks for your help

  • #2
    Would you mind posting the error message?

    What I notice: you can't do the table recalibration without generatin the recal_data.csv file through the CountCovariate utility from GATK..

    Comment


    • #3
      Humm, probably that's the problem.

      The error message is this

      ##### ERROR ------------------------------------------------------------------------------------------
      ##### ERROR A USER ERROR has occurred (version 1.3-21-gcb284ee):
      ##### ERROR The invalid arguments or inputs must be corrected before the GATK can proceed
      ##### ERROR Please do not post this error to the GATK forum
      ##### ERROR
      ##### ERROR See the documentation (rerun with -h) for this tool to view allowable command-line arguments.
      ##### ERROR Visit our wiki for extensive documentation http://www.broadinstitute.org/gsa/wiki
      ##### ERROR Visit our forum to view answers to commonly asked questions http://getsatisfaction.com/gsa
      ##### ERROR
      ##### ERROR MESSAGE: SAM/BAM file sample02187A_align_sorted.bam.bam is malformed: SAM file doesn't have any read groups defined in the header. The GATK no longer supports SAM files without read groups
      ##### ERROR ------------------------------------------------------------------------------------------

      Comment


      • #4
        Ahh. My favourite GATK error:
        In order to perform analyses correctly with the GATK you need to specify a read group tag when aligning your sequences with BWA. Just add the following argument within the bwa sampe step
        Code:
        -r "@RG\tID:sample\tLB:sample\tPL:ILLUMINA\tSM:sample"
        RG defines a ReadGroup, ID: specifies the name of the ReadGroup, LB defines the name of the library sequenced and SM defines the individual sample, PL defines the platform you used. You can leave all of them as sample (or anything else) in case you really got a single sample you wish to analyze.

        You might wonder what that could be good for: These tags are used when merging BAM files from different samples to distinguish between the different samples.

        However, although the error message states the thing I explained above, you should carry out the CountCovariate step before the TableRecalibration step, otherwise you have no recal_file (you specified it as SM.recal_data_sample.csv)

        Hope that helps...

        Comment


        • #5
          Thanks for your help.
          I made the change that you suggested in the sampe step.
          It work all the way until I try to CoutCovariates, he said that :

          ##### ERROR MESSAGE: Lexicographically sorted human genome sequence detected in reads.
          ##### ERROR For safety's sake the GATK requires human contigs in karyotypic order: 1, 2, ..., 10, 11, ..., 20, 21, 22, X, Y with M either leading or trailing these contigs.
          ##### ERROR This is because all distributed GATK resources are sorted in karyotypic order, and your processing will fail when you need to use these files.
          ##### ERROR You can use the ReorderSam utility to fix this problem: http://www.broadinstitute.org/gsa/wi...php/ReorderSam
          ##### ERROR reads contigs = [chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr1, chr20, chr21, chr22, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chrM, chrX, chrY]
          ##### ERROR ------------------------------------------------------------------------------------------
          But I already ReOrderedSam

          java -jar /GenoStorage/Software/picard-tools/ReorderSam.jar I=Sample_align_sorted.bam O=Sample.karyotypic.bam REFERENCE=/GenoStorage/Genomas/hg19/hg19RefGenome.fasta

          Comment


          • #6
            Ok, that's something else. GATK wants to have the chromosomes ordered this way: chr1, chr2, chr3, ..., chrX, chrY, chrM. It seems your reference fasta file contains the chromosomes in a lexikographical ordering chr10 directly after chr1. When you reorder your SAM file ot orders the reads according to the order in your reference fasta file. You could download the single-chromosome fasta files from UCSC (http://hgdownload.cse.ucsc.edu/golde...9/chromosomes/) and order them using cat like that:
            Code:
            cat chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19 chr20 chr21 chr22 chrX chrY chrM > hg19.fa
            It might be working to just do the ReorderSam program again with the newly sorted reference fasta file, otherwise you'd need to repeat alignment. In case you're planning a pipeline you might want to have the reference file in order to save the ReorderSam step...

            Hope that helps,
            Peter

            Comment


            • #7
              I made your suggestion with the chromosomes.
              Now I'm doing the ReorderSam

              java -jar /GenoStorage/Software/picard-tools/ReorderSam.jar I=sample02187A_align.bam O=sample02187A.kayrotypic.bam REFERENCE=/GenoStorage/Genomas/hg19/hg19RefGenome.fa
              [Wed Feb 01 11:04:17 WET 2012] net.sf.picard.sam.ReorderSam INPUT=sample02187A_align.bam OUTPUT=sample02187A.kayrotypic.bam REFERENCE=/GenoStorage/Genomas/hg19/hg19RefGenome.fa ALLOW_INCOMPLETE_DICT_CONCORDANCE=false ALLOW_CONTIG_LENGTH_DISCORDANCE=false VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false
              [Wed Feb 01 11:04:17 WET 2012] Executing as meusebio@IMMGene1 on Linux 2.6.32-71.29.1.el6.x86_64 amd64; OpenJDK 64-Bit Server VM 1.6.0_22-b22
              [Wed Feb 01 11:04:17 WET 2012] net.sf.picard.sam.ReorderSam done. Elapsed time: 0.00 minutes.
              Runtime.totalMemory()=1518993408
              Exception in thread "main" net.sf.picard.PicardException: Mismatch between sequence dictionary fasta index for /GenoStorage/Genomas/hg19/hg19RefGenome.fa, sequence 'chr10' != 'chr1'.
              at net.sf.picard.reference.IndexedFastaSequenceFile.sanityCheckDictionaryAgainstIndex(IndexedFastaSequenceFile.java:143)
              at net.sf.picard.reference.IndexedFastaSequenceFile.<init>(IndexedFastaSequenceFile.java:86)
              at net.sf.picard.reference.IndexedFastaSequenceFile.<init>(IndexedFastaSequenceFile.java:95)
              at net.sf.picard.reference.ReferenceSequenceFileFactory.getReferenceSequenceFile(ReferenceSequenceFileFactory.java:75)
              at net.sf.picard.reference.ReferenceSequenceFileFactory.getReferenceSequenceFile(ReferenceSequenceFileFactory.java:58)
              at net.sf.picard.sam.ReorderSam.doWork(ReorderSam.java:87)
              at net.sf.picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:175)
              at net.sf.picard.cmdline.CommandLineProgram.instanceMainWithExit(CommandLineProgram.java:118)
              at net.sf.picard.sam.ReorderSam.main(ReorderSam.java:77)

              I already tried to do a dictionary like they say, but it doesn't work also

              java -jar /GenoStorage/Software/picard-tools/CreateSequenceDictionary.jar REFERENCE=/GenoStorage/Genomas/hg19/hg19RefGenome.fa O=dictionary.bam
              [Wed Feb 01 11:06:04 WET 2012] net.sf.picard.sam.CreateSequenceDictionary REFERENCE=/GenoStorage/Genomas/hg19/hg19RefGenome.fa OUTPUT=dictionary.bam TRUNCATE_NAMES_AT_WHITESPACE=true NUM_SEQUENCES=2147483647 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false
              [Wed Feb 01 11:06:04 WET 2012] Executing as meusebio@IMMGene1 on Linux 2.6.32-71.29.1.el6.x86_64 amd64; OpenJDK 64-Bit Server VM 1.6.0_22-b22
              [Wed Feb 01 11:06:04 WET 2012] net.sf.picard.sam.CreateSequenceDictionary done. Elapsed time: 0.00 minutes.
              Runtime.totalMemory()=1518993408
              Exception in thread "main" net.sf.picard.PicardException: Mismatch between sequence dictionary fasta index for /GenoStorage/Genomas/hg19/hg19RefGenome.fa, sequence 'chr10' != 'chr1'.
              at net.sf.picard.reference.IndexedFastaSequenceFile.sanityCheckDictionaryAgainstIndex(IndexedFastaSequenceFile.java:143)
              at net.sf.picard.reference.IndexedFastaSequenceFile.<init>(IndexedFastaSequenceFile.java:86)
              at net.sf.picard.reference.IndexedFastaSequenceFile.<init>(IndexedFastaSequenceFile.java:95)
              at net.sf.picard.reference.ReferenceSequenceFileFactory.getReferenceSequenceFile(ReferenceSequenceFileFactory.java:75)
              at net.sf.picard.sam.CreateSequenceDictionary.makeSequenceDictionary(CreateSequenceDictionary.java:128)
              at net.sf.picard.sam.CreateSequenceDictionary.doWork(CreateSequenceDictionary.java:113)
              at net.sf.picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:175)
              at net.sf.picard.sam.CreateSequenceDictionary.main(CreateSequenceDictionary.java:93)

              Comment


              • #8
                ok. see this line:
                Code:
                Exception in thread "main" net.sf.picard.PicardException: Mismatch between sequence dictionary fasta index for /GenoStorage/Genomas/hg19/hg19RefGenome.fa, sequence 'chr10' != 'chr1'.
                He states you need to index your newly created reference fasta file, as the old index you've created (based on the wrong order) is not comaptible with the new order.

                So index your fasta file ( I think I used tabix for that but I'm not sure). However the GATK tells you how to prepare input reference files here:

                Comment


                • #9
                  I think I'm really bad at this

                  [meusebio@IMMGene1 meusebio]$ samtools faidx /GenoStorage/Genomas/hg19/hg19RefGenome.fa
                  [meusebio@IMMGene1 meusebio]$ java -jar /GenoStorage/Software/picard-tools/CreateSequenceDictionary.jar REFERENCE=/GenoStorage/Genomas/hg19/hg19RefGenome.fa O=hg19RefGenome.dict
                  [Wed Feb 01 11:34:44 WET 2012] net.sf.picard.sam.CreateSequenceDictionary REFERENCE=/GenoStorage/Genomas/hg19/hg19RefGenome.fa OUTPUT=hg19RefGenome.dict TRUNCATE_NAMES_AT_WHITESPACE=true NUM_SEQUENCES=2147483647 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false
                  [Wed Feb 01 11:34:44 WET 2012] Executing as meusebio@IMMGene1 on Linux 2.6.32-71.29.1.el6.x86_64 amd64; OpenJDK 64-Bit Server VM 1.6.0_22-b22
                  [Wed Feb 01 11:34:44 WET 2012] net.sf.picard.sam.CreateSequenceDictionary done. Elapsed time: 0.00 minutes.
                  Runtime.totalMemory()=1518993408
                  Exception in thread "main" net.sf.picard.PicardException: Mismatch between sequence dictionary fasta index for /GenoStorage/Genomas/hg19/hg19RefGenome.fa, sequence 'chr10' != 'chr1'.
                  at net.sf.picard.reference.IndexedFastaSequenceFile.sanityCheckDictionaryAgainstIndex(IndexedFastaSequenceFile.java:143)
                  at net.sf.picard.reference.IndexedFastaSequenceFile.<init>(IndexedFastaSequenceFile.java:86)
                  at net.sf.picard.reference.IndexedFastaSequenceFile.<init>(IndexedFastaSequenceFile.java:95)
                  at net.sf.picard.reference.ReferenceSequenceFileFactory.getReferenceSequenceFile(ReferenceSequenceFileFactory.java:75)
                  at net.sf.picard.sam.CreateSequenceDictionary.makeSequenceDictionary(CreateSequenceDictionary.java:128)
                  at net.sf.picard.sam.CreateSequenceDictionary.doWork(CreateSequenceDictionary.java:113)
                  at net.sf.picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:175)
                  at net.sf.picard.sam.CreateSequenceDictionary.main(CreateSequenceDictionary.java:93)

                  Comment


                  • #10
                    that's strange...
                    in case you don't want to give it another try you may download the GATK ressource
                    input files here:



                    I have no clue why the procedure you did, did not work...

                    Comment


                    • #11
                      Hi, again
                      I begin my work again using a Ref file in the correct order. I indexed it with bwa and after samtools.

                      Apparently everything worked fine, until the part I had to create a Sequence Dictionary.

                      java -jar /GenoStorage/Software/picard-tools/CreateSequenceDictionary.jar R=/GenoStorage/Genomas/hg19/hg19RefGenome.fa O=hg19RefGenome.dict
                      [Thu Feb 02 10:38:09 WET 2012] net.sf.picard.sam.CreateSequenceDictionary REFERENCE=/GenoStorage/Genomas/hg19/hg19RefGenome.fa OUTPUT=hg19RefGenome.dict TRUNCATE_NAMES_AT_WHITESPACE=true NUM_SEQUENCES=2147483647 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false
                      [Thu Feb 02 10:38:09 WET 2012] Executing as meusebio@IMMGene1 on Linux 2.6.32-71.29.1.el6.x86_64 amd64; OpenJDK 64-Bit Server VM 1.6.0_22-b22
                      [Thu Feb 02 10:38:09 WET 2012] net.sf.picard.sam.CreateSequenceDictionary done. Elapsed time: 0.00 minutes.
                      Runtime.totalMemory()=1518993408
                      Exception in thread "main" net.sf.picard.PicardException: Mismatch between sequence dictionary fasta index for /GenoStorage/Genomas/hg19/hg19RefGenome.fa, sequence 'chr10' != 'chr1'.
                      at net.sf.picard.reference.IndexedFastaSequenceFile.sanityCheckDictionaryAgainstIndex(IndexedFastaSequenceFile.java:143)
                      at net.sf.picard.reference.IndexedFastaSequenceFile.<init>(IndexedFastaSequenceFile.java:86)
                      at net.sf.picard.reference.IndexedFastaSequenceFile.<init>(IndexedFastaSequenceFile.java:95)
                      at net.sf.picard.reference.ReferenceSequenceFileFactory.getReferenceSequenceFile(ReferenceSequenceFileFactory.java:75)
                      at net.sf.picard.sam.CreateSequenceDictionary.makeSequenceDictionary(CreateSequenceDictionary.java:128)
                      at net.sf.picard.sam.CreateSequenceDictionary.doWork(CreateSequenceDictionary.java:113)
                      at net.sf.picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:175)
                      at net.sf.picard.sam.CreateSequenceDictionary.main(CreateSequenceDictionary.java:93)


                      If I use the old RefGeno it will work fine, but I think that's stupid, because is not in the correct order

                      Comment


                      • #12
                        maybe you should specify the absolute path of the dict output file. Maybe you've got the old dict file in the same path as the newly createed hg19RefGenome.fa file and he puts the new sequence dictionary somewhere else...
                        I'm afraid that's the only thing I can come up with

                        Comment


                        • #13
                          You are being great !

                          I'm already at GATK stage (I hope).

                          Another program, another problem

                          java -Xmx4g -jar /GenoStorage/Software/GATK/GenomeAnalysisTK.jar \ -R /GenoStorage/Genomas/hg19/hg19RefGenome.fa \ -knownSites /GenoStorage/BasesDados/ucsc_hg19/snp132CodingDbSnp.txt \ -I sample02187A_align_sorted.bam \ -T CountCovariates \ -cov ReadGroupCovariate \ -cov QualityScoreCovariate \ -cov CycleCovariate \ -cov DinucCovariate \ -recalFile sample02187A.recal_data.csv
                          INFO 14:32:42,831 HelpFormatter - ---------------------------------------------------------------------------------
                          INFO 14:32:42,833 HelpFormatter - The Genome Analysis Toolkit (GATK) v1.3-21-gcb284ee, Compiled 2011/11/29 16:46:58
                          INFO 14:32:42,834 HelpFormatter - Copyright (c) 2010 The Broad Institute
                          INFO 14:32:42,834 HelpFormatter - Please view our documentation at http://www.broadinstitute.org/gsa/wiki
                          INFO 14:32:42,834 HelpFormatter - For support, please view our support site at http://getsatisfaction.com/gsa
                          INFO 14:32:42,835 HelpFormatter - Program Args: -R /GenoStorage/Genomas/hg19/hg19RefGenome.fa -knownSites /GenoStorage/BasesDados/ucsc_hg19/snp132CodingDbSnp.txt -I sample02187A_align_sorted.bam -T CountCovariates -cov ReadGroupCovariate -cov QualityScoreCovariate -cov CycleCovariate -cov DinucCovariate -recalFile sample02187A.recal_data.csv
                          INFO 14:32:42,835 HelpFormatter - Date/Time: 2012/02/02 14:32:42
                          INFO 14:32:42,836 HelpFormatter - ---------------------------------------------------------------------------------
                          INFO 14:32:42,836 HelpFormatter - ---------------------------------------------------------------------------------
                          INFO 14:32:44,115 GATKRunReport - Uploaded run statistics report to AWS S3
                          ##### ERROR ------------------------------------------------------------------------------------------
                          ##### ERROR A USER ERROR has occurred (version 1.3-21-gcb284ee):
                          ##### ERROR The invalid arguments or inputs must be corrected before the GATK can proceed
                          ##### ERROR Please do not post this error to the GATK forum
                          ##### ERROR
                          ##### ERROR See the documentation (rerun with -h) for this tool to view allowable command-line arguments.
                          ##### ERROR Visit our wiki for extensive documentation http://www.broadinstitute.org/gsa/wiki
                          ##### ERROR Visit our forum to view answers to commonly asked questions http://getsatisfaction.com/gsa
                          ##### ERROR
                          ##### ERROR MESSAGE: Invalid command line: Failed to parse value /GenoStorage/BasesDados/ucsc_hg19/snp132CodingDbSnp.txt for argument knownSites. Message: Invalid command line: No tribble type was provided on the command line and the type of the file could not be determined dynamically. Please add an explicit type tag :NAME listing the correct type from among the supported types:
                          ##### ERROR Name FeatureType Documentation
                          ##### ERROR BEAGLE BeagleFeature http://www.broadinstitute.org/gsa/ga...agleCodec.html
                          ##### ERROR BED BEDFeature http://www.broadinstitute.org/gsa/ga..._BEDCodec.html
                          ##### ERROR BEDTABLE TableFeature http://www.broadinstitute.org/gsa/ga...ableCodec.html
                          ##### ERROR GELITEXT GeliTextFeature http://www.broadinstitute.org/gsa/ga...TextCodec.html
                          ##### ERROR OLDDBSNP OldDbSNPFeature http://www.broadinstitute.org/gsa/ga...bSNPCodec.html
                          ##### ERROR RAWHAPMAP RawHapMapFeature http://www.broadinstitute.org/gsa/ga...pMapCodec.html
                          ##### ERROR REFSEQ RefSeqFeature http://www.broadinstitute.org/gsa/ga...fSeqCodec.html
                          ##### ERROR SAMPILEUP SAMPileupFeature http://www.broadinstitute.org/gsa/ga...leupCodec.html
                          ##### ERROR SAMREAD SAMReadFeature http://www.broadinstitute.org/gsa/ga...ReadCodec.html
                          ##### ERROR TABLE TableFeature http://www.broadinstitute.org/gsa/ga...ableCodec.html
                          ##### ERROR VCF VariantContext http://www.broadinstitute.org/gsa/ga..._VCFCodec.html
                          ##### ERROR VCF3 VariantContext http://www.broadinstitute.org/gsa/ga...VCF3Codec.html
                          ##### ERROR ------------------------------------------------------------------------------------------

                          Comment


                          • #14
                            Hello,

                            I just had the same problem but found that the solution is the following:

                            Add name and file type after the -knownSites argument. If you file is in vcf format, use the following.

                            -knownSites:name,VCF snp132CodingDbSnp.txt

                            That is described at the bottom of page <http://www.broadinstitute.org/gsa/wiki/index.php/Input_files_for_the_GATK>

                            Comment


                            • #15
                              Moving this to the appropriate forum.

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Essential Discoveries and Tools in Epitranscriptomics
                                by seqadmin




                                The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                                04-22-2024, 07:01 AM
                              • seqadmin
                                Current Approaches to Protein Sequencing
                                by seqadmin


                                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                04-04-2024, 04:25 PM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 04-11-2024, 12:08 PM
                              0 responses
                              59 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 10:19 PM
                              0 responses
                              57 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 09:21 AM
                              0 responses
                              53 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-04-2024, 09:00 AM
                              0 responses
                              56 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X