Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Questions about exome sequencing

    I've got exome data from 12 human samples that I'd like to analyze. For the most part, I've been following GATK's best practices for my pipeline, but I run into some problems and I can't seem to solve them by reading the forums here or at broadinstitute.org. I wanted to get all the way through the pipeline with one of the samples just to make sure that my script will work properly.

    The first real problem I had was when I used the BaseRecalibrator tool from GATK. From what I could tell, there was a problem (I think) with many of the CIGAR strings. There are about 1000 instances of deletions at the end of a read. Perhaps this is my the source of all my issues. Does this even make sense? It seems like a deletion should not occur at the beginning or end of a read, but I may be misunderstanding something. By reading the forums at broadinstitute, I learned that adding the "-rf BadCigar" option will make BaseRecalibrator ignore all cases of bad CIGAR strings. But then later on when I use HaplotypeCaller, I get an error with error message line:
    Code:
    ##### ERROR MESSAGE: Somehow the requested coordinate is not covered by the read. Too many deletions?
    This reminded me of the 'bad' CIGAR strings that I had observed earlier.

    If this CIGAR string issue is my problem, can anyone help me fix it? I used bwa aln to align and then bwa sampe to generate the sam file. The odd CIGAR strings already appear in this sam file, so if deletions at the end of a read are bad, then the problem seems to be in the alignment/sampe stage. I can also post an abbreviated version of my script or the entire error message that I get from HaplotypeCaller if they might be helpful.

    Thank you,
    Blake

  • #2
    If you post up the script and error it will be easier to help, not sure why this is happening as I haven't dealt with the specific error but having used GATK 'best prcaitce' pipelines I can say it requires a lot of attention to get it running correctly! If you could attach a small sample of your SAM with reads that are generating error that would be really helpful too. Also the Broad/GATK forum people are very helpful if you try contacting them directly.

    Comment


    • #3
      Thank you for the reply. Here is an abbreviated version of my script so far (sampleL1_1 refers to the first mate on lane 1 of the paired end reads):

      Code:
      bwa aln -t 4 -f sampleL1_1.sai hg19 sampleL1_1.fastq
      bwa aln -t 4 -f sampleL1_2.sai hg19 sampleL1_2.fastq
      bwa sampe -f sampleL1.sam -r"RG@\tID:sampleL1\tLB:sampleL1\tSM:sampleL1\tPL:ILLUMINA" hg19 sampleL1_1.sai sampleL1_2.sai sampleL1_1.fastq sampleL1_2.fastq
      java -Xmx4g -Djava.io.tmpdir=/local -jar SortSam.jar SO=coordinate INPUT=sampleL1.sam OUTPUT=sampleL1.bam VALIDATION_STRINGENCY=LENIENT CREATE_INDEX=true
      java -Xmx4g -Djava.io.tmpdir=/local -jar MarkDuplicates.jar INPUT=sampleL1.bam OUTPUT=sampleL1.dedup.bam METRICS_FILE=sampleL1.dup.txt VALIDATION_STRINGENCY=LENIENT REMOVE_DUPLICATES=true
      samtools index sampleL1.dedup.bam
      java -Xmx4g -Djava.io.tmpdir=/local -jar GenomeAnalysisTK.jar -T RealignerTargetCreator -R hg19.fa -I sampleL1.dedup.bam -o sampleL1.intervals -known gold_standard.vcf -known 1000G_phase1.vcf
      java -Xmx4g -Djava.io.tmpdir=/local -jar GenomeAnalysisTK.jar -T IndelRealigner -R hg19.fa -I sampleL1.dedup.bam -targetIntervals sample1.intervals -o sampleL1.dedup.realn.bam -known gold_standard.vcf -known 1000G_phase1.vcf
      java -Xmx4g -Djava.io.tmpdir=/local -jar GenomeAnalysisTK.jar -T BaseRecalibrator -rf BadCigar -R hg19.fa -I sampleL1.dedup.realn.bam -knownSites dbsnp137.vcf -knownSites gold_standard.vcf -knownSites 1000G_phase1.vcf -o sampleL1.recal_data.table
      java -Xmx4g -Djava.io.tmpdir=/local -jar GenomeAnalysisTK.jar -T PrintReads -R hg19.fa -I sampleL1.dedup.realn.bam -BQSR sampleL1.recal_data.table -o sampleL1.dedup.realn.recal.bam
      java -Xmx4g -Djava.io.tmpdir=/local -jar GenomeAnalysisTK.jar -T ReduceReads -R hg19.fa -I sampleL1.dedup.realn.recal.bam -o sampleL1.dedup.realn.recal.reduced.bam
      I got an error earlier on the BaseRecalibrator step saying something like "START (101) > (100) STOP -- this should never happen -- call Mauricio!" After searching the broadinstitute.org forums, I saw a few posts claiming that adding the -rf BadCigar would make the tools ignore the reads with bad CIGAR strings. It seems to have worked. I do the previous set of commands on three lanes and then picard mergesamfiles, index and use GATK's HaplotypeCaller:

      Code:
      java -Xmx4g -jar MergeSamFiles.jar TMP_DIR=/local INPUT=sampleL1.dedup.realn.recal.reduced.bam INPUT=sampleL2.dedup.realn.recal.reduced.bam INPUT=sampleL3.dedup.realn.recal.reduced.bam OUTPUT=sample.merged.bam
      samtools index sample.merged.bam
      java -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R hg19.fa -I sample.merged.bam --genotyping_mode DISCOVERY -stand_emit_conf 10 -stand_call_conf 30 -o sample_raw.vcf
      As I previously mentioned, I get an error on the last step:

      Code:
      ##### ERROR ------------------------------------------------------------------------------------------
      ##### ERROR stack trace
      org.broadinstitute.sting.utils.exceptions.ReviewedStingException: Somehow the requested coordinate is not covered by the read. Too many deletions?
              at org.broadinstitute.sting.utils.sam.ReadUtils.getReadCoordinateForReferenceCoordinate(ReadUtils.java:447)
              at org.broadinstitute.sting.utils.sam.ReadUtils.getReadCoordinateForReferenceCoordinate(ReadUtils.java:396)
              at org.broadinstitute.sting.utils.sam.ReadUtils.getReadCoordinateForReferenceCoordinate(ReadUtils.java:392)
              at org.broadinstitute.sting.gatk.walkers.annotator.DepthOfCoverage.annotate(DepthOfCoverage.java:56)
              at org.broadinstitute.sting.gatk.walkers.annotator.interfaces.InfoFieldAnnotation.annotate(InfoFieldAnnotation.java:24)
              at org.broadinstitute.sting.gatk.walkers.annotator.VariantAnnotatorEngine.annotateContext(VariantAnnotatorEngine.java:223)
              at org.broadinstitute.sting.gatk.walkers.haplotypecaller.GenotypingEngine.assignGenotypeLikelihoodsAndCallIndependentEvents(GenotypingEngine.java:196)
              at org.broadinstitute.sting.gatk.walkers.haplotypecaller.HaplotypeCaller.map(HaplotypeCaller.java:411)
              at org.broadinstitute.sting.gatk.walkers.haplotypecaller.HaplotypeCaller.map(HaplotypeCaller.java:107)
              at org.broadinstitute.sting.gatk.traversals.TraverseActiveRegions.processActiveRegion(TraverseActiveRegions.java:285)
              at org.broadinstitute.sting.gatk.traversals.TraverseActiveRegions.callWalkerMapOnActiveRegions(TraverseActiveRegions.java:230)
              at org.broadinstitute.sting.gatk.traversals.TraverseActiveRegions.processActiveRegions(TraverseActiveRegions.java:205)
              at org.broadinstitute.sting.gatk.traversals.TraverseActiveRegions.traverse(TraverseActiveRegions.java:131)
              at org.broadinstitute.sting.gatk.traversals.TraverseActiveRegions.traverse(TraverseActiveRegions.java:28)
              at org.broadinstitute.sting.gatk.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:74)
              at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:281)
              at org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:113)
              at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:237)
              at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:147)
              at org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:91)
      ##### ERROR ------------------------------------------------------------------------------------------
      ##### ERROR A GATK RUNTIME ERROR has occurred (version 2.3-9-ge5ebf34):
      ##### ERROR
      ##### ERROR Please visit the wiki to see if this is a known problem
      ##### ERROR If not, please post the error, with stack trace, to the GATK forum
      ##### ERROR Visit our website and forum for extensive documentation and answers to
      ##### ERROR commonly asked questions http://www.broadinstitute.org/gatk
      ##### ERROR
      ##### ERROR MESSAGE: Somehow the requested coordinate is not covered by the read. Too many deletions?
      ##### ERROR ------------------------------------------------------------------------------------------
      Perhaps this is all more information than I needed to provide as I'm hoping that my problems originate from bad CIGAR strings at the alignment step. Both of the errors that I have encountered that were not silly typos seem to be referring to these bad CIGAR strings. Here is one of the odd looking reads that I get in my original sam file:

      Code:
      GWZHISEQ01:231:D2BL9ACXX:1:1101:8158:10808      73      chr9    80793844        37      95M4I2M3D       =       80793844        0       TGATATTCTCGAGTTCCAAAACGGGGTCCCTGTGAAGGTGACCAACGTCAAGGATGGCACCACCCACCAGACCTCCTTGGTACCTTGGTACCTCCAGGATT    CCCFFFFFHHHHHIIJJJJJJJJJJJGIIJJJHIJJJJHHIJJJJJJHIJJJJJJJJIJJGHHFFFFDDDDDDDDDDDDDCCDDDDDDDDDDDDDDDD?CC   RG:Z:JC06_L1    XT:A:U  NM:i:7  SM:i:37 AM:i:0  X0:i:1  X1:i:0  XM:i:4  XO:i:1  XG:i:1  MD:Z:97^TTC0
      This is only the first one that I see, and there are around 500 in my first sample's lane. Does it make any sense at all to have a CIGAR string "95M4I2M3D"? It's also strange because the read has 101 bases and 95 + 4 + 2 + 3 is not 101.

      Comment


      • #4
        Ok, my immediate response to issues like this is: what is the proportion of reads that have the error to total reads, i.e. can you just throw them out?! You say 500, I assume of millions, so I would get rid and see if your pipeline now runs. You can retain reads and see if they fall in same area, or share any other info which may direct you to the cause of the problem.

        As to the CIGAR, I am not an expert at all! But I agree, it does not make sense based on what CIGAR is meant to say about the read (104 bases when read is 101 as you state). Perhaps try the Picard ValidateSAMFile to test if there is anything really wrong, esp if you do go back and remove the 'weird' CIGAR-reads.

        Sorry I can't be of more help, good luck and post up what you did when solved as it will be useful+interesting to see.

        Comment


        • #5
          D operations don't contribute to read length, so 95+4+2=101.

          Having said that, Both the CIGAR string and MD tag make no sense. Talking about a deletion after the end of a read is just strange. I would have guessed that the IndelRealigner just screwed this one up, but if this was how things are in the original SAM file, then it was actually the aligner. This looks like a bug in BWA.

          Comment


          • #6
            Originally posted by dpryan View Post
            D operations don't contribute to read length, so 95+4+2=101.

            Having said that, Both the CIGAR string and MD tag make no sense. Talking about a deletion after the end of a read is just strange. I would have guessed that the IndelRealigner just screwed this one up, but if this was how things are in the original SAM file, then it was actually the aligner. This looks like a bug in BWA.
            Of course D's don't contribute to read length. Duh, what was I thinking? I think you're right. I aligned with bowtie and did
            Code:
            awk '$6 ~ /D$/ {print}' sampleL1_bowtie.sam
            and nothing was returned. I'll post what happens with my pipeline when I try aligning with bowtie and maybe some more QC.

            Comment


            • #7
              A reported error in GATK



              Update your version from version 2.3-9-ge5ebf34 to the current

              Comment


              • #8
                Reading through the GATK forums, it seems that blakeoft is correct and weird CIGAR strings cause this (see, for example this on-going thread). Even if GATK gets around the weird CIGAR string, I would still consider that to be a BWA bug, since you really shouldn't have to run with "-rf BadCigar".

                Comment


                • #9
                  I think my problem is that I was using bwa aln and not bwa mem. The latter produces alignments with no trailing D's with the same fastq input. But I think my main question (is it nonsensical for CIGAR strings to have D's at the end?) has been answered. Thanks to everyone.

                  Comment


                  • #10
                    I have another problem, but I want to post on GATK's forums. That leads to another issue, which is that I don't know how to post there... There doesn't seem to be a way to ask a new question or post in an existing thread. Does anyone know how to ask a new question on that forum?

                    Comment


                    • #11
                      First you need to register with GATK. At the top of the home page there is a 'Forum' option, and within there is the option to 'Register'.

                      Once registered and signed in, you can post a new question on the same 'Forum' page (top left), or you can add comments to a current thread by scrolling to the bottom of the thread and adding a comment.

                      Comment

                      Latest Articles

                      Collapse

                      • seqadmin
                        Essential Discoveries and Tools in Epitranscriptomics
                        by seqadmin


                        The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist on Modified Bases...
                        Yesterday, 07:01 AM
                      • seqadmin
                        Current Approaches to Protein Sequencing
                        by seqadmin


                        Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                        04-04-2024, 04:25 PM

                      ad_right_rmr

                      Collapse

                      News

                      Collapse

                      Topics Statistics Last Post
                      Started by seqadmin, 04-11-2024, 12:08 PM
                      0 responses
                      39 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 04-10-2024, 10:19 PM
                      0 responses
                      41 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 04-10-2024, 09:21 AM
                      0 responses
                      35 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 04-04-2024, 09:00 AM
                      0 responses
                      55 views
                      0 likes
                      Last Post seqadmin  
                      Working...
                      X