Seqanswers Leaderboard Ad

**bruce01** · 01-29-2014, 03:53 AM

If you post up the script and error it will be easier to help, not sure why this is happening as I haven't dealt with the specific error but having used GATK 'best prcaitce' pipelines I can say it requires a lot of attention to get it running correctly! If you could attach a small sample of your SAM with reads that are generating error that would be really helpful too. Also the Broad/GATK forum people are very helpful if you try contacting them directly.

**blakeoft** · 01-29-2014, 06:50 AM

Thank you for the reply. Here is an abbreviated version of my script so far (sampleL1_1 refers to the first mate on lane 1 of the paired end reads):

Code:

bwa aln -t 4 -f sampleL1_1.sai hg19 sampleL1_1.fastq
bwa aln -t 4 -f sampleL1_2.sai hg19 sampleL1_2.fastq
bwa sampe -f sampleL1.sam -r"RG@\tID:sampleL1\tLB:sampleL1\tSM:sampleL1\tPL:ILLUMINA" hg19 sampleL1_1.sai sampleL1_2.sai sampleL1_1.fastq sampleL1_2.fastq
java -Xmx4g -Djava.io.tmpdir=/local -jar SortSam.jar SO=coordinate INPUT=sampleL1.sam OUTPUT=sampleL1.bam VALIDATION_STRINGENCY=LENIENT CREATE_INDEX=true
java -Xmx4g -Djava.io.tmpdir=/local -jar MarkDuplicates.jar INPUT=sampleL1.bam OUTPUT=sampleL1.dedup.bam METRICS_FILE=sampleL1.dup.txt VALIDATION_STRINGENCY=LENIENT REMOVE_DUPLICATES=true
samtools index sampleL1.dedup.bam
java -Xmx4g -Djava.io.tmpdir=/local -jar GenomeAnalysisTK.jar -T RealignerTargetCreator -R hg19.fa -I sampleL1.dedup.bam -o sampleL1.intervals -known gold_standard.vcf -known 1000G_phase1.vcf
java -Xmx4g -Djava.io.tmpdir=/local -jar GenomeAnalysisTK.jar -T IndelRealigner -R hg19.fa -I sampleL1.dedup.bam -targetIntervals sample1.intervals -o sampleL1.dedup.realn.bam -known gold_standard.vcf -known 1000G_phase1.vcf
java -Xmx4g -Djava.io.tmpdir=/local -jar GenomeAnalysisTK.jar -T BaseRecalibrator -rf BadCigar -R hg19.fa -I sampleL1.dedup.realn.bam -knownSites dbsnp137.vcf -knownSites gold_standard.vcf -knownSites 1000G_phase1.vcf -o sampleL1.recal_data.table
java -Xmx4g -Djava.io.tmpdir=/local -jar GenomeAnalysisTK.jar -T PrintReads -R hg19.fa -I sampleL1.dedup.realn.bam -BQSR sampleL1.recal_data.table -o sampleL1.dedup.realn.recal.bam
java -Xmx4g -Djava.io.tmpdir=/local -jar GenomeAnalysisTK.jar -T ReduceReads -R hg19.fa -I sampleL1.dedup.realn.recal.bam -o sampleL1.dedup.realn.recal.reduced.bam

I got an error earlier on the BaseRecalibrator step saying something like "START (101) > (100) STOP -- this should never happen -- call Mauricio!" After searching the broadinstitute.org forums, I saw a few posts claiming that adding the -rf BadCigar would make the tools ignore the reads with bad CIGAR strings. It seems to have worked. I do the previous set of commands on three lanes and then picard mergesamfiles, index and use GATK's HaplotypeCaller:

Code:

java -Xmx4g -jar MergeSamFiles.jar TMP_DIR=/local INPUT=sampleL1.dedup.realn.recal.reduced.bam INPUT=sampleL2.dedup.realn.recal.reduced.bam INPUT=sampleL3.dedup.realn.recal.reduced.bam OUTPUT=sample.merged.bam
samtools index sample.merged.bam
java -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R hg19.fa -I sample.merged.bam --genotyping_mode DISCOVERY -stand_emit_conf 10 -stand_call_conf 30 -o sample_raw.vcf

As I previously mentioned, I get an error on the last step:

Code:

##### ERROR ------------------------------------------------------------------------------------------
##### ERROR stack trace
org.broadinstitute.sting.utils.exceptions.ReviewedStingException: Somehow the requested coordinate is not covered by the read. Too many deletions?
        at org.broadinstitute.sting.utils.sam.ReadUtils.getReadCoordinateForReferenceCoordinate(ReadUtils.java:447)
        at org.broadinstitute.sting.utils.sam.ReadUtils.getReadCoordinateForReferenceCoordinate(ReadUtils.java:396)
        at org.broadinstitute.sting.utils.sam.ReadUtils.getReadCoordinateForReferenceCoordinate(ReadUtils.java:392)
        at org.broadinstitute.sting.gatk.walkers.annotator.DepthOfCoverage.annotate(DepthOfCoverage.java:56)
        at org.broadinstitute.sting.gatk.walkers.annotator.interfaces.InfoFieldAnnotation.annotate(InfoFieldAnnotation.java:24)
        at org.broadinstitute.sting.gatk.walkers.annotator.VariantAnnotatorEngine.annotateContext(VariantAnnotatorEngine.java:223)
        at org.broadinstitute.sting.gatk.walkers.haplotypecaller.GenotypingEngine.assignGenotypeLikelihoodsAndCallIndependentEvents(GenotypingEngine.java:196)
        at org.broadinstitute.sting.gatk.walkers.haplotypecaller.HaplotypeCaller.map(HaplotypeCaller.java:411)
        at org.broadinstitute.sting.gatk.walkers.haplotypecaller.HaplotypeCaller.map(HaplotypeCaller.java:107)
        at org.broadinstitute.sting.gatk.traversals.TraverseActiveRegions.processActiveRegion(TraverseActiveRegions.java:285)
        at org.broadinstitute.sting.gatk.traversals.TraverseActiveRegions.callWalkerMapOnActiveRegions(TraverseActiveRegions.java:230)
        at org.broadinstitute.sting.gatk.traversals.TraverseActiveRegions.processActiveRegions(TraverseActiveRegions.java:205)
        at org.broadinstitute.sting.gatk.traversals.TraverseActiveRegions.traverse(TraverseActiveRegions.java:131)
        at org.broadinstitute.sting.gatk.traversals.TraverseActiveRegions.traverse(TraverseActiveRegions.java:28)
        at org.broadinstitute.sting.gatk.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:74)
        at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:281)
        at org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:113)
        at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:237)
        at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:147)
        at org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:91)
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR A GATK RUNTIME ERROR has occurred (version 2.3-9-ge5ebf34):
##### ERROR
##### ERROR Please visit the wiki to see if this is a known problem
##### ERROR If not, please post the error, with stack trace, to the GATK forum
##### ERROR Visit our website and forum for extensive documentation and answers to
##### ERROR commonly asked questions http://www.broadinstitute.org/gatk
##### ERROR
##### ERROR MESSAGE: Somehow the requested coordinate is not covered by the read. Too many deletions?
##### ERROR ------------------------------------------------------------------------------------------

Perhaps this is all more information than I needed to provide as I'm hoping that my problems originate from bad CIGAR strings at the alignment step. Both of the errors that I have encountered that were not silly typos seem to be referring to these bad CIGAR strings. Here is one of the odd looking reads that I get in my original sam file:

Code:

GWZHISEQ01:231:D2BL9ACXX:1:1101:8158:10808      73      chr9    80793844        37      95M4I2M3D       =       80793844        0       TGATATTCTCGAGTTCCAAAACGGGGTCCCTGTGAAGGTGACCAACGTCAAGGATGGCACCACCCACCAGACCTCCTTGGTACCTTGGTACCTCCAGGATT    CCCFFFFFHHHHHIIJJJJJJJJJJJGIIJJJHIJJJJHHIJJJJJJHIJJJJJJJJIJJGHHFFFFDDDDDDDDDDDDDCCDDDDDDDDDDDDDDDD?CC   RG:Z:JC06_L1    XT:A:U  NM:i:7  SM:i:37 AM:i:0  X0:i:1  X1:i:0  XM:i:4  XO:i:1  XG:i:1  MD:Z:97^TTC0

This is only the first one that I see, and there are around 500 in my first sample's lane. Does it make any sense at all to have a CIGAR string "95M4I2M3D"? It's also strange because the read has 101 bases and 95 + 4 + 2 + 3 is not 101.

**bruce01** · 01-29-2014, 07:29 AM

Ok, my immediate response to issues like this is: what is the proportion of reads that have the error to total reads, i.e. can you just throw them out?! You say 500, I assume of millions, so I would get rid and see if your pipeline now runs. You can retain reads and see if they fall in same area, or share any other info which may direct you to the cause of the problem.

As to the CIGAR, I am not an expert at all! But I agree, it does not make sense based on what CIGAR is meant to say about the read (104 bases when read is 101 as you state). Perhaps try the Picard ValidateSAMFile to test if there is anything really wrong, esp if you do go back and remove the 'weird' CIGAR-reads.

Sorry I can't be of more help, good luck and post up what you did when solved as it will be useful+interesting to see.

**dpryan** · 01-29-2014, 07:42 AM

D operations don't contribute to read length, so 95+4+2=101.

Having said that, Both the CIGAR string and MD tag make no sense. Talking about a deletion after the end of a read is just strange. I would have guessed that the IndelRealigner just screwed this one up, but if this was how things are in the original SAM file, then it was actually the aligner. This looks like a bug in BWA.

**blakeoft** · 01-29-2014, 08:17 AM

Originally posted by dpryan View Post

D operations don't contribute to read length, so 95+4+2=101.

Having said that, Both the CIGAR string and MD tag make no sense. Talking about a deletion after the end of a read is just strange. I would have guessed that the IndelRealigner just screwed this one up, but if this was how things are in the original SAM file, then it was actually the aligner. This looks like a bug in BWA.

Of course D's don't contribute to read length. Duh, what was I thinking? I think you're right. I aligned with bowtie and did

Code:

awk '$6 ~ /D$/ {print}' sampleL1_bowtie.sam

and nothing was returned. I'll post what happens with my pipeline when I try aligning with bowtie and maybe some more QC.

**rbagnall** · 01-29-2014, 04:11 PM

A reported error in GATK

http://www.broadinstitute.org/gatk/search?q=Somehow+the+requested+coordinate+is+not+covered+by+the+read.+Too+many+deletions%3F

Update your version from version 2.3-9-ge5ebf34 to the current

**dpryan** · 01-29-2014, 05:04 PM

Reading through the GATK forums, it seems that blakeoft is correct and weird CIGAR strings cause this (see, for example this on-going thread). Even if GATK gets around the weird CIGAR string, I would still consider that to be a BWA bug, since you really shouldn't have to run with "-rf BadCigar".

**blakeoft** · 01-31-2014, 07:59 AM

I think my problem is that I was using bwa aln and not bwa mem. The latter produces alignments with no trailing D's with the same fastq input. But I think my main question (is it nonsensical for CIGAR strings to have D's at the end?) has been answered. Thanks to everyone.

**blakeoft** · 02-04-2014, 01:06 PM

I have another problem, but I want to post on GATK's forums. That leads to another issue, which is that I don't know how to post there... There doesn't seem to be a way to ask a new question or post in an existing thread. Does anyone know how to ask a new question on that forum?

**rbagnall** · 02-04-2014, 01:44 PM

First you need to register with GATK. At the top of the home page there is a 'Forum' option, and within there is the option to 'Register'.

Once registered and signed in, you can post a new question on the same 'Forum' page (top left), or you can add comments to a current thread by scrolling to the bottom of the thread and adding a comment.

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 39 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 41 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 35 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 55 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Questions about exome sequencing

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News