SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
MAPQ must should be 0 for unmapped read KevinLam Bioinformatics 24 09-23-2013 06:05 AM
human genome short read data for "Do it yourself genetic testing" attilav General 2 08-22-2011 12:59 PM
tophat v1.1.1 - "warning: multiple closures found for read" cs Bioinformatics 1 11-02-2010 07:02 AM
What's the meaning of "*" in the read bases cols in pileup file skblazer Bioinformatics 1 07-13-2010 02:27 AM
Concentration of mismatches close to read ends in "Whole Transcriptome" protocol hingamp SOLiD 42 05-27-2010 04:43 AM

Reply
 
Thread Tools
Old 09-05-2011, 02:38 PM   #1
efoss
Member
 
Location: Seattle

Join Date: Jul 2011
Posts: 98
Default GATK "MAPQ should be 0 for unmapped read" complaint for mapped read

In running UnifiedGenotyper in GATK, I got the following error message:

Exception in thread "main" net.sf.samtools.SAMFormatException: SAM validation error: ERROR: Record 5248502, Read name FCD0CALABXX:1:1104:13631:110790#CTTGTAAT, MAPQ should be 0 for unmapped read.
at net.sf.samtools.SAMUtils.processValidationErrors(SAMUtils.java:334)
at net.sf.samtools.BAMFileReader$BAMFileIterator.advance(BAMFileReader.java:469)

... and a bunch more of these error messages that start with "at". I interpret this message as saying that a read at line 5248502 named " FCD0CALABXX:1:1104:13631:110790#CTTGTAAT " is unmapped but has a mapping quality score of that is not 0. However, if I look at the sam file that corresponds to the bam file, I find that there are two reads named "FCD ....", one at line 5248502 and one at the previous line, both of them are mapped and both of them have mapping quality scores that are non-zero:

line 5248501:

FCD0CALABXX:1:1104:13631:110790#CTTGTAAT 83 chrM 226 29 90M chr9_gl000201_random 36145 0 TAGGACATAATAATAACAATTGAATGTCTGCACAGCCGCTTTCCACACAGACATCATAACAAAAAATTTCCACCAAACCCCCCCTCCCCC ^_cW[`UbacaecbeO]]TUababdbTa^[cGcc``c[adecagdgdgggegbeggdgggggfegeefeffXfffg[gggggggeggggf XT:A:U NM:i:2 SM:i:29 AM:i:29 X0:i:1 X1:i:0 XM:i:2 XO:i:0 XG:i:0 MD:Z:84C0T4



line 5248502:

FCD0CALABXX:1:1104:13631:110790#CTTGTAAT 167 chr9_gl000201_random 36145 29 41S49M chrM 226 0 AGCCTAAATAGCCCACACGTTCCCCTTAAATAAGACATCACGATGGATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCTCCA gggggggggggggggggggggggggggggggg_gggeeggggagggegegfggggg\cfgdggggegfggggggggdfag[eefefeefe XT:A:M NM:i:1 SM:i:29 AM:i:29 XM:i:1 XO:i:0 XG:i:0 MD:Z:3C45



Both of them have somewhat weird chromosomes, but both of those chromosomes are in my fasta genome data base, so that should be fine. Does anyone see what I'm doing wrong?

Thanks.

Eric
efoss is offline   Reply With Quote
Old 09-05-2011, 04:39 PM   #2
efoss
Member
 
Location: Seattle

Join Date: Jul 2011
Posts: 98
Default

A friend of mine solved this one for me: The 167 flag in column 2 of the offending line tells me that the read is, in fact, not mapped. If I enter "167" here, I can get that information:

http://picard.sourceforge.net/explain-flags.html

Now why it counts as unmapped when it appears to be mapped to chr9_gl000201_random 36145 is still mysterious to me, but in any event, I should be able to fix this problem by deleting the offending line.

Eric
efoss is offline   Reply With Quote
Old 09-05-2011, 11:17 PM   #3
swbarnes2
Senior Member
 
Location: San Diego

Join Date: May 2008
Posts: 912
Default

Quote:
Originally Posted by efoss View Post
A friend of mine solved this one for me: The 167 flag in column 2 of the offending line tells me that the read is, in fact, not mapped. If I enter "167" here, I can get that information:

http://picard.sourceforge.net/explain-flags.html

Now why it counts as unmapped when it appears to be mapped to chr9_gl000201_random 36145 is still mysterious to me, but in any event, I should be able to fix this problem by deleting the offending line.

Eric
There are two classic reasons why unmapped reads can appear as if they are mapped.

The first is that sam file specs call for an unmapped mate of a mapped read to be given the position of the mapped read. This is so that the two reads will sort together, but that doesn't seem to be happening here.

The second happens if you align with bwa. bwa concatenates reference contigs together before aligning, so if a read hangs off the edge of one contig on to another, it'll have a mapping position, but since that's not quite right, it's also marked as unmapped. GATK is confused by this behavior, unless you tell it to ignore it.

Those binary flags are a little weird, as I'd think that the first should have a flag of 91, since its mate is unmapped.

But in general, believe the flags. If the flag has the 4 bit flagged, it doesn't matter what the rest of the entry says, don't count the read as mapped.

Rather than delete the line, you can set the stringency on GATK to be lenient, so it will skip that line.
swbarnes2 is offline   Reply With Quote
Old 09-06-2011, 10:12 AM   #4
efoss
Member
 
Location: Seattle

Join Date: Jul 2011
Posts: 98
Default

Quote:
Originally Posted by swbarnes2 View Post
There are two classic reasons why unmapped reads can appear as if they are mapped.

The first is that sam file specs call for an unmapped mate of a mapped read to be given the position of the mapped read. This is so that the two reads will sort together, but that doesn't seem to be happening here.

The second happens if you align with bwa. bwa concatenates reference contigs together before aligning, so if a read hangs off the edge of one contig on to another, it'll have a mapping position, but since that's not quite right, it's also marked as unmapped. GATK is confused by this behavior, unless you tell it to ignore it.

Those binary flags are a little weird, as I'd think that the first should have a flag of 91, since its mate is unmapped.

But in general, believe the flags. If the flag has the 4 bit flagged, it doesn't matter what the rest of the entry says, don't count the read as mapped.

Rather than delete the line, you can set the stringency on GATK to be lenient, so it will skip that line.
Thanks very much. That makes sense - and this file was, in fact, made using bwa.

I searched around for a lenient option in GATK and didn't see one. Is the syntax like this?:

java -Xmx20G -jar /home/efoss/sequencing/GenomeAnalysisTK-1.1-31-gdc8398e/GenomeAnalysisTK.jar VALIDATION_STRINGENCY=LENIENT ...

Thanks.

Eric
efoss is offline   Reply With Quote
Old 09-06-2011, 10:37 AM   #5
swbarnes2
Senior Member
 
Location: San Diego

Join Date: May 2008
Posts: 912
Default

Quote:
Originally Posted by efoss View Post
Thanks very much. That makes sense - and this file was, in fact, made using bwa.

I searched around for a lenient option in GATK and didn't see one. Is the syntax like this?:

java -Xmx20G -jar /home/efoss/sequencing/GenomeAnalysisTK-1.1-31-gdc8398e/GenomeAnalysisTK.jar VALIDATION_STRINGENCY=LENIENT ...

Thanks.

Eric
Yes, that's the one. It's probably better to keep all the reads in your file, and have GATK ignore the problem ones, than to delete them out.
swbarnes2 is offline   Reply With Quote
Old 09-06-2011, 11:01 AM   #6
efoss
Member
 
Location: Seattle

Join Date: Jul 2011
Posts: 98
Default

Quote:
Originally Posted by swbarnes2 View Post
Yes, that's the one. It's probably better to keep all the reads in your file, and have GATK ignore the problem ones, than to delete them out.
GATK doesn't seem to like my VALIDATION_STRINGENCY=LENIENT argument:

##### ERROR MESSAGE: Invalid argument value 'VALIDATION_STRINGENCY=LENIENT' at position 0.

I took this from a Picard Tools command I found with Google. My complete command (broken up into separate lines for readability) is as follows:

java -Xmx20G -jar /home/efoss/sequencing/GenomeAnalysisTK-1.1-31-gdc8398e/GenomeAnalysisTK.jar

VALIDATION_STRINGENCY=LENIENT

-R /mnt/fred/home/efoss/gene_databases/human_genome/wg.fa

-T UnifiedGenotyper

-I /mnt/fred/home/efoss/sequencing/MDS/patient1/recreate_patient1_sam_file_plus_header_090511_1.bam

-B:dbsnp,VCF /mnt/fred/home/efoss/gene_databases/GATK_resource_files/00-All.vcf -nt 6

-o /mnt/fred/home/efoss/sequencing/MDS/patient1/MDS_patient1_dcov500_UnifiedGenotyper_090611_1.raw.vcf

-dcov 500

Thanks.

Eric
efoss is offline   Reply With Quote
Old 09-06-2011, 12:31 PM   #7
swbarnes2
Senior Member
 
Location: San Diego

Join Date: May 2008
Posts: 912
Default

Maybe that command line is obsolete?

This is from the version 1.1.23 help entry:

Quote:
-S,--validation_strictness <validation_strictness> How strict should we be with validation (STRICT|LENIENT|
SILENT)
swbarnes2 is offline   Reply With Quote
Old 09-06-2011, 01:51 PM   #8
efoss
Member
 
Location: Seattle

Join Date: Jul 2011
Posts: 98
Default

Quote:
Originally Posted by swbarnes2 View Post
Maybe that command line is obsolete?

This is from the version 1.1.23 help entry:
Perfect - thanks so much for the help.

Eric
efoss is offline   Reply With Quote
Old 09-07-2011, 04:25 AM   #9
raonyguimaraes
Member
 
Location: Belo Horizonte - Brazil

Join Date: Jun 2010
Posts: 38
Default

The best tip i can give you is to use the data from Broad in your analysis.


ftp://ftp.broadinstitute.org/pub/seq...sembly19.fasta
And
ftp://ftp.ncbi.nlm.nih.gov/snp/organ.../00-All.vcf.gz
raonyguimaraes is offline   Reply With Quote
Old 04-26-2012, 11:03 PM   #10
cedance
Senior Member
 
Location: Germany

Join Date: Feb 2011
Posts: 108
Default

Quote:
Originally Posted by swbarnes2 View Post
The first is that sam file specs call for an unmapped mate of a mapped read to be given the position of the mapped read. This is so that the two reads will sort together, but that doesn't seem to be happening here.
swbarnes2, Thanks for explaining the reason of MAPQ error with regard to bwa. However, I don't follow this one. Could you please explain this part a bit more detailed? Thank you.
cedance is offline   Reply With Quote
Old 04-27-2012, 12:13 PM   #11
swbarnes2
Senior Member
 
Location: San Diego

Join Date: May 2008
Posts: 912
Default

Quote:
Originally Posted by cedance View Post
swbarnes2, Thanks for explaining the reason of MAPQ error with regard to bwa. However, I don't follow this one. Could you please explain this part a bit more detailed? Thank you.
When you map single reads, if a read doesn't map, its chromosome and position entries in the .bam file are left blank, or they have a star, or something like that. The 4 flag is also set.

When you have paired reads, if one end maps, and the other doesn't, the mapped end has the chromosome and position in its bam entry, just like you would think it should. But the unmapped read also has that same chromosome and position filled out in its entry. It's not left blank like you might think it should be. The reason for this is so that if you sort by chromosome and position, the unmapped read will sort right next to its mapped mate.

Only if both ends of a pair fail to map are they given empty chromosome and position values.

Just because an entry in the .bam file has a chromosome and position doesn't mean that it really mapped there. If it has a 4 in the flag, then you can't necesarily believe anything else in the .bam entry (except the read name, and sequence, and quality).
swbarnes2 is offline   Reply With Quote
Old 04-27-2012, 02:38 PM   #12
cedance
Senior Member
 
Location: Germany

Join Date: Feb 2011
Posts: 108
Default

Quote:
Originally Posted by swbarnes2 View Post
When you map single reads, if a read doesn't map, its chromosome and position entries in the .bam file are left blank, or they have a star, or something like that. The 4 flag is also set.

When you have paired reads, if one end maps, and the other doesn't, the mapped end has the chromosome and position in its bam entry, just like you would think it should. But the unmapped read also has that same chromosome and position filled out in its entry. It's not left blank like you might think it should be. The reason for this is so that if you sort by chromosome and position, the unmapped read will sort right next to its mapped mate.

Only if both ends of a pair fail to map are they given empty chromosome and position values.

Just because an entry in the .bam file has a chromosome and position doesn't mean that it really mapped there. If it has a 4 in the flag, then you can't necesarily believe anything else in the .bam entry (except the read name, and sequence, and quality).

swbarnes2, great! thank you.
Just 1 more question. How about the NH:i:1 and XT:A:U optional flags in case of tophat and bwa? Of course its easy to just check for the unmapped read by doing a "bitwise and" in perl as and($flag, 4) == 4. However, I was just wondering because, in bwa, I checked for a recent library and found out that 5 reads with unmapped read flag set had the optional flag XT:A:U as well... Isn't this a bug then? It concludes all the same though that one can't rely on any other info/optional flag to decide if the read is unmapped or not..
cedance is offline   Reply With Quote
Old 10-18-2012, 01:19 AM   #13
Sam64
Member
 
Location: Paris

Join Date: Jun 2011
Posts: 15
Default

Quote:
Originally Posted by swbarnes2 View Post
Yes, that's the one. It's probably better to keep all the reads in your file, and have GATK ignore the problem ones, than to delete them out.
Are you sure when you say it's probably better to keep all the reads....
Would anybody have an opinion about the problem ? Is it necessary to eliminate the reads with errors (MAPQ should be 0 for unmapped read) ? or can you use the option "VALIDATION_STRINGENCY=LENIENT" to avoid the problem and continue the analysis without removing these reads ?

Thank you,

Sam64
Sam64 is offline   Reply With Quote
Reply

Tags
error message, gatk, mapq should be 0, unifiedgenotyper, unmapped read

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 05:12 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO