Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • GATK "MAPQ should be 0 for unmapped read" complaint for mapped read

    In running UnifiedGenotyper in GATK, I got the following error message:

    Exception in thread "main" net.sf.samtools.SAMFormatException: SAM validation error: ERROR: Record 5248502, Read name FCD0CALABXX:1:1104:13631:110790#CTTGTAAT, MAPQ should be 0 for unmapped read.
    at net.sf.samtools.SAMUtils.processValidationErrors(SAMUtils.java:334)
    at net.sf.samtools.BAMFileReader$BAMFileIterator.advance(BAMFileReader.java:469)

    ... and a bunch more of these error messages that start with "at". I interpret this message as saying that a read at line 5248502 named " FCD0CALABXX:1:1104:13631:110790#CTTGTAAT " is unmapped but has a mapping quality score of that is not 0. However, if I look at the sam file that corresponds to the bam file, I find that there are two reads named "FCD ....", one at line 5248502 and one at the previous line, both of them are mapped and both of them have mapping quality scores that are non-zero:

    line 5248501:

    FCD0CALABXX:1:1104:13631:110790#CTTGTAAT 83 chrM 226 29 90M chr9_gl000201_random 36145 0 TAGGACATAATAATAACAATTGAATGTCTGCACAGCCGCTTTCCACACAGACATCATAACAAAAAATTTCCACCAAACCCCCCCTCCCCC ^_cW[`UbacaecbeO]]TUababdbTa^[cGcc``c[adecagdgdgggegbeggdgggggfegeefeffXfffg[gggggggeggggf XT:A:U NM:i:2 SM:i:29 AM:i:29 X0:i:1 X1:i:0 XM:i:2 XO:i:0 XG:i:0 MD:Z:84C0T4



    line 5248502:

    FCD0CALABXX:1:1104:13631:110790#CTTGTAAT 167 chr9_gl000201_random 36145 29 41S49M chrM 226 0 AGCCTAAATAGCCCACACGTTCCCCTTAAATAAGACATCACGATGGATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCTCCA gggggggggggggggggggggggggggggggg_gggeeggggagggegegfggggg\cfgdggggegfggggggggdfag[eefefeefe XT:A:M NM:i:1 SM:i:29 AM:i:29 XM:i:1 XO:i:0 XG:i:0 MD:Z:3C45



    Both of them have somewhat weird chromosomes, but both of those chromosomes are in my fasta genome data base, so that should be fine. Does anyone see what I'm doing wrong?

    Thanks.

    Eric

  • #2
    A friend of mine solved this one for me: The 167 flag in column 2 of the offending line tells me that the read is, in fact, not mapped. If I enter "167" here, I can get that information:



    Now why it counts as unmapped when it appears to be mapped to chr9_gl000201_random 36145 is still mysterious to me, but in any event, I should be able to fix this problem by deleting the offending line.

    Eric

    Comment


    • #3
      Originally posted by efoss View Post
      A friend of mine solved this one for me: The 167 flag in column 2 of the offending line tells me that the read is, in fact, not mapped. If I enter "167" here, I can get that information:



      Now why it counts as unmapped when it appears to be mapped to chr9_gl000201_random 36145 is still mysterious to me, but in any event, I should be able to fix this problem by deleting the offending line.

      Eric
      There are two classic reasons why unmapped reads can appear as if they are mapped.

      The first is that sam file specs call for an unmapped mate of a mapped read to be given the position of the mapped read. This is so that the two reads will sort together, but that doesn't seem to be happening here.

      The second happens if you align with bwa. bwa concatenates reference contigs together before aligning, so if a read hangs off the edge of one contig on to another, it'll have a mapping position, but since that's not quite right, it's also marked as unmapped. GATK is confused by this behavior, unless you tell it to ignore it.

      Those binary flags are a little weird, as I'd think that the first should have a flag of 91, since its mate is unmapped.

      But in general, believe the flags. If the flag has the 4 bit flagged, it doesn't matter what the rest of the entry says, don't count the read as mapped.

      Rather than delete the line, you can set the stringency on GATK to be lenient, so it will skip that line.

      Comment


      • #4
        Originally posted by swbarnes2 View Post
        There are two classic reasons why unmapped reads can appear as if they are mapped.

        The first is that sam file specs call for an unmapped mate of a mapped read to be given the position of the mapped read. This is so that the two reads will sort together, but that doesn't seem to be happening here.

        The second happens if you align with bwa. bwa concatenates reference contigs together before aligning, so if a read hangs off the edge of one contig on to another, it'll have a mapping position, but since that's not quite right, it's also marked as unmapped. GATK is confused by this behavior, unless you tell it to ignore it.

        Those binary flags are a little weird, as I'd think that the first should have a flag of 91, since its mate is unmapped.

        But in general, believe the flags. If the flag has the 4 bit flagged, it doesn't matter what the rest of the entry says, don't count the read as mapped.

        Rather than delete the line, you can set the stringency on GATK to be lenient, so it will skip that line.
        Thanks very much. That makes sense - and this file was, in fact, made using bwa.

        I searched around for a lenient option in GATK and didn't see one. Is the syntax like this?:

        java -Xmx20G -jar /home/efoss/sequencing/GenomeAnalysisTK-1.1-31-gdc8398e/GenomeAnalysisTK.jar VALIDATION_STRINGENCY=LENIENT ...

        Thanks.

        Eric

        Comment


        • #5
          Originally posted by efoss View Post
          Thanks very much. That makes sense - and this file was, in fact, made using bwa.

          I searched around for a lenient option in GATK and didn't see one. Is the syntax like this?:

          java -Xmx20G -jar /home/efoss/sequencing/GenomeAnalysisTK-1.1-31-gdc8398e/GenomeAnalysisTK.jar VALIDATION_STRINGENCY=LENIENT ...

          Thanks.

          Eric
          Yes, that's the one. It's probably better to keep all the reads in your file, and have GATK ignore the problem ones, than to delete them out.

          Comment


          • #6
            Originally posted by swbarnes2 View Post
            Yes, that's the one. It's probably better to keep all the reads in your file, and have GATK ignore the problem ones, than to delete them out.
            GATK doesn't seem to like my VALIDATION_STRINGENCY=LENIENT argument:

            ##### ERROR MESSAGE: Invalid argument value 'VALIDATION_STRINGENCY=LENIENT' at position 0.

            I took this from a Picard Tools command I found with Google. My complete command (broken up into separate lines for readability) is as follows:

            java -Xmx20G -jar /home/efoss/sequencing/GenomeAnalysisTK-1.1-31-gdc8398e/GenomeAnalysisTK.jar

            VALIDATION_STRINGENCY=LENIENT

            -R /mnt/fred/home/efoss/gene_databases/human_genome/wg.fa

            -T UnifiedGenotyper

            -I /mnt/fred/home/efoss/sequencing/MDS/patient1/recreate_patient1_sam_file_plus_header_090511_1.bam

            -B:dbsnp,VCF /mnt/fred/home/efoss/gene_databases/GATK_resource_files/00-All.vcf -nt 6

            -o /mnt/fred/home/efoss/sequencing/MDS/patient1/MDS_patient1_dcov500_UnifiedGenotyper_090611_1.raw.vcf

            -dcov 500

            Thanks.

            Eric

            Comment


            • #7
              Maybe that command line is obsolete?

              This is from the version 1.1.23 help entry:

              -S,--validation_strictness <validation_strictness> How strict should we be with validation (STRICT|LENIENT|
              SILENT)

              Comment


              • #8
                Originally posted by swbarnes2 View Post
                Maybe that command line is obsolete?

                This is from the version 1.1.23 help entry:
                Perfect - thanks so much for the help.

                Eric

                Comment


                • #9
                  The best tip i can give you is to use the data from Broad in your analysis.


                  ftp://ftp.broadinstitute.org/pub/seq...sembly19.fasta
                  And
                  ftp://ftp.ncbi.nlm.nih.gov/snp/organ.../00-All.vcf.gz

                  Comment


                  • #10
                    Originally posted by swbarnes2 View Post
                    The first is that sam file specs call for an unmapped mate of a mapped read to be given the position of the mapped read. This is so that the two reads will sort together, but that doesn't seem to be happening here.
                    swbarnes2, Thanks for explaining the reason of MAPQ error with regard to bwa. However, I don't follow this one. Could you please explain this part a bit more detailed? Thank you.

                    Comment


                    • #11
                      Originally posted by cedance View Post
                      swbarnes2, Thanks for explaining the reason of MAPQ error with regard to bwa. However, I don't follow this one. Could you please explain this part a bit more detailed? Thank you.
                      When you map single reads, if a read doesn't map, its chromosome and position entries in the .bam file are left blank, or they have a star, or something like that. The 4 flag is also set.

                      When you have paired reads, if one end maps, and the other doesn't, the mapped end has the chromosome and position in its bam entry, just like you would think it should. But the unmapped read also has that same chromosome and position filled out in its entry. It's not left blank like you might think it should be. The reason for this is so that if you sort by chromosome and position, the unmapped read will sort right next to its mapped mate.

                      Only if both ends of a pair fail to map are they given empty chromosome and position values.

                      Just because an entry in the .bam file has a chromosome and position doesn't mean that it really mapped there. If it has a 4 in the flag, then you can't necesarily believe anything else in the .bam entry (except the read name, and sequence, and quality).

                      Comment


                      • #12
                        Originally posted by swbarnes2 View Post
                        When you map single reads, if a read doesn't map, its chromosome and position entries in the .bam file are left blank, or they have a star, or something like that. The 4 flag is also set.

                        When you have paired reads, if one end maps, and the other doesn't, the mapped end has the chromosome and position in its bam entry, just like you would think it should. But the unmapped read also has that same chromosome and position filled out in its entry. It's not left blank like you might think it should be. The reason for this is so that if you sort by chromosome and position, the unmapped read will sort right next to its mapped mate.

                        Only if both ends of a pair fail to map are they given empty chromosome and position values.

                        Just because an entry in the .bam file has a chromosome and position doesn't mean that it really mapped there. If it has a 4 in the flag, then you can't necesarily believe anything else in the .bam entry (except the read name, and sequence, and quality).

                        swbarnes2, great! thank you.
                        Just 1 more question. How about the NH:i:1 and XT:A:U optional flags in case of tophat and bwa? Of course its easy to just check for the unmapped read by doing a "bitwise and" in perl as and($flag, 4) == 4. However, I was just wondering because, in bwa, I checked for a recent library and found out that 5 reads with unmapped read flag set had the optional flag XT:A:U as well... Isn't this a bug then? It concludes all the same though that one can't rely on any other info/optional flag to decide if the read is unmapped or not..

                        Comment


                        • #13
                          Originally posted by swbarnes2 View Post
                          Yes, that's the one. It's probably better to keep all the reads in your file, and have GATK ignore the problem ones, than to delete them out.
                          Are you sure when you say it's probably better to keep all the reads....
                          Would anybody have an opinion about the problem ? Is it necessary to eliminate the reads with errors (MAPQ should be 0 for unmapped read) ? or can you use the option "VALIDATION_STRINGENCY=LENIENT" to avoid the problem and continue the analysis without removing these reads ?

                          Thank you,

                          Sam64

                          Comment

                          Latest Articles

                          Collapse

                          • seqadmin
                            Techniques and Challenges in Conservation Genomics
                            by seqadmin



                            The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                            Avian Conservation
                            Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                            03-08-2024, 10:41 AM
                          • seqadmin
                            The Impact of AI in Genomic Medicine
                            by seqadmin



                            Artificial intelligence (AI) has evolved from a futuristic vision to a mainstream technology, highlighted by the introduction of tools like OpenAI's ChatGPT and Google's Gemini. In recent years, AI has become increasingly integrated into the field of genomics. This integration has enabled new scientific discoveries while simultaneously raising important ethical questions1. Interviews with two researchers at the center of this intersection provide insightful perspectives into...
                            02-26-2024, 02:07 PM

                          ad_right_rmr

                          Collapse

                          News

                          Collapse

                          Topics Statistics Last Post
                          Started by seqadmin, 03-14-2024, 06:13 AM
                          0 responses
                          34 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 03-08-2024, 08:03 AM
                          0 responses
                          72 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 03-07-2024, 08:13 AM
                          0 responses
                          81 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 03-06-2024, 09:51 AM
                          0 responses
                          68 views
                          0 likes
                          Last Post seqadmin  
                          Working...
                          X