Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • What's causing malformed reads

    Hello everyone,

    My first post here so please excuse any etiquette mistakes. I'm working through a GATK pipeline for sequence data from multiple individuals. I have got to the local indel realignment phase and midway through the realignment process (target locator already run) I get an error message which kills the process:

    ERROR MESSAGE: SAM/BAM file SAMFileReader{..file path} is malformed: BAM file has a read with mismatching number of bases and base qualities. Offender: T_SOLEXA-GA02:6:9:1538:8018 [1 bases] [0 quals]

    I have found a way to get around this using -filterMBQ which skips malformed reads. But I am curious about the underlying cause of the problem. Is it most likely that something I have done incorrectly during the pipeline involving file formatting has created a mismatch between bases and base qualities, or is it the case that these mismatches can occur at low frequency as a normal part of the sequencing process? As the Malformed read filter exists it makes me think that these can just occur 'naturally' but I have no idea why.

    Any thoughts or those with experience of this problem I'd really appreciate hearing from you. I'm apprehensive about moving on with the pipeline without understanding the root of the problem.

    Best,

    Rubal7

  • #2
    looks pretty strange: he found a read having only one base and no associated quality. Do you do any kind of adaptor sequence removal or quality trimming? Anyways I've never seen that error...

    Comment


    • #3
      Along the same line of inquiry as ulz_peter, have a look in the SAM/BAM file you used as input to see if the original read is malformed or if this is being introduced along the way. It's odd for a read to be only 1 base long.

      Comment


      • #4
        Thanks guys, checking both these things now

        Comment


        • #5
          The offending read:
          T_SOLEXA-GA01_r:6:9:1538:8018 528 chr7 111016499 0 1M * 0 0 C * XT:A:R NM:i:0 XN:i:1 X0:f:1.36217e+08 XM:i:0 XO:i:0 XG:i:0 MD:A:1 RG:Z:NR_49w XI:Z:AACTCCG YI:Z:.--/-2/ ZQ:A:L

          Comment


          • #6
            I'm not surprised that the "doesn't pass QC" flag is set on that read. A * by itself in the QUAL field like that normally would mean "no quality stored", which would indeed be a malformed line. However, a single * is ambiguous in this case, since it's also a possible QUAL+33 score (for a crappy base call).

            Frankly, you'd be well off removing such short reads, since their mapping is going to be totally unreliable and they won't contribute anything meaningful to your results. Presumably whatever program you're using to do the adaptor trimming is capable of not returning reads below a certain size.

            Comment


            • #7
              Thanks, I'll probably remove short reads like you suggest as they are likely to do more harm than good!

              Comment


              • #8
                I too am seeing this error using GATK (v1.4-5-g253a07f) during indel realignment. I've never encountered it until today: 24 out of 28 files processed fine, but 4 of them fail prematurely due to a 'malformed' bam error on entries that are supposedly missing the quality score but have between 30 and 68 bases.

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  Advancing Precision Medicine for Rare Diseases in Children
                  by seqadmin




                  Many organizations study rare diseases, but few have a mission as impactful as Rady Children’s Institute for Genomic Medicine (RCIGM). “We are all about changing outcomes for children,” explained Dr. Stephen Kingsmore, President and CEO of the group. The institute’s initial goal was to provide rapid diagnoses for critically ill children and shorten their diagnostic odyssey, a term used to describe the long and arduous process it takes patients to obtain an accurate...
                  12-16-2024, 07:57 AM
                • seqadmin
                  Recent Advances in Sequencing Technologies
                  by seqadmin



                  Innovations in next-generation sequencing technologies and techniques are driving more precise and comprehensive exploration of complex biological systems. Current advancements include improved accessibility for long-read sequencing and significant progress in single-cell and 3D genomics. This article explores some of the most impactful developments in the field over the past year.

                  Long-Read Sequencing
                  Long-read sequencing has seen remarkable advancements,...
                  12-02-2024, 01:49 PM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by seqadmin, 12-17-2024, 10:28 AM
                0 responses
                39 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 12-13-2024, 08:24 AM
                0 responses
                52 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 12-12-2024, 07:41 AM
                0 responses
                38 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 12-11-2024, 07:45 AM
                0 responses
                46 views
                0 likes
                Last Post seqadmin  
                Working...
                X