Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Novoalign with GATK recalibration

    Hey everyone,

    I've been using Novoalign to map Illumina reads (TruSeq capture, HiSeq paired-end sequencing), then using GATK base quality recalibration to hopefully get better results. But you strangely get both ends of reads with very high reported quality scores after GATK base quality recalibration. Novoalign features its own recalibration which does not show these same effects, but if you use GATK base quality recalibration then once again very high quality scores at both ends are observed.

    These quality scores, particularly at the 3' end doesn't seem right for this data. In addition the same effects are not seen from a BWA alignment of the same data. I have seen the same effect on each dataset I have tried this out on so far. (All TruSeq, HiSeq) As the -H option was set in Novoalign many reads were trimmed (as much as leaving only 16 bases), and these effects are still observed after removing all trimmed reads.

    Novoalign mapped reads (without Novoalign recalibration) before GATK recalibration


    Novoalign mapped reads (without Novoalign recalibration) after GATK recalibration


    Novoalign mapped reads (with Novoalign recalibration) before GATK recalibration


    Novoalign mapped reads (with Novoalign recalibration) after GATK recalibration


    BWA mapped reads before GATK recalibration


    BWA mapped reads after GATK recalibration


    Uploaded with ImageShack.us

    My pipeline has been:
    alignment, sort/order, FastQC, Duplicate removal (MarkDuplicates), GATK base quality recalibration, FastQC. I would've had the first FastQC step after but has been easier to implement in this case, and I'm not thinking it would be hiding anything (duplicate levels ~17%).

    Any enlightenment would be appreciated.
    Last edited by trickytank; 08-23-2011, 04:16 PM. Reason: labels wrong on figures

  • #2
    Hi Trickytank,

    We did a similar study but looked at dbSNP concordance rather than the FASTQC quality profile.

    I think your figure legends (from top to bottom) in figures 2 and 4 probably mean "After" calibration. Is that right?

    This is quite an interesting observation why GATK would do this.

    Comment


    • #3
      Originally posted by zee View Post
      I think your figure legends (from top to bottom) in figures 2 and 4 probably mean "After" calibration. Is that right?
      Thanks, I've fixed that now.

      Do you have a link/article for your study?

      Comment


      • #4
        Tricktank,
        There could be a simple explanation. Novoalign can clip alignments, trimming them back to the best local alignment. This means a mismatch in the last few bases is likely to be clipped.
        Novoaligns quality calibration works on alignments before clipping so it won't show this affect.
        Clipping is done to improve accuracy of SNP calling. With dynamic programming algorithms like Smith-Waterman and Needleman-Wunsch there are often suboptimal alignments that only differ slightly in score from the optimal alignment. This especially happens near the ends of alignments. For example an true indel of 1bp in the last few bp of a read may be aligned as mismatches. The clipping ensures there are enough matching bases after a SNP or Indel to ensure the alignment is optimum.
        Clipping can be turned off with the option -o FULLNW

        Colin

        Comment


        • #5
          Hey thanks for that. By the sounds of it I would be better off not using:
          -o FULLNW

          I found that the number of SNP variants changes very little <30 of ~100,000 conditioning on depth >4 when using GATK and having used Novoalign recalibration.
          Using BWA that GATK changes the SNP variants by around 1,000~2,000. I'm thinking to just not use GATK recalibration on Novoalign runs.

          Comment


          • #6
            I'm going to try the -o FULLNW option to see if it removes what I have observed.

            Comment


            • #7
              I'm going to try the -o FULLNW option to see if it removes what I have observed. I'll post my results here.

              Comment


              • #8
                to clarify, does this mean by default Novoalign clips mismatches at the ends of reads which are not seen in the reference index?

                Comment


                • #9
                  Yes, mismatches near the ends of the alignment will be clipped so that best local alignment is reported. It doesn't seem right if all we had was SNPs but if our sample includes indels and structural variations and these occur near the ends of the read then they may get aligned as mismatches. This can then cause erroneous SNP calls. Clipping avoids this problem and improves specificity of SNP & Indel calls but it may reduce sensitivity a bit.
                  It would be interesting to see effect of clipping on dbSNP concordance, we haven't done this yet.

                  Comment


                  • #10
                    Using the -o FULLNW option

                    And with the -o FULLNW, the FastQC plots are no longer worrying.

                    Novoalign with recalibration and -o FULLNW option, before GATK recalibration BAM file:

                    By trickytank at 2011-09-05

                    Novoalign with recalibration and -o FULLNW option, after GATK recalibration BAM file:

                    By trickytank at 2011-09-05

                    I was under the impression that BAQ implemented in SAMtools is designed to overcome the problems of misalignments caused by indels near the ends of reads, and shouldn't effect sensitivity as much as clipping at the alignment stage? (Local realignment around indels also seems like an alternative too.)

                    Comment


                    • #11
                      Local realignment should help if you use -o FullNW, I haven't looked into this. It would be interesting to see effect on dbSNP concordance.
                      We added soft clipping before these tools were readily available.

                      Comment

                      Latest Articles

                      Collapse

                      • seqadmin
                        Strategies for Sequencing Challenging Samples
                        by seqadmin


                        Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                        03-22-2024, 06:39 AM
                      • seqadmin
                        Techniques and Challenges in Conservation Genomics
                        by seqadmin



                        The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                        Avian Conservation
                        Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                        03-08-2024, 10:41 AM

                      ad_right_rmr

                      Collapse

                      News

                      Collapse

                      Topics Statistics Last Post
                      Started by seqadmin, Yesterday, 06:37 PM
                      0 responses
                      8 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, Yesterday, 06:07 PM
                      0 responses
                      8 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 03-22-2024, 10:03 AM
                      0 responses
                      49 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 03-21-2024, 07:32 AM
                      0 responses
                      66 views
                      0 likes
                      Last Post seqadmin  
                      Working...
                      X