Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • GATK: read counts varying between programs

    Hi all,

    I am doing an exome analysis with BWA 0.6.1-r104 and GATK v2.2-8-gec077cd.
    I have paired end reads, my protocol until now is (in brief, omitting options etc.)

    bwa aln R1.fastq
    bwa aln R2.fastq
    bwa sampe R1.sai R2.sai
    picard/CleanSam.jar
    picard/SortSam.jar
    picard/MarkDuplicates.jar
    picard/AddOrReplaceReadGroups.jar
    picard/BuildBamIndex.jar
    GATK -T RealignerTargetCreator -known dbsnp.vcf
    GATK -T IndelRealigner -known dbsnp.vcf
    GATK -T BaseRecalibrator -knownSites dbsnp.vcf
    GATK -T PrintReads


    A closer look on the output of the above toolchain revealed changes in read counts I did not quite understand.

    I have 85767226 paired end = 171534452 sequences in fastQ file

    BWA reports this number, the cleaned SAM file has 171534452 alignments as expected.



    MarkDuplicates reports:

    Read 165619516 records. 2 pairs never matched.
    Marking 20272927 records as duplicates.
    Found 2919670 optical duplicate clusters.

    so nearly 6 million reads seem to miss.



    CreateTargets MicroScheduler reports

    35915555 reads were filtered out during traversal out of 166579875 total (21.56%)
    -> 428072 reads (0.26% of total) failing BadMateFilter
    -> 16077607 reads (9.65% of total) failing DuplicateReadFilter
    -> 19409876 reads (11.65% of total) failing MappingQualityZeroFilter

    so nearly 5 million reads seem to miss



    The Realigner MicroScheduler reports

    0 reads were filtered out during traversal out of 171551640 total (0.00%)

    which appears a miracle to me since
    1) there are even more reads now than input sequences,
    2) all those crappy reads reported by CreateTargets do not appear.



    From Base recalibration MicroScheduler, I get

    41397379 reads were filtered out during traversal out of 171703265 total (24.11%)
    -> 16010068 reads (9.32% of total) failing DuplicateReadFilter
    -> 25387311 reads (14.79% of total) failing MappingQualityZeroFilter

    ..... so my reads got even more offspring, but, e.g., the duplicate reads reappear with "roughly" the same number.


    I found these varying counts a little irritating -- can someone please give me a hint on the logics of these numbers? And, does the protocol look meaningful?


    Thanks for any comments!

  • #2
    Originally posted by stephlo View Post
    Hi all,

    I am doing an exome analysis with BWA 0.6.1-r104 and GATK v2.2-8-gec077cd.
    I have paired end reads, my protocol until now is (in brief, omitting options etc.)

    bwa aln R1.fastq
    bwa aln R2.fastq
    bwa sampe R1.sai R2.sai
    picard/CleanSam.jar
    picard/SortSam.jar
    picard/MarkDuplicates.jar
    picard/AddOrReplaceReadGroups.jar
    picard/BuildBamIndex.jar
    GATK -T RealignerTargetCreator -known dbsnp.vcf
    GATK -T IndelRealigner -known dbsnp.vcf
    GATK -T BaseRecalibrator -knownSites dbsnp.vcf
    GATK -T PrintReads


    A closer look on the output of the above toolchain revealed changes in read counts I did not quite understand.

    I have 85767226 paired end = 171534452 sequences in fastQ file

    BWA reports this number, the cleaned SAM file has 171534452 alignments as expected.



    MarkDuplicates reports:

    Read 165619516 records. 2 pairs never matched.
    Marking 20272927 records as duplicates.
    Found 2919670 optical duplicate clusters.

    so nearly 6 million reads seem to miss.



    CreateTargets MicroScheduler reports

    35915555 reads were filtered out during traversal out of 166579875 total (21.56%)
    -> 428072 reads (0.26% of total) failing BadMateFilter
    -> 16077607 reads (9.65% of total) failing DuplicateReadFilter
    -> 19409876 reads (11.65% of total) failing MappingQualityZeroFilter

    so nearly 5 million reads seem to miss



    The Realigner MicroScheduler reports

    0 reads were filtered out during traversal out of 171551640 total (0.00%)

    which appears a miracle to me since
    1) there are even more reads now than input sequences,
    2) all those crappy reads reported by CreateTargets do not appear.



    From Base recalibration MicroScheduler, I get

    41397379 reads were filtered out during traversal out of 171703265 total (24.11%)
    -> 16010068 reads (9.32% of total) failing DuplicateReadFilter
    -> 25387311 reads (14.79% of total) failing MappingQualityZeroFilter

    ..... so my reads got even more offspring, but, e.g., the duplicate reads reappear with "roughly" the same number.


    I found these varying counts a little irritating -- can someone please give me a hint on the logics of these numbers? And, does the protocol look meaningful?


    Thanks for any comments!
    Hi Stephio,

    Just a comment. I use a similar approach - no CleanSam step used - for exome capture analysis and got comparable results. IMHO, the protocol We are currently following is standard and sound. At the same time, I found no logical explanation to the variation in the no. of read counts. I am also concerned about the high no. of failing MappingQualityZeroFilter reads. Any logical/clear explanation for that?


    Cheers,

    Fernando

    Comment


    • #3
      Hi Fernando,

      I posted the same question on the GATK forum and was told not to worry as long as those counts are "roughly silimar" :




      Best,

      Stephan

      Comment

      Latest Articles

      Collapse

      • seqadmin
        Strategies for Sequencing Challenging Samples
        by seqadmin


        Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
        03-22-2024, 06:39 AM
      • seqadmin
        Techniques and Challenges in Conservation Genomics
        by seqadmin



        The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

        Avian Conservation
        Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
        03-08-2024, 10:41 AM

      ad_right_rmr

      Collapse

      News

      Collapse

      Topics Statistics Last Post
      Started by seqadmin, Yesterday, 06:37 PM
      0 responses
      10 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, Yesterday, 06:07 PM
      0 responses
      9 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 03-22-2024, 10:03 AM
      0 responses
      51 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 03-21-2024, 07:32 AM
      0 responses
      67 views
      0 likes
      Last Post seqadmin  
      Working...
      X