Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • bwa GATK4 read group issues

    Hello,
    I am trying to run GATK4 HaplotypeCaller on bam files from single individuals generated with bwa mem on Ubuntu 18 with
    Code:
    $gatk HaplotypeCaller -R Tse_SBAPGDGG_D.fa -I AHP2746_sorted_dedup.bam -ERC GVCF -O AHP2746.gvcf
    But I get the error:
    A USER ERROR has occurred: Argument --emit-ref-confidence has a bad value: Can only be used in single sample mode currently. Use the --sample-name argument to run on a single sample out of a multi-sample BAM file.
    I assume this is a read group error.
    if I run
    Code:
    samtools view -H AHP2746_sorted_dedup.bam | grep @RG
    I get nothing back.
    If I run
    Code:
    samtools view -H AHP2746_sorted_dedup.bam
    I get a long list of headers and the bottom look like this:
    Code:
    @SQ     LN:500  SN:scaffold_42026
    @SQ     SN:scaffold_42025       LN:500
    @SQ     SN:scaffold_42022       LN:500
    @SQ     SN:scaffold_42019       LN:500
    @SQ     SN:scaffold_42018       LN:500
    @SQ     LN:500  SN:scaffold_42017
    @SQ     SN:scaffold_42016       LN:500
    @SQ     SN:scaffold_42015       LN:500
    @SQ     SN:scaffold_42014       LN:500
    @SQ     SN:scaffold_42013       LN:500
    @SQ     SN:scaffold_42012       LN:500
    @SQ     SN:scaffold_42011       LN:500
    @SQ     SN:scaffold_42010       LN:500
    @SQ     SN:scaffold_42009       LN:500
    @SQ     SN:scaffold_42008       LN:500
    @SQ     SN:scaffold_42006       LN:500
    @SQ     SN:scaffold_42005       LN:500
    @SQ     SN:scaffold_42004       LN:500
    @SQ     SN:scaffold_42003       LN:500
    @SQ     SN:scaffold_42002       LN:500
    @SQ     SN:scaffold_42001       LN:500
    @SQ     SN:scaffold_42000       LN:500
    @SQ     SN:scaffold_41998       LN:500
    @SQ     SN:scaffold_41997       LN:500
    @SQ     SN:scaffold_41996       LN:500
    @SQ     SN:scaffold_41995       LN:500
    @PG     PN:bwa  VN:0.7.17-r1188 CL:bwa mem -t 32 Tse_SBAPGDGG_D AHP2746_L003_R1_001.fastq.gz AHP2746_L003_R2_001.fastq.gz ID:bwa
    @PG     CL:elprep filter /dev/stdin AHP2746_sorted_dedup.bam --filter-unmapped-reads --mark-duplicates --mark-optical-duplicates AHP2746_output.metrics --optical-duplicates-pixel-distance 100 --remove-duplicates --sorting-order coordinate    PP:bwa  ID:elprep 4.1.3 PN:elprep       VN:4.1.3        DS:http://github.com/exascience/elprep
    I assume my bam files have no read group defined (I did not use the bwa mem -R option).
    I then tried running bwa mem with -R
    Code:
    bwa mem -M -t 32 -R '@RG\tID:sample_1\tSM:AHP2746\tPL:illumina\tPU:lane1\tPI:420' Tse_SBAPGDGG_D AHP2746_L003_R1_001.fastq.gz AHP2746_L003_R2_001.fastq.gz > bwa-mem-Rtest_AHP2746.bam
    the analyses completed, but I am using elprep to dedupe and order the bam files and elprep throws the error "2019/07/23 10:07:46 gzip: invalid header in NewBGZFReader"
    my elprep command:
    Code:
    elprep filter bwa-mem-Rtest_AHP2746.bam bwa-mem-Rtest_AHP2746-elprep.bam --mark-duplicates --mark-optical-duplicates output.metrics --remove-duplicates --sorting-order coordinate --nr-of-threads 32

    Anyone know what I am doing wrong here?

    Additional info:
    reference genome was generated and assembled with linked reads (10X) sequenced on a NovaSeq and Hi-C data.
    the resequencing data were sequenced on HiSeq400 and some on NovaSeq, but all have same issue.
    I am using the latest software versions available with conda installs.

  • #2
    Hi,

    Quite a while since you posted this so if you reached a solution that'd be great to hear.
    Either way I'm having the same problem and thought I'd add some things to the
    discussion.

    I am however not sure this is related to the RG tag. According to SAM format
    specifications, the BC tag is where sample ID is stored so my guess is that is has to do
    with this.

    However, in my case I haven't specified any BC tag at all and I wonder if this could be
    the problem.

    SAM format specs: https://samtools.github.io/hts-specs/SAMv1.pdf

    Comment


    • #3
      Okay I managed to solve my issue; so anyone reading this can try it out and judge for themselves if it works for them.


      Problem:

      The problem is that gatk assumes you've set the RG tag in all your SAM/BAM/CRAM files (which no one but broad institute actually do for single sample libraries). Looking at the current state of gatk HaplotypCaller at their GitHub it seems it will count the number of RG tags in your SAM header and then return the error message from the original post if this number is not exactly equal to one (including cases where it is zero).

      Solution:

      Add an RG tag to you input file (for example with picardtools AddOrReplaceReadGroups).

      Solution complications:

      I had a problem where I got instead got "invalid uncompressedLength" when running but from running picard ValidateSamFile it warned me my index file was older than the original file so I re-indexed the RG-tagged file, where it finishes without problems.

      Commands used:

      NB: I'm using picard installed via conda, if you're using the .jar file directly just replace "picard" with java --jar path/to/picardtools.jar (or gatk with java --jar path/to/gatk.jar).

      Code:
      picard AddOrReplaceReadGroups I=reads.bam O=reads.tagged.bam RGLB=lib1 RGPU=unit1 RGSM=20 RGPL=ILLUMINA RGID=4c
      
      samtools index reads.tagged.bam 
      
      gatk HaplotypeCaller -R ref.fa -I reads.tagged.bam -ERC GVCF -O vars.vcf
      PS. I noticed the BC tag I mentioned earlier is defined under the RG tag so scratch what I wrote about that in the first post.

      Comment


      • #4
        Okay I managed to solve my issue which I'll post here if someone else finds it useful in the future.

        RL; DR:

        Add an arbitrary RG tag.

        Problem:

        From reading at their github it seems gatk HaplotypeCaller will call on a utils function to assert the number of samples, which it does by counting the number of RG entries in the header. HaplotypCaller then exits with the error message if this count is not equal to 1, which seemingly also includes when it the utils function returns zero.

        Solution:

        Add an RG tag to your header and tag all reads with the RG tag. The fastest solution is to do this with picardtools AddOrReplaceReadGroups.

        Solution complication:

        When running gatk HaplotypeCaller on my RG tagged file I got an error saying the uncompression length was invalid. I ran picardtools ValidateSamFile which warned that the index file was older than the BAM file, and after rerunning the indexing everything seemingly worked great (using samtools index).

        Code:

        NB: I'm running picardtools/gatk installed through conda, if you're using the .jar file directly just exchange picard/gatk with java --jar path/to/jar-file.jar

        Code:
        picard AddOrReplaceReadGroups I=reads.bam O=reads.tag.bam RGID=4 RGLB=lib1 RGPL=ILLUMINA RGPU=unit1 RGSM=20 
        
        samtools index reads.tag.bam
        
        gatk HaplotypeCaller I=reads.tag.bam -R ref.fa -ERC GVCF -O vars.vcf
        PS. I noticed the BC tag is part of the RG tag, so disregard that part of my prior answer.

        Comment

        Latest Articles

        Collapse

        • seqadmin
          Strategies for Sequencing Challenging Samples
          by seqadmin


          Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
          03-22-2024, 06:39 AM
        • seqadmin
          Techniques and Challenges in Conservation Genomics
          by seqadmin



          The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

          Avian Conservation
          Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
          03-08-2024, 10:41 AM

        ad_right_rmr

        Collapse

        News

        Collapse

        Topics Statistics Last Post
        Started by seqadmin, Yesterday, 06:37 PM
        0 responses
        10 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, Yesterday, 06:07 PM
        0 responses
        9 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 03-22-2024, 10:03 AM
        0 responses
        49 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 03-21-2024, 07:32 AM
        0 responses
        67 views
        0 likes
        Last Post seqadmin  
        Working...
        X