Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • CIGAR error when running htseq-count after BBMap

    I am getting the following types of errors when running htseq-count on a SAM file generated with bbmap.sh

    Code:
    Error occured when reading beginning of SAM/BAM file.
      ("Illegal CIGAR string '66='", 'line 81726 of file bb_mapped.sam')
      [Exception type: ValueError, raised in _HTSeq.pyx:1116]
    The above SAM file was generated with a command like

    Code:
    bbmap.sh build=2 in=stdin.fq bamscript=bam.sh maxindel=200000 ambiguous=toss \
     usejni=t outu=bb_unm apped.sam outm=bb_mapped.sam rpkm=rpkm.txt -Xmx28g
    The htseq-count command I used was

    Code:
    htseq-count -i gene bb_mapped.sam genome.gff > ! counts.txt
    Any pointers greatly appreciated!

  • #2
    See this for a solution: https://www.biostars.org/p/182156/ You won't need to re-do the alignment but you would have to reformat your bam files.

    Comment


    • #3
      Originally posted by GenoMax View Post
      See this for a solution: https://www.biostars.org/p/182156/ You won't need to re-do the alignment but you would have to reformat your bam files.
      Perfect! Thank you. For some reason I thought 1.3 was the default and so I had only experimented with the 1.4 flag.

      Comment


      • #4
        I assume reformat.sh command will work (can you confirm, if you try it?). I have been using sam=1.3 to avoid this issue.

        Comment


        • #5
          Originally posted by GenoMax View Post
          I assume reformat.sh command will work (can you confirm, if you try it?). I have been using sam=1.3 to avoid this issue.
          Yes, reformat.sh does work.

          Comment


          • #6
            On a related note, I encountered this issue even after reformat.

            The sam file was generated by BBMap,
            Code:
            bbmap.sh -Xmx32g in1=r1.BBDuk.fastq.gz in2=r2.BBDuk.fastq.gz maxindel=2000 outm=mapped.sam outu=unmapped.sam ehist=ehist
            I used reformat to convert the CIGAR to samtools v1.3,
            Code:
            reformat.sh sam=1.3 in=in.sam out=reformat.sam
            However, all the reads were still unassigned when I ran featureCounts,
            Code:
            featureCounts -T 4 -p -C -O -a ref.gtf -R -o output reformat.sam
            Here's one example read from the BBMap-generated SAM,
            HTML Code:
            HWI-ST975:91:C49TLACXX:7:1101:2847:2191 1:N:0:CAGATC    83      Chr2 CHROMOSOME dumped from ADB: Jun/20/09 14:54; last updated: 2009-02-02      17851283 44       101M    =       17851177        -207    TGCTGCAAATGCATTTGTTCCTCGTAAACGTCCCAATACGGCTGGTAGAGTTTCAGTGGAACATCCAAACGGTGAACATCCTTCAAGGACATTATTTGTTA   8:DDDECC@@>A=DDDDB=/@=;6BCFFEE?23CA@@@F@JIIEHFE3JJIGHF>HGGGGDCBIG?GEE?GIIIGIEJGGGIJGGJJJHFGGGFFFFFCBB     NM:i:0  AM:i:44
            HWI-ST975:91:C49TLACXX:7:1101:2847:2191 1:N:0:CAGATC    163     Chr2 CHROMOSOME dumped from ADB: Jun/20/09 14:54; last updated: 2009-02-02      17851177 44       101M    =       17851283        207     TCGAAGAGTGTGATGTCTTTTGCACGGGAGGGGGTATGGAGTTGGATGTTGAATCCCAAGATAATCATGCTGTTGATGCGTCAGGGATGCAGATTTCTGAT   @;;DD;DDBDBFFEH>FGH>@HHHHHI<AGIIID7;@FHADGA4C=CEEHACFFFFFEECEC@;@@CCCCCCC@C:>>(88><BC<89??>>3>C>ADC:A     NM:i:0  AM:i:44
            To makre sure this read was not mapped to the unannotated region, I used to same featureCounts command with a HiSat2-generated BAM and that same read was assigned.

            HTML Code:
            HWI-ST975:91:C49TLACXX:7:1101:2847:2191 83      Chr2    17851283        60      101M    =       17851177        -207    TGCTGCAAATGCATTTGTTCCTCGTAAACGTCCCAATACGGCTGGTAGAGTTTCAGTGGAACATCCAAACGGTGAACATCCTTCAAGGACATTATTTGTTA     8:DDDECC@@>A=DDDDB=/@=;6BCFFEE?23CA@@@F@JIIEHFE3JJIGHF>HGGGGDCBIG?GEE?GIIIGIEJGGGIJGGJJJHFGGGFFFFFCBB     AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:101        YS:i:0  YT:Z:CP NH:i:1
            HWI-ST975:91:C49TLACXX:7:1101:2847:2191 163     Chr2    17851177        60      101M    =       17851283        207     TCGAAGAGTGTGATGTCTTTTGCACGGGAGGGGGTATGGAGTTGGATGTTGAATCCCAAGATAATCATGCTGTTGATGCGTCAGGGATGCAGATTTCTGAT     @;;DD;DDBDBFFEH>FGH>@HHHHHI<AGIIID7;@FHADGA4C=CEEHACFFFFFEECEC@;@@CCCCCCC@C:>>(88><BC<89??>>3>C>ADC:A     AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:101        YS:i:0  YT:Z:CP NH:i:1
            The CIGARs are identical (101M) in both SAM and BAM, so I wasn't sure what may be the cause of this issue. Has anyone encounted this before? Thank you for any advice and input.

            Comment


            • #7
              @chiyai: What version of BBMap are you using?

              Comment


              • #8
                @GenoMax
                BBMap version 36.62
                featureCounts v1.5.1
                Thank you.

                Comment


                • #9
                  Hi chiayi,

                  in your BBmap index, the naming of the chromosome is "Chr2 CHROMOSOME dumped from ADB: Jun/20/09 14:54; last updated: 2009-02-02", whilst in the Hisat alignment it's just "Chr2". Either you rename each alignment in the bbamp results, or redo the alignment with a reduced fasta header, or rename the chromosmes in your gtf accordingly.

                  Cheers,

                  Michael

                  Comment


                  • #10
                    @Michael: Good catch. Not cleaning up fasta headers comes back and bites each time :-)

                    @Chiayi: Don't leave any spaces in fasta headers since they always cause problems with one thing or other.

                    Comment


                    • #11
                      @Michael: I used sed to clean up the chromosome name in each alignment in the SAM. featureCounts works fine after that. Thank you so much.
                      @GenoMax: Thanks for the suggestion.
                      BTW, does reformat.sh only clean fasta header, or also clean sam/bam like my case here?

                      Comment


                      • #12
                        Originally posted by chiayi View Post
                        BTW, does reformat.sh only clean fasta header, or also clean sam/bam like my case here?
                        reformat in your command iteration is not touching the headers it only converts CIGAR strings to SAM 1.3 format (bbmap uses SAM format 1.4 by default). featureCounts and HTSeq-count don't understand 1.4 format (as of last time I was looking at that). BTW: Samtools v.1.3.x is unrelated to issue of SAM format v.1.3.

                        Comment


                        • #13
                          Originally posted by GenoMax View Post
                          reformat in your command iteration is not touching the headers it only converts CIGAR strings to SAM 1.3 format (bbmap uses SAM format 1.4 by default). featureCounts and HTSeq-count don't understand 1.4 format (as of last time I was looking at that). BTW: Samtools v.1.3.x is unrelated to issue of SAM format v.1.3.
                          I was referring to the trd (trimreaddescription) option in reformat. I thought that would clean the fasta header after the first space, or did I misunderstand?

                          Comment


                          • #14
                            Originally posted by chiayi View Post
                            I was referring to the trd (trimreaddescription) option in reformat. I thought that would clean the fasta header after the first space, or did I misunderstand?
                            Discrepancy in your case was with the genome fasta headers and not read names. You could have done trd on fasta genome file to trim names after the first space and then made the index.

                            I suspect that option won't work when processing sam/bam files. Even if it did, it is for read names and not genome headers. @Brian will have to confirm.

                            Comment


                            • #15
                              Yep - "trd" can be used on the fasta file prior to mapping, or the parameter "trd" can be added during mapping. However, once the reads are mapped, Reformat will not change the "rname" field of the sam file, just the "qname". That's a good idea, though - I'll change it so that it can do that.

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM
                              • seqadmin
                                Techniques and Challenges in Conservation Genomics
                                by seqadmin



                                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                Avian Conservation
                                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                03-08-2024, 10:41 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, Yesterday, 06:37 PM
                              0 responses
                              10 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, Yesterday, 06:07 PM
                              0 responses
                              9 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-22-2024, 10:03 AM
                              0 responses
                              49 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-21-2024, 07:32 AM
                              0 responses
                              67 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X