SEQanswers

Go Back   SEQanswers > Applications Forums > RNA Sequencing



Similar Threads
Thread Thread Starter Forum Replies Last Post
htseq-count error, AssertionError saskak Bioinformatics 7 07-06-2015 04:04 AM
Error in htseq-count diego diaz Bioinformatics 4 04-08-2015 01:02 AM
Error in htseq-count. Large gaps in CIGAR string usha Bioinformatics 1 09-19-2014 12:21 PM
Error when using HTSeq count shocker8786 Bioinformatics 2 02-19-2014 10:10 PM
htseq-count error sissi Bioinformatics 0 03-20-2012 11:40 PM

Reply
 
Thread Tools
Old 03-17-2016, 05:44 AM   #1
sunkid
Member
 
Location: Southern California

Join Date: Nov 2014
Posts: 16
Default CIGAR error when running htseq-count after BBMap

I am getting the following types of errors when running htseq-count on a SAM file generated with bbmap.sh

Code:
Error occured when reading beginning of SAM/BAM file.
  ("Illegal CIGAR string '66='", 'line 81726 of file bb_mapped.sam')
  [Exception type: ValueError, raised in _HTSeq.pyx:1116]
The above SAM file was generated with a command like

Code:
bbmap.sh build=2 in=stdin.fq bamscript=bam.sh maxindel=200000 ambiguous=toss \
 usejni=t outu=bb_unm apped.sam outm=bb_mapped.sam rpkm=rpkm.txt -Xmx28g
The htseq-count command I used was

Code:
htseq-count -i gene bb_mapped.sam genome.gff > ! counts.txt
Any pointers greatly appreciated!
sunkid is offline   Reply With Quote
Old 03-17-2016, 05:52 AM   #2
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,961
Default

See this for a solution: https://www.biostars.org/p/182156/ You won't need to re-do the alignment but you would have to reformat your bam files.
GenoMax is offline   Reply With Quote
Old 03-17-2016, 06:22 AM   #3
sunkid
Member
 
Location: Southern California

Join Date: Nov 2014
Posts: 16
Default

Quote:
Originally Posted by GenoMax View Post
See this for a solution: https://www.biostars.org/p/182156/ You won't need to re-do the alignment but you would have to reformat your bam files.
Perfect! Thank you. For some reason I thought 1.3 was the default and so I had only experimented with the 1.4 flag.
sunkid is offline   Reply With Quote
Old 03-17-2016, 06:32 AM   #4
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,961
Default

I assume reformat.sh command will work (can you confirm, if you try it?). I have been using sam=1.3 to avoid this issue.
GenoMax is offline   Reply With Quote
Old 03-17-2016, 01:09 PM   #5
sunkid
Member
 
Location: Southern California

Join Date: Nov 2014
Posts: 16
Default

Quote:
Originally Posted by GenoMax View Post
I assume reformat.sh command will work (can you confirm, if you try it?). I have been using sam=1.3 to avoid this issue.
Yes, reformat.sh does work.
sunkid is offline   Reply With Quote
Old 12-14-2016, 08:20 PM   #6
chiayi
Member
 
Location: New York

Join Date: Dec 2016
Posts: 22
Default

On a related note, I encountered this issue even after reformat.

The sam file was generated by BBMap,
Code:
bbmap.sh -Xmx32g in1=r1.BBDuk.fastq.gz in2=r2.BBDuk.fastq.gz maxindel=2000 outm=mapped.sam outu=unmapped.sam ehist=ehist
I used reformat to convert the CIGAR to samtools v1.3,
Code:
reformat.sh sam=1.3 in=in.sam out=reformat.sam
However, all the reads were still unassigned when I ran featureCounts,
Code:
featureCounts -T 4 -p -C -O -a ref.gtf -R -o output reformat.sam
Here's one example read from the BBMap-generated SAM,
HTML Code:
HWI-ST975:91:C49TLACXX:7:1101:2847:2191 1:N:0:CAGATC    83      Chr2 CHROMOSOME dumped from ADB: Jun/20/09 14:54; last updated: 2009-02-02      17851283 44       101M    =       17851177        -207    TGCTGCAAATGCATTTGTTCCTCGTAAACGTCCCAATACGGCTGGTAGAGTTTCAGTGGAACATCCAAACGGTGAACATCCTTCAAGGACATTATTTGTTA   8:DDDECC@@>A=DDDDB=/@=;6BCFFEE?23CA@@@F@JIIEHFE3JJIGHF>HGGGGDCBIG?GEE?GIIIGIEJGGGIJGGJJJHFGGGFFFFFCBB     NM:i:0  AM:i:44
HWI-ST975:91:C49TLACXX:7:1101:2847:2191 1:N:0:CAGATC    163     Chr2 CHROMOSOME dumped from ADB: Jun/20/09 14:54; last updated: 2009-02-02      17851177 44       101M    =       17851283        207     TCGAAGAGTGTGATGTCTTTTGCACGGGAGGGGGTATGGAGTTGGATGTTGAATCCCAAGATAATCATGCTGTTGATGCGTCAGGGATGCAGATTTCTGAT   @;;DD;DDBDBFFEH>FGH>@HHHHHI<AGIIID7;@FHADGA4C=CEEHACFFFFFEECEC@;@@CCCCCCC@C:>>(88><BC<89??>>3>C>ADC:A     NM:i:0  AM:i:44
To makre sure this read was not mapped to the unannotated region, I used to same featureCounts command with a HiSat2-generated BAM and that same read was assigned.

HTML Code:
HWI-ST975:91:C49TLACXX:7:1101:2847:2191 83      Chr2    17851283        60      101M    =       17851177        -207    TGCTGCAAATGCATTTGTTCCTCGTAAACGTCCCAATACGGCTGGTAGAGTTTCAGTGGAACATCCAAACGGTGAACATCCTTCAAGGACATTATTTGTTA     8:DDDECC@@>A=DDDDB=/@=;6BCFFEE?23CA@@@F@JIIEHFE3JJIGHF>HGGGGDCBIG?GEE?GIIIGIEJGGGIJGGJJJHFGGGFFFFFCBB     AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:101        YS:i:0  YT:Z:CP NH:i:1
HWI-ST975:91:C49TLACXX:7:1101:2847:2191 163     Chr2    17851177        60      101M    =       17851283        207     TCGAAGAGTGTGATGTCTTTTGCACGGGAGGGGGTATGGAGTTGGATGTTGAATCCCAAGATAATCATGCTGTTGATGCGTCAGGGATGCAGATTTCTGAT     @;;DD;DDBDBFFEH>FGH>@HHHHHI<AGIIID7;@FHADGA4C=CEEHACFFFFFEECEC@;@@CCCCCCC@C:>>(88><BC<89??>>3>C>ADC:A     AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:101        YS:i:0  YT:Z:CP NH:i:1
The CIGARs are identical (101M) in both SAM and BAM, so I wasn't sure what may be the cause of this issue. Has anyone encounted this before? Thank you for any advice and input.
chiayi is offline   Reply With Quote
Old 12-15-2016, 03:23 AM   #7
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,961
Default

@chiyai: What version of BBMap are you using?
GenoMax is offline   Reply With Quote
Old 12-15-2016, 03:45 AM   #8
chiayi
Member
 
Location: New York

Join Date: Dec 2016
Posts: 22
Default

@GenoMax
BBMap version 36.62
featureCounts v1.5.1
Thank you.
chiayi is offline   Reply With Quote
Old 12-15-2016, 03:57 AM   #9
Michael.Ante
Senior Member
 
Location: Vienna

Join Date: Oct 2011
Posts: 121
Default

Hi chiayi,

in your BBmap index, the naming of the chromosome is "Chr2 CHROMOSOME dumped from ADB: Jun/20/09 14:54; last updated: 2009-02-02", whilst in the Hisat alignment it's just "Chr2". Either you rename each alignment in the bbamp results, or redo the alignment with a reduced fasta header, or rename the chromosmes in your gtf accordingly.

Cheers,

Michael
Michael.Ante is offline   Reply With Quote
Old 12-15-2016, 04:23 AM   #10
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,961
Default

@Michael: Good catch. Not cleaning up fasta headers comes back and bites each time :-)

@Chiayi: Don't leave any spaces in fasta headers since they always cause problems with one thing or other.
GenoMax is offline   Reply With Quote
Old 12-15-2016, 05:39 AM   #11
chiayi
Member
 
Location: New York

Join Date: Dec 2016
Posts: 22
Default

@Michael: I used sed to clean up the chromosome name in each alignment in the SAM. featureCounts works fine after that. Thank you so much.
@GenoMax: Thanks for the suggestion.
BTW, does reformat.sh only clean fasta header, or also clean sam/bam like my case here?
chiayi is offline   Reply With Quote
Old 12-15-2016, 05:53 AM   #12
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,961
Default

Quote:
Originally Posted by chiayi View Post
BTW, does reformat.sh only clean fasta header, or also clean sam/bam like my case here?
reformat in your command iteration is not touching the headers it only converts CIGAR strings to SAM 1.3 format (bbmap uses SAM format 1.4 by default). featureCounts and HTSeq-count don't understand 1.4 format (as of last time I was looking at that). BTW: Samtools v.1.3.x is unrelated to issue of SAM format v.1.3.
GenoMax is offline   Reply With Quote
Old 12-15-2016, 06:00 AM   #13
chiayi
Member
 
Location: New York

Join Date: Dec 2016
Posts: 22
Default

Quote:
Originally Posted by GenoMax View Post
reformat in your command iteration is not touching the headers it only converts CIGAR strings to SAM 1.3 format (bbmap uses SAM format 1.4 by default). featureCounts and HTSeq-count don't understand 1.4 format (as of last time I was looking at that). BTW: Samtools v.1.3.x is unrelated to issue of SAM format v.1.3.
I was referring to the trd (trimreaddescription) option in reformat. I thought that would clean the fasta header after the first space, or did I misunderstand?
chiayi is offline   Reply With Quote
Old 12-15-2016, 06:17 AM   #14
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,961
Default

Quote:
Originally Posted by chiayi View Post
I was referring to the trd (trimreaddescription) option in reformat. I thought that would clean the fasta header after the first space, or did I misunderstand?
Discrepancy in your case was with the genome fasta headers and not read names. You could have done trd on fasta genome file to trim names after the first space and then made the index.

I suspect that option won't work when processing sam/bam files. Even if it did, it is for read names and not genome headers. @Brian will have to confirm.
GenoMax is offline   Reply With Quote
Old 12-15-2016, 11:43 AM   #15
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Yep - "trd" can be used on the fasta file prior to mapping, or the parameter "trd" can be added during mapping. However, once the reads are mapped, Reformat will not change the "rname" field of the sam file, just the "qname". That's a good idea, though - I'll change it so that it can do that.
Brian Bushnell is offline   Reply With Quote
Old 12-15-2016, 11:52 AM   #16
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,961
Default

Quote:
Originally Posted by Brian Bushnell View Post
Yep - "trd" can be used on the fasta file prior to mapping, or the parameter "trd" can be added during mapping. However, once the reads are mapped, Reformat will not change the "rname" field of the sam file, just the "qname". That's a good idea, though - I'll change it so that it can do that.
If you are going to add that is there a way to make it act only on the genome name (with a new option name for reformat).

I would not want to lose the R1/R2 information in read headers, if trd takes out all things after first space.

Last edited by GenoMax; 12-15-2016 at 11:54 AM.
GenoMax is offline   Reply With Quote
Old 12-15-2016, 01:16 PM   #17
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Quote:
Originally Posted by GenoMax View Post
If you are going to add that is there a way to make it act only on the genome name (with a new option name for reformat).

I would not want to lose the R1/R2 information in read headers, if trd takes out all things after first space.
Unfortunately, the sam specification does not allow retaining R1/R2 information anyway, because R1 and R2 are required to have the same name. You can override this in BBMap with the flag "keepnames", which I find very helpful in a lot of situations, but bear in mind that the result is not a spec-compliant sam/bam file.

Currently, "trd" in BBMap trims after the whitespace for both reads and ref sequences. I guess I could add an option to do them independently, and a similar flag for Reformat.
Brian Bushnell is offline   Reply With Quote
Old 12-15-2016, 02:37 PM   #18
chiayi
Member
 
Location: New York

Join Date: Dec 2016
Posts: 22
Default

@Brian, @GenoMax,
Thanks for all the clarification. It's very helpful.
It turned out my original issue was solely caused by the descripancy in the genome name. The latest version of featureCounts can recognize CIGARs from samtools v1.4. I heared that from Wei and did a test run. It worked.
chiayi is offline   Reply With Quote
Old 12-15-2016, 02:39 PM   #19
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Great to hear; thanks!
Brian Bushnell is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 11:13 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO