Hello everybody,
I have aligned bam files (using STAR version 2.6.1a_08-27) and want to quantify them using htseq-count (version 0.9.1). When I run htseq-count, some bams run through perfectly fine, but for some I get following error:
>>>
....
13300000 SAM alignment record pairs processed.
13400000 SAM alignment record pairs processed.
Error occured when processing SAM input (record #26816395 in file /data/folder/file.bam):
'ascii' codec can't decode byte 0xe4 in position 4: ordinal not in range(128)
[Exception type: UnicodeDecodeError, raised in libcutils.pyx:134]
>>>
When I go back to the bam file, this specific line (26816395) contains a question mark in the quality part of the read (either ? or ^?). I used to deal with this by just removing this line since it were only one or a few reads per file; but now I have at least 30 lines or more with question marks, so I don't just want to remove all the reads.
The problem is that the character is not REALLY a question mark (I know this because when i try to grep for '?', nothing comes up). This means that it is just a substitution for an unknown character.
So my question is: how do I remove non-ascii characters in the bam file?
(alternatively: does somebody know where they come from in the first place? Can I redo the alignment using different parameters?)
I've been trying to find a solution for quite some time and I appreciate any help or work around
Thanks!
I have aligned bam files (using STAR version 2.6.1a_08-27) and want to quantify them using htseq-count (version 0.9.1). When I run htseq-count, some bams run through perfectly fine, but for some I get following error:
>>>
....
13300000 SAM alignment record pairs processed.
13400000 SAM alignment record pairs processed.
Error occured when processing SAM input (record #26816395 in file /data/folder/file.bam):
'ascii' codec can't decode byte 0xe4 in position 4: ordinal not in range(128)
[Exception type: UnicodeDecodeError, raised in libcutils.pyx:134]
>>>
When I go back to the bam file, this specific line (26816395) contains a question mark in the quality part of the read (either ? or ^?). I used to deal with this by just removing this line since it were only one or a few reads per file; but now I have at least 30 lines or more with question marks, so I don't just want to remove all the reads.
The problem is that the character is not REALLY a question mark (I know this because when i try to grep for '?', nothing comes up). This means that it is just a substitution for an unknown character.
So my question is: how do I remove non-ascii characters in the bam file?
(alternatively: does somebody know where they come from in the first place? Can I redo the alignment using different parameters?)
I've been trying to find a solution for quite some time and I appreciate any help or work around
Thanks!
Comment