Hi,
I'm trying to create read count tables from my sam files with HTSeq (v0.6.1) and run into a problem I don't get.
When I run HTSeq-count (htseq-count -r name -s yes -t "Coding gene" -i "name" -f sam foo.sam bar.gff) I encounter two things: (i) a very low percentage of reads matching to features and (ii) half of the reads are without mate (Warning: 13666808 reads with missing mate encountered)
I wrote a simple script checking how many of my read names occur twice in the *.sam file (i.e., checking for mate names) and less than 5% are single mate reads. Additionally, I checked how many read pairs fall inside a feature and found that >40% do.
As many people used HTSeq-count without problems I'm pretty sure I'm doing something wrong and not HTSeq. But I'd like to understand what ...
This is a line from my sam file:
And this a line from my gff file (gff3 produced with bp_genbank2gff script):
I changed some of the settings for testing purposes,e.g., used -r pos instead of name and ran out of memory. This was kind of expected because HTSeq doesn't find mates and therefore keeps each read in memory. I also tried strand=no without any change at all.
Does anyone have a clue what I'm doing wrong?
Regards,
TimK
I'm trying to create read count tables from my sam files with HTSeq (v0.6.1) and run into a problem I don't get.
When I run HTSeq-count (htseq-count -r name -s yes -t "Coding gene" -i "name" -f sam foo.sam bar.gff) I encounter two things: (i) a very low percentage of reads matching to features and (ii) half of the reads are without mate (Warning: 13666808 reads with missing mate encountered)
I wrote a simple script checking how many of my read names occur twice in the *.sam file (i.e., checking for mate names) and less than 5% are single mate reads. Additionally, I checked how many read pairs fall inside a feature and found that >40% do.
As many people used HTSeq-count without problems I'm pretty sure I'm doing something wrong and not HTSeq. But I'd like to understand what ...
This is a line from my sam file:
HWI-ST218:183:H88NHADXX:1:1101:1250:2193 83 chr1 3950365 60 99M1S chr1 3950190 -274 CAACTTCAACGGCCAGAGATAAATACCGCCCTTCCCTTAATATGATTGAAATCAACCAGACCGGAACCCACAACACCACCCCATAAAGCAATACAAACTN DDDDDDDDDDDDDDEDDDEEEEDDBDDDDDDEDEEFFFFFFFHHHHHHJJJJJJJJJJHHJJJJJJJJJIJIIHGHGCJJJJJJJJJHGHHHFFFDD=1# AS:i:99
]
]
chr1 RefSeq Coding gene 286 609 . - . name=chr1_10001;product="transposase"
I changed some of the settings for testing purposes,e.g., used -r pos instead of name and ran out of memory. This was kind of expected because HTSeq doesn't find mates and therefore keeps each read in memory. I also tried strand=no without any change at all.
Does anyone have a clue what I'm doing wrong?
Regards,
TimK
Comment