This is my basic workflow:
DATA and ALIGNMENT:
50SE Illumina reads mapped with tophat2 (v 2.0.11; default setting) to a bowtie2 index of hg38 (all chromosomes plus mitochondria, but no alternate chromosomal assemblies)
outfile reports as follows:
READ COUNTING
1) Generated a "corrected" gtf file from UCSC (i.e., geneID and transcript IDs are different) using the method described here.
2) converted tophat alignment bam to sam (samtools)
3) counted reads with HTSeq using the general command:
My HTSeq output reports as follows:
The result is that I have ~4.5 million unique alignments as counted by HTSeq. That seems low and when I query the bam file,
the output reports 25707742 alignments as unique (NH:i:1)
Can anyone offer a simple explanation of what is happening here? The hand-wavy explanation that I have at the moment is that HTSeq count is very conservative, and if a read cannot be assigned unambiguously, and exclusively to a gene, it is not counted. However, the HTSeq output is secretly making me question the integrity of my pipeline and/or my input files.
Thank you for any advice or reassurance.
David
EDIT: and it occurs to me that it might be relevant that the library prep was RiboMinus. Since I'm used to working with poly-A purified samples, my expectations might be too high in this case.
DATA and ALIGNMENT:
50SE Illumina reads mapped with tophat2 (v 2.0.11; default setting) to a bowtie2 index of hg38 (all chromosomes plus mitochondria, but no alternate chromosomal assemblies)
outfile reports as follows:
Reads:
Input : 29397870
Mapped : 28589633 (97.3% of input)
of these: 5179700 (18.1%) have multiple alignments (612 have >20)
97.3% overall read mapping rate.
Input : 29397870
Mapped : 28589633 (97.3% of input)
of these: 5179700 (18.1%) have multiple alignments (612 have >20)
97.3% overall read mapping rate.
1) Generated a "corrected" gtf file from UCSC (i.e., geneID and transcript IDs are different) using the method described here.
2) converted tophat alignment bam to sam (samtools)
3) counted reads with HTSeq using the general command:
htseq-count --stranded=no accepted_hits.sam hg38_UCSC_alt.gtf >HTSeq.count
no_feature 13107880
ambiguous 5784082
too_low_aQual 0
not_aligned 0
alignment_not_unique 25653910
ambiguous 5784082
too_low_aQual 0
not_aligned 0
alignment_not_unique 25653910
samtools view accepted_hits.bam | grep "NH:i:1" -c
Can anyone offer a simple explanation of what is happening here? The hand-wavy explanation that I have at the moment is that HTSeq count is very conservative, and if a read cannot be assigned unambiguously, and exclusively to a gene, it is not counted. However, the HTSeq output is secretly making me question the integrity of my pipeline and/or my input files.
Thank you for any advice or reassurance.
David
EDIT: and it occurs to me that it might be relevant that the library prep was RiboMinus. Since I'm used to working with poly-A purified samples, my expectations might be too high in this case.
Comment