Here are the Align summary data from TopHat for single-end reads. These SE Reads are NOT stranded.
cat /PATH/align_summary.txt | grep Mapped
Mapped : 66096436 (97.2% of input)
Mapped : 59491205 (97.4% of input)
Mapped : 58752388 (97.4% of input)
Mapped : 61205947 (97.4% of input)
So an average number of reads of about ~60 million
The default options for htseq-count are ( i.e. Like I do for paired-end reads): htseq-count -f bam -r name accepted_hits_sorted_QN.bam (name sorted bam)
21237878 22854704 23350366 21377177 (sums for four conditions -2 control 2 experimental)
So ~21 million. Roughly one third of the reads are reported.
Just running DESeq2 on the UN-sorted bam (i.e. The original *bam file from TopHat) with default value for “s” (default is “yes” assumed stranded).
Gives
23794422
Unsorted bam with “s” = no " htseq-count -f bam -r name –s no accepted_hits.bam" (original bam from TopHat)
47138205
Name Sorted bam with “s” =no "htseq-count -f bam -r name –s no accepted_hits_sorted_QN.bam” (name sorted bam)
44074576
The closest I can get to number of reads reported by TopHat is 47/60 million.
Can someone explain why I might be seeing the low number of counts relative to the number of reported reads.
The issue does NOT appear for paired-end reads. For PE reads the counts reported TopHat matches closely the sum of counts reported by htseq-counts.
cat /PATH/align_summary.txt | grep Mapped
Mapped : 66096436 (97.2% of input)
Mapped : 59491205 (97.4% of input)
Mapped : 58752388 (97.4% of input)
Mapped : 61205947 (97.4% of input)
So an average number of reads of about ~60 million
The default options for htseq-count are ( i.e. Like I do for paired-end reads): htseq-count -f bam -r name accepted_hits_sorted_QN.bam (name sorted bam)
21237878 22854704 23350366 21377177 (sums for four conditions -2 control 2 experimental)
So ~21 million. Roughly one third of the reads are reported.
Just running DESeq2 on the UN-sorted bam (i.e. The original *bam file from TopHat) with default value for “s” (default is “yes” assumed stranded).
Gives
23794422
Unsorted bam with “s” = no " htseq-count -f bam -r name –s no accepted_hits.bam" (original bam from TopHat)
47138205
Name Sorted bam with “s” =no "htseq-count -f bam -r name –s no accepted_hits_sorted_QN.bam” (name sorted bam)
44074576
The closest I can get to number of reads reported by TopHat is 47/60 million.
Can someone explain why I might be seeing the low number of counts relative to the number of reported reads.
The issue does NOT appear for paired-end reads. For PE reads the counts reported TopHat matches closely the sum of counts reported by htseq-counts.
Comment