Hello all. I've been fiddling around with htseq-count using SAM files that were output by hisat2. I am using paired reads here and realized as I was going through the Htseq-count Manual that I should sort my data beforehand using
I used this command to execute htseq-count:
So now I have two htseq-count output files to compare: one where the data is sorted and one where the data is unsorted.
In the unsorted warning file I notice at the bottom: "4674733 SAM alignment pairs processed"
In the sorted warning file I notice at the bottom: "8572367 SAM alignment paris processed"
That is about 2X the amount of pairs processed for the sorted data! The count txt files for both runs look good, but I'm guessing there is much more data on counts in the sorted one. When comparing counts for specific genes between the two, the sorted appears to have around double the amount. My question is why is this and what are the consequences of sorting? Does it lead to more precise counts, or is it not even worth it?
Code:
samtools sort
Code:
htseq-count -m intersection-nonempty -i Name -s reverse -r name Sample_1_hisat2_results_sorted.sam /Volumes/cachannel/RNA_SEQ/Notch_RNASeq/9.1_Reference_Files/XENLA_UTAmayball_cdna_longest_CHRS2.gff3 >Sample_1_Hisat2_Counts.txt 2>Sample_1_Hisat2_Counts_OUTPUT_WARNINGS.txt
In the unsorted warning file I notice at the bottom: "4674733 SAM alignment pairs processed"
In the sorted warning file I notice at the bottom: "8572367 SAM alignment paris processed"
That is about 2X the amount of pairs processed for the sorted data! The count txt files for both runs look good, but I'm guessing there is much more data on counts in the sorted one. When comparing counts for specific genes between the two, the sorted appears to have around double the amount. My question is why is this and what are the consequences of sorting? Does it lead to more precise counts, or is it not even worth it?
Comment