View Single Post
Old 05-11-2016, 07:17 AM   #8
Jane M
Senior Member
Location: Paris

Join Date: Aug 2011
Posts: 239

Compared to how I ran htseq-count previously, I added -i option to get counts for each gene_name and not gene_ID. I could run htseq count on this new gtf without any problem
htseq-count --format=bam --order=name --mode=union -a 20 -i gene_name --stranded=reverse mybam.bam gencode.v19.chr_patch_hapl_scaff.annotation.gtf > mynewcounts.gencod.htseqCount
But I still have a couple of questions:
1. With the genecode annotation, we can find how many genes are attributed to each gene_type (30 gene types), like protein coding, lincRNA, pseudogene, rRNA,... Do you know where to find this information for the refFlat annotation?

2. I am a bit confused by gene_name and gene_ID:
When counting in gencode.v19.chr_patch_hapl_scaff.annotation.gtf the number of genes ("gene" in field 3), I get 63,568:
cat gencode.v19.chr_patch_hapl_scaff.annotation.gtf | cut -f3  | wc -l
but the number of unique gene_name is 56,629:
cat gencode.v19.chr_patch_hapl_scaff.annotation.gtf | cut -d';' -f5 | sort | uniq | wc -l
In the output file of htseq count, 56,629 are listed.

Does someone know what are those genes with same gene_name but different gene_ID?
cat gencode.v19.chr_patch_hapl_scaff.annotation.gene.gtf | cut -f9  | sort | uniq | wc -l
3. Finally, I noticed that the counting changed a lot between the 2 annotations, even for well known genes as TP53, the number of reads attributed to this gene - and others - double! I am surprised that the annotation of such very well known genes changes. Is it normal?

Any clarification would be greatly appreciated.

Last edited by Jane M; 05-11-2016 at 07:49 AM.
Jane M is offline   Reply With Quote