Seqanswers Leaderboard Ad

**Simon Anders** · 08-29-2013, 12:52 AM

yes, it's a bit low. Use a visualization tool such as IGV to see where all your other reads map to. And double-check that you use the right GTF file, and the right "stranded" argument.

**dvanic** · 09-01-2013, 11:33 PM

Hi Simon!
Thanks!

I'm sure I'm using the right strand [have posted my results of how I chose the strand here: http://seqanswers.com/forums/showpos...8&postcount=50 and I'm using the gencode gtf from their website (http://www.gencodegenes.org/releases/17.html).

Following your reply, I've had more of a look at all of my datasets /10+ experimental setups, sequencing in different labs, on HiSeq/miSeq, stranded/unstranded, 75-150 nt read lengths/, and the ratio of (reads counted to features) to (reads in proper pairs) or to (uniquely mapping reads) (assessed using rseqc bam_stat .py module [reports uniqueness based on mapping quality) http://dldcc-web.brc.bcm.edu/lilab/l...l/#bam-stat-py ) is always in the 40-50% ballpark, including for the Illumina body map (50bp PE polyA selected).

For example, for liver and brain, the numbers are:

Code:

library	Reads2features	Unique	MappedInProperPairs	Ratio2unique	Ratio2PP
liverPE	62644060	141129035	129483856	0.443877902	0.483798227
brainPE	53279529	131608236	110397776	0.404834307	0.482614152

The only datasets for which I see an ~80% number of reads being counted for features are single end unstranded 75bp libraries, including the brain/liver illumina body map:

Code:

Library	Reads2features	Unique			Ratio2unique
liver75	58800747	66603793		0.882843819
brain75	44451030	56174550		0.791301933

Do you normally see more than 50% of tags counted to features in paired end libraries???

**Simon Anders** · 09-01-2013, 11:51 PM

Yes, I do. Again: Have you checked with a genome browser like IGV?

**Simon Anders** · 09-01-2013, 11:53 PM

BTW, in the post you reference above, you have posted an excerpt of a count table which contains the gene ID "ENSG00000010295.14". What is this ".14"? This does not look like a valid gene ID. I hope you are not trying to count by transcripts.

**dvanic** · 09-03-2013, 03:45 PM

Thanks for your comments, Simon. My problem is a lot sillier than the wrong gtf - it's using the wrong denominator.

I did visualize the data with IGV and UCSC, and did see that most of my reads were mapping to features. I didn't leave a comment here about that since "but they do map to features! htseq is not counting them!" isn't a reasonable arguement, given that so many different researchers have used it in so many different systems, and no one has picked up such nonsense before.

The reason I was getting such low "numbers" was my denominator - reads in the original fastq OR that when reporting unique and properly paired reads I was reporting at a per-read basis, while htseq is intelligent enough to give a count of 1 to the gene on a per-pair basis.

Basically, when I've gone back and looked, I see that in my human data, the breakdown is:

70-80% of reads in the library are mapped uniquely. So automatically those 20-30% of non-uniquely mapped reads will not be considered by htseq
65- 75% of reads in the library are mapped in proper pairs.

Questions I still have:

1) Will htseq still count the reads for which one read in the pair is unmapped and the other is mapped to a feature?
2) In the weird scenario if a read pair is mapped across chromosomes (read 1 maps to chr 1, read 2 to chr 7) [I've got a few of those, including some that have the NH:i:1 tag] - they will be discarded as being mapped to more than one feature element (if both are in annotated features) - what about if only one is?

So the actual denominator should be uniquely mapping reads, and my numerator multiplied by 2x since the counts are counted on a per-pair basis.
If I do the math this way, I get a steady 80% of uniquely mapping reads counted to features [ or 88% of reads mapped in proper pairs being counted to features], which seems quite reasonable, given that I'm using random hexamer priming and no poly-A selection, so I expect to have reads mapping to poorly annotated genes/transcripts that give rise to non-polyadenylated RNA as well as a small number of reads mapping to unspliced pre-mRNA.

PS The decimals are part of the standard gencode gtf format, and I think they reflect their internal updates to annotations. Sample of gtf below:

Code:

chr12	HAVANA	gene	6647541	6665239	.	-	.	gene_id "ENSG00000010295.15"; transcript_id "ENSG00000010295.15"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "IFFO1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "IFFO1"; level 2; havana_gene "OTTHUMG00000141264.3";
chr12	ENSEMBL	transcript	6647541	6661066	.	-	.	gene_id "ENSG00000010295.15"; transcript_id "ENST00000436152.2"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "IFFO1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "IFFO1-201"; level 3; tag "basic"; havana_gene "OTTHUMG00000141264.3";
chr12	ENSEMBL	exon	6660761	6661066	.	-	.	gene_id "ENSG00000010295.15"; transcript_id "ENST00000436152.2"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "IFFO1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "IFFO1-201"; exon_number 1;  level 3; tag "basic"; havana_gene "OTTHUMG00000141264.3";
chr12	ENSEMBL	exon	6660564	6660669	.	-	.	gene_id "ENSG00000010295.15"; transcript_id "ENST00000436152.2"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "IFFO1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "IFFO1-201"; exon_number 2;  level 3; tag "basic"; havana_gene "OTTHUMG00000141264.3";
chr12	ENSEMBL	exon	6660107	6660167	.	-	.	gene_id "ENSG00000010295.15"; transcript_id "ENST00000436152.2"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "IFFO1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "IFFO1-201"; exon_number 3;  level 3; tag "basic"; havana_gene "OTTHUMG00000141264.3";
chr12	ENSEMBL	exon	6659861	6659956	.	-	.	gene_id "ENSG00000010295.15"; transcript_id "ENST00000436152.2"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "IFFO1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "IFFO1-201"; exon_number 4;  level 3; tag "basic"; havana_gene "OTTHUMG00000141264.3";
chr12	ENSEMBL	CDS	6659861	6659869	.	-	0	gene_id "ENSG00000010295.15"; transcript_id "ENST00000436152.2"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "IFFO1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "IFFO1-201"; exon_number 4;  level 3; tag "basic"; havana_gene "OTTHUMG00000141264.3";
chr12	ENSEMBL	start_codon	6659867	6659869	.	-	0	gene_id "ENSG00000010295.15"; transcript_id "ENST00000436152.2"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "IFFO1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "IFFO1-201"; exon_number 4;  level 3; tag "basic"; havana_gene "OTTHUMG00000141264.3";
chr12	ENSEMBL	exon	6658922	6659062	.	-	.	gene_id "ENSG00000010295.15"; transcript_id "ENST00000436152.2"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "IFFO1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "IFFO1-201"; exon_number 5;  level 3; tag "basic"; havana_gene "OTTHUMG00000141264.3";
chr12	ENSEMBL	CDS	6658922	6659062	.	-	0	gene_id "ENSG00000010295.15"; transcript_id "ENST00000436152.2"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "IFFO1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "IFFO1-201"; exon_number 5;  level 3; tag "basic"; havana_gene "OTTHUMG00000141264.3";
chr12	ENSEMBL	exon	6658642	6658650	.	-	.	gene_id "ENSG00000010295.15"; transcript_id "ENST00000436152.2"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "IFFO1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "IFFO1-201"; exon_number 6;  level 3; tag "basic"; havana_gene "OTTHUMG00000141264.3";
chr12	ENSEMBL	CDS	6658642	6658650	.	-	0	gene_id "ENSG00000010295.15"; transcript_id "ENST00000436152.2"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "IFFO1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "IFFO1-201"; exon_number 6;  level 3; tag "basic"; havana_gene "OTTHUMG00000141264.3";
chr12	ENSEMBL	exon	6657834	6657991	.	-	.	gene_id "ENSG00000010295.15"; transcript_id "ENST00000436152.2"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "IFFO1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "IFFO1-201"; exon_number 7;  level 3; tag "basic"; havana_gene "OTTHUMG00000141264.3";
chr12	ENSEMBL	CDS	6657834	6657991	.	-	0	gene_id "ENSG00000010295.15"; transcript_id "ENST00000436152.2"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "IFFO1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "IFFO1-201"; exon_number 7;  level 3; tag "basic"; havana_gene "OTTHUMG00000141264.3";
chr12	ENSEMBL	exon	6657591	6657711	.	-	.	gene_id "ENSG00000010295.15"; transcript_id "ENST00000436152.2"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "IFFO1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "IFFO1-201"; exon_number 8;  level 3; tag "basic"; havana_gene "OTTHUMG00000141264.3";
chr12	ENSEMBL	CDS	6657591	6657711	.	-	1	gene_id "ENSG00000010295.15"; transcript_id "ENST00000436152.2"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "IFFO1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "IFFO1-201"; exon_number 8;  level 3; tag "basic"; havana_gene "OTTHUMG00000141264.3";
chr12	ENSEMBL	exon	6657231	6657326	.	-	.	gene_id "ENSG00000010295.15"; transcript_id "ENST00000436152.2"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "IFFO1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "IFFO1-201"; exon_number 9;  level 3; tag "basic"; havana_gene "OTTHUMG00000141264.3";
chr12	ENSEMBL	CDS	6657231	6657326	.	-	0	gene_id "ENSG00000010295.15"; transcript_id "ENST00000436152.2"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "IFFO1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "IFFO1-201"; exon_number 9;  level 3; tag "basic"; havana_gene "OTTHUMG00000141264.3";
chr12	ENSEMBL	exon	6650678	6650808	.	-	.	gene_id "ENSG00000010295.15"; transcript_id "ENST00000436152.2"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "IFFO1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "IFFO1-201"; exon_number 10;  level 3; tag "basic"; havana_gene "OTTHUMG00000141264.3";
chr12	ENSEMBL	CDS	6650678	6650808	.	-	0	gene_id "ENSG00000010295.15"; transcript_id "ENST00000436152.2"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "IFFO1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "IFFO1-201"; exon_number 10;  level 3; tag "basic"; havana_gene "OTTHUMG00000141264.3";
chr12	ENSEMBL	exon	6648127	6649754	.	-	.	gene_id "ENSG00000010295.15"; transcript_id "ENST00000436152.2"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "IFFO1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "IFFO1-201"; exon_number 11;  level 3; tag "basic"; havana_gene "OTTHUMG00000141264.3";
chr12	ENSEMBL	CDS	6649652	6649754	.	-	1	gene_id "ENSG00000010295.15"; transcript_id "ENST00000436152.2"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "IFFO1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "IFFO1-201"; exon_number 11;  level 3; tag "basic"; havana_gene "OTTHUMG00000141264.3";
chr12	ENSEMBL	stop_codon	6649649	6649651	.	-	0	gene_id "ENSG00000010295.15"; transcript_id "ENST00000436152.2"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "IFFO1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "IFFO1-201"; exon_number 11;  level 3; tag "basic"; havana_gene "OTTHUMG00000141264.3";
chr12	ENSEMBL	exon	6647541	6647788	.	-	.	gene_id "ENSG00000010295.15"; transcript_id "ENST00000436152.2"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "IFFO1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "IFFO1-201"; exon_number 12;  level 3; tag "basic"; havana_gene "OTTHUMG00000141264.3";
chr12	ENSEMBL	UTR	6660761	6661066	.	-	.	gene_id "ENSG00000010295.15"; transcript_id "ENST00000436152.2"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "IFFO1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "IFFO1-201"; level 3; tag "basic"; havana_gene "OTTHUMG00000141264.3";
chr12	ENSEMBL	UTR	6660564	6660669	.	-	.	gene_id "ENSG00000010295.15"; transcript_id "ENST00000436152.2"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "IFFO1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "IFFO1-201"; level 3; tag "basic"; havana_gene "OTTHUMG00000141264.3";
chr12	ENSEMBL	UTR	6660107	6660167	.	-	.	gene_id "ENSG00000010295.15"; transcript_id "ENST00000436152.2"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "IFFO1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "IFFO1-201"; level 3; tag "basic"; havana_gene "OTTHUMG00000141264.3";
chr12	ENSEMBL	UTR	6659870	6659956	.	-	.	gene_id "ENSG00000010295.15"; transcript_id "ENST00000436152.2"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "IFFO1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "IFFO1-201"; level 3; tag "basic"; havana_gene "OTTHUMG00000141264.3";
chr12	ENSEMBL	UTR	6648127	6649651	.	-	.	gene_id "ENSG00000010295.15"; transcript_id "ENST00000436152.2"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "IFFO1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "IFFO1-201"; level 3; tag "basic"; havana_gene "OTTHUMG00000141264.3";
chr12	ENSEMBL	UTR	6647541	6647788	.	-	.	gene_id "ENSG00000010295.15"; transcript_id "ENST00000436152.2"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "IFFO1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "IFFO1-201"; level 3; tag "basic"; havana_gene "OTTHUMG00000141264.3";

**Simon Anders** · 09-03-2013, 10:32 PM

1.) If the mate is unmapped, the read is still counted.
2.) If the mate maps to an intergenic region, the read will be counted, unless you use "intersection_strict". I have not put in a test to kick out read pairs mapping to different chromosomes, so this won't be treated any different from a mate mapping just between the neighbouring genes.

**Simon Anders** · 09-03-2013, 10:34 PM

BTW: The original idea of HTSeq was to provide users with a framework to write their own scripts, and htseq-count was just meant as an example. If you want to change the rules of the counting scheme and know a bit of Python, you can write your own script, following the examples in the documentation. (Improving the explanations on counting in the documentation is on my to-do list)

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 30 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 32 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 28 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 53 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

HTSeq - how many reads map to features

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News