SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Tophat RNASeq mapping - right reads map and left reads map 50% less airad22 Bioinformatics 2 08-14-2013 08:26 AM
[HTSeq-Count] How are reads counted? syintel87 Bioinformatics 0 06-17-2013 02:18 PM
Reads Count - HTSeq alternative? NGSAwesome RNA Sequencing 1 10-15-2012 10:13 AM
htseq-count gets more reads? deepsea Bioinformatics 3 03-29-2012 12:27 PM
Best tool to map 454 reads onto sanger reads? dan Bioinformatics 3 07-27-2009 09:51 AM

Reply
 
Thread Tools
Old 08-28-2013, 11:43 PM   #1
dvanic
Member
 
Location: Sydney, Australia

Join Date: Jan 2012
Posts: 61
Default HTSeq - how many reads map to features

Out of curiosity, how many reads are normally counted to features in your data?

I'm working with Illumina Random hexamer primed human PE RNA-Seq and using the gencode 17 annotation, and have 20-50% of reads being counted towards features. I'm wondering if this is normal, or a bit low?

(I count this as [sum of all number in count table] / [ number of reads in original fastq] ; obviously can have denominator as "uniquely mapped reads" as well)
dvanic is offline   Reply With Quote
Old 08-29-2013, 01:52 AM   #2
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 994
Default

yes, it's a bit low. Use a visualization tool such as IGV to see where all your other reads map to. And double-check that you use the right GTF file, and the right "stranded" argument.
Simon Anders is offline   Reply With Quote
Old 09-02-2013, 12:33 AM   #3
dvanic
Member
 
Location: Sydney, Australia

Join Date: Jan 2012
Posts: 61
Default

Hi Simon!
Thanks!

I'm sure I'm using the right strand [have posted my results of how I chose the strand here: http://seqanswers.com/forums/showpos...8&postcount=50 and I'm using the gencode gtf from their website (http://www.gencodegenes.org/releases/17.html).

Following your reply, I've had more of a look at all of my datasets /10+ experimental setups, sequencing in different labs, on HiSeq/miSeq, stranded/unstranded, 75-150 nt read lengths/, and the ratio of (reads counted to features) to (reads in proper pairs) or to (uniquely mapping reads) (assessed using rseqc bam_stat .py module [reports uniqueness based on mapping quality) http://dldcc-web.brc.bcm.edu/lilab/l...l/#bam-stat-py ) is always in the 40-50% ballpark, including for the Illumina body map (50bp PE polyA selected).

For example, for liver and brain, the numbers are:
Code:
library	Reads2features	Unique	MappedInProperPairs	Ratio2unique	Ratio2PP
liverPE	62644060	141129035	129483856	0.443877902	0.483798227
brainPE	53279529	131608236	110397776	0.404834307	0.482614152
The only datasets for which I see an ~80% number of reads being counted for features are single end unstranded 75bp libraries, including the brain/liver illumina body map:
Code:
Library	Reads2features	Unique			Ratio2unique
liver75	58800747	66603793		0.882843819
brain75	44451030	56174550		0.791301933
Do you normally see more than 50% of tags counted to features in paired end libraries???
dvanic is offline   Reply With Quote
Old 09-02-2013, 12:51 AM   #4
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 994
Default

Yes, I do. Again: Have you checked with a genome browser like IGV?
Simon Anders is offline   Reply With Quote
Old 09-02-2013, 12:53 AM   #5
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 994
Default

BTW, in the post you reference above, you have posted an excerpt of a count table which contains the gene ID "ENSG00000010295.14". What is this ".14"? This does not look like a valid gene ID. I hope you are not trying to count by transcripts.
Simon Anders is offline   Reply With Quote
Old 09-03-2013, 04:45 PM   #6
dvanic
Member
 
Location: Sydney, Australia

Join Date: Jan 2012
Posts: 61
Smile

Thanks for your comments, Simon. My problem is a lot sillier than the wrong gtf - it's using the wrong denominator.

I did visualize the data with IGV and UCSC, and did see that most of my reads were mapping to features. I didn't leave a comment here about that since "but they do map to features! htseq is not counting them!" isn't a reasonable arguement, given that so many different researchers have used it in so many different systems, and no one has picked up such nonsense before.

The reason I was getting such low "numbers" was my denominator - reads in the original fastq OR that when reporting unique and properly paired reads I was reporting at a per-read basis, while htseq is intelligent enough to give a count of 1 to the gene on a per-pair basis.

Basically, when I've gone back and looked, I see that in my human data, the breakdown is:
  • 70-80% of reads in the library are mapped uniquely. So automatically those 20-30% of non-uniquely mapped reads will not be considered by htseq
  • 65- 75% of reads in the library are mapped in proper pairs.

Questions I still have:
Quote:
1) Will htseq still count the reads for which one read in the pair is unmapped and the other is mapped to a feature?
2) In the weird scenario if a read pair is mapped across chromosomes (read 1 maps to chr 1, read 2 to chr 7) [I've got a few of those, including some that have the NH:i:1 tag] - they will be discarded as being mapped to more than one feature element (if both are in annotated features) - what about if only one is?
So the actual denominator should be uniquely mapping reads, and my numerator multiplied by 2x since the counts are counted on a per-pair basis.
If I do the math this way, I get a steady 80% of uniquely mapping reads counted to features [ or 88% of reads mapped in proper pairs being counted to features], which seems quite reasonable, given that I'm using random hexamer priming and no poly-A selection, so I expect to have reads mapping to poorly annotated genes/transcripts that give rise to non-polyadenylated RNA as well as a small number of reads mapping to unspliced pre-mRNA.


PS The decimals are part of the standard gencode gtf format, and I think they reflect their internal updates to annotations. Sample of gtf below:
Code:
chr12	HAVANA	gene	6647541	6665239	.	-	.	gene_id "ENSG00000010295.15"; transcript_id "ENSG00000010295.15"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "IFFO1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "IFFO1"; level 2; havana_gene "OTTHUMG00000141264.3";
chr12	ENSEMBL	transcript	6647541	6661066	.	-	.	gene_id "ENSG00000010295.15"; transcript_id "ENST00000436152.2"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "IFFO1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "IFFO1-201"; level 3; tag "basic"; havana_gene "OTTHUMG00000141264.3";
chr12	ENSEMBL	exon	6660761	6661066	.	-	.	gene_id "ENSG00000010295.15"; transcript_id "ENST00000436152.2"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "IFFO1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "IFFO1-201"; exon_number 1;  level 3; tag "basic"; havana_gene "OTTHUMG00000141264.3";
chr12	ENSEMBL	exon	6660564	6660669	.	-	.	gene_id "ENSG00000010295.15"; transcript_id "ENST00000436152.2"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "IFFO1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "IFFO1-201"; exon_number 2;  level 3; tag "basic"; havana_gene "OTTHUMG00000141264.3";
chr12	ENSEMBL	exon	6660107	6660167	.	-	.	gene_id "ENSG00000010295.15"; transcript_id "ENST00000436152.2"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "IFFO1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "IFFO1-201"; exon_number 3;  level 3; tag "basic"; havana_gene "OTTHUMG00000141264.3";
chr12	ENSEMBL	exon	6659861	6659956	.	-	.	gene_id "ENSG00000010295.15"; transcript_id "ENST00000436152.2"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "IFFO1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "IFFO1-201"; exon_number 4;  level 3; tag "basic"; havana_gene "OTTHUMG00000141264.3";
chr12	ENSEMBL	CDS	6659861	6659869	.	-	0	gene_id "ENSG00000010295.15"; transcript_id "ENST00000436152.2"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "IFFO1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "IFFO1-201"; exon_number 4;  level 3; tag "basic"; havana_gene "OTTHUMG00000141264.3";
chr12	ENSEMBL	start_codon	6659867	6659869	.	-	0	gene_id "ENSG00000010295.15"; transcript_id "ENST00000436152.2"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "IFFO1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "IFFO1-201"; exon_number 4;  level 3; tag "basic"; havana_gene "OTTHUMG00000141264.3";
chr12	ENSEMBL	exon	6658922	6659062	.	-	.	gene_id "ENSG00000010295.15"; transcript_id "ENST00000436152.2"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "IFFO1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "IFFO1-201"; exon_number 5;  level 3; tag "basic"; havana_gene "OTTHUMG00000141264.3";
chr12	ENSEMBL	CDS	6658922	6659062	.	-	0	gene_id "ENSG00000010295.15"; transcript_id "ENST00000436152.2"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "IFFO1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "IFFO1-201"; exon_number 5;  level 3; tag "basic"; havana_gene "OTTHUMG00000141264.3";
chr12	ENSEMBL	exon	6658642	6658650	.	-	.	gene_id "ENSG00000010295.15"; transcript_id "ENST00000436152.2"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "IFFO1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "IFFO1-201"; exon_number 6;  level 3; tag "basic"; havana_gene "OTTHUMG00000141264.3";
chr12	ENSEMBL	CDS	6658642	6658650	.	-	0	gene_id "ENSG00000010295.15"; transcript_id "ENST00000436152.2"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "IFFO1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "IFFO1-201"; exon_number 6;  level 3; tag "basic"; havana_gene "OTTHUMG00000141264.3";
chr12	ENSEMBL	exon	6657834	6657991	.	-	.	gene_id "ENSG00000010295.15"; transcript_id "ENST00000436152.2"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "IFFO1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "IFFO1-201"; exon_number 7;  level 3; tag "basic"; havana_gene "OTTHUMG00000141264.3";
chr12	ENSEMBL	CDS	6657834	6657991	.	-	0	gene_id "ENSG00000010295.15"; transcript_id "ENST00000436152.2"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "IFFO1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "IFFO1-201"; exon_number 7;  level 3; tag "basic"; havana_gene "OTTHUMG00000141264.3";
chr12	ENSEMBL	exon	6657591	6657711	.	-	.	gene_id "ENSG00000010295.15"; transcript_id "ENST00000436152.2"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "IFFO1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "IFFO1-201"; exon_number 8;  level 3; tag "basic"; havana_gene "OTTHUMG00000141264.3";
chr12	ENSEMBL	CDS	6657591	6657711	.	-	1	gene_id "ENSG00000010295.15"; transcript_id "ENST00000436152.2"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "IFFO1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "IFFO1-201"; exon_number 8;  level 3; tag "basic"; havana_gene "OTTHUMG00000141264.3";
chr12	ENSEMBL	exon	6657231	6657326	.	-	.	gene_id "ENSG00000010295.15"; transcript_id "ENST00000436152.2"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "IFFO1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "IFFO1-201"; exon_number 9;  level 3; tag "basic"; havana_gene "OTTHUMG00000141264.3";
chr12	ENSEMBL	CDS	6657231	6657326	.	-	0	gene_id "ENSG00000010295.15"; transcript_id "ENST00000436152.2"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "IFFO1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "IFFO1-201"; exon_number 9;  level 3; tag "basic"; havana_gene "OTTHUMG00000141264.3";
chr12	ENSEMBL	exon	6650678	6650808	.	-	.	gene_id "ENSG00000010295.15"; transcript_id "ENST00000436152.2"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "IFFO1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "IFFO1-201"; exon_number 10;  level 3; tag "basic"; havana_gene "OTTHUMG00000141264.3";
chr12	ENSEMBL	CDS	6650678	6650808	.	-	0	gene_id "ENSG00000010295.15"; transcript_id "ENST00000436152.2"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "IFFO1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "IFFO1-201"; exon_number 10;  level 3; tag "basic"; havana_gene "OTTHUMG00000141264.3";
chr12	ENSEMBL	exon	6648127	6649754	.	-	.	gene_id "ENSG00000010295.15"; transcript_id "ENST00000436152.2"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "IFFO1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "IFFO1-201"; exon_number 11;  level 3; tag "basic"; havana_gene "OTTHUMG00000141264.3";
chr12	ENSEMBL	CDS	6649652	6649754	.	-	1	gene_id "ENSG00000010295.15"; transcript_id "ENST00000436152.2"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "IFFO1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "IFFO1-201"; exon_number 11;  level 3; tag "basic"; havana_gene "OTTHUMG00000141264.3";
chr12	ENSEMBL	stop_codon	6649649	6649651	.	-	0	gene_id "ENSG00000010295.15"; transcript_id "ENST00000436152.2"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "IFFO1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "IFFO1-201"; exon_number 11;  level 3; tag "basic"; havana_gene "OTTHUMG00000141264.3";
chr12	ENSEMBL	exon	6647541	6647788	.	-	.	gene_id "ENSG00000010295.15"; transcript_id "ENST00000436152.2"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "IFFO1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "IFFO1-201"; exon_number 12;  level 3; tag "basic"; havana_gene "OTTHUMG00000141264.3";
chr12	ENSEMBL	UTR	6660761	6661066	.	-	.	gene_id "ENSG00000010295.15"; transcript_id "ENST00000436152.2"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "IFFO1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "IFFO1-201"; level 3; tag "basic"; havana_gene "OTTHUMG00000141264.3";
chr12	ENSEMBL	UTR	6660564	6660669	.	-	.	gene_id "ENSG00000010295.15"; transcript_id "ENST00000436152.2"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "IFFO1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "IFFO1-201"; level 3; tag "basic"; havana_gene "OTTHUMG00000141264.3";
chr12	ENSEMBL	UTR	6660107	6660167	.	-	.	gene_id "ENSG00000010295.15"; transcript_id "ENST00000436152.2"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "IFFO1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "IFFO1-201"; level 3; tag "basic"; havana_gene "OTTHUMG00000141264.3";
chr12	ENSEMBL	UTR	6659870	6659956	.	-	.	gene_id "ENSG00000010295.15"; transcript_id "ENST00000436152.2"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "IFFO1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "IFFO1-201"; level 3; tag "basic"; havana_gene "OTTHUMG00000141264.3";
chr12	ENSEMBL	UTR	6648127	6649651	.	-	.	gene_id "ENSG00000010295.15"; transcript_id "ENST00000436152.2"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "IFFO1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "IFFO1-201"; level 3; tag "basic"; havana_gene "OTTHUMG00000141264.3";
chr12	ENSEMBL	UTR	6647541	6647788	.	-	.	gene_id "ENSG00000010295.15"; transcript_id "ENST00000436152.2"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "IFFO1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "IFFO1-201"; level 3; tag "basic"; havana_gene "OTTHUMG00000141264.3";
dvanic is offline   Reply With Quote
Old 09-03-2013, 11:32 PM   #7
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 994
Default

1.) If the mate is unmapped, the read is still counted.
2.) If the mate maps to an intergenic region, the read will be counted, unless you use "intersection_strict". I have not put in a test to kick out read pairs mapping to different chromosomes, so this won't be treated any different from a mate mapping just between the neighbouring genes.
Simon Anders is offline   Reply With Quote
Old 09-03-2013, 11:34 PM   #8
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 994
Default

BTW: The original idea of HTSeq was to provide users with a framework to write their own scripts, and htseq-count was just meant as an example. If you want to change the rules of the counting scheme and know a bit of Python, you can write your own script, following the examples in the documentation. (Improving the explanations on counting in the documentation is on my to-do list)
Simon Anders is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 05:34 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO