Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • HTSeq - how many reads map to features

    Out of curiosity, how many reads are normally counted to features in your data?

    I'm working with Illumina Random hexamer primed human PE RNA-Seq and using the gencode 17 annotation, and have 20-50% of reads being counted towards features. I'm wondering if this is normal, or a bit low?

    (I count this as [sum of all number in count table] / [ number of reads in original fastq] ; obviously can have denominator as "uniquely mapped reads" as well)

  • #2
    yes, it's a bit low. Use a visualization tool such as IGV to see where all your other reads map to. And double-check that you use the right GTF file, and the right "stranded" argument.

    Comment


    • #3
      Hi Simon!
      Thanks!

      I'm sure I'm using the right strand [have posted my results of how I chose the strand here: http://seqanswers.com/forums/showpos...8&postcount=50 and I'm using the gencode gtf from their website (http://www.gencodegenes.org/releases/17.html).

      Following your reply, I've had more of a look at all of my datasets /10+ experimental setups, sequencing in different labs, on HiSeq/miSeq, stranded/unstranded, 75-150 nt read lengths/, and the ratio of (reads counted to features) to (reads in proper pairs) or to (uniquely mapping reads) (assessed using rseqc bam_stat .py module [reports uniqueness based on mapping quality) http://dldcc-web.brc.bcm.edu/lilab/l...l/#bam-stat-py ) is always in the 40-50% ballpark, including for the Illumina body map (50bp PE polyA selected).

      For example, for liver and brain, the numbers are:
      Code:
      library	Reads2features	Unique	MappedInProperPairs	Ratio2unique	Ratio2PP
      liverPE	62644060	141129035	129483856	0.443877902	0.483798227
      brainPE	53279529	131608236	110397776	0.404834307	0.482614152
      The only datasets for which I see an ~80% number of reads being counted for features are single end unstranded 75bp libraries, including the brain/liver illumina body map:
      Code:
      Library	Reads2features	Unique			Ratio2unique
      liver75	58800747	66603793		0.882843819
      brain75	44451030	56174550		0.791301933
      Do you normally see more than 50% of tags counted to features in paired end libraries???

      Comment


      • #4
        Yes, I do. Again: Have you checked with a genome browser like IGV?

        Comment


        • #5
          BTW, in the post you reference above, you have posted an excerpt of a count table which contains the gene ID "ENSG00000010295.14". What is this ".14"? This does not look like a valid gene ID. I hope you are not trying to count by transcripts.

          Comment


          • #6
            Thanks for your comments, Simon. My problem is a lot sillier than the wrong gtf - it's using the wrong denominator.

            I did visualize the data with IGV and UCSC, and did see that most of my reads were mapping to features. I didn't leave a comment here about that since "but they do map to features! htseq is not counting them!" isn't a reasonable arguement, given that so many different researchers have used it in so many different systems, and no one has picked up such nonsense before.

            The reason I was getting such low "numbers" was my denominator - reads in the original fastq OR that when reporting unique and properly paired reads I was reporting at a per-read basis, while htseq is intelligent enough to give a count of 1 to the gene on a per-pair basis.

            Basically, when I've gone back and looked, I see that in my human data, the breakdown is:
            • 70-80% of reads in the library are mapped uniquely. So automatically those 20-30% of non-uniquely mapped reads will not be considered by htseq
            • 65- 75% of reads in the library are mapped in proper pairs.


            Questions I still have:
            1) Will htseq still count the reads for which one read in the pair is unmapped and the other is mapped to a feature?
            2) In the weird scenario if a read pair is mapped across chromosomes (read 1 maps to chr 1, read 2 to chr 7) [I've got a few of those, including some that have the NH:i:1 tag] - they will be discarded as being mapped to more than one feature element (if both are in annotated features) - what about if only one is?
            So the actual denominator should be uniquely mapping reads, and my numerator multiplied by 2x since the counts are counted on a per-pair basis.
            If I do the math this way, I get a steady 80% of uniquely mapping reads counted to features [ or 88% of reads mapped in proper pairs being counted to features], which seems quite reasonable, given that I'm using random hexamer priming and no poly-A selection, so I expect to have reads mapping to poorly annotated genes/transcripts that give rise to non-polyadenylated RNA as well as a small number of reads mapping to unspliced pre-mRNA.


            PS The decimals are part of the standard gencode gtf format, and I think they reflect their internal updates to annotations. Sample of gtf below:
            Code:
            chr12	HAVANA	gene	6647541	6665239	.	-	.	gene_id "ENSG00000010295.15"; transcript_id "ENSG00000010295.15"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "IFFO1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "IFFO1"; level 2; havana_gene "OTTHUMG00000141264.3";
            chr12	ENSEMBL	transcript	6647541	6661066	.	-	.	gene_id "ENSG00000010295.15"; transcript_id "ENST00000436152.2"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "IFFO1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "IFFO1-201"; level 3; tag "basic"; havana_gene "OTTHUMG00000141264.3";
            chr12	ENSEMBL	exon	6660761	6661066	.	-	.	gene_id "ENSG00000010295.15"; transcript_id "ENST00000436152.2"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "IFFO1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "IFFO1-201"; exon_number 1;  level 3; tag "basic"; havana_gene "OTTHUMG00000141264.3";
            chr12	ENSEMBL	exon	6660564	6660669	.	-	.	gene_id "ENSG00000010295.15"; transcript_id "ENST00000436152.2"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "IFFO1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "IFFO1-201"; exon_number 2;  level 3; tag "basic"; havana_gene "OTTHUMG00000141264.3";
            chr12	ENSEMBL	exon	6660107	6660167	.	-	.	gene_id "ENSG00000010295.15"; transcript_id "ENST00000436152.2"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "IFFO1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "IFFO1-201"; exon_number 3;  level 3; tag "basic"; havana_gene "OTTHUMG00000141264.3";
            chr12	ENSEMBL	exon	6659861	6659956	.	-	.	gene_id "ENSG00000010295.15"; transcript_id "ENST00000436152.2"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "IFFO1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "IFFO1-201"; exon_number 4;  level 3; tag "basic"; havana_gene "OTTHUMG00000141264.3";
            chr12	ENSEMBL	CDS	6659861	6659869	.	-	0	gene_id "ENSG00000010295.15"; transcript_id "ENST00000436152.2"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "IFFO1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "IFFO1-201"; exon_number 4;  level 3; tag "basic"; havana_gene "OTTHUMG00000141264.3";
            chr12	ENSEMBL	start_codon	6659867	6659869	.	-	0	gene_id "ENSG00000010295.15"; transcript_id "ENST00000436152.2"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "IFFO1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "IFFO1-201"; exon_number 4;  level 3; tag "basic"; havana_gene "OTTHUMG00000141264.3";
            chr12	ENSEMBL	exon	6658922	6659062	.	-	.	gene_id "ENSG00000010295.15"; transcript_id "ENST00000436152.2"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "IFFO1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "IFFO1-201"; exon_number 5;  level 3; tag "basic"; havana_gene "OTTHUMG00000141264.3";
            chr12	ENSEMBL	CDS	6658922	6659062	.	-	0	gene_id "ENSG00000010295.15"; transcript_id "ENST00000436152.2"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "IFFO1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "IFFO1-201"; exon_number 5;  level 3; tag "basic"; havana_gene "OTTHUMG00000141264.3";
            chr12	ENSEMBL	exon	6658642	6658650	.	-	.	gene_id "ENSG00000010295.15"; transcript_id "ENST00000436152.2"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "IFFO1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "IFFO1-201"; exon_number 6;  level 3; tag "basic"; havana_gene "OTTHUMG00000141264.3";
            chr12	ENSEMBL	CDS	6658642	6658650	.	-	0	gene_id "ENSG00000010295.15"; transcript_id "ENST00000436152.2"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "IFFO1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "IFFO1-201"; exon_number 6;  level 3; tag "basic"; havana_gene "OTTHUMG00000141264.3";
            chr12	ENSEMBL	exon	6657834	6657991	.	-	.	gene_id "ENSG00000010295.15"; transcript_id "ENST00000436152.2"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "IFFO1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "IFFO1-201"; exon_number 7;  level 3; tag "basic"; havana_gene "OTTHUMG00000141264.3";
            chr12	ENSEMBL	CDS	6657834	6657991	.	-	0	gene_id "ENSG00000010295.15"; transcript_id "ENST00000436152.2"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "IFFO1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "IFFO1-201"; exon_number 7;  level 3; tag "basic"; havana_gene "OTTHUMG00000141264.3";
            chr12	ENSEMBL	exon	6657591	6657711	.	-	.	gene_id "ENSG00000010295.15"; transcript_id "ENST00000436152.2"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "IFFO1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "IFFO1-201"; exon_number 8;  level 3; tag "basic"; havana_gene "OTTHUMG00000141264.3";
            chr12	ENSEMBL	CDS	6657591	6657711	.	-	1	gene_id "ENSG00000010295.15"; transcript_id "ENST00000436152.2"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "IFFO1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "IFFO1-201"; exon_number 8;  level 3; tag "basic"; havana_gene "OTTHUMG00000141264.3";
            chr12	ENSEMBL	exon	6657231	6657326	.	-	.	gene_id "ENSG00000010295.15"; transcript_id "ENST00000436152.2"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "IFFO1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "IFFO1-201"; exon_number 9;  level 3; tag "basic"; havana_gene "OTTHUMG00000141264.3";
            chr12	ENSEMBL	CDS	6657231	6657326	.	-	0	gene_id "ENSG00000010295.15"; transcript_id "ENST00000436152.2"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "IFFO1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "IFFO1-201"; exon_number 9;  level 3; tag "basic"; havana_gene "OTTHUMG00000141264.3";
            chr12	ENSEMBL	exon	6650678	6650808	.	-	.	gene_id "ENSG00000010295.15"; transcript_id "ENST00000436152.2"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "IFFO1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "IFFO1-201"; exon_number 10;  level 3; tag "basic"; havana_gene "OTTHUMG00000141264.3";
            chr12	ENSEMBL	CDS	6650678	6650808	.	-	0	gene_id "ENSG00000010295.15"; transcript_id "ENST00000436152.2"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "IFFO1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "IFFO1-201"; exon_number 10;  level 3; tag "basic"; havana_gene "OTTHUMG00000141264.3";
            chr12	ENSEMBL	exon	6648127	6649754	.	-	.	gene_id "ENSG00000010295.15"; transcript_id "ENST00000436152.2"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "IFFO1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "IFFO1-201"; exon_number 11;  level 3; tag "basic"; havana_gene "OTTHUMG00000141264.3";
            chr12	ENSEMBL	CDS	6649652	6649754	.	-	1	gene_id "ENSG00000010295.15"; transcript_id "ENST00000436152.2"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "IFFO1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "IFFO1-201"; exon_number 11;  level 3; tag "basic"; havana_gene "OTTHUMG00000141264.3";
            chr12	ENSEMBL	stop_codon	6649649	6649651	.	-	0	gene_id "ENSG00000010295.15"; transcript_id "ENST00000436152.2"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "IFFO1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "IFFO1-201"; exon_number 11;  level 3; tag "basic"; havana_gene "OTTHUMG00000141264.3";
            chr12	ENSEMBL	exon	6647541	6647788	.	-	.	gene_id "ENSG00000010295.15"; transcript_id "ENST00000436152.2"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "IFFO1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "IFFO1-201"; exon_number 12;  level 3; tag "basic"; havana_gene "OTTHUMG00000141264.3";
            chr12	ENSEMBL	UTR	6660761	6661066	.	-	.	gene_id "ENSG00000010295.15"; transcript_id "ENST00000436152.2"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "IFFO1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "IFFO1-201"; level 3; tag "basic"; havana_gene "OTTHUMG00000141264.3";
            chr12	ENSEMBL	UTR	6660564	6660669	.	-	.	gene_id "ENSG00000010295.15"; transcript_id "ENST00000436152.2"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "IFFO1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "IFFO1-201"; level 3; tag "basic"; havana_gene "OTTHUMG00000141264.3";
            chr12	ENSEMBL	UTR	6660107	6660167	.	-	.	gene_id "ENSG00000010295.15"; transcript_id "ENST00000436152.2"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "IFFO1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "IFFO1-201"; level 3; tag "basic"; havana_gene "OTTHUMG00000141264.3";
            chr12	ENSEMBL	UTR	6659870	6659956	.	-	.	gene_id "ENSG00000010295.15"; transcript_id "ENST00000436152.2"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "IFFO1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "IFFO1-201"; level 3; tag "basic"; havana_gene "OTTHUMG00000141264.3";
            chr12	ENSEMBL	UTR	6648127	6649651	.	-	.	gene_id "ENSG00000010295.15"; transcript_id "ENST00000436152.2"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "IFFO1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "IFFO1-201"; level 3; tag "basic"; havana_gene "OTTHUMG00000141264.3";
            chr12	ENSEMBL	UTR	6647541	6647788	.	-	.	gene_id "ENSG00000010295.15"; transcript_id "ENST00000436152.2"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "IFFO1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "IFFO1-201"; level 3; tag "basic"; havana_gene "OTTHUMG00000141264.3";

            Comment


            • #7
              1.) If the mate is unmapped, the read is still counted.
              2.) If the mate maps to an intergenic region, the read will be counted, unless you use "intersection_strict". I have not put in a test to kick out read pairs mapping to different chromosomes, so this won't be treated any different from a mate mapping just between the neighbouring genes.

              Comment


              • #8
                BTW: The original idea of HTSeq was to provide users with a framework to write their own scripts, and htseq-count was just meant as an example. If you want to change the rules of the counting scheme and know a bit of Python, you can write your own script, following the examples in the documentation. (Improving the explanations on counting in the documentation is on my to-do list)

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  Current Approaches to Protein Sequencing
                  by seqadmin


                  Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                  04-04-2024, 04:25 PM
                • seqadmin
                  Strategies for Sequencing Challenging Samples
                  by seqadmin


                  Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                  03-22-2024, 06:39 AM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by seqadmin, 04-11-2024, 12:08 PM
                0 responses
                30 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-10-2024, 10:19 PM
                0 responses
                32 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-10-2024, 09:21 AM
                0 responses
                28 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-04-2024, 09:00 AM
                0 responses
                53 views
                0 likes
                Last Post seqadmin  
                Working...
                X