Seqanswers Leaderboard Ad

**dpryan** · 08-25-2015, 12:57 AM

The first line of the output is just one number followed by a tab:
" 1851236"

That shouldn't happen, there's probably an error in the GTF file. BTW, just give htseq-count the BAM file.

I. The sum of all counts are merely 16 million, while the uniquely mapped reads are 43 million, why is the huge discrepancy?

Just because an alignment is "unique" to the genome doesn't mean that (A) it overlaps a gene or (B) that it overlaps only 1 gene.

II. For a few genes of interest, I got "0" in the count output, while I can see hundreds to thousands in IGV when looking at the bam file mapping.

Perhaps they're multimappers. You can always use the -o option and grep for a few read names to see why they're not being counted (it's incredibly unlikely that htseq-count is doing something wrong).

[QUOTE] The genes.gtf was the same UCSC source as the genome I used for tophat2 mapping, so the chromosome names are consistent (chrI chrII etc).[QUOTE]

Avoid using GTF files from UCSC, they tend to have a variety of problems. Stick to Ensembl for everything.

**Michael.Ante** · 08-25-2015, 01:03 AM

Hi Shaohe,

could you please provide an image of one of these cases, including the gtf-annotation in the IGV?

It seems to me, that the first line in your result file is due to the fact, that some gene_ids are ether empty or not not set properly (e.g. special chars?). Thus, your last gene with this issue counted 1.8 M reads.

Regarding your first question, you have to account the no_feature reads (intronic or inter-genic mapped), as well as the ambiguous reads (reads mapping in overlapping features). These would sum up to 10.8 M reads.
In the alignments_not_unique the number of alignments (not reads) are counted, you can leave this out.

BTW: With the new version of HTSeq-count, you are using, you don't need to convert your bam file.

Cheers,
Michael

**snownontrace** · 08-25-2015, 06:19 PM

Hi Ryan,

Thank you very much for your quick response.

Originally posted by dpryan View Post

That shouldn't happen, there's probably an error in the GTF file. BTW, just give htseq-count the BAM file.

Thanks for the heads up. I pulled out the numbers that were assigned to genes and they sum up to be ~30 million, much closer to the mapped reads number (43 million). However, it is quite unsatisfying to have this weird empty head thing that takes up 1.8 million counts.

It'll be really nice to bypass the large sam files. However, when I tried to use BAM file directly with "-f bam", an error occured saying I don't have pysam module. So I tried to install pysam by "pip install pysam", which failed with an error (downloading the pysam tarball and install from there has a similar error):

Code:

Command "/Library/Frameworks/Python.framework/Versions/2.7/Resources/Python.app/Contents/MacOS/Python -c "import setuptools, tokenize;__file__='/private/var/folders/k0/1s_c443x4sz_k0xlt15g2z_r0000gn/T/pip-build-LF1tZX/pysam/setup.py';exec(compile(getattr(tokenize, 'open', open)(__file__).read().replace('\r\n', '\n'), __file__, 'exec'))" install --record /var/folders/k0/1s_c443x4sz_k0xlt15g2z_r0000gn/T/pip-xfBXDS-record/install-record.txt --single-version-externally-managed --compile" failed with error code 1 in /private/var/folders/k0/1s_c443x4sz_k0xlt15g2z_r0000gn/T/pip-build-LF1tZX/pysam

Please help me if you know how to fix this problem!

Just because an alignment is "unique" to the genome doesn't mean that (A) it overlaps a gene or (B) that it overlaps only 1 gene.

The genes I was looking at are not that complicated, so I didn't except to get "0" for them. In fact, when I use cufflinks I got 96 fkpm for one gene -- a snapshot from IGV for this example gene is attached.

Perhaps they're multimappers. You can always use the -o option and grep for a few read names to see why they're not being counted (it's incredibly unlikely that htseq-count is doing something wrong).

The first five lines of the -o output is:

Code:

['HWI-D00611:139:C6VKWANXX:8:2203:1197:2158\t272\tchrI\t2241\t0\t50M\t*\t0\t0\tCGCCCACGAAACCGTGCCGCACGTGTGGGTTTACGAGCTGAATATTTTCC\tFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFBBBBB\tAS:i:0\tXN:i:0\tXM:i:0\tXO:i:0\tXG:i:0\tNM:i:0\tMD:Z:50\tYT:Z:UU\tNH:i:7\tCC:Z:=\tCP:i:9569203\tHI:i:0\tXF:Z:__alignment_not_unique\n', 'HWI-D00611:139:C6VKWANXX:8:1314:14632:94054\t272\tchrI\t2241\t0\t50M\t*\t0\t0\tCGCCCACGAAACCGTGCCGCACGTGTGGGTTTACGAGCTGAATATTTTCC\tFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFBBBBB\tAS:i:0\tXN:i:0\tXM:i:0\tXO:i:0\tXG:i:0\tNM:i:0\tMD:Z:50\tYT:Z:UU\tNH:i:7\tCC:Z:=\tCP:i:9569203\tHI:i:0\tXF:Z:__alignment_not_unique\n', 'HWI-D00611:139:C6VKWANXX:8:1101:12688:99667\t272\tchrI\t2523\t0\t50M\t*\t0\t0\tATATTTATCAATTTTCTTCTGAGAGTCTCATTGAGACTCTTATTTACGCC\tFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFBBBBB\tAS:i:0\tXN:i:0\tXM:i:0\tXO:i:0\tXG:i:0\tNM:i:0\tMD:Z:50\tYT:Z:UU\tNH:i:7\tCC:Z:=\tCP:i:5594774\tHI:i:0\tXF:Z:__alignment_not_unique\n', 'HWI-D00611:139:C6VKWANXX:8:1108:19834:88650\t272\tchrI\t2523\t0\t50M\t*\t0\t0\tATATTTATCAATTTTCTTCTGAGAGTCTCATTGAGACTCTTATTTACGCC\tFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFBBBBB\tAS:i:0\tXN:i:0\tXM:i:0\tXO:i:0\tXG:i:0\tNM:i:0\tMD:Z:50\tYT:Z:UU\tNH:i:7\tCC:Z:=\tCP:i:5594774\tHI:i:0\tXF:Z:__alignment_not_unique\n', 'HWI-D00611:139:C6VKWANXX:8:2111:11430:84665\t272\tchrI\t2523\t0\t50M\t*\t0\t0\tATATTTATCAATTTTCTTCTGAGAGTCTCATTGAGACTCTTATTTACGCC\t<FBFFF/FFBBFFFFFFFFFFFBFFFFFFFBFFFFFFFFFFFFFFBBBB/\tAS:i:0\tXN:i:0\tXM:i:0\tXO:i:0\tXG:i:0\tNM:i:0\tMD:Z:50\tYT:Z:UU\tNH:i:7\tCC:Z:=\tCP:i:5594774\tHI:i:0\tXF:Z:__alignment_not_unique\n']

I have no idea what these are, but all five have the feature "__alignment_not_unique". I don't know how or if there is method to pinpoint the reads that aligned to particular genes.

Avoid using GTF files from UCSC, they tend to have a variety of problems. Stick to Ensembl for everything.

I'll try the Ensembl genome assembly and GTF files. BTW I am using C. elegans rather than human genome and I heard UCSC is pretty good. Apparently this is not the case since they have some empty gene IDs and names for lots of entries (see my reply to Michael below).

Many thanks!
Shaohe

Attached Files

klp-7_WT3_IGV.png (23.7 KB, 50 views)

**snownontrace** · 08-25-2015, 06:32 PM

Hi Michael,
Also thank you for your quick and kind reply!

Originally posted by Michael.Ante View Post

could you please provide an image of one of these cases, including the gtf-annotation in the IGV?

Does the image in my last post do the job? This is one of the genes I was curious about, which has many reads in the coding region, but the count table says "0" for it.

It seems to me, that the first line in your result file is due to the fact, that some gene_ids are ether empty or not not set properly (e.g. special chars?). Thus, your last gene with this issue counted 1.8 M reads.

That's likely the reason! When I inspect the gtf file, this is the example gene entries I posted the IGV picture for:

Code:

chrIII	unknown	CDS	10808247	10808360	.	+	0	gene_id "klp-7"; gene_name "klp-7"; p_id "P25071"; transcript_id "NM_001027507"; tss_id "TSS33773";
chrIII	unknown	exon	10808247	10808360	.	+	.	gene_id "klp-7"; gene_name "klp-7"; p_id "P25071"; transcript_id "NM_001027507"; tss_id "TSS33773";
chrIII	unknown	start_codon	10808247	10808249	.	+	.	gene_id "klp-7"; gene_name "klp-7"; p_id "P25071"; transcript_id "NM_001027507"; tss_id "TSS33773";
chrIII	unknown	CDS	10808501	10808717	.	+	0	gene_id ""; gene_name ""; p_id "P6745"; transcript_id "NM_001027506"; tss_id "TSS5767";
chrIII	unknown	CDS	10808501	10808717	.	+	0	gene_id "klp-7"; gene_name "klp-7"; p_id "P25071"; transcript_id "NM_001027507"; tss_id "TSS33773";
chrIII	unknown	exon	10808501	10808717	.	+	.	gene_id ""; gene_name ""; p_id "P6745"; transcript_id "NM_001027506"; tss_id "TSS5767";
chrIII	unknown	exon	10808501	10808717	.	+	.	gene_id "klp-7"; gene_name "klp-7"; p_id "P25071"; transcript_id "NM_001027507"; tss_id "TSS33773";
chrIII	unknown	CDS	10808766	10809423	.	+	2	gene_id ""; gene_name ""; p_id "P6745"; transcript_id "NM_001027506"; tss_id "TSS5767";
chrIII	unknown	CDS	10808766	10809423	.	+	2	gene_id "klp-7"; gene_name "klp-7"; p_id "P25071"; transcript_id "NM_001027507"; tss_id "TSS33773";
chrIII	unknown	exon	10808766	10809423	.	+	.	gene_id ""; gene_name ""; p_id "P6745"; transcript_id "NM_001027506"; tss_id "TSS5767";
chrIII	unknown	exon	10808766	10809423	.	+	.	gene_id "klp-7"; gene_name "klp-7"; p_id "P25071"; transcript_id "NM_001027507"; tss_id "TSS33773";
chrIII	unknown	CDS	10809474	10809935	.	+	1	gene_id ""; gene_name ""; p_id "P6745"; transcript_id "NM_001027506"; tss_id "TSS5767";
chrIII	unknown	CDS	10809474	10809935	.	+	1	gene_id "klp-7"; gene_name "klp-7"; p_id "P25071"; transcript_id "NM_001027507"; tss_id "TSS33773";
chrIII	unknown	exon	10809474	10809935	.	+	.	gene_id ""; gene_name ""; p_id "P6745"; transcript_id "NM_001027506"; tss_id "TSS5767";
chrIII	unknown	exon	10809474	10809935	.	+	.	gene_id "klp-7"; gene_name "klp-7"; p_id "P25071"; transcript_id "NM_001027507"; tss_id "TSS33773";
chrIII	unknown	CDS	10809986	10810297	.	+	1	gene_id ""; gene_name ""; p_id "P6745"; transcript_id "NM_001027506"; tss_id "TSS5767";
chrIII	unknown	CDS	10809986	10810297	.	+	1	gene_id "klp-7"; gene_name "klp-7"; p_id "P25071"; transcript_id "NM_001027507"; tss_id "TSS33773";
chrIII	unknown	exon	10809986	10810297	.	+	.	gene_id ""; gene_name ""; p_id "P6745"; transcript_id "NM_001027506"; tss_id "TSS5767";
chrIII	unknown	exon	10809986	10810297	.	+	.	gene_id "klp-7"; gene_name "klp-7"; p_id "P25071"; transcript_id "NM_001027507"; tss_id "TSS33773";
chrIII	unknown	CDS	10810345	10810648	.	+	1	gene_id ""; gene_name ""; p_id "P6745"; transcript_id "NM_001027506"; tss_id "TSS5767";
chrIII	unknown	CDS	10810345	10810648	.	+	1	gene_id "klp-7"; gene_name "klp-7"; p_id "P25071"; transcript_id "NM_001027507"; tss_id "TSS33773";
chrIII	unknown	exon	10810345	10810651	.	+	.	gene_id ""; gene_name ""; p_id "P6745"; transcript_id "NM_001027506"; tss_id "TSS5767";
chrIII	unknown	exon	10810345	10810651	.	+	.	gene_id "klp-7"; gene_name "klp-7"; p_id "P25071"; transcript_id "NM_001027507"; tss_id "TSS33773";
chrIII	unknown	stop_codon	10810649	10810651	.	+	.	gene_id ""; gene_name ""; p_id "P6745"; transcript_id "NM_001027506"; tss_id "TSS5767";
chrIII	unknown	stop_codon	10810649	10810651	.	+	.	gene_id "klp-7"; gene_name "klp-7"; p_id "P25071"; transcript_id "NM_001027507"; tss_id "TSS33773";

As you may notice, for many exons there is a duplicated entry with EMPTY gene_name and gene_id, likely explaining the 1.8 million counts assigned to "EMPTY" in line 1 of the output.

Regarding your first question, you have to account the no_feature reads (intronic or inter-genic mapped), as well as the ambiguous reads (reads mapping in overlapping features). These would sum up to 10.8 M reads.
In the alignments_not_unique the number of alignments (not reads) are counted, you can leave this out.

Thanks for pointing that out. I totally thought the first line of the output is the sum, so I just added that with the special features. In fact if I add all counts including the ones assigned to genes, the special features and 1.8 million from line 1, they add up to 43 million.

BTW: With the new version of HTSeq-count, you are using, you don't need to convert your bam file.

Thanks for the tip. As I was saying in the reply to Ryan, I got error when installing the pysam module, which prevented me from using bam file directly. I'd appreciate it if you have some insights into that.

Thanks!
Shaohe

**dpryan** · 08-25-2015, 11:18 PM

The example read was marked "not unique" because it was a multimapper (it had 7 alignments). Those should rightfully be excluded.

BTW, you might just use featureCounts instead. It's MUCH faster than HTSeq-count and you can just avoid the pysam installation problem (no clue what's going on there).

**Michael.Ante** · 08-26-2015, 12:43 AM

Having a look at your GTF explains, why you have a quite high number of ambiguous reads:
You have in your example a lot of exons with the same positions, but different gene_ids. Each read falling into those exons cannot be assigned unambiguously and is therefore not counted for your genes.

If you want to proceed with this data, you could use an awk command to clean the GTF:

Code:

awk 'length($10)>3{print $0}'  genes.gtf > clean_genes.gtf

This will discard al gene_ids containing only ""; .

**snownontrace** · 08-26-2015, 08:10 AM

Thanks! I'll try featureCounts. From the paper it looks decent!
Shaohe

**snownontrace** · 08-26-2015, 08:21 AM

Thanks Michael. I decided to stay away from this problematic gtf file and instead use the one from Ensembl. I wrote a python script to convert the chromosome names from (I, II, III, IV, V, X, MtDNA) to (chrI, chrII, chrIII, chrIV, chrV, chrX and chrM) to make it compatible with the sam files I already have.

Here is the script if someone happens to run into the same problem as mine and do not want to spend another day to re-do the mapping.

Code:

import re

folder = '/Users/snownontrace/Desktop/RNAseq/'
datafile = folder + "genes_Ensembl_WBcel235.gtf"

romanNumeral = re.compile('^(?=[MDCLXVI])M*(C[MD]|D?C{0,3})(X[CL]|L?X{0,3})(I[XV]|V?I{0,3})$') #this is strict Roman numerals regex that check order

newGTFfile = open(datafile.split('.')[0]+'_converted_chromosome.gtf','w')

with open(datafile) as f:
    for r in f:
        if romanNumeral.match(r.split()[0]):
            newGTFfile.write('chr'+r)
        else:
            newGTFfile.write('chrM'+'\t'+'\t'.join(r.split()[1:])+'\n')

Shaohe

**shi** · 08-26-2015, 03:14 PM

Hi Shaohe, featureCounts can automatically match up chr names with or without the 'chr' prefix in the annotation and bam/sam files (eg. 'chrI' is matched with 'I'). This wont work for your 'MtDNA' chr though, but you can use the '-A' option to provide chr aliases to the program. You do not need to redo the mapping.

Wei

**snownontrace** · 08-26-2015, 05:14 PM

Hi Wei,

The script I posted did the conversion of chromosome names and I finished my analysis. I am still very interested in trying featureCounts and will post questions here to get help if stuck.

Thanks!
Shaohe

**dpryan** · 08-26-2015, 11:48 PM

FYI, if you ever need to convert more complicated/annoying chromosome names, I have a github repo that has some of the more common mappings.

**snownontrace** · 08-27-2015, 04:22 AM

Thanks. Will certainly be useful someday!
Shaohe

Topics	Statistics	Last Post
The Role of Spliceosomes in RNA Splicing and Genome Evolution by seqadmin Started by seqadmin, Today, 07:03 AM	0 responses 10 views 0 likes	Last Post by seqadmin Today, 07:03 AM
A Closer Look at the Enigmatic Genomes of Oikopleura dioica by seqadmin Started by seqadmin, 05-10-2024, 06:35 AM	0 responses 31 views 0 likes	Last Post by seqadmin 05-10-2024, 06:35 AM
Advanced Epigenome Editing Platform Explores Gene Regulation Mechanisms by seqadmin Started by seqadmin, 05-09-2024, 02:46 PM	0 responses 41 views 0 likes	Last Post by seqadmin 05-09-2024, 02:46 PM
Telomere Maintenance by PARP1: A New Perspective in Cancer Research by seqadmin Started by seqadmin, 05-07-2024, 06:57 AM	0 responses 33 views 0 likes	Last Post by seqadmin 05-07-2024, 06:57 AM

Seqanswers Leaderboard Ad

Announcement

IGV sees many counts, but HTSeq-count counts 0

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News