SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
DEXSeq Using Counts File From htseq-count FuzzyCoder Bioinformatics 20 01-03-2016 11:18 PM
Zero counts in htseq-count output lanner Bioinformatics 2 05-05-2014 12:36 AM
htseq-count only not-unique counts moser Bioinformatics 7 08-09-2013 08:28 AM
htseq-count : 0 counts per miRNA JPerales RNA Sequencing 4 06-04-2013 06:00 AM
Discrepancy between HTSeq-count counts and total mapped reads M_staats Bioinformatics 1 03-21-2013 06:16 AM

Reply
 
Thread Tools
Old 08-24-2015, 08:34 PM   #1
snownontrace
Junior Member
 
Location: San Diego

Join Date: Aug 2015
Posts: 7
Question IGV sees many counts, but HTSeq-count counts 0

I appreciate any help to figure out this puzzle.

What I did:

1. I first used tophat2 to map the reads to the genome, get accepted.bam. This file is ~900 MB. Here is the alignment summary:
------------------------------------------------------------------------------------
Reads:
Input : 45593954
Mapped : 44546323 (97.7% of input)
of these: 1496005 ( 3.4%) have multiple alignments (7 have >20)
97.7% overall read mapping rate.
------------------------------------------------------------------------------------

2. I then used samtools to convert the bam file to sam file for htseq-count to handle:

samtools view -h -o WT3.sam WT3_th2out/accepted_hits.bam

The output sam file is quite large, is ~11 GB.

3. Finally I passed the sam file to htseq-count to generate a count table:

htseq-count --stranded=no WT3.sam genes.gtf > WT3_counts

The first line of the output is just one number followed by a tab:
" 1851236"
I assume this is the total counts for all genes (Is this right?).

At the end of the file it says:
------------------------------------------------------------------------------------
__no_feature 5542374
__ambiguous 5280500
__too_low_aQual 0
__not_aligned 0
__alignment_not_unique 3266554
------------------------------------------------------------------------------------

The questions:

I. The sum of all counts are merely 16 million, while the uniquely mapped reads are 43 million, why is the huge discrepancy?

II. For a few genes of interest, I got "0" in the count output, while I can see hundreds to thousands in IGV when looking at the bam file mapping.

What I have checked:

The genes.gtf was the same UCSC source as the genome I used for tophat2 mapping, so the chromosome names are consistent (chrI chrII etc).

The genes of interest I was looking at are not duplicated genes, so the mapped reads inside them should be mostly unique.

I am desparate to get some explanation for this.

Thanks!
Shaohe
snownontrace is offline   Reply With Quote
Old 08-25-2015, 12:57 AM   #2
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,474
Default

Quote:
The first line of the output is just one number followed by a tab:
" 1851236"
That shouldn't happen, there's probably an error in the GTF file. BTW, just give htseq-count the BAM file.

Quote:
I. The sum of all counts are merely 16 million, while the uniquely mapped reads are 43 million, why is the huge discrepancy?
Just because an alignment is "unique" to the genome doesn't mean that (A) it overlaps a gene or (B) that it overlaps only 1 gene.

Quote:
II. For a few genes of interest, I got "0" in the count output, while I can see hundreds to thousands in IGV when looking at the bam file mapping.
Perhaps they're multimappers. You can always use the -o option and grep for a few read names to see why they're not being counted (it's incredibly unlikely that htseq-count is doing something wrong).

[QUOTE] The genes.gtf was the same UCSC source as the genome I used for tophat2 mapping, so the chromosome names are consistent (chrI chrII etc).[QUOTE]

Avoid using GTF files from UCSC, they tend to have a variety of problems. Stick to Ensembl for everything.
dpryan is offline   Reply With Quote
Old 08-25-2015, 01:03 AM   #3
Michael.Ante
Senior Member
 
Location: Vienna

Join Date: Oct 2011
Posts: 120
Default

Hi Shaohe,

could you please provide an image of one of these cases, including the gtf-annotation in the IGV?

It seems to me, that the first line in your result file is due to the fact, that some gene_ids are ether empty or not not set properly (e.g. special chars?). Thus, your last gene with this issue counted 1.8 M reads.

Regarding your first question, you have to account the no_feature reads (intronic or inter-genic mapped), as well as the ambiguous reads (reads mapping in overlapping features). These would sum up to 10.8 M reads.
In the alignments_not_unique the number of alignments (not reads) are counted, you can leave this out.

BTW: With the new version of HTSeq-count, you are using, you don't need to convert your bam file.

Cheers,
Michael
Michael.Ante is offline   Reply With Quote
Old 08-25-2015, 06:19 PM   #4
snownontrace
Junior Member
 
Location: San Diego

Join Date: Aug 2015
Posts: 7
Default

Hi Ryan,

Thank you very much for your quick response.

Quote:
Originally Posted by dpryan View Post
That shouldn't happen, there's probably an error in the GTF file. BTW, just give htseq-count the BAM file.
Thanks for the heads up. I pulled out the numbers that were assigned to genes and they sum up to be ~30 million, much closer to the mapped reads number (43 million). However, it is quite unsatisfying to have this weird empty head thing that takes up 1.8 million counts.

It'll be really nice to bypass the large sam files. However, when I tried to use BAM file directly with "-f bam", an error occured saying I don't have pysam module. So I tried to install pysam by "pip install pysam", which failed with an error (downloading the pysam tarball and install from there has a similar error):

Code:
Command "/Library/Frameworks/Python.framework/Versions/2.7/Resources/Python.app/Contents/MacOS/Python -c "import setuptools, tokenize;__file__='/private/var/folders/k0/1s_c443x4sz_k0xlt15g2z_r0000gn/T/pip-build-LF1tZX/pysam/setup.py';exec(compile(getattr(tokenize, 'open', open)(__file__).read().replace('\r\n', '\n'), __file__, 'exec'))" install --record /var/folders/k0/1s_c443x4sz_k0xlt15g2z_r0000gn/T/pip-xfBXDS-record/install-record.txt --single-version-externally-managed --compile" failed with error code 1 in /private/var/folders/k0/1s_c443x4sz_k0xlt15g2z_r0000gn/T/pip-build-LF1tZX/pysam
Please help me if you know how to fix this problem!

Quote:
Just because an alignment is "unique" to the genome doesn't mean that (A) it overlaps a gene or (B) that it overlaps only 1 gene.
The genes I was looking at are not that complicated, so I didn't except to get "0" for them. In fact, when I use cufflinks I got 96 fkpm for one gene -- a snapshot from IGV for this example gene is attached.

Quote:
Perhaps they're multimappers. You can always use the -o option and grep for a few read names to see why they're not being counted (it's incredibly unlikely that htseq-count is doing something wrong).
The first five lines of the -o output is:

Code:
['HWI-D00611:139:C6VKWANXX:8:2203:1197:2158\t272\tchrI\t2241\t0\t50M\t*\t0\t0\tCGCCCACGAAACCGTGCCGCACGTGTGGGTTTACGAGCTGAATATTTTCC\tFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFBBBBB\tAS:i:0\tXN:i:0\tXM:i:0\tXO:i:0\tXG:i:0\tNM:i:0\tMD:Z:50\tYT:Z:UU\tNH:i:7\tCC:Z:=\tCP:i:9569203\tHI:i:0\tXF:Z:__alignment_not_unique\n', 'HWI-D00611:139:C6VKWANXX:8:1314:14632:94054\t272\tchrI\t2241\t0\t50M\t*\t0\t0\tCGCCCACGAAACCGTGCCGCACGTGTGGGTTTACGAGCTGAATATTTTCC\tFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFBBBBB\tAS:i:0\tXN:i:0\tXM:i:0\tXO:i:0\tXG:i:0\tNM:i:0\tMD:Z:50\tYT:Z:UU\tNH:i:7\tCC:Z:=\tCP:i:9569203\tHI:i:0\tXF:Z:__alignment_not_unique\n', 'HWI-D00611:139:C6VKWANXX:8:1101:12688:99667\t272\tchrI\t2523\t0\t50M\t*\t0\t0\tATATTTATCAATTTTCTTCTGAGAGTCTCATTGAGACTCTTATTTACGCC\tFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFBBBBB\tAS:i:0\tXN:i:0\tXM:i:0\tXO:i:0\tXG:i:0\tNM:i:0\tMD:Z:50\tYT:Z:UU\tNH:i:7\tCC:Z:=\tCP:i:5594774\tHI:i:0\tXF:Z:__alignment_not_unique\n', 'HWI-D00611:139:C6VKWANXX:8:1108:19834:88650\t272\tchrI\t2523\t0\t50M\t*\t0\t0\tATATTTATCAATTTTCTTCTGAGAGTCTCATTGAGACTCTTATTTACGCC\tFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFBBBBB\tAS:i:0\tXN:i:0\tXM:i:0\tXO:i:0\tXG:i:0\tNM:i:0\tMD:Z:50\tYT:Z:UU\tNH:i:7\tCC:Z:=\tCP:i:5594774\tHI:i:0\tXF:Z:__alignment_not_unique\n', 'HWI-D00611:139:C6VKWANXX:8:2111:11430:84665\t272\tchrI\t2523\t0\t50M\t*\t0\t0\tATATTTATCAATTTTCTTCTGAGAGTCTCATTGAGACTCTTATTTACGCC\t<FBFFF/FFBBFFFFFFFFFFFBFFFFFFFBFFFFFFFFFFFFFFBBBB/\tAS:i:0\tXN:i:0\tXM:i:0\tXO:i:0\tXG:i:0\tNM:i:0\tMD:Z:50\tYT:Z:UU\tNH:i:7\tCC:Z:=\tCP:i:5594774\tHI:i:0\tXF:Z:__alignment_not_unique\n']
I have no idea what these are, but all five have the feature "__alignment_not_unique". I don't know how or if there is method to pinpoint the reads that aligned to particular genes.

Quote:
Avoid using GTF files from UCSC, they tend to have a variety of problems. Stick to Ensembl for everything.
I'll try the Ensembl genome assembly and GTF files. BTW I am using C. elegans rather than human genome and I heard UCSC is pretty good. Apparently this is not the case since they have some empty gene IDs and names for lots of entries (see my reply to Michael below).

Many thanks!
Shaohe
Attached Images
File Type: png klp-7_WT3_IGV.png (23.7 KB, 13 views)

Last edited by snownontrace; 08-25-2015 at 06:58 PM.
snownontrace is offline   Reply With Quote
Old 08-25-2015, 06:32 PM   #5
snownontrace
Junior Member
 
Location: San Diego

Join Date: Aug 2015
Posts: 7
Default

Hi Michael,
Also thank you for your quick and kind reply!

Quote:
Originally Posted by Michael.Ante View Post
could you please provide an image of one of these cases, including the gtf-annotation in the IGV?
Does the image in my last post do the job? This is one of the genes I was curious about, which has many reads in the coding region, but the count table says "0" for it.

Quote:
It seems to me, that the first line in your result file is due to the fact, that some gene_ids are ether empty or not not set properly (e.g. special chars?). Thus, your last gene with this issue counted 1.8 M reads.
That's likely the reason! When I inspect the gtf file, this is the example gene entries I posted the IGV picture for:
Code:
chrIII	unknown	CDS	10808247	10808360	.	+	0	gene_id "klp-7"; gene_name "klp-7"; p_id "P25071"; transcript_id "NM_001027507"; tss_id "TSS33773";
chrIII	unknown	exon	10808247	10808360	.	+	.	gene_id "klp-7"; gene_name "klp-7"; p_id "P25071"; transcript_id "NM_001027507"; tss_id "TSS33773";
chrIII	unknown	start_codon	10808247	10808249	.	+	.	gene_id "klp-7"; gene_name "klp-7"; p_id "P25071"; transcript_id "NM_001027507"; tss_id "TSS33773";
chrIII	unknown	CDS	10808501	10808717	.	+	0	gene_id ""; gene_name ""; p_id "P6745"; transcript_id "NM_001027506"; tss_id "TSS5767";
chrIII	unknown	CDS	10808501	10808717	.	+	0	gene_id "klp-7"; gene_name "klp-7"; p_id "P25071"; transcript_id "NM_001027507"; tss_id "TSS33773";
chrIII	unknown	exon	10808501	10808717	.	+	.	gene_id ""; gene_name ""; p_id "P6745"; transcript_id "NM_001027506"; tss_id "TSS5767";
chrIII	unknown	exon	10808501	10808717	.	+	.	gene_id "klp-7"; gene_name "klp-7"; p_id "P25071"; transcript_id "NM_001027507"; tss_id "TSS33773";
chrIII	unknown	CDS	10808766	10809423	.	+	2	gene_id ""; gene_name ""; p_id "P6745"; transcript_id "NM_001027506"; tss_id "TSS5767";
chrIII	unknown	CDS	10808766	10809423	.	+	2	gene_id "klp-7"; gene_name "klp-7"; p_id "P25071"; transcript_id "NM_001027507"; tss_id "TSS33773";
chrIII	unknown	exon	10808766	10809423	.	+	.	gene_id ""; gene_name ""; p_id "P6745"; transcript_id "NM_001027506"; tss_id "TSS5767";
chrIII	unknown	exon	10808766	10809423	.	+	.	gene_id "klp-7"; gene_name "klp-7"; p_id "P25071"; transcript_id "NM_001027507"; tss_id "TSS33773";
chrIII	unknown	CDS	10809474	10809935	.	+	1	gene_id ""; gene_name ""; p_id "P6745"; transcript_id "NM_001027506"; tss_id "TSS5767";
chrIII	unknown	CDS	10809474	10809935	.	+	1	gene_id "klp-7"; gene_name "klp-7"; p_id "P25071"; transcript_id "NM_001027507"; tss_id "TSS33773";
chrIII	unknown	exon	10809474	10809935	.	+	.	gene_id ""; gene_name ""; p_id "P6745"; transcript_id "NM_001027506"; tss_id "TSS5767";
chrIII	unknown	exon	10809474	10809935	.	+	.	gene_id "klp-7"; gene_name "klp-7"; p_id "P25071"; transcript_id "NM_001027507"; tss_id "TSS33773";
chrIII	unknown	CDS	10809986	10810297	.	+	1	gene_id ""; gene_name ""; p_id "P6745"; transcript_id "NM_001027506"; tss_id "TSS5767";
chrIII	unknown	CDS	10809986	10810297	.	+	1	gene_id "klp-7"; gene_name "klp-7"; p_id "P25071"; transcript_id "NM_001027507"; tss_id "TSS33773";
chrIII	unknown	exon	10809986	10810297	.	+	.	gene_id ""; gene_name ""; p_id "P6745"; transcript_id "NM_001027506"; tss_id "TSS5767";
chrIII	unknown	exon	10809986	10810297	.	+	.	gene_id "klp-7"; gene_name "klp-7"; p_id "P25071"; transcript_id "NM_001027507"; tss_id "TSS33773";
chrIII	unknown	CDS	10810345	10810648	.	+	1	gene_id ""; gene_name ""; p_id "P6745"; transcript_id "NM_001027506"; tss_id "TSS5767";
chrIII	unknown	CDS	10810345	10810648	.	+	1	gene_id "klp-7"; gene_name "klp-7"; p_id "P25071"; transcript_id "NM_001027507"; tss_id "TSS33773";
chrIII	unknown	exon	10810345	10810651	.	+	.	gene_id ""; gene_name ""; p_id "P6745"; transcript_id "NM_001027506"; tss_id "TSS5767";
chrIII	unknown	exon	10810345	10810651	.	+	.	gene_id "klp-7"; gene_name "klp-7"; p_id "P25071"; transcript_id "NM_001027507"; tss_id "TSS33773";
chrIII	unknown	stop_codon	10810649	10810651	.	+	.	gene_id ""; gene_name ""; p_id "P6745"; transcript_id "NM_001027506"; tss_id "TSS5767";
chrIII	unknown	stop_codon	10810649	10810651	.	+	.	gene_id "klp-7"; gene_name "klp-7"; p_id "P25071"; transcript_id "NM_001027507"; tss_id "TSS33773";
As you may notice, for many exons there is a duplicated entry with EMPTY gene_name and gene_id, likely explaining the 1.8 million counts assigned to "EMPTY" in line 1 of the output.

Quote:
Regarding your first question, you have to account the no_feature reads (intronic or inter-genic mapped), as well as the ambiguous reads (reads mapping in overlapping features). These would sum up to 10.8 M reads.
In the alignments_not_unique the number of alignments (not reads) are counted, you can leave this out.
Thanks for pointing that out. I totally thought the first line of the output is the sum, so I just added that with the special features. In fact if I add all counts including the ones assigned to genes, the special features and 1.8 million from line 1, they add up to 43 million.

Quote:
BTW: With the new version of HTSeq-count, you are using, you don't need to convert your bam file.
Thanks for the tip. As I was saying in the reply to Ryan, I got error when installing the pysam module, which prevented me from using bam file directly. I'd appreciate it if you have some insights into that.

Thanks!
Shaohe

Last edited by snownontrace; 08-25-2015 at 06:56 PM.
snownontrace is offline   Reply With Quote
Old 08-25-2015, 11:18 PM   #6
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,474
Default

The example read was marked "not unique" because it was a multimapper (it had 7 alignments). Those should rightfully be excluded.

BTW, you might just use featureCounts instead. It's MUCH faster than HTSeq-count and you can just avoid the pysam installation problem (no clue what's going on there).
dpryan is offline   Reply With Quote
Old 08-26-2015, 12:43 AM   #7
Michael.Ante
Senior Member
 
Location: Vienna

Join Date: Oct 2011
Posts: 120
Default

Having a look at your GTF explains, why you have a quite high number of ambiguous reads:
You have in your example a lot of exons with the same positions, but different gene_ids. Each read falling into those exons cannot be assigned unambiguously and is therefore not counted for your genes.

If you want to proceed with this data, you could use an awk command to clean the GTF:
Code:
awk 'length($10)>3{print $0}'  genes.gtf > clean_genes.gtf
This will discard al gene_ids containing only ""; .
Michael.Ante is offline   Reply With Quote
Old 08-26-2015, 08:10 AM   #8
snownontrace
Junior Member
 
Location: San Diego

Join Date: Aug 2015
Posts: 7
Default

Thanks! I'll try featureCounts. From the paper it looks decent!
Shaohe
snownontrace is offline   Reply With Quote
Old 08-26-2015, 08:21 AM   #9
snownontrace
Junior Member
 
Location: San Diego

Join Date: Aug 2015
Posts: 7
Default

Thanks Michael. I decided to stay away from this problematic gtf file and instead use the one from Ensembl. I wrote a python script to convert the chromosome names from (I, II, III, IV, V, X, MtDNA) to (chrI, chrII, chrIII, chrIV, chrV, chrX and chrM) to make it compatible with the sam files I already have.

Here is the script if someone happens to run into the same problem as mine and do not want to spend another day to re-do the mapping.

Code:
import re

folder = '/Users/snownontrace/Desktop/RNAseq/'
datafile = folder + "genes_Ensembl_WBcel235.gtf"

romanNumeral = re.compile('^(?=[MDCLXVI])M*(C[MD]|D?C{0,3})(X[CL]|L?X{0,3})(I[XV]|V?I{0,3})$') #this is strict Roman numerals regex that check order

newGTFfile = open(datafile.split('.')[0]+'_converted_chromosome.gtf','w')

with open(datafile) as f:
    for r in f:
        if romanNumeral.match(r.split()[0]):
            newGTFfile.write('chr'+r)
        else:
            newGTFfile.write('chrM'+'\t'+'\t'.join(r.split()[1:])+'\n')
Shaohe
snownontrace is offline   Reply With Quote
Old 08-26-2015, 03:14 PM   #10
shi
Wei Shi
 
Location: Australia

Join Date: Feb 2010
Posts: 235
Default

Hi Shaohe, featureCounts can automatically match up chr names with or without the 'chr' prefix in the annotation and bam/sam files (eg. 'chrI' is matched with 'I'). This wont work for your 'MtDNA' chr though, but you can use the '-A' option to provide chr aliases to the program. You do not need to redo the mapping.

Wei
shi is offline   Reply With Quote
Old 08-26-2015, 05:14 PM   #11
snownontrace
Junior Member
 
Location: San Diego

Join Date: Aug 2015
Posts: 7
Default

Hi Wei,

The script I posted did the conversion of chromosome names and I finished my analysis. I am still very interested in trying featureCounts and will post questions here to get help if stuck.

Thanks!
Shaohe
snownontrace is offline   Reply With Quote
Old 08-26-2015, 11:48 PM   #12
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,474
Default

FYI, if you ever need to convert more complicated/annoying chromosome names, I have a github repo that has some of the more common mappings.
dpryan is offline   Reply With Quote
Old 08-27-2015, 04:22 AM   #13
snownontrace
Junior Member
 
Location: San Diego

Join Date: Aug 2015
Posts: 7
Default

Thanks. Will certainly be useful someday!
Shaohe
snownontrace is offline   Reply With Quote
Reply

Tags
htseq-count, pysam install

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 06:55 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2017, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO