![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
DEXSeq Using Counts File From htseq-count | FuzzyCoder | Bioinformatics | 20 | 01-04-2016 12:18 AM |
Error with GTF file when using htseq-count | MDonlin | Bioinformatics | 13 | 01-13-2015 09:29 AM |
Isolating non coding RNA using bedtools/samtools/htseq-count | tirohia | Bioinformatics | 3 | 09-30-2014 12:13 AM |
HTseq-count feature type choice for RNAseq | shangzhong0619 | Bioinformatics | 3 | 08-07-2014 01:29 PM |
multiBamCov or htseq-count to count read per feature ? | NicoBxl | Bioinformatics | 1 | 07-03-2012 03:05 AM |
![]() |
|
Thread Tools |
![]() |
#1 |
Senior Member
Location: Paris Join Date: Aug 2011
Posts: 239
|
![]()
Dear all,
Sorry if this question has been asked before, I have not found similar topics on the web, but there are probably... Until now, I used a gtf downloaded from UCSC table (genome: human, assembly: hg19, group:genes and gene predictions, track: RefSeq genes, table: refFlat) as a feature file, with gene name, in GFF format for counting reads in genes with htseq-count. I have one experiment with ribodepleted samples (TruSeq total RNA Stranded). I first counted the reads and performed differential expression analysis using such a gtf. In this gtf downloaded few weeks ago, there are 26688 genes, including 913 "LINC*". This number of lincRNA seems low, so I wonder if this table is comprehensive for ribodepleted experiment. Can you please tell me which reference you use when interesting in non coding elements of ribodepleted experiments? Thank you for your feedback, Jane |
![]() |
![]() |
![]() |
#2 |
Devon Ryan
Location: Freiburg, Germany Join Date: Jul 2011
Posts: 3,480
|
![]()
UCSC references...kind of suck. Use Gencode/Ensembl and you'll get more complete coding and non-coding transcripts. Having said that, for lincRNAs you might want to check out lincrnadb or RNAcentral.
|
![]() |
![]() |
![]() |
#3 |
Junior Member
Location: korea Join Date: Dec 2015
Posts: 2
|
![]()
I used gencode. but no counts.
perhaps what column used htseq-count ? |
![]() |
![]() |
![]() |
#4 |
Devon Ryan
Location: Freiburg, Germany Join Date: Jul 2011
Posts: 3,480
|
![]()
Presumably the lack of counts is due to the difference in chromosome names.
|
![]() |
![]() |
![]() |
#5 | |
Senior Member
Location: Paris Join Date: Aug 2011
Posts: 239
|
![]()
Thank you dpryan for your answer.
Don't you thing this refFlat reference is sufficient for experiments with polyA RNA selection? Quote:
I am currently looking at what this file contains: number of genes (name, ID, status), transcripts, ... It has not the same format like the refFlat file, so I expect some issues for read counting. |
|
![]() |
![]() |
![]() |
#6 |
Devon Ryan
Location: Freiburg, Germany Join Date: Jul 2011
Posts: 3,480
|
![]()
The UCSC file annotations are never that good. The file you downloaded is fine, though again the chromosome names probably differ, which is what's causing problems. I think featureCounts can handle the chromosome naming difference internally, so try using that instead (it's much faster anyway).
|
![]() |
![]() |
![]() |
#7 | |
Senior Member
Location: Paris Join Date: Aug 2011
Posts: 239
|
![]() Quote:
chr1->22, chrX, chrY have the same names, but the additional chromosomes have indeed different names: chr17_ctg5_hap1, chr17_gl000205_random in refFlat and GL000191.1, GL000192.1. I start the tests right now! |
|
![]() |
![]() |
![]() |
#8 |
Senior Member
Location: Paris Join Date: Aug 2011
Posts: 239
|
![]()
Compared to how I ran htseq-count previously, I added -i option to get counts for each gene_name and not gene_ID. I could run htseq count on this new gtf without any problem
![]() Code:
htseq-count --format=bam --order=name --mode=union -a 20 -i gene_name --stranded=reverse mybam.bam gencode.v19.chr_patch_hapl_scaff.annotation.gtf > mynewcounts.gencod.htseqCount 1. With the genecode annotation, we can find how many genes are attributed to each gene_type (30 gene types), like protein coding, lincRNA, pseudogene, rRNA,... Do you know where to find this information for the refFlat annotation? 2. I am a bit confused by gene_name and gene_ID: When counting in gencode.v19.chr_patch_hapl_scaff.annotation.gtf the number of genes ("gene" in field 3), I get 63,568: Code:
cat gencode.v19.chr_patch_hapl_scaff.annotation.gtf | cut -f3 | wc -l 63568 Code:
cat gencode.v19.chr_patch_hapl_scaff.annotation.gtf | cut -d';' -f5 | sort | uniq | wc -l 56629 Does someone know what are those genes with same gene_name but different gene_ID? Code:
cat gencode.v19.chr_patch_hapl_scaff.annotation.gene.gtf | cut -f9 | sort | uniq | wc -l 63568 Any clarification would be greatly appreciated. Last edited by Jane M; 05-11-2016 at 07:49 AM. |
![]() |
![]() |
![]() |
#9 |
Devon Ryan
Location: Freiburg, Germany Join Date: Jul 2011
Posts: 3,480
|
![]()
1. Use things like "cut" and "uniq" to determine this. This isn't something you need to look up, just determine it yourself.
2. How does one define a gene? Is it a location, a sequence, something else? If you have essentially the same sequence on different chromosomes and both are expressed are they the same gene or different ones? In such cases, gencode/ensembl will give each instance a unique ID. UCSC will give each instance the same ID in such cases, which is a good way to completely break a LOT of programs.This is why one should normally quantify by gene ID. You can add gene names after everything is analysed. 3. UCSC annotations are rather minimalistic. |
![]() |
![]() |
![]() |
#10 | ||
Senior Member
Location: Paris Join Date: Aug 2011
Posts: 239
|
![]() Quote:
refFlat file (RefSeq), that I used until now: Code:
chr1 hg19_refFlat exon 11874 12227 0.000000 + . gene_id "DDX11L1"; transcript_id "DDX11L1"; chr1 hg19_refFlat exon 12613 12721 0.000000 + . gene_id "DDX11L1"; transcript_id "DDX11L1"; chr1 hg19_refFlat exon 13221 14409 0.000000 + . gene_id "DDX11L1"; transcript_id "DDX11L1"; chr1 hg19_refFlat exon 14362 14829 0.000000 - . gene_id "WASH7P"; transcript_id "WASH7P"; chr1 hg19_refFlat exon 14970 15038 0.000000 - . gene_id "WASH7P"; transcript_id "WASH7P"; chr1 hg19_refFlat exon 15796 15947 0.000000 - . gene_id "WASH7P"; transcript_id "WASH7P"; refGene file (RefSeq): Code:
chr1 hg19_refGene start_codon 67000042 67000044 0.000000 + . gene_id "NM_032291"; transcript_id "NM_032291"; chr1 hg19_refGene CDS 67000042 67000051 0.000000 + 0 gene_id "NM_032291"; transcript_id "NM_032291"; chr1 hg19_refGene exon 66999639 67000051 0.000000 + . gene_id "NM_032291"; transcript_id "NM_032291"; chr1 hg19_refGene CDS 67091530 67091593 0.000000 + 2 gene_id "NM_032291"; transcript_id "NM_032291"; chr1 hg19_refGene exon 67091530 67091593 0.000000 + . gene_id "NM_032291"; transcript_id "NM_032291"; chr1 hg19_refGene CDS 67098753 67098777 0.000000 + 1 gene_id "NM_032291"; transcript_id "NM_032291"; chr1 hg19_refGene exon 67098753 67098777 0.000000 + . gene_id "NM_032291"; transcript_id "NM_032291"; chr1 hg19_refGene CDS 67101627 67101698 0.000000 + 0 gene_id "NM_032291"; transcript_id "NM_032291"; Code:
chr1 hg19_knownGene exon 11874 12227 0.000000 + . gene_id "uc010nxr.1"; transcript_id "uc010nxr.1"; chr1 hg19_knownGene exon 12646 12697 0.000000 + . gene_id "uc010nxr.1"; transcript_id "uc010nxr.1"; chr1 hg19_knownGene exon 13221 14409 0.000000 + . gene_id "uc010nxr.1"; transcript_id "uc010nxr.1"; chr1 hg19_knownGene start_codon 12190 12192 0.000000 + . gene_id "uc010nxq.1"; transcript_id "uc010nxq.1"; chr1 hg19_knownGene CDS 12190 12227 0.000000 + 0 gene_id "uc010nxq.1"; transcript_id "uc010nxq.1"; chr1 hg19_knownGene exon 11874 12227 0.000000 + . gene_id "uc010nxq.1"; transcript_id "uc010nxq.1"; chr1 hg19_knownGene CDS 12595 12721 0.000000 + 1 gene_id "uc010nxq.1"; transcript_id "uc010nxq.1"; chr1 hg19_knownGene exon 12595 12721 0.000000 + . gene_id "uc010nxq.1"; transcript_id "uc010nxq.1"; chr1 hg19_knownGene CDS 13403 13636 0.000000 + 0 gene_id "uc010nxq.1"; transcript_id "uc010nxq.1"; chr1 hg19_knownGene stop_codon 13637 13639 0.000000 + . gene_id "uc010nxq.1"; transcript_id "uc010nxq.1"; chr1 hg19_knownGene exon 13403 14409 0.000000 + . gene_id "uc010nxq.1"; transcript_id "uc010nxq.1"; Quote:
Ok, but I am very surprised for annotation of well known genes. |
||
![]() |
![]() |
![]() |
#11 |
Senior Member
Location: East Coast USA Join Date: Feb 2008
Posts: 7,080
|
![]()
Why are you not using the GTF file from UCSC instead of all those other files. You could create that from Table Browser. (iGenomes bundle does not have non-coding genes).
Edit: @Vikas Bansal has a solution to get the non-coding elements (you can choose GTF output in table browser) for hg19 here. Last edited by GenoMax; 05-12-2016 at 06:54 AM. |
![]() |
![]() |
![]() |
#12 | ||
Senior Member
Location: Paris Join Date: Aug 2011
Posts: 239
|
![]() Quote:
I added the refGene and knownGenes gtf files to show that these files do not contain description neither. There seem to be some non coding elements in refFlat gtf: ~900 LINC, ... Quote:
|
||
![]() |
![]() |
![]() |
Thread Tools | |
|
|