![]() |
Which feature file for htseq-count for non coding elements of ribodepleted samples?
Dear all,
Sorry if this question has been asked before, I have not found similar topics on the web, but there are probably... Until now, I used a gtf downloaded from UCSC table (genome: human, assembly: hg19, group:genes and gene predictions, track: RefSeq genes, table: refFlat) as a feature file, with gene name, in GFF format for counting reads in genes with htseq-count. I have one experiment with ribodepleted samples (TruSeq total RNA Stranded). I first counted the reads and performed differential expression analysis using such a gtf. In this gtf downloaded few weeks ago, there are 26688 genes, including 913 "LINC*". This number of lincRNA seems low, so I wonder if this table is comprehensive for ribodepleted experiment. Can you please tell me which reference you use when interesting in non coding elements of ribodepleted experiments? Thank you for your feedback, Jane |
UCSC references...kind of suck. Use Gencode/Ensembl and you'll get more complete coding and non-coding transcripts. Having said that, for lincRNAs you might want to check out lincrnadb or RNAcentral.
|
I used gencode. but no counts.
perhaps what column used htseq-count ? |
Presumably the lack of counts is due to the difference in chromosome names.
|
Thank you dpryan for your answer.
Quote:
Quote:
I am currently looking at what this file contains: number of genes (name, ID, status), transcripts, ... It has not the same format like the refFlat file, so I expect some issues for read counting. |
The UCSC file annotations are never that good. The file you downloaded is fine, though again the chromosome names probably differ, which is what's causing problems. I think featureCounts can handle the chromosome naming difference internally, so try using that instead (it's much faster anyway).
|
Quote:
chr1->22, chrX, chrY have the same names, but the additional chromosomes have indeed different names: chr17_ctg5_hap1, chr17_gl000205_random in refFlat and GL000191.1, GL000192.1. I start the tests right now! |
Compared to how I ran htseq-count previously, I added -i option to get counts for each gene_name and not gene_ID. I could run htseq count on this new gtf without any problem :D
Code:
htseq-count --format=bam --order=name --mode=union -a 20 -i gene_name --stranded=reverse mybam.bam gencode.v19.chr_patch_hapl_scaff.annotation.gtf > mynewcounts.gencod.htseqCount 1. With the genecode annotation, we can find how many genes are attributed to each gene_type (30 gene types), like protein coding, lincRNA, pseudogene, rRNA,... Do you know where to find this information for the refFlat annotation? 2. I am a bit confused by gene_name and gene_ID: When counting in gencode.v19.chr_patch_hapl_scaff.annotation.gtf the number of genes ("gene" in field 3), I get 63,568: Code:
cat gencode.v19.chr_patch_hapl_scaff.annotation.gtf | cut -f3 | wc -l Code:
cat gencode.v19.chr_patch_hapl_scaff.annotation.gtf | cut -d';' -f5 | sort | uniq | wc -l Does someone know what are those genes with same gene_name but different gene_ID? Code:
cat gencode.v19.chr_patch_hapl_scaff.annotation.gene.gtf | cut -f9 | sort | uniq | wc -l Any clarification would be greatly appreciated. |
1. Use things like "cut" and "uniq" to determine this. This isn't something you need to look up, just determine it yourself.
2. How does one define a gene? Is it a location, a sequence, something else? If you have essentially the same sequence on different chromosomes and both are expressed are they the same gene or different ones? In such cases, gencode/ensembl will give each instance a unique ID. UCSC will give each instance the same ID in such cases, which is a good way to completely break a LOT of programs.This is why one should normally quantify by gene ID. You can add gene names after everything is analysed. 3. UCSC annotations are rather minimalistic. |
Quote:
refFlat file (RefSeq), that I used until now: Code:
chr1 hg19_refFlat exon 11874 12227 0.000000 + . gene_id "DDX11L1"; transcript_id "DDX11L1"; refGene file (RefSeq): Code:
chr1 hg19_refGene start_codon 67000042 67000044 0.000000 + . gene_id "NM_032291"; transcript_id "NM_032291"; Code:
chr1 hg19_knownGene exon 11874 12227 0.000000 + . gene_id "uc010nxr.1"; transcript_id "uc010nxr.1"; Quote:
Quote:
|
Why are you not using the GTF file from UCSC instead of all those other files. You could create that from Table Browser. (iGenomes bundle does not have non-coding genes).
Edit: @Vikas Bansal has a solution to get the non-coding elements (you can choose GTF output in table browser) for hg19 here. |
Quote:
I added the refGene and knownGenes gtf files to show that these files do not contain description neither. There seem to be some non coding elements in refFlat gtf: ~900 LINC, ... Quote:
|
All times are GMT -8. The time now is 06:29 AM. |
Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2021, vBulletin Solutions, Inc.