Seqanswers Leaderboard Ad

**dpryan** · 06-17-2014, 04:18 AM

Why did you count exons if you want gene-level metrics. Htseq-count can deal with overlapping genes without issue.

**Minty** · 06-17-2014, 04:24 AM

I'm actually just a summer worker now and my supervisor did the gene level counts but the most important gene didn't show any expression in the data. It is located in a promoter area of an other gene. With exon counts it shows the expected expression and some collaborator used the exon data but I can't get his code before he comes back from his holiday and I'm stuck.

**dpryan** · 06-17-2014, 04:29 AM

It sounds like your supervisor did it wrong. In any case, you might just want to redo the counting, either with htseq-count or with featureCounts (which is much faster).

**Minty** · 06-17-2014, 04:34 AM

She used this code for the gene counts:
python -m HTSeq.scripts.count -m union -s no -a 10 -t gene -i ID $name.n-sorted.q10.sam $ANNOT/$gf_gene > $name.n-sorted.q10.$out_gene

The ANNOT is a directory path and $name is used to run the whole list with additional running code. What should I change to get the gene counts right?

**dpryan** · 06-17-2014, 04:40 AM

Without seeing the annotation file, my guess is that whatever gene you're getting 0 counts for overlaps slightly this other gene, so using "-m union" is losing a lot of counts. So, try "-m intersection_nonempty" instead.

If one gene completely overlaps the other, then without a stranded dataset there's no way to say from which of the overlapping genes a given read originates, thus saying that it has 0 counts would be appropriate. Granted, you might be able to guess from the coverage profile that most/all of them actually come from one or the other gene, but then you're using an expectation maximization method rather than direct counting. Alternatively, just remove the overlapping portion of one of the genes and then mention that whenever the data gets published. You'd likely get hammered by the reviewers, but that would be completely appropriate.

**Minty** · 06-17-2014, 04:49 AM

The overlap is complete, that really is the problem. Checking from the manual (before even running the exon counts) all three options for -m just say ambiguous. That's why I came here (after intense googling).

Edit.
As I know a little of programming I would assume that the exons could be combined with some script. My skills just aren't good enough yet, so any help with just transforming a gff file from:
14-3-3epsilon:1 0
14-3-3epsilon:2 0
14-3-3epsilon:3 3527
14-3-3epsilon:4 1343
14-3-3epsilon:5 1
14-3-3epsilon:6 0
14-3-3epsilon:7 57
14-3-3epsilon:8 0

to:
14-3-3epsilon 4928

is appreciated.

**dpryan** · 06-17-2014, 05:03 AM

No count-based method will ever return anything other than 0 in that circumstance, since doing otherwise would be wrong.

If you look at the read distribution in a genome browser (IGV or similar) and it's very obvious in all samples that the reads aligning to this region only originate from one of the genes then you might be able to get away with just removing the overlapping portion of one of the genes, which would result in the other not having 0 counts. This is, of course, also a good way to produce results that don't match the underlying biology. One would hope that your collaborator, who did something similar, thought about this issue (I wouldn't hold my breath).

**Minty** · 06-17-2014, 05:08 AM

The gene of interest has it's exon in an area that the other gene doesn't have any exons so the removal of the overlapping part of the other gene could be an options if nothing else works.

**dpryan** · 06-17-2014, 05:52 AM

Just recounting by genes would produce similar results.

Keeping in mind that counting by exons will over/under count things (depending on how spliced reads are treated):

Code:

cat foo.txt | sed -r 's/:[0-9]+//' | awk '{if($1 == gene) {count+=$2} else {if(gene != "") printf("%s\t%i\n",gene,count);gene=$1;count=$2}}END{printf("%s\t%i\n",gene,count)}' > foo.combined.txt

**Minty** · 06-18-2014, 04:48 AM

Thank you for the code. I'll keep in mind the affect of spliced reads.

Topics	Statistics	Last Post
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, Today, 08:47 AM	0 responses 10 views 0 likes	Last Post by seqadmin Today, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 57 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 53 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM

Seqanswers Leaderboard Ad

Announcement

HTSeq exon counts to gene level

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News