SEQanswers

Go Back   SEQanswers > Applications Forums > RNA Sequencing



Similar Threads
Thread Thread Starter Forum Replies Last Post
[DEXSeq] exon counts to "PSI" (exon inclusion level) yerbol Bioinformatics 3 11-23-2015 05:32 PM
edgeR spliceVariants: gene- and exon-level dispersion bstephenwhite@gmail.com Bioinformatics 3 09-11-2014 03:47 PM
exon expression level Ohad RNA Sequencing 6 06-10-2014 05:59 AM
DEXSeq - Counting with HT-seq at the exon level Yohann Bioinformatics 1 05-09-2014 11:35 AM
DEXSeq gene level counts Julien Roux Bioinformatics 3 11-28-2012 12:31 AM

Reply
 
Thread Tools
Old 06-17-2014, 04:15 AM   #1
Minty
Junior Member
 
Location: Finland

Join Date: Jun 2014
Posts: 6
Default HTSeq exon counts to gene level

Hi,

I'm working with drosophila melanogaster RNA-Seq data and at least one of my genes of interest is overlapping an other gene so I used HTSeq with exon counts but now I don't know how to combine the individual exon counts to gene level (so all the exon counts of a gene are counted for that gene for further analysis). Is it possible (more easily than with manually counting them in an excel file or something like that)? I have 18 samples and overs 50000 exons so there should be an easier way. Can the combining be done in R or Unix environment?

If the HTSeq exon counts can't be combined, what should be used instead of HTSeq when the exon has to be the starting point to show the expression of overlapping genes?
Minty is offline   Reply With Quote
Old 06-17-2014, 04:18 AM   #2
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

Why did you count exons if you want gene-level metrics. Htseq-count can deal with overlapping genes without issue.
dpryan is offline   Reply With Quote
Old 06-17-2014, 04:24 AM   #3
Minty
Junior Member
 
Location: Finland

Join Date: Jun 2014
Posts: 6
Default

I'm actually just a summer worker now and my supervisor did the gene level counts but the most important gene didn't show any expression in the data. It is located in a promoter area of an other gene. With exon counts it shows the expected expression and some collaborator used the exon data but I can't get his code before he comes back from his holiday and I'm stuck.
Minty is offline   Reply With Quote
Old 06-17-2014, 04:29 AM   #4
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

It sounds like your supervisor did it wrong. In any case, you might just want to redo the counting, either with htseq-count or with featureCounts (which is much faster).
dpryan is offline   Reply With Quote
Old 06-17-2014, 04:34 AM   #5
Minty
Junior Member
 
Location: Finland

Join Date: Jun 2014
Posts: 6
Default

She used this code for the gene counts:
python -m HTSeq.scripts.count -m union -s no -a 10 -t gene -i ID $name.n-sorted.q10.sam $ANNOT/$gf_gene > $name.n-sorted.q10.$out_gene

The ANNOT is a directory path and $name is used to run the whole list with additional running code. What should I change to get the gene counts right?
Minty is offline   Reply With Quote
Old 06-17-2014, 04:40 AM   #6
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

Without seeing the annotation file, my guess is that whatever gene you're getting 0 counts for overlaps slightly this other gene, so using "-m union" is losing a lot of counts. So, try "-m intersection_nonempty" instead.

If one gene completely overlaps the other, then without a stranded dataset there's no way to say from which of the overlapping genes a given read originates, thus saying that it has 0 counts would be appropriate. Granted, you might be able to guess from the coverage profile that most/all of them actually come from one or the other gene, but then you're using an expectation maximization method rather than direct counting. Alternatively, just remove the overlapping portion of one of the genes and then mention that whenever the data gets published. You'd likely get hammered by the reviewers, but that would be completely appropriate.
dpryan is offline   Reply With Quote
Old 06-17-2014, 04:49 AM   #7
Minty
Junior Member
 
Location: Finland

Join Date: Jun 2014
Posts: 6
Default

The overlap is complete, that really is the problem. Checking from the manual (before even running the exon counts) all three options for -m just say ambiguous. That's why I came here (after intense googling).

Edit.
As I know a little of programming I would assume that the exons could be combined with some script. My skills just aren't good enough yet, so any help with just transforming a gff file from:
14-3-3epsilon:1 0
14-3-3epsilon:2 0
14-3-3epsilon:3 3527
14-3-3epsilon:4 1343
14-3-3epsilon:5 1
14-3-3epsilon:6 0
14-3-3epsilon:7 57
14-3-3epsilon:8 0

to:
14-3-3epsilon 4928

is appreciated.

Last edited by Minty; 06-17-2014 at 05:02 AM.
Minty is offline   Reply With Quote
Old 06-17-2014, 05:03 AM   #8
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

No count-based method will ever return anything other than 0 in that circumstance, since doing otherwise would be wrong.

If you look at the read distribution in a genome browser (IGV or similar) and it's very obvious in all samples that the reads aligning to this region only originate from one of the genes then you might be able to get away with just removing the overlapping portion of one of the genes, which would result in the other not having 0 counts. This is, of course, also a good way to produce results that don't match the underlying biology. One would hope that your collaborator, who did something similar, thought about this issue (I wouldn't hold my breath).
dpryan is offline   Reply With Quote
Old 06-17-2014, 05:08 AM   #9
Minty
Junior Member
 
Location: Finland

Join Date: Jun 2014
Posts: 6
Default

The gene of interest has it's exon in an area that the other gene doesn't have any exons so the removal of the overlapping part of the other gene could be an options if nothing else works.
Minty is offline   Reply With Quote
Old 06-17-2014, 05:52 AM   #10
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

Just recounting by genes would produce similar results.

Keeping in mind that counting by exons will over/under count things (depending on how spliced reads are treated):
Code:
cat foo.txt | sed -r 's/:[0-9]+//' | awk '{if($1 == gene) {count+=$2} else {if(gene != "") printf("%s\t%i\n",gene,count);gene=$1;count=$2}}END{printf("%s\t%i\n",gene,count)}' > foo.combined.txt
dpryan is offline   Reply With Quote
Old 06-18-2014, 04:48 AM   #11
Minty
Junior Member
 
Location: Finland

Join Date: Jun 2014
Posts: 6
Default

Thank you for the code. I'll keep in mind the affect of spliced reads.
Minty is offline   Reply With Quote
Reply

Tags
count, exon, gene, htseq

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 01:10 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO