SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
DEXSeq Using Counts File From htseq-count FuzzyCoder Bioinformatics 20 01-03-2016 11:18 PM
Problem with HTSeq dexseq_count.py script SanderEST Bioinformatics 7 04-01-2013 08:54 AM
singletons in dexseq_count.py Jetse Bioinformatics 1 12-14-2012 12:05 AM
DEXSeq gene level counts Julien Roux Bioinformatics 3 11-28-2012 12:31 AM
DEXseq_count.py error newbietonextgen Bioinformatics 1 06-27-2012 04:02 PM

Reply
 
Thread Tools
Old 02-08-2013, 01:54 PM   #1
sszelinger
Junior Member
 
Location: arizona

Join Date: Apr 2009
Posts: 3
Default DEXSEQ dexseq_count.py counts

Hi,
I am writing because am trying to figure out the concept behind the python script that counts reads per exon for DEXSEQ analysis.
Seems like the count output does not explain what I see in IGV and so the statistics can be off in DEXSEQ and render my analysis pointless.

This is what I have done:
1. created flattened gff from ensembl gtf
python ~/bin/DEXSeq/python_scripts/dexseq_prepare_annotation.py ensembl.63.genes.gtf ensembl.63.genes.gff

2. counted reads in query sorted bam file:
samtools view sorted.bam | python ~/bin/DEXSeq/python_scripts/dexseq_count.py --paired=yes -s no -a 10 ensembl.63.genes.gff - dexseq.counts.txt

3. created samples object for DEXSEQ in R.
I am using DEXSeq_1.4.0. in R Studio 0.97.248

samples = data.frame(
condition = factor(c("unaffected","affected")),
replicate = factor(c(1:1,1:1)),
row.names = factor(c("A","B")),
stringsAsFactors = TRUE,
check.names = FALSE
)

annotationfile = "ensembl.63.genes.gff"
fullFilenames<- list.files("/Users/sszelinger/Documents/PROJECTS/DEXSEQ",full.names=TRUE,pattern="counts.txt")

ecs<- read.HTSeqCounts(countfiles = fullFilenames,design = samples,flattenedfile = annotationfile)

Looking up counts for gene of interest
countTableForGene(ecs, "ENSG00000XXXXXX", normalized=FALSE, withDispersion=FALSE)

Exon count for sample A
E001 141
E002 66
E003 45
E004 5
E005 48
E006 37
E007 31
E008 36
E009 64
E010 35

Exon Count for sample B
E001 187
E002 95
E003 66
E004 2
E005 38
E006 29
E007 7
E008 32
E009 55
E010 27

You can see in "A" for exon 4 "E004" there is count=5 and for "B" for exon 4 count = 2. A is unaffected and B is affected.
When I pull up IGV, it seems like that "A" has a lot more reads than "B". A has an average coverage of 20 and B has about 3. In my view B is a great case for exon skipping between case and control, and the integrated DNA-RNA analysis suggest this but the counts do not pick it up. So I have to wonder what are the rules for counting the reads in dexseq? Is it based on HTSeq? If so is dexseq_count.py uses union, intersection_strict, or intersection_non_empty. From the IGV trace it looks like to me A should have at least 10 times the coverage as B at same position.

Can someone explain this or give me hints.
If the python script does not count split reads or those that do not map more than X bases within the exon than I understand, but should be documented with the python script so we can make more educated decisions.
I feel that there a real problem here in those situations where we want to detect exon skipping and the low count numbers based on the python script might be misleading. Especially if the counts biased in both case and control subjects into a direction where no significant difference might be conculded between them.
I appreciate any comment. Hope the IGV trace loads up fine..
Thanks very much,
Szabi
Attached Images
File Type: jpg A_vs_B.jpg (48.4 KB, 31 views)
sszelinger is offline   Reply With Quote
Old 02-11-2013, 01:56 PM   #2
areyes
Senior Member
 
Location: Heidelberg

Join Date: Aug 2010
Posts: 165
Default

Hi @sszelinger,

Thanks for your interest in DEXSeq!

It sounds strange. First of all, I would double check in the gtf that the E004 corresponds to the genomic region that you are observing in the browser. Note that the python script assigns new exon ID's to each unique non-overlapping exonic region based on all its annotated transcripts, so it might not correspond to the exon number 4 of the gene model from the browser.

My second guess would be regarding the quality of the alignments, the python script to count reads discards those reads with quality below 10 (this includes multiple hits). Maybe some of the alignments seen in the browser were filtered by the script.

Alejandro
areyes is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 10:15 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO