Hi,
I am writing because am trying to figure out the concept behind the python script that counts reads per exon for DEXSEQ analysis.
Seems like the count output does not explain what I see in IGV and so the statistics can be off in DEXSEQ and render my analysis pointless.
This is what I have done:
1. created flattened gff from ensembl gtf
python ~/bin/DEXSeq/python_scripts/dexseq_prepare_annotation.py ensembl.63.genes.gtf ensembl.63.genes.gff
2. counted reads in query sorted bam file:
samtools view sorted.bam | python ~/bin/DEXSeq/python_scripts/dexseq_count.py --paired=yes -s no -a 10 ensembl.63.genes.gff - dexseq.counts.txt
3. created samples object for DEXSEQ in R.
I am using DEXSeq_1.4.0. in R Studio 0.97.248
samples = data.frame(
condition = factor(c("unaffected","affected")),
replicate = factor(c(1:1,1:1)),
row.names = factor(c("A","B")),
stringsAsFactors = TRUE,
check.names = FALSE
)
annotationfile = "ensembl.63.genes.gff"
fullFilenames<- list.files("/Users/sszelinger/Documents/PROJECTS/DEXSEQ",full.names=TRUE,pattern="counts.txt")
ecs<- read.HTSeqCounts(countfiles = fullFilenames,design = samples,flattenedfile = annotationfile)
Looking up counts for gene of interest
countTableForGene(ecs, "ENSG00000XXXXXX", normalized=FALSE, withDispersion=FALSE)
Exon count for sample A
E001 141
E002 66
E003 45
E004 5
E005 48
E006 37
E007 31
E008 36
E009 64
E010 35
Exon Count for sample B
E001 187
E002 95
E003 66
E004 2
E005 38
E006 29
E007 7
E008 32
E009 55
E010 27
You can see in "A" for exon 4 "E004" there is count=5 and for "B" for exon 4 count = 2. A is unaffected and B is affected.
When I pull up IGV, it seems like that "A" has a lot more reads than "B". A has an average coverage of 20 and B has about 3. In my view B is a great case for exon skipping between case and control, and the integrated DNA-RNA analysis suggest this but the counts do not pick it up. So I have to wonder what are the rules for counting the reads in dexseq? Is it based on HTSeq? If so is dexseq_count.py uses union, intersection_strict, or intersection_non_empty. From the IGV trace it looks like to me A should have at least 10 times the coverage as B at same position.
Can someone explain this or give me hints.
If the python script does not count split reads or those that do not map more than X bases within the exon than I understand, but should be documented with the python script so we can make more educated decisions.
I feel that there a real problem here in those situations where we want to detect exon skipping and the low count numbers based on the python script might be misleading. Especially if the counts biased in both case and control subjects into a direction where no significant difference might be conculded between them.
I appreciate any comment. Hope the IGV trace loads up fine..
Thanks very much,
Szabi
I am writing because am trying to figure out the concept behind the python script that counts reads per exon for DEXSEQ analysis.
Seems like the count output does not explain what I see in IGV and so the statistics can be off in DEXSEQ and render my analysis pointless.
This is what I have done:
1. created flattened gff from ensembl gtf
python ~/bin/DEXSeq/python_scripts/dexseq_prepare_annotation.py ensembl.63.genes.gtf ensembl.63.genes.gff
2. counted reads in query sorted bam file:
samtools view sorted.bam | python ~/bin/DEXSeq/python_scripts/dexseq_count.py --paired=yes -s no -a 10 ensembl.63.genes.gff - dexseq.counts.txt
3. created samples object for DEXSEQ in R.
I am using DEXSeq_1.4.0. in R Studio 0.97.248
samples = data.frame(
condition = factor(c("unaffected","affected")),
replicate = factor(c(1:1,1:1)),
row.names = factor(c("A","B")),
stringsAsFactors = TRUE,
check.names = FALSE
)
annotationfile = "ensembl.63.genes.gff"
fullFilenames<- list.files("/Users/sszelinger/Documents/PROJECTS/DEXSEQ",full.names=TRUE,pattern="counts.txt")
ecs<- read.HTSeqCounts(countfiles = fullFilenames,design = samples,flattenedfile = annotationfile)
Looking up counts for gene of interest
countTableForGene(ecs, "ENSG00000XXXXXX", normalized=FALSE, withDispersion=FALSE)
Exon count for sample A
E001 141
E002 66
E003 45
E004 5
E005 48
E006 37
E007 31
E008 36
E009 64
E010 35
Exon Count for sample B
E001 187
E002 95
E003 66
E004 2
E005 38
E006 29
E007 7
E008 32
E009 55
E010 27
You can see in "A" for exon 4 "E004" there is count=5 and for "B" for exon 4 count = 2. A is unaffected and B is affected.
When I pull up IGV, it seems like that "A" has a lot more reads than "B". A has an average coverage of 20 and B has about 3. In my view B is a great case for exon skipping between case and control, and the integrated DNA-RNA analysis suggest this but the counts do not pick it up. So I have to wonder what are the rules for counting the reads in dexseq? Is it based on HTSeq? If so is dexseq_count.py uses union, intersection_strict, or intersection_non_empty. From the IGV trace it looks like to me A should have at least 10 times the coverage as B at same position.
Can someone explain this or give me hints.
If the python script does not count split reads or those that do not map more than X bases within the exon than I understand, but should be documented with the python script so we can make more educated decisions.
I feel that there a real problem here in those situations where we want to detect exon skipping and the low count numbers based on the python script might be misleading. Especially if the counts biased in both case and control subjects into a direction where no significant difference might be conculded between them.
I appreciate any comment. Hope the IGV trace loads up fine..
Thanks very much,
Szabi
Comment