SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
DEXSeq - more than two different conditions yylilly RNA Sequencing 22 04-25-2019 09:00 PM
dexseq simulation Pedrissimo RNA Sequencing 1 06-08-2012 02:47 AM
DEXSeq exonic_part_number paolo.kunder Bioinformatics 1 04-12-2012 08:40 AM
DEXSeq error xguo Bioinformatics 4 12-06-2011 06:31 AM
DEXSeq and Biobase MerFer Bioinformatics 1 10-14-2011 04:50 AM

Reply
 
Thread Tools
Old 10-20-2012, 07:26 PM   #1
mathew
Member
 
Location: australia

Join Date: Jan 2011
Posts: 81
Default DEXseq

I have a 50X PE RNA-seq data and want to use DExseq. I understand that I need to get read counts from HTseq and normalize and count the expression. I have two questions:
1. in case of no replicate how differential expression is estimated?
2. For transcript expression do I have to go back and run cufflink first and then run DEseq, that will not make sense as cufflink will give me normalized count RPKM which cannot be used in DESEq. In summary how DExSeq estimate transcript expression. Any pointers will be great.


Thanks

Mathew
mathew is offline   Reply With Quote
Old 10-22-2012, 02:30 AM   #2
areyes
Senior Member
 
Location: Heidelberg

Join Date: Aug 2010
Posts: 165
Default

Dear Matthew,

1. in case of no replicate how differential expression is estimated?

The short answer is: no differential expression can be estimated without biological replicates. You can have a look around SEQAnswers about the necessity of estimating biological variability to determine differential expression.

2. For transcript expression do I have to go back and run cufflink first and then run DEseq, that will not make sense as cufflink will give me normalized count RPKM which cannot be used in DESEq. In summary how DExSeq estimate transcript expression. Any pointers will be great.

DEXSeq does not estimate transcript expression. It tests for differential usage of exons between conditions.

For more details of discussion for both of your questions you can have a look into the DEXSeq paper:
http://genome.cshlp.org/content/22/10/2008.long

Hope it is useful!
Alejandro

Last edited by areyes; 10-22-2012 at 02:31 AM. Reason: wording
areyes is offline   Reply With Quote
Old 03-13-2013, 12:48 PM   #3
marianaboroni
Junior Member
 
Location: BH- Brasil

Join Date: Feb 2011
Posts: 5
Default

Hi!
I'm trying to use the pipeline tophat2 -> HTSEq -> DEXSeq, but I'm stuck on the DEXSEq workflow!

Here is what I did:

1) I ran TopHat2 using a junc file:
Quote:
tophat2 -o tophat_out -i 10 -I 30000 -p 8 --library-type fr-unstranded -j combined.juncs -G my.gff --transcriptomeindex=transcriptome_data chr.fa myreads_1.fastq myreads_2.fastq
2) I converted bam to the sam file:
Quote:
samtools sort -n accepted_hits.bam accepted_hits.nsorted
samtools view accepted_hits.nsorted.bam > ./accepted_hits.nsorted.sam
3) Then I produced the count table using htseq-count
Quote:
htseq-count -i ID accepted_hits.nsorted.sam my.gff -q --stranded=yes > count_HTSeq_exons.txt
4)Now I'm trying to use the DEXSeq package to test differential use of exons

Quote:
#My code in R
library("DEXSeq")
samples = data.frame(
condition = factor(c("sl","all","all","all","all")),
replicate = factor(c(1:1,1:4)),
type = c("single-read", "paired-end", "paired-end", "paired-end", "paired-end"),
row.names = factor(c("cercaria_SL_Trapping_count_HTSeq_exons","Todos_cercaria_ERR022872_count_HTSeq_exons", "Todos_cercaria_ERR022875_count_HTSeq_exons", "Todos_cercaria_ERR022877_count_HTSeq_exons", "Todos_cercaria_ERR022878_count_HTSeq_exons")),
stringsAsFactors = TRUE,
check.names = FALSE
)
annotationfile = file.path("./GFF/v5.07.08.12.chado.HTSeq.gff")
ecs = read.HTSeqCounts(countfiles = paste(rownames(samples), "txt", sep=".") , design = samples, flattenedfile = annotationfile)
sampleNames(ecs) = rownames(samples)
But then I get an error: Erro em x[[i]] : índice fora de limites

If I try the last command without the GFF file, the object ecs is created without problems, what makes me think that the problem is with my GFF file, the same I used with the HTSeq-count.

Here is a piece of my gff file:
Quote:
##gff-version 3
##sequence-region Schisto_mansoni.Chr_1 1 65476681
Schisto_mansoni.Chr_1 chado contig 1 19825 . + . ID=contig_14209;Name=null;
Schisto_mansoni.Chr_1 chado gene 11159 12750 . + . ID=Smp_186980;
Schisto_mansoni.Chr_1 chado exon 11159 11220 . + 0 ID=Smp_186980.1:exon:1;Parent=Smp_186980.1;
Schisto_mansoni.Chr_1 chado exon 12411 12750 . + 0 ID=Smp_186980.1:exon:2;Parent=Smp_186980.1;
Schisto_mansoni.Chr_1 chado mRNA 11159 12750 . + . ID=Smp_186980.1;Parent=Smp_186980;
Schisto_mansoni.Chr_1 chado polypeptide 11159 12750 . + . ID=Smp_186980.1ep;Derives_from=Smp_186980.1;timelastmodified=14.10.2011+12:49:04+BST;feature_id=23195294;
Schisto_mansoni.Chr_1 chado gene 16927 17315 . + . ID=Smp_197050;
Schisto_mansoni.Chr_1 chado exon 16927 17082 . + 0 ID=Smp_197050.1:exon:1;Parent=Smp_197050.1;
Schisto_mansoni.Chr_1 chado exon 17286 17315 . + 0 ID=Smp_197050.1:exon:2;Parent=Smp_197050.1;
Schisto_mansoni.Chr_1 chado polypeptide 16927 17315 . + . ID=Smp_197050.1ep;Derives_from=Smp_197050.1;colour=2;timelastmodified=14.10.2011+12:50:09+BST;feature_id=23195325;
Schisto_mansoni.Chr_1 chado mRNA 16927 17315 . + . ID=Smp_197050.1;Parent=Smp_197050;
Schisto_mansoni.Chr_1 chado gap 19826 20025 . + . ID=Schisto_mansoni.Chr_1.embl:gap:19825-20025;
Schisto_mansoni.Chr_1 chado contig 20026 60973 . + . ID=contig_14210;Name=null;
Schisto_mansoni.Chr_1 chado gene 51849 114890 . + . ID=Smp_160500;
Schisto_mansoni.Chr_1 chado exon 51849 51861 . + 0 ID=Smp_160500.1:exon:1;Parent=Smp_160500.1;
Schisto_mansoni.Chr_1 chado exon 51901 51987 . + 0 ID=Smp_160500.1:exon:2;Parent=Smp_160500.1;
Schisto_mansoni.Chr_1 chado exon 52024 52416 . + 0 ID=Smp_160500.1:exon:3;Parent=Smp_160500.1;
Schisto_mansoni.Chr_1 chado exon 52448 52679 . + 0 ID=Smp_160500.1:exon:4;Parent=Smp_160500.1;
Schisto_mansoni.Chr_1 chado exon 52715 52802 . + 0 ID=Smp_160500.1:exon:5;Parent=Smp_160500.1;

Any sugestion??

Thanks!
marianaboroni is offline   Reply With Quote
Old 03-13-2013, 02:37 PM   #4
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 994
Default

Admittedly, the error message is not very helpful, but your problem is that you used htseq-count to get your count table. However, you are supposed to use the two Python scripts that come with DEXSeq instead. Please see the vignette (last section) for details.
Simon Anders is offline   Reply With Quote
Old 05-15-2013, 11:45 AM   #5
wfan
Junior Member
 
Location: Seattle

Join Date: Oct 2010
Posts: 4
Default

Hi
I have used DEXSeq R package successfully to do alternative splicing detection (at least it gives output without error)
My question is about the novel junction/splicing patterns. Since I notice DEXSeq uses annotation file, does DEXSeq detects novel splicing pattern, for example a transcript contains some of the intron, or any scenario it can be arise in the cancer abnormal genome?

I did use TopHat with annotation, maybe in order to detect novel splicing junction I ought to use the de novo aligner to start with, instead of normal TopHat. Lastly now one can run TopHat 2 without annotation.

Maybe my question is really an aligner question rather a DEXSeq question, in order to detect novel splicing one ought to use a de novo aligner. Any suggestion for a de novo aligner?
Thanks
Wenhong
wfan is offline   Reply With Quote
Old 05-15-2013, 09:35 PM   #6
jmw86069
Member
 
Location: RTP, NC, USA

Join Date: Jun 2009
Posts: 28
Default

The counts script that comes with DEXseq that Dr. Anders referenced will produce counts per exon, which are then linked to annotated genes. No de novo detection is accomplished unless you provide a GTF file with de novo exons. Further, the junctions are not used, only the relative exon abundances are used.

However, some people here have described positive results using the junctions output from TopHat2, as if they were exons. They get fewer overall reads (since reads wholly contained within an exon, and not spanning the splice boundary, are not used), but described getting consistently reliable, and verifiable results. And with 100-base PE reads, it seems more likely to get reads spanning junctions. I'm intending to try it soon, I'm interested in the outcome. It perhaps more directly addresses the question about differential isoforms, than using relative exon abundances which I feel only indirectly infers the alternative splicing. Still, for exons that go from zero to a zillion, it works great! :-)

I have not created a GTF file from the junctions.bed output of TopHat2 -- I presume it's reasonable once you figure out the GTF details, linking junctions to parent genes. If you want to enable truly de novo work, you'd run CuffMerge (which runs CuffCompare) then use the merged.gtf file to define genes -- then link the junctions.bed entries together using those genes. That way if there is a completely novel gene locus, you'd be able to stitch together its junctions.bed entries appropriately. DEXseq will only test a gene if it has more than one child GTF feature, so you can't just feed it junctions.bed without grouping them by gene.
jmw86069 is offline   Reply With Quote
Old 05-06-2014, 07:41 AM   #7
krespim
Member
 
Location: Dresden

Join Date: Jul 2012
Posts: 49
Default

Quote:
Originally Posted by jmw86069 View Post
The counts script that comes with DEXseq that Dr. Anders referenced will produce counts per exon, which are then linked to annotated genes. No de novo detection is accomplished unless you provide a GTF file with de novo exons. Further, the junctions are not used, only the relative exon abundances are used.

However, some people here have described positive results using the junctions output from TopHat2, as if they were exons. They get fewer overall reads (since reads wholly contained within an exon, and not spanning the splice boundary, are not used), but described getting consistently reliable, and verifiable results. And with 100-base PE reads, it seems more likely to get reads spanning junctions. I'm intending to try it soon, I'm interested in the outcome. It perhaps more directly addresses the question about differential isoforms, than using relative exon abundances which I feel only indirectly infers the alternative splicing. Still, for exons that go from zero to a zillion, it works great! :-)

I have not created a GTF file from the junctions.bed output of TopHat2 -- I presume it's reasonable once you figure out the GTF details, linking junctions to parent genes. If you want to enable truly de novo work, you'd run CuffMerge (which runs CuffCompare) then use the merged.gtf file to define genes -- then link the junctions.bed entries together using those genes. That way if there is a completely novel gene locus, you'd be able to stitch together its junctions.bed entries appropriately. DEXseq will only test a gene if it has more than one child GTF feature, so you can't just feed it junctions.bed without grouping them by gene.

Hi jmw86069. did you eventually try to run DEXseq with tophat junction.bed counts? I would be interested in knowing if it worked. I tried it with DESeq2, using each junction as separate identity and it sort of worked.
krespim is offline   Reply With Quote
Old 05-08-2014, 06:51 AM   #8
areyes
Senior Member
 
Location: Heidelberg

Join Date: Aug 2010
Posts: 165
Default

Hi all, I just saw a poster in a conference in which they used the exon-exon junction counts from the STAR aligner as input to DEXSeq. It seemed to work nicely for human cancer genomes...
areyes is offline   Reply With Quote
Old 05-08-2014, 07:54 AM   #9
krespim
Member
 
Location: Dresden

Join Date: Jul 2012
Posts: 49
Default de novo?

Quote:
Originally Posted by areyes View Post
Hi all, I just saw a poster in a conference in which they used the exon-exon junction counts from the STAR aligner as input to DEXSeq. It seemed to work nicely for human cancer genomes...

Great! I never used STAR, but do you remember if they used de novo junctions detected from the data?

Maybe is just my data that is too funky. The dispersion estimates are completely out of whack:

krespim is offline   Reply With Quote
Old 05-09-2014, 08:16 AM   #10
areyes
Senior Member
 
Location: Heidelberg

Join Date: Aug 2010
Posts: 165
Default

I think STAR does looks for de novo junctions!

Yes, the plot look funky! There seems to be three clouds of points, the lowest one is what is found in the "normal" cases. The upper clouds are the funny ones... they even seem to have the opposite mean-variance relation to the lower cloud...

I am a bit out of ideas, maybe Mike or Simon could give more feedback... does the matrix count seem a bit sparsed? What kind of data is this?

Alejandro
areyes is offline   Reply With Quote
Old 05-09-2014, 12:11 PM   #11
krespim
Member
 
Location: Dresden

Join Date: Jul 2012
Posts: 49
Default

Quote:
Originally Posted by areyes View Post
What kind of data is this?
The data comes from the junctions.bed files from tophat. I gave each junction a unique name (a combination of locus and junctions coordinates), created a matrix of counts by merging these junctions.bed for all my conditions and ran DESeq with it.

Quote:
does the matrix count seem a bit sparsed?
A lot yes. I see that many junctions have counts for only 2 or 3 of the conditions. I do get ~400 significantly expressed junctions (FDR < 0.1), but these are *all* like:

c1 c1 c2 c2
0 0 100 80
900 850 0 0

Mock numbers but it gives you an idea. Other junctions with low FDR are similar.

So, yes, funky indeed. I could use only counts for annotated junctions, but since I am interested in global mis-regulation of splicing it would sort of defy the point.
krespim is offline   Reply With Quote
Old 05-20-2014, 06:35 AM   #12
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 994
Default

It seems that your junctions fall into two classes, those which behave soemwhat normal and those which make up this strange upper cloud.

I guess the upper cloud is made by junctions with zero counts in all but one sample, and in this one sample, you get a lot of counts. The higher the counts in the one non-zero sample, the larger are both dispersion and mean, and this is why the cloud curves upwards rather than downwards.

Now, if you see counts only in a single sample, and not in all replicate samples of the same condition, you should not infer much from it. And if this happens a lot, this means that the differences between your samples are driven by something which does not go along sample groups (conditions), i.e., you have a huge non-reproducibilty in your experiment.

Maybe tell us more about your experimental design and biology.
Simon Anders is offline   Reply With Quote
Old 05-20-2014, 11:49 AM   #13
Michael Love
Senior Member
 
Location: Boston

Join Date: Jul 2013
Posts: 333
Default

If I was working with count data that produced the above plot, I would split out groups for the curve fitting and shrinkage of dispersion estimates, because the underlying assumption that the dispersions belong to a single cloud is violated.

Code:
dds= estimateSizeFactors(dds)
dds = estimateDispersionsGeneEst(dds)
# put here an empty vector for now
dispersions(dds) = numeric(nrow(dds))

# define a cutoff by eye
cutoff = 1
upper = which(mcols(dds)$dispGeneEst > cutoff)
lower = which(mcols(dds)$dispGeneEst <= cutoff)

ddsLower = dds[lower,]
ddsLower = estimateDispersionsFit(ddsLower)
ddsLower = estimateDispersionsMAP(ddsLower)

# just use the MLE for the upper ones
dispersions(dds)[upper] = mcols(dds[upper,])$dispGeneEst

# use the final estimates for the lower ones
dispersions(dds)[lower] = dispersions(ddsLower)

# this is just necessary for the correct labeling in the plot
mcols(dds)$dispOutlier = FALSE
mcols(dds)$dispOutlier[mcols(dds)$dispersion == mcols(dds)$dispGeneEst] = TRUE

plotDispEsts(dds)
Michael Love is offline   Reply With Quote
Old 05-23-2014, 04:54 AM   #14
krespim
Member
 
Location: Dresden

Join Date: Jul 2012
Posts: 49
Default

Thanks for the reply Simon.

Quote:
Originally Posted by Simon Anders View Post
Now, if you see counts only in a single sample, and not in all replicate samples of the same condition, you should not infer much from it. And if this happens a lot, this means that the differences between your samples are driven by something which does not go along sample groups (conditions), i.e., you have a huge non-reproducibilty in your experiment.
Yep, this seems to be true. The samples are not mine, I am just the guy trying to make sense of the data.

Quote:
Originally Posted by Simon Anders View Post
Maybe tell us more about your experimental design and biology.
In brief, experimentally, my collaborators are trying to get a very small number of cells from a very defined population. This involves a very long protocol in the bench, and my opinion the longer it is the more prone to errors and sample degradation, and in the end, they need to pool several samples together. So the replicates are more technical then biological. To solve this issue, they are now changing the experimental design. This will involve fewer steps, but a more heterogeneous tissue composition.

Their hypothesis does make sense though. They are looking at a protein expressed only a in cell niche, that interacts with splicing factors. Together with other evidence, they hypothesised that it might be regulating AS (or mis-regulating it in the kd mice).
krespim is offline   Reply With Quote
Old 05-23-2014, 04:59 AM   #15
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 994
Default

Well, if they see such strong differences between samples that are essentially technical replicates, then they are probably indeed well advised to try to improve their bench protocol.
Simon Anders is offline   Reply With Quote
Old 05-23-2014, 05:11 AM   #16
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 994
Default

By the way: The effect of seeing a large count for a gene in one sample and zero counts in all other samples is quite common when working with sample derived from very few cells.

In such cases, it is not too uncommon that one only gets maybe around a million mRNA molecules per sample, and after reverse transcription, this is down to possibly less than a hundred thousand cDNA molecules, which then go through many rounds of PCR. If one then sequences this to a depth of a few million reads per sample, i.e., much more reads than there were initial cDNA molecules, then one is bound to see most transcript molecules several times and some individual cDNA molecules, which fared especially well in the PCR, can easily produce hundreds of reads, all showing fragments of the very same mRNA molecule.

Hence, in a setting with very low amounts of starting RNA (typically, below ~1 µg of total RNA per sample), it is quite possible to see a huge difference, say hundreds of reads in one sample and zero reads for the same gene in all other samples, and in reality, this reflects a difference of only a single lone mRNA molecule.

See also our paper on this issue:
Brenencke, Anders, Kim et al., Accounting for technical noise in single-cell RNA-seq experiments
Nature Methods, 10, 1093–1095 (2013)
doi:10.1038/nmeth.2645

Last edited by Simon Anders; 05-23-2014 at 05:15 AM.
Simon Anders is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 05:58 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2021, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO