SEQanswers

Go Back   SEQanswers > Applications Forums > RNA Sequencing



Similar Threads
Thread Thread Starter Forum Replies Last Post
First 6 genes missing from HTSeq when read into edgeR sindrle Bioinformatics 3 01-24-2014 04:26 AM
EdgeR heatmap specific genes claire5 Bioinformatics 1 10-25-2013 05:56 AM
How to rescue multi-reads when using htseq to generate edgeR/DESeq counts? Hilary April Smith Bioinformatics 3 05-06-2013 12:07 PM
DESeq: problem in viewing heatmap.2 output coutellec RNA Sequencing 0 02-03-2013 08:53 AM
help with a heatmap with deseq - legend? vebaev Bioinformatics 3 03-03-2012 10:39 AM

Reply
 
Thread Tools
Old 03-31-2014, 12:12 PM   #61
super0925
Senior Member
 
Location: UK

Join Date: Feb 2014
Posts: 206
Default

Quote:
Originally Posted by dpryan View Post
Have a look at my answer over on biostars to a similar question. That should tell you most of what you want to know (particularly given the included links and replies from others).
Hi dpryan
If I only have one sample in each condition, how do I do the DE analysis?
Cause I think I need at least 2 replicates in each condition and then doing the DE statistical by DESeq/edgeR/Cuffdiff. Am I right?

If I don't run Cufflinks but running Htseq-counts,
Could I use this kind of method to measure it?

gene1: (a1/s1)-(a2/s2)

a1 is the counts (i.e. the number of mapped reads) of this gene in sample1
s1 is the total mapped reads in sample 1
a2 is the counts (i.e. the number of mapped reads) of this gene in sample2
s2 is the total mapped reads in sample 2

Or if I use FPKM calculated by Cifflinks to measure it ? i.e. FPKM1-FPKM2?
super0925 is offline   Reply With Quote
Old 03-31-2014, 12:15 PM   #62
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

You don't do a DE analysis without replicates. The best you can do is look at the ranked fold-changes (have a look at the GFold package, which tries to do this in a somewhat more useful way). Personally, I wouldn't waste my time without good reason.
dpryan is offline   Reply With Quote
Old 03-31-2014, 03:02 PM   #63
Zapages
Member
 
Location: NJ

Join Date: Oct 2012
Posts: 97
Default

Quote:
Originally Posted by dpryan View Post
You don't do a DE analysis without replicates. The best you can do is look at the ranked fold-changes (have a look at the GFold package, which tries to do this in a somewhat more useful way). Personally, I wouldn't waste my time without good reason.
I would recommend Gfold as well. Also I made Wrapper/GUI for it on iPlant Collaborative. I am going to test it out this or next week to see if it works properly on it.
Zapages is offline   Reply With Quote
Old 04-01-2014, 06:00 AM   #64
super0925
Senior Member
 
Location: UK

Join Date: Feb 2014
Posts: 206
Default

Quote:
Originally Posted by Zapages View Post
I would recommend Gfold as well. Also I made Wrapper/GUI for it on iPlant Collaborative. I am going to test it out this or next week to see if it works properly on it.
Hi, dpryan and zapages, thank you both !
super0925 is offline   Reply With Quote
Old 04-01-2014, 08:14 AM   #65
super0925
Senior Member
 
Location: UK

Join Date: Feb 2014
Posts: 206
Default

Quote:
Originally Posted by dpryan View Post
You don't do a DE analysis without replicates. The best you can do is look at the ranked fold-changes (have a look at the GFold package, which tries to do this in a somewhat more useful way). Personally, I wouldn't waste my time without good reason.
I found that the ranking of DE genes list (e.g. Top 1000 up/down regulated genes) are different in DESeq,edgeR,Cuffdiff,baySeq,...... (Of course they are different), now including this Gfold method, which is not reply on P-value but ''generalized fold change''.
How do I final make the decision by ranking Significant genes? Each methods has their advantage and claimed their most reliable.

I remember last time I told you I am thinking to calculate the overlap over Top 1000 genes list among the 3-5 methods. And then decide to use the method with biggest overlap. Am I right? Or is this measurement too rude and naive ?Or any other methods from your suggestion?

Last edited by super0925; 04-01-2014 at 08:27 AM.
super0925 is offline   Reply With Quote
Old 04-01-2014, 09:37 AM   #66
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

It doesn't matter which one you pick, they'll all be only vaguely meaningful at best...that's the problem with not having replicates.
dpryan is offline   Reply With Quote
Old 04-02-2014, 01:59 AM   #67
super0925
Senior Member
 
Location: UK

Join Date: Feb 2014
Posts: 206
Default

Quote:
Originally Posted by dpryan View Post
It doesn't matter which one you pick, they'll all be only vaguely meaningful at best...that's the problem with not having replicates.
Sorry, I didn't say it clearly, what I mean is the real dataset with replicate. Of course the ranking of TOP1000 DE genes are different from cuffdiff/Deseq/edgeR/Gfold/limma...... So I said do I need to do 3-5 simultaneously and then compare the overlap? The method with the biggest overlap is the best one? Cheers.
super0925 is offline   Reply With Quote
Old 04-02-2014, 02:05 AM   #68
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

If you have replicates then there's no point in using GFold, so just exclude it. Aside from that, the best way to go forward is to make a relatively small list of candidates that aren't shared between the different packages and then see if they validate using other means (qPCR or whatever you prefer). Then you have an idea which package is better modeling your particular dataset. Try to pick candidates with relatively similar characteristics (i.e., fold-change and adjusted p-value) just to make it a more apples-to-apples comparison.

BTW, usually the most significant genes overlap pretty heavily among the various tools. It's normally at the margins of significance where you see differences (unsurprisingly).
dpryan is offline   Reply With Quote
Old 04-04-2014, 09:34 PM   #69
nbahlis
Member
 
Location: Canada

Join Date: May 2013
Posts: 25
Default DEseq and Heatmap

Can someone please advise what is the best script to create a heatmap with only differentially expressed genes? I've tried the script below and I am not sure this is correct way to do it

> select <- order(p.adjust( res$pvalue[keep], method="BH" ) < .1 )
> heatmap.2( assay(rld) [ select, ], scale="row", cexRow=0.5, cexCol=0.75,
+ trace="none", dendrogram="column",
+ col = rev(heat.colors(25)))

thank you
nbahlis is offline   Reply With Quote
Old 04-05-2014, 02:49 AM   #70
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

The general concept of that seems reasonable. My only concern would be if you're first subsetting "res$pvalue" and if you haven't already done this with "rld" (or the object you used to make it" then the "select" indices may not be comparable. Otherwise, that seems reasonable.
dpryan is offline   Reply With Quote
Old 04-05-2014, 07:06 AM   #71
nbahlis
Member
 
Location: Canada

Join Date: May 2013
Posts: 25
Default

thank you Ryan, do you mean I need to transform my data with rld before I select a subset?
nbahlis is offline   Reply With Quote
Old 04-05-2014, 10:34 AM   #72
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

No, my concern is whether the genes from " res$pvalue[keep]" correspond to those from "assay(rld)" or not, since in one case you're subsetting an object and the other not. Without knowing what you did in the previous steps, it's impossible to say of the indices in "select" are appropriate or not. Presumably you did things such that this makes sense, but this is unclear.
dpryan is offline   Reply With Quote
Old 04-06-2014, 06:13 PM   #73
super0925
Senior Member
 
Location: UK

Join Date: Feb 2014
Posts: 206
Default

Quote:
Originally Posted by dpryan View Post
You don't do a DE analysis without replicates. The best you can do is look at the ranked fold-changes (have a look at the GFold package, which tries to do this in a somewhat more useful way). Personally, I wouldn't waste my time without good reason.
I have tried Gfold on my dataset . It seems works well. However, my samples are strand. I use strand (Tophat setting) and unstrand (the defaulf of Tophat) to map my samples and get the sam/bam file separately. However, the Top 100 genes ranking by GFold is almost same between strand and unstrand.Is it normal?
super0925 is offline   Reply With Quote
Old 04-07-2014, 05:15 AM   #74
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

That would seem normal to me. If you used a stranded protocol then all of the counts and mappings should be very similar (they'd be identical in a perfect world).
dpryan is offline   Reply With Quote
Old 04-07-2014, 11:53 PM   #75
nbahlis
Member
 
Location: Canada

Join Date: May 2013
Posts: 25
Default

Hi Ryan. The script I used does not seem to filter the genes based on padj < 0.1. I have followed the DESeq2 vignette and the only change I am introducing is the selection step for the genes based on the padj
nbahlis is offline   Reply With Quote
Old 04-08-2014, 01:49 AM   #76
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

Then presumably that should work fine.
dpryan is offline   Reply With Quote
Old 04-08-2014, 06:20 AM   #77
nbahlis
Member
 
Location: Canada

Join Date: May 2013
Posts: 25
Default

Quote:
Originally Posted by dpryan View Post
Then presumably that should work fine.
Hi Ryan,

What is the right way to draw the heat map based on genes with padj < 0.1 instead of selecting "Rowmeans" as in the vignette? The script I have written does NOT select genes based on their padj < 0.1

thank you for your help
nbahlis is offline   Reply With Quote
Old 04-08-2014, 06:27 AM   #78
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

You would just use a different index, such as:

Code:
select <- which(p.adjust( res$pvalue[keep], method="BH" ) < .1 )
dpryan is offline   Reply With Quote
Old 04-08-2014, 06:28 AM   #79
nbahlis
Member
 
Location: Canada

Join Date: May 2013
Posts: 25
Default

Quote:
Originally Posted by dpryan View Post
Then presumably that should work fine.
here is the script I've written:
> sampleFiles <- list.files(path="~/Desktop/HTseq_counts/", pattern="*.counts")
> Table3 <- data.frame(
+ sampleName = sampleFiles, fileName = sampleFiles,
+ condition = c( "pre", "pre", "pre", "pre", "pre", "pre", "pre", "pre", "pre", "pre", "pre", "pre", "post", "post", "post", "post", "post", "post", "post", "post", "post", "post", "post", "post"),
+ libType = c("pair8", "pair10", "pair9", "pair1", "pair7", "pair11", "pair2", "pair3", "pair4", "pair12", "pair13", "pair6", "pair7", "pair1", "pair3", "pair4", "pair11", "pair2", "pair6", "pair10", "pair9", "pair8", "pair12", "pair13"))
> directory <- c("~/Desktop/HTseq_counts/")
> design <- formula(~ libType + condition)
> ddsHTSeq <- DESeqDataSetFromHTSeqCount(sampleTable= Table3, directory= directory, design= design)
Usage note: the following factors have 3 or more levels:

libType

For DESeq2 versions < 1.3, if you plan on extracting results for
these factors, we recommend using betaPrior=FALSE as an argument
when calling DESeq().
As currently implemented in version 1.2, the log2 fold changes can
vary if the base level is changed, when extracting results for a
factor with 3 or more levels. A solution will be implemented in
version 1.3 which allows for the use of a beta prior and symmetric
log2 fold change estimates regardless of the selection of base level.
> Table3
sampleName fileName condition libType
1 P110.counts P110.counts pre pair8
2 P124.counts P124.counts pre pair10
3 P149.counts P149.counts pre pair9
4 P185.counts P185.counts pre pair1
5 P189.counts P189.counts pre pair7
6 P192.counts P192.counts pre pair11
7 P218.counts P218.counts pre pair2
8 P227.counts P227.counts pre pair3
9 P235.counts P235.counts pre pair4
10 P274.counts P274.counts pre pair12
11 P321.counts P321.counts pre pair13
12 P351.counts P351.counts pre pair6
13 P357.counts P357.counts post pair7
14 P367.counts P367.counts post pair1
15 P377.counts P377.counts post pair3
16 P384.counts P384.counts post pair4
17 P426.counts P426.counts post pair11
18 P543.counts P543.counts post pair2
19 P584.counts P584.counts post pair6
20 P590.counts P590.counts post pair10
21 P594.counts P594.counts post pair9
22 P610.counts P610.counts post pair8
23 P653.counts P653.counts post pair12
24 P654.counts P654.counts post pair13
> ddsHTSeq
class: DESeqDataSet
dim: 51911 24
exptData(0):
assays(1): counts
rownames(51911): ENSG00000000003 ENSG00000000005 ... ENSG00000259170
ENSG00000259171
rowData metadata column names(0):
colnames(24): P110.counts P124.counts ... P653.counts P654.counts
colData names(2): condition libType
> dds <- DESeq(ddsHTSeq)
estimating size factors
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing
1515 rows did not converge in beta, labelled in mcols(object)$betaConv. Use larger maxit argument with nbinomWaldTest
> res <- results(dds)
> res <- res[order(res$padj),]
> plotMA(dds, main="DESeq2", ylim=c(-2,2))
> plotDispEsts(dds)
> hist( res$pvalue, breaks=100)
> filterThreshold <- 5
> keep <- rowMeans( counts( dds, normalized=TRUE)) > filterThreshold
> table(p.adjust(res$pvalue, method="BH") < 0.1)

FALSE TRUE
44016 1036
> table(p.adjust(res$pvalue[keep], method="BH") < 0.1)

FALSE TRUE
17543 1684
> hist( res$pvalue[keep], breaks=100)
> reskeep <- results(dds[keep])
> reskeep <- reskeep[order(reskeep$padj),]
> library( "org.Hs.eg.db" )
Loading required package: AnnotationDbi
Loading required package: Biobase
Welcome to Bioconductor

Vignettes contain introductory material; view with
'browseVignettes()'. To cite Bioconductor, see
'citation("Biobase")', and for packages 'citation("pkgname")'.

Loading required package: DBI

> cols(org.Hs.eg.db)
[1] "ENTREZID" "PFAM" "IPI" "PROSITE" "ACCNUM"
[6] "ALIAS" "CHR" "CHRLOC" "CHRLOCEND" "ENZYME"
[11] "MAP" "PATH" "PMID" "REFSEQ" "SYMBOL"
[16] "UNIGENE" "ENSEMBL" "ENSEMBLPROT" "ENSEMBLTRANS" "GENENAME"
[21] "UNIPROT" "GO" "EVIDENCE" "ONTOLOGY" "GOALL"
[26] "EVIDENCEALL" "ONTOLOGYALL" "OMIM" "UCSCKG"
Warning message:
In cols(org.Hs.eg.db) :
'cols' has been deprecated and replaced by 'columns' for versions of
Bioc that are higher than 2.13. Please use 'columns' anywhere that
you previously used 'cols'
> convertIDs <- function( ids, fromKey, toKey, db, ifMultiple=c( "putNA", "useFirst" ) ) {
+ stopifnot( inherits( db, "AnnotationDb" ) )
+ ifMultiple <- match.arg( ifMultiple )
+ suppressWarnings( selRes <- AnnotationDbi::select(
+ db, keys=ids, keytype=fromKey, cols=c(fromKey,toKey) ) )
+ if( ifMultiple == "putNA" ) {
+ duplicatedIds <- selRes[ duplicated( selRes[,1] ), 1 ]
+ selRes <- selRes[ ! selRes[,1] %in% duplicatedIds, ] }
+ return( selRes[ match( ids, selRes[,1] ), 2 ] ) }
> reskeep$symbol <- convertIDs( row.names(reskeep), "ENSEMBL", "SYMBOL", org.Hs.eg.db )
> head(reskeep)
DataFrame with 6 rows and 7 columns
baseMean log2FoldChange lfcSE stat pvalue
<numeric> <numeric> <numeric> <numeric> <numeric>
ENSG00000118193 352.79648 -1.3133381 0.1805654 -7.273477 3.503503e-13
ENSG00000090889 486.81411 -1.5000754 0.2177141 -6.890116 5.574692e-12
ENSG00000166851 591.71901 -1.4069394 0.2097238 -6.708535 1.965874e-11
ENSG00000213401 45.86075 -3.6684941 0.5520076 -6.645731 3.017157e-11
ENSG00000144554 1137.21388 -0.8662289 0.1334898 -6.489104 8.634842e-11
ENSG00000204592 27901.79642 0.6212955 0.0959501 6.475194 9.469050e-11
padj symbol
<numeric> <character>
ENSG00000118193 7.130679e-09 KIF14
ENSG00000090889 5.673085e-08 KIF4A
ENSG00000166851 1.333714e-07 PLK1
ENSG00000213401 1.535205e-07 NA
ENSG00000144554 3.212059e-07 FANCD2
ENSG00000204592 3.212059e-07 HLA-E

> select <- order(p.adjust( reskeep$pvalue, method="BH" ) < .0005)
> heatmap.2( assay(rld) [ select, ], scale="row", cexRow=0.5, cexCol=0.75,
+ trace="none", dendrogram="column",
+ col = redgreen(75))
> select <- order(p.adjust( reskeep$pvalue[keep], method="BH" ) < .00005)
> heatmap.2(assay(rld)[select,],col = redgreen(75),trace="none", cexRow=0.75, cexCol=0.75)
> select <- order(p.adjust( reskeep$pvalue[keep], method="BH" ) < .00000005)
> heatmap.2( assay(rld) [ select, ], scale="row", cexRow=0.5, cexCol=0.75,
+ trace="none", dendrogram="column",
+ col = redgreen(75))
> select <- order(p.adjust( reskeep$padj[keep], method="BH" ) < .00000005)
> heatmap.2( assay(rld) [ select, ], scale="row", cexRow=0.5, cexCol=0.75,
+ trace="none", dendrogram="column",
+ col = redgreen(75))
nbahlis is offline   Reply With Quote
Old 04-08-2014, 07:26 AM   #80
nbahlis
Member
 
Location: Canada

Join Date: May 2013
Posts: 25
Default

thanks Ryan. For whatever reason "select <- which(p.adjust( res$pvalue[keep], method="BH" ) < .1 )" does not select in the heatmap for the genes with padj<0.1. The list of the genes seems to be randomly picked up
nbahlis is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 02:11 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2021, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO