SEQanswers

Go Back   SEQanswers > Applications Forums > RNA Sequencing



Similar Threads
Thread Thread Starter Forum Replies Last Post
First 6 genes missing from HTSeq when read into edgeR sindrle Bioinformatics 3 01-24-2014 04:26 AM
EdgeR heatmap specific genes claire5 Bioinformatics 1 10-25-2013 05:56 AM
How to rescue multi-reads when using htseq to generate edgeR/DESeq counts? Hilary April Smith Bioinformatics 3 05-06-2013 12:07 PM
DESeq: problem in viewing heatmap.2 output coutellec RNA Sequencing 0 02-03-2013 08:53 AM
help with a heatmap with deseq - legend? vebaev Bioinformatics 3 03-03-2012 10:39 AM

Reply
 
Thread Tools
Old 04-08-2014, 07:28 AM   #81
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

FYI, the newer versions of DESeq2 do independent filtering for you, so you needn't normally filter things yourself (i.e., you can skip the rowMeans() step).

Code:
select <- order(p.adjust( reskeep$pvalue, method="BH" ) < .0005)
I have no idea why you're trying to do it this way, that simply won't do what you want. You're ordering a set of TRUE and FALSE boolean values, which will really not do what you want.

Code:
select <- which(p.adjust(reskeep$pvalue, method="BH")<0.1)
Also:
Code:
p.adjust( reskeep$pvalue[keep])
Aside from using order() instead of which(), this won't do what you want since you've already subset "res" to get "reskeep" (from which I presume you're making "rld", though you don't show that). You're thing selecting things based on a further subset of reskeep that is no longer meaningful and then applying that to "rld".
The above line will do what you want and use a more typical threshold of significance.
dpryan is offline   Reply With Quote
Old 04-10-2014, 05:20 PM   #82
pecanton
Member
 
Location: Florida, USA

Join Date: Jun 2011
Posts: 13
Default

Hi,

First off, I'm sorry to hijack this dicussion, but my question is very related to the original post so I didn't want to start a new thread.

I have a time course RNA seq experiment, with a control condition and 4 time points, each with 3 biological replicates. I've done TopHat2-HTseq-DESeq2 analysis to find the differentially expressed genes in each time point vs the control condition. (curiously, with my data DEseq2 predicted all but 2 or 3 of the genes predicted by EdgeR, out of tens to thousands of predictions, so that's why I've been using only DEseq2)

I now want to do some clustering analysis to find genes with more or less similar expression patterns across the time points, and yes, get a pretty heatmap by the end of the process. However, I have some doubts on how to proceed:

1) Should I do the clustering analysis with the variance stabilization of the counts table as in the DEseq2 vignette or should I use the table I've compiled with the log2 fold changes for all genes in all the time points after extracting the results of the differential expression tests? Related to this, I'd like to reduce the number genes used in the clustering. Out of ~18000 genes in my organism, at most ~5% of genes are predicted to have a significant change in expression (P-adjusted value < 0.05), so I'm curious if it's not best to reduce the genes to cluster only those that are significant in at least one of the time points. I feel all the other genes would just contribute a lot of noise and huge, unmanageable clustering.

2) I've been testing the Cluster 3-TreeView tools to get a feel of how to proceed with the heirarchical clustering analysis, but there are many parameters to put into the algorithm. Reading up papers on the subject has left me even more confused as to what is actually proper versus what produces the results I'd like to see. For example, should I center data or not, and use median or mean? Should I use Spearman correlation or euclidian distances between genes? What is better, centroid linkage or average linkage?

Any pointers on this matter would be greatly appreciated.
pecanton is offline   Reply With Quote
Old 04-11-2014, 02:51 AM   #83
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

It's generally a better idea to use the variance stabilized data when making heatmaps. How you want to filter things is up to you. Randomly subsampling things, using only the DE genes, or filtering based on variance are probably the most common choices. The method to choose depends on what you hope to gain. I personally do clustering mostly for quality control, since it's largely not that informative otherwise.

BTW, you might be interested in using WGCNA, which can be thought of as using clustering to do more rigourous groupings of genes/probes. It's an R package available on CRAN and is probably not the most user-friendly package ever, but the methods might be really appealing for the questions you have in mind.

Regarding centering, usually you'll want to center the values. After that there's no perfect answer to things. Lior Pachter wrote a blog post a while back regarding making heatmaps that you should read. I also wrote a bit about this in an answer a while back over on biostars that you might find helpful. The short answer is that there is usually no single best answer.
dpryan is offline   Reply With Quote
Old 04-11-2014, 04:16 AM   #84
super0925
Senior Member
 
Location: UK

Join Date: Feb 2014
Posts: 206
Default

Quote:
Originally Posted by dpryan View Post
It's generally a better idea to use the variance stabilized data when making heatmaps. How you want to filter things is up to you. Randomly subsampling things, using only the DE genes, or filtering based on variance are probably the most common choices. The method to choose depends on what you hope to gain. I personally do clustering mostly for quality control, since it's largely not that informative otherwise.

BTW, you might be interested in using WGCNA, which can be thought of as using clustering to do more rigourous groupings of genes/probes. It's an R package available on CRAN and is probably not the most user-friendly package ever, but the methods might be really appealing for the questions you have in mind.

Regarding centering, usually you'll want to center the values. After that there's no perfect answer to things. Lior Pachter wrote a blog post a while back regarding making heatmaps that you should read. I also wrote a bit about this in an answer a while back over on biostars that you might find helpful. The short answer is that there is usually no single best answer.
Hi Ryan
on you post on http://www.biostars.org/p/91978/#91984 , you mentioned that it is a really really bad idea to use Euclidian distance to draw heatmap on RNA-seq data, so what do command you prefer to draw heatmap clustering?
I have just followed the http://bioconductor.org/packages/dev.../doc/DESeq.pdf
Thank you!

Last edited by super0925; 04-11-2014 at 04:32 AM.
super0925 is offline   Reply With Quote
Old 04-11-2014, 04:40 AM   #85
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

@super0925: Lior mentioned the issues with euclidean distance in his post (I was just referring to that in my answer) so I would recommend just reading that and seeing what he recommended there (Mahalanobis distance if I remember correctly, though I recall him not being completely enthralled with any particular method).
dpryan is offline   Reply With Quote
Old 04-11-2014, 05:44 AM   #86
super0925
Senior Member
 
Location: UK

Join Date: Feb 2014
Posts: 206
Default

Quote:
Originally Posted by dpryan View Post
It's mostly a question of how aggressively to quality trim after removing adapter sequences. I fall into the the "trim gently" camp, so I trim off adapters and bases with a phred score of 5 or below. While more aggressive trimming can probably increase the mapping rate, it will generally drastically decrease the overall number of mapped reads (and generally not improve accuracy that much). This is at least the case for RNAseq, other experiment types may be different.
Hi Ryan so how about the duplicates?I have seen someone on blog said better not to remove the duplicate?
super0925 is offline   Reply With Quote
Old 04-11-2014, 05:48 AM   #87
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

That someone was correct. Since you'll often have highly expressed short genes, marking duplicates would artificially deflate read counts and decrease your ability to detect DE genes. What might make sense is to look for sudden jumps in depth along genes and perhaps marking duplicates according to that, though I've never seen anyone do that and I kind of doubt it'd be that beneficial (though I could be wrong there).

BTW, my name is Devon (I know, Ryan is also a common first name).
dpryan is offline   Reply With Quote
Old 04-11-2014, 06:05 AM   #88
super0925
Senior Member
 
Location: UK

Join Date: Feb 2014
Posts: 206
Default

Quote:
Originally Posted by dpryan View Post
That someone was correct. Since you'll often have highly expressed short genes, marking duplicates would artificially deflate read counts and decrease your ability to detect DE genes. What might make sense is to look for sudden jumps in depth along genes and perhaps marking duplicates according to that, though I've never seen anyone do that and I kind of doubt it'd be that beneficial (though I could be wrong there).

BTW, my name is Devon (I know, Ryan is also a common first name).
Thank you Devon!
super0925 is offline   Reply With Quote
Old 04-11-2014, 08:42 AM   #89
pecanton
Member
 
Location: Florida, USA

Join Date: Jun 2011
Posts: 13
Default

Quote:
Originally Posted by dpryan View Post
It's generally a better idea to use the variance stabilized data when making heatmaps. How you want to filter things is up to you. Randomly subsampling things, using only the DE genes, or filtering based on variance are probably the most common choices. The method to choose depends on what you hope to gain. I personally do clustering mostly for quality control, since it's largely not that informative otherwise.

BTW, you might be interested in using WGCNA, which can be thought of as using clustering to do more rigourous groupings of genes/probes. It's an R package available on CRAN and is probably not the most user-friendly package ever, but the methods might be really appealing for the questions you have in mind.

Regarding centering, usually you'll want to center the values. After that there's no perfect answer to things. Lior Pachter wrote a blog post a while back regarding making heatmaps that you should read. I also wrote a bit about this in an answer a while back over on biostars that you might find helpful. The short answer is that there is usually no single best answer.
That post by lior Pachter was really informative. Thank you for that. I also read your post at Biostars. I think I may settle on filtering the variance stabilized matrix by the DE genes and NOT using the euclidian distance between expression data. However, I'm wondering how it will look. In the matrix I have 15 columns, 3 replicates for each of the 4 time points and the control condition. The table after the DE analysis and the log 2 changes only has 4, since DEseq2 has already evaluated the replicates into one result, and it implicitly has the comparison to the control condition. I'd think longer vectors for each gene would allow better clustering through any algorithm, though I still get the impression that it's for debate which one is more accurate, just that one should shy away from those that cluster mainly on min or max values.
pecanton is offline   Reply With Quote
Old 04-20-2014, 01:38 PM   #90
super0925
Senior Member
 
Location: UK

Join Date: Feb 2014
Posts: 206
Default

Quote:
Originally Posted by dpryan View Post
You don't do a DE analysis without replicates. The best you can do is look at the ranked fold-changes (have a look at the GFold package, which tries to do this in a somewhat more useful way). Personally, I wouldn't waste my time without good reason.
Hi Devon I remember that last time you are talking biological or technological replicates.
I think that how many minimum or maximum replicates in library preparation is based on the experiment design, but could we decide it from Deseq/edgeR results (e.g. outliers or expression levels) or we only decide it just based on the experience ? because as you know, the more replicates requires the more money for sequencing.
My understanding, the more replicates is more useful theoretically, am I right?
In library preparation, what is the impact of reads length and depth? are they the more the better ? and if the depth is more important than length?
Thank you!
super0925 is offline   Reply With Quote
Old 04-21-2014, 02:41 AM   #91
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

Biological replicates are the most important. In an ideal world, you'd do a pilot experiment and then use a tool like scotty to determine how many replicates you would need for the power you want.
dpryan is offline   Reply With Quote
Old 05-11-2014, 05:23 PM   #92
super0925
Senior Member
 
Location: UK

Join Date: Feb 2014
Posts: 206
Default

Quote:
Originally Posted by dpryan View Post
Biological replicates are the most important. In an ideal world, you'd do a pilot experiment and then use a tool like scotty to determine how many replicates you would need for the power you want.
Hi how do you uniform the gene name between different methods (Texudo vs Htseq+edgeR)?
Now I am doing RNA-Seq on the bovine, my reference genome use Ensembl database.
As you know, the Texudo pipeline: Tophat2-Cufflinks-Cuffmerge-Cuffdiff.
After the Cuffmerge and get the merged.gtf which could give you gene name output only but not the Ensembl ID. However, the gene name of Cuffdiff, some are like "Q865A2_BOVIN", which made me confused, cause it is not official gene name.
Is there any methods to uniform gene names (whatsoever Ensembl ID, gene name or UCSC ID ,.etc.) across different methods?
I think I could do Tophat2-Cuffdiff without Cufflinks+Cuffmerge to do this pipeline. However, I am not sure I am right and is there any other way to uniform them?
Thank you!
super0925 is offline   Reply With Quote
Old 05-11-2014, 05:54 PM   #93
sindrle
Senior Member
 
Location: Norway

Join Date: Aug 2013
Posts: 266
Default

Originally Posted by dpryan
....FPKM/RPKM values have some downsides in that they lead you to believe that you've properly corrected for library size differences, when you actually haven't robustly.
...
Some people mistakenly believe that the count-based method is "biased" toward longer genes. This is incorrect since that's not a real bias (in the common rather than technical meaning), that's a benefit. Etc.
...
.
Could you please elaborte on these two points please?

When Im analysing expression profiles I export one table with RPKM, one with CPM and also one with normalised counts from DEseq2.

However I don't know which one is the best? So far I have used CPM values, because the have correlated best with RT-PCR delta CT values. Also with corresponding protein values.

How do these three methods relate? What are pros and cons?
sindrle is offline   Reply With Quote
Old 05-12-2014, 01:55 AM   #94
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

CPM is basically a non-length-normalized variant of RPKM, so it has many of the same pros and cons. In an ideal world, RPKM should be the most accurate of the three for many things, since if you have differences in major isoform length between groups then it can better represent that. Having said that, it's a bit more difficult to calculate accurately and then use well statistically (look at the evolution of cufflinks for an example). The normalized counts from DESeq2 (or edgeR) are mostly used for visualization and not much more. In most cases, there's not one "best" thing. These are all metrics that are more/less useful for a variety of things. If CPM has better predictive power in your samples then I wouldn't argue much with that. All the theoretical discussions in the world have to take a back seat to what validates experimentally.
dpryan is offline   Reply With Quote
Old 05-12-2014, 04:21 AM   #95
super0925
Senior Member
 
Location: UK

Join Date: Feb 2014
Posts: 206
Default

Hi Devon and Zapages,
Have you seen my questions on post #92 ? Thank you!
I am confused on how to uniform the gene name or ID between Tuxedo and count based pipelines. The different gene annotation file is not convenient especially I don't know how to deal with sth like "Q865A2_BOVIN"

Last edited by super0925; 05-12-2014 at 04:23 AM.
super0925 is offline   Reply With Quote
Old 05-12-2014, 04:27 AM   #96
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

Quote:
Originally Posted by super0925 View Post
Hi how do you uniform the gene name between different methods (Texudo vs Htseq+edgeR)?
Now I am doing RNA-Seq on the bovine, my reference genome use Ensembl database.
As you know, the Texudo pipeline: Tophat2-Cufflinks-Cuffmerge-Cuffdiff.
After the Cuffmerge and get the merged.gtf which could give you gene name output only but not the Ensembl ID. However, the gene name of Cuffdiff, some are like "Q865A2_BOVIN", which made me confused, cause it is not official gene name.
Is there any methods to uniform gene names (whatsoever Ensembl ID, gene name or UCSC ID ,.etc.) across different methods?
I think I could do Tophat2-Cuffdiff without Cufflinks+Cuffmerge to do this pipeline. However, I am not sure I am right and is there any other way to uniform them?
Thank you!
Q865A2_BOVIN looks like a uniprot ID (with a "_BOVIN" suffix, perhaps), so just convert that with Biomart or R (there's probably a bovine annotation package). BTW, this is usually not an issue since the IDs output by cufflinks and those used by htseq-count can be the same (they're both sourced from the annotation file you provide, so you can just tell htseq-count to count according to the same ID).
dpryan is offline   Reply With Quote
Old 05-12-2014, 04:42 AM   #97
super0925
Senior Member
 
Location: UK

Join Date: Feb 2014
Posts: 206
Default

Quote:
Originally Posted by dpryan View Post
Q865A2_BOVIN looks like a uniprot ID (with a "_BOVIN" suffix, perhaps), so just convert that with Biomart or R (there's probably a bovine annotation package). BTW, this is usually not an issue since the IDs output by cufflinks and those used by htseq-count can be the same (they're both sourced from the annotation file you provide, so you can just tell htseq-count to count according to the same ID).
Which R package could do this? I know how to convert the Ensembl gene ID list to gene names list on Biomart website http://www.ensembl.org/biomart/martview/ ,but it is manually (I mean I need to copy and pasted one list by one list), not automatically.
super0925 is offline   Reply With Quote
Old 05-12-2014, 04:47 AM   #98
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

org.Bt.eg.db from Bioconductor. It has a uniprot to ensembl conversion map. BTW, you can also script biomart with the biomaRt package in R (this is very useful for things like ID conversions and fetching sequence information).
dpryan is offline   Reply With Quote
Old 05-12-2014, 05:58 AM   #99
super0925
Senior Member
 
Location: UK

Join Date: Feb 2014
Posts: 206
Default

Quote:
Originally Posted by dpryan View Post
org.Bt.eg.db from Bioconductor. It has a uniprot to ensembl conversion map. BTW, you can also script biomart with the biomaRt package in R (this is very useful for things like ID conversions and fetching sequence information).
I have two questions now:
1. I have tried the Biomart(http://www.ensembl.org/biomart/martview/) on Q865A2_BOVIN but I didn't find the result...
2. I also try bioDBnet (http://biodbnet.abcc.ncifcrf.gov/db/db2dbRes.php) and it is not work as well.
Could you please try it? (may take you 1 minute) I think may be I wrongly choose the catalog (e.g. UnigeneID, Gene ID) of the input or output.

The result of Q865A2_BOVIN is OAS1 (please see (http://www.uniprot.org/uniprot/Q865A2), btw what is '2'-5'-oligoadenylate synthetase', could it be transfer to from Biomart or bioDBnet?

Thank you!

Last edited by super0925; 05-12-2014 at 06:05 AM.
super0925 is offline   Reply With Quote
Old 05-12-2014, 06:10 AM   #100
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

Perhaps the uniprot IDs haven't been merged into biomart yet. Since you have an annotation file, the conversions are contained in it. So just load it into R with GenomicRanges and then get the conversions from the mcols() of the resulting GRanges object.

A "2'-5'-oligoadenylate synthetase" would add ATP to oligos (if you google this, you'll find that apparently this particular method ends up activating ribonucleases and leading to RNA degradation).
dpryan is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 07:56 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2022, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO