SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Tophat 2.0.2 on multiple Single-End reads: need som help sindrle RNA Sequencing 2 11-27-2013 09:02 AM
RNAseq time series data w/ controls for each time point Mocca RNA Sequencing 2 08-08-2013 12:12 AM
Analyzing expression time course with DESeq john_nl Bioinformatics 1 02-23-2012 12:17 AM
About identify the diffenential expression gene ruby SOLiD 6 06-19-2009 02:57 PM

Reply
 
Thread Tools
Old 09-15-2014, 06:18 AM   #1
rpauly
Member
 
Location: Atlanta

Join Date: Apr 2011
Posts: 32
Default SOM (Self-organising Maps) to identify expression trends in time-course data

Hi,

I am curious to know if anyone has employed SOM (Self-organising Maps) to identify expression trends in time-course data (RNASeq/Exome seq)?

~Thanks,
Rini
rpauly is offline   Reply With Quote
Old 09-15-2014, 06:33 AM   #2
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

There are a few papers that use SOM with RNAseq in general (a time-course exome-seq experiment would rarely make any sense), though I don't recall that they use it in the context of a time-course experiment (there's no reason that wouldn't work though). Just search pubmed for them if all you need are some papers.
dpryan is offline   Reply With Quote
Old 09-16-2014, 08:06 AM   #3
rpauly
Member
 
Location: Atlanta

Join Date: Apr 2011
Posts: 32
Default

Thank you for the quick reply!
But I do not see so many papers with SOM and RNASEQ (maybe 3?).

Do you have specific references that you have come across?

~Thanks!
rpauly is offline   Reply With Quote
Old 09-16-2014, 08:08 AM   #4
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

There aren't that many since it's not a terribly popular method. Also, there are probably some papers using it on microarrays (the same concepts will apply).
dpryan is offline   Reply With Quote
Old 02-02-2015, 06:00 PM   #5
Dario1984
Senior Member
 
Location: Sydney, Australia

Join Date: Jun 2011
Posts: 166
Default

There are better statistical modelling approaches available, such as GPclust. It allows biological replicates to be properly used. It's made for normally distributed data, but you could transform your counts to be normally distributed.
Dario1984 is offline   Reply With Quote
Old 02-04-2015, 04:47 AM   #6
sarvidsson
Senior Member
 
Location: Berlin, Germany

Join Date: Jan 2015
Posts: 137
Default

Quote:
Originally Posted by Dario1984 View Post
There are better statistical modelling approaches available, such as GPclust. It allows biological replicates to be properly used. It's made for normally distributed data, but you could transform your counts to be normally distributed.
Not wanting to hijack the thread, but I'd be interested in trying GPclust on one large RNASeq dataset I'm working on currently. Did you use it, and if so, is it worth spending time to try it out?
sarvidsson is offline   Reply With Quote
Old 02-05-2015, 12:27 AM   #7
JamesHensman
Junior Member
 
Location: UK

Join Date: May 2013
Posts: 2
Default GPclust

GPclust works well for me, but then I'm the author

You can find the code and some demo IPython notebooks here http://staffwww.dcs.sheffield.ac.uk/...n/gpclust.html

If your data is cleanish, GPclust can provide nice results like this one
https://drive.google.com/file/d/0Bz7...ew?usp=sharing

Apologies for the self promotion -- I'm happy to help if there are other questions.
JamesHensman is offline   Reply With Quote
Old 02-05-2015, 02:27 AM   #8
sarvidsson
Senior Member
 
Location: Berlin, Germany

Join Date: Jan 2015
Posts: 137
Default

Quote:
Originally Posted by JamesHensman View Post
GPclust works well for me, but then I'm the author

You can find the code and some demo IPython notebooks here http://staffwww.dcs.sheffield.ac.uk/...n/gpclust.html

If your data is cleanish, GPclust can provide nice results like this one
https://drive.google.com/file/d/0Bz7...ew?usp=sharing

Apologies for the self promotion -- I'm happy to help if there are other questions.
I'll try it out then. How would you define "cleanish"? My data is a large (200+ samples) and rather deep (30 Million+ reads per sample) expression set aligned to a draft transcriptome assembly, which is not completely cleaned up (i.e. there will be some redudancy in the transcontigs and some contaminating species transcripts). Also, some of the sample points/replicates had problems (degraded RNA, contamination from other species etc.) so they will introduce some errors (the worst samples were selected out, however).

No offense, but my experience is that "clean" data is the exception, and is mostly encountered as example datasets in bioinformatic publications describing analysis methods I'd be interested in a method which is robust to the problems described above.
sarvidsson is offline   Reply With Quote
Old 02-05-2015, 03:45 AM   #9
JamesHensman
Junior Member
 
Location: UK

Join Date: May 2013
Posts: 2
Default

I'd say your data needs to be free of outliers and other nasty behaviour.

I've taken to filtering signals for signal-to-noise ratio, by dividing the variance of the replicate means by the mean of the replicate variances. You can still cluster 1000s of genes with gpclust, by too many genes which are just noise will confuse it.

I would say that a method that deals with the problems you describe probably depends mostly on good data munging, rather than the method itself.
JamesHensman is offline   Reply With Quote
Old 02-05-2015, 04:19 AM   #10
sarvidsson
Senior Member
 
Location: Berlin, Germany

Join Date: Jan 2015
Posts: 137
Default

Quote:
Originally Posted by JamesHensman View Post
I'd say your data needs to be free of outliers and other nasty behaviour.
By (partly manually) looking (PCA, time plots...) at some 20 genes with known behaviour, we've removed obvious outlier samples.

Quote:
Originally Posted by JamesHensman View Post
I've taken to filtering signals for signal-to-noise ratio, by dividing the variance of the replicate means by the mean of the replicate variances. You can still cluster 1000s of genes with gpclust, by too many genes which are just noise will confuse it.
By "too many genes which are just noise" do you mean that they show erratic behaviour over time or over replicate - or both? I've looked at a lot of expression data over the last few years, and typically most genes show erratic behaviour in either respect - be it due to uncontrolled biological or environmental variation, imperfect replication or whatever technical difficulties thereafter.

Quote:
Originally Posted by JamesHensman View Post
I would say that a method that deals with the problems you describe probably depends mostly on good data munging, rather than the method itself.
Agreed - but be careful in distancing your method too far from the application of it. Most "successful" (widely used) bioinformatic packages (e.g. for variant detection or expression analysis) include documentation and hands-on examples (using published, real datasets with all kinds of biases and noise, not only simulated data) on recommended practices for raw data pre-processing, normalization, filtering of "noisy" samples, filtering of "noisy" genes etc... While I understand that researchers do not always have the time for maintaining such documentation or providing support to users, IMO this is a key to "success" (in the sense of getting a well cited paper).
I must admit I didn't read your IPython notebooks yet, so I better do that now...
sarvidsson is offline   Reply With Quote
Old 02-05-2015, 05:18 AM   #11
sarvidsson
Senior Member
 
Location: Berlin, Germany

Join Date: Jan 2015
Posts: 137
Default

So I've checked your IPython notebooks, and the examples make sense to me. The Kalinka dataset is microarray data on log2-scale - correct? So I should be able to use RNA-Seq count data processed with DESeq2's rlog or vst, right? (http://www.bioconductor.org/packages....pdf#section.2) I guess I should skip your normalization step then, however...

You might find that more novices (at least here on SEQanswers) would want a walkthrough for data from an RNA-Seq experiment, e.g. from an unprocessed count table from HTSeq-count.

I could try this the hard way, but would it be feasible to cluster >=10 000 genes in reasonable time (given that they are well filtered)? If not, are there steps in the algorithm that could be parallelized to achieve that?

Last edited by sarvidsson; 02-05-2015 at 05:30 AM. Reason: grammar
sarvidsson is offline   Reply With Quote
Old 02-08-2015, 05:00 PM   #12
Dario1984
Senior Member
 
Location: Sydney, Australia

Join Date: Jun 2011
Posts: 166
Default

Quote:
Originally Posted by sarvidsson View Post
Did you use it, and if so, is it worth spending time to try it out?
I haven't tried it.
Dario1984 is offline   Reply With Quote
Reply

Tags
bioinformatics, expression trends, som, time-course

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 02:19 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2021, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO