SEQanswers

Go Back   SEQanswers > Applications Forums > RNA Sequencing



Similar Threads
Thread Thread Starter Forum Replies Last Post
going from RNA seq TopHat output to variant calls efoss Bioinformatics 12 11-11-2013 02:15 AM
how to determine strand from tophat output for paired-end RNA-seq data jay2008 Bioinformatics 1 05-30-2012 05:46 AM
RNA-Seq: Comparative Analysis of RNA-Seq Alignment Algorithms and the RNA-Seq Unified Newsbot! Literature Watch 3 07-31-2011 08:08 PM
RNA-Seq: A multiplex RNA-seq strategy to profile poly(A(+)) RNA: Application to analy Newsbot! Literature Watch 0 04-26-2011 05:00 AM
RNA-Seq: ExpEdit: a web server to explore human RNA editing in RNA-Seq experiments. Newsbot! Literature Watch 0 03-24-2011 03:10 AM

Reply
 
Thread Tools
Old 05-25-2010, 07:40 AM   #1
zorph
Member
 
Location: FL

Join Date: May 2010
Posts: 40
Default RNA-seq output

hello,
i've analyzed my data and have a bunch of Wig files. the overall goal of my project is to compare different transcriptomes. Is there a user friendly program that will allow me to do this?

Fyi-i ran my samples on SOLiD and analyzed the samples using BioScope.
zorph is offline   Reply With Quote
Old 05-26-2010, 08:31 PM   #2
Davis McC
Member
 
Location: Melbourne

Join Date: May 2010
Posts: 16
Default DE Analysis Pipeline with edgeR

Hi zorph

I am one of the developers for the Bioconductor package edgeR , which is designed for carrying out differential expression analysis of count data (like RNA-seq). Check out the User's Guide for more details and case studies to provide examples on how to use the package.

I'm not familiar with Wig files and can't tell what sort of analysis you've carried out already, but colleagues of mine suggest the following sort of steps to go from raw RNA-seq short read data from the raw fasta files, through to GO category testing. You may find this sort of analysis pipeline useful.

Steps with required tools & files

To perform the entire analysis, the following steps and tools will be needed:

1. Get some short read RNA-seq data, for at least two different experimental conditions you wish to compare

2. Choose a reference to map against, and map your data using a short read mapper that outputs in SAM format. We tend to use bowtie. Other options are bwa, SOAP2, novoalign, shrimp.

3. Use SAMtools to convert SAM output into the binary BAM format, which is both smaller on disk and allows for fast indexing.

4. Summarize reads on the gene/transcript/exon level. We use the R platform with the Rsamtools and GenomicFeatures packages.

5. Calculate DE genes from counts summarized on the gene level. We use the R package edgeR, which we have developed, although there are other tools out there. edgeR can account for biological variation in the data (using a negative binomial model), separate biological from technical variation, produce an MDS plot, and conduct exact testing procedures.

6. Perform GO category testing on the results of the differential expression analysis, using the R package goseq.

Considerations for DE Analysis

Extra-Poisson variation (or overdispersion) is typical of RNA-seq data, especially if there is biological replication amongst your samples. If you only have technical replicates then this may not be an issue, but I would recommend running your data through edgeR to get some idea of the inter-library variability. If you have overdispersed data, then using a Poisson model will *drastically* overestimate the levels of differential expression in your data. Using a NB model like in edgeR can account for this extra variation in the data and give much better assessment of DE.

edgeR can deal with overdispersion in the data, investigate inter-library (incl. biological) variability and get exact p-values for DE based on the NB model.

Hope that is helpful and good luck with your data analysis. Please ask if you have any more questions I might be able to help with.

Best regards
Davis
Davis McC is offline   Reply With Quote
Old 05-27-2010, 12:06 AM   #3
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 994
Default

Hi

Davis gave a nice summary of the way how to do it.

Two additional points (which I mainly put to advertise our software):

- An alternative to edgeR is our package, DESeq. DESeq's method is based on edgeR's, but different in a number of points (and we think, of course, that this makes it better). See our paper for the exact differences.

The main point, however, is that you get a proper analysis only if you have a method that can, as Davis writes, "deal with overdispersion in the data, investigate inter-library (incl. biological) variability", and to my knowledge, edgeR and its derivative, DESeq, are the only tools currently available, which do this properly.

- While both edgeR and DESeq are easy enough to use that even users unfamiliar with R will manage, the summerization might be a bit more tricky. An alternative is htseq-count.

Simon
Simon Anders is offline   Reply With Quote
Old 05-31-2010, 10:25 AM   #4
Livi81
Member
 
Location: BC, Canada

Join Date: Apr 2010
Posts: 21
Default

I'm also looking for a biologist friendly way to analyse RNA-seq output files. I saw that Partek seem to have some nice software, how does it compare to edgeR and DESeq?
Thanks
Livi81 is offline   Reply With Quote
Old 05-31-2010, 01:02 PM   #5
jgibbons1
Senior Member
 
Location: Worcester, MA

Join Date: Oct 2009
Posts: 133
Default

Hi Livi81,
I've played around with the trial version of the Partek software quite a bit and was not thrilled with it. My major problem was speed. I was working with Illumina data sets consisting of 25 million reads each. Even with 4 gigs RAM the software stalled and froze my computer a few times. You may be able to bypass this doing by bringing mapped output in, rather than doing it in the Partek software. I found that using a UNIX or R environment to be much better for me. It's worth calling the company though for a free trial.
jgibbons1 is offline   Reply With Quote
Old 07-14-2010, 09:15 AM   #6
townway
Member
 
Location: Rockville

Join Date: May 2009
Posts: 40
Default

Before moving to step4. Summarize reads on the gene/transcript/exon level.

do you think it is necessary to remove the reads mapping to rRNA and psudogene region? and do you know how to make this from SAM file?

Thank you
townway is offline   Reply With Quote
Old 07-14-2010, 09:41 AM   #7
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 994
Default

Quote:
Originally Posted by townway View Post
do you think it is necessary to remove the reads mapping to rRNA and psudogene region? and do you know how to make this from SAM file?
No, why should this be necessary? After step 4, you have a table with counts, with one row for each genes. Provided the rRNA and pseudo genes were in your annotation, there will also be some rows for these, and then you can conveniently kick out these rows if you don't like them. You can also leave them in. After all, if you get counts for a pseudogene, the gene may not be that 'pseudo' and you may want to look at it. And the counts to rRNA may be informative to judge the effectiveness of the the RNA removal step of your sample prep. Of course, any differential expression that edgeR or DESeq may report for them will be biologically meaningless.

Simon
Simon Anders is offline   Reply With Quote
Old 07-15-2010, 12:11 PM   #8
greigite
Senior Member
 
Location: Cambridge, MA

Join Date: Mar 2009
Posts: 141
Default

Hopefully this question is allowed on the RNA-seq forum
I'm interested to get opinions from the developers of edgeR and DEseq (and others) about whether the statistical analyses in these packages are appropriate for a couple other types of biological count data. Specifically, I work on metagenomic analyses of complex microbial communities (in soils, plants, water etc). The type of data I'm working with are sequencing reads, typically produced on 454, that are then annotated through various different pipelines. The outcome is a bunch of counts of genes with particular annotations or that are in specific functional categories. The genome space of the community is certainly greatly undersampled, as in many RNA-seq experiments, but the magnitude of difference in counts is less. Could I apply one of these packages to analyzing my data? I have biological but not technical replicates at the moment. The second type of data are counts of the number of organisms in particular phylogenetic categories, and this data is closer to RNA-seq data in that there a few highly abundant categories and a long tail of low-abundance types. Again, I have biological but not technical replicates.
greigite is offline   Reply With Quote
Old 07-20-2010, 09:39 AM   #9
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 994
Default

Hi

in principle both edgeR and DESeq are suitable for any kind of count data for which the model fits. The assumption of a negative binomial distribution is quite robust; the crucial question is the variance-mean relation.

To do proper statistics, you need to have a reasonable estimate of the variance for each gene (or gene category, or species, or clade, or whatever it is you count in meta-genomics). As one typically has only few replicates, one needs to assume that genes of similar expression strength (or clades of similar abundance, or whatever) have similar variance.

In case of DESeq, there are diagnostics (the variance residuals, visualized with 'residualEcdfPlot' and also used to find 'variance outliers') that allow you to check how well this model fits, so that you know whether you can put trust in your results.

So, yes, it is worth a try, and I'd be very interested to hear how it goes.

Cheers
Simon
Simon Anders is offline   Reply With Quote
Old 07-20-2010, 02:32 PM   #10
greigite
Senior Member
 
Location: Cambridge, MA

Join Date: Mar 2009
Posts: 141
Default

Thank you, Simon. I will try out DESeq on my data and let you know how it goes. BTW I would also be very interested in a way to compare multiple treatments. In the present project I have 3 treatments each with 3 biological replicates.
greigite is offline   Reply With Quote
Old 08-20-2010, 04:01 AM   #11
crh
Member
 
Location: tx

Join Date: Dec 2009
Posts: 46
Default

Quote:
Originally Posted by Simon Anders View Post
Hi

Davis gave a nice summary of the way how to do it.

Two additional points (which I mainly put to advertise our software):

- An alternative to edgeR is our package, DESeq. DESeq's method is based on edgeR's, but different in a number of points (and we think, of course, that this makes it better). See our paper for the exact differences.

The main point, however, is that you get a proper analysis only if you have a method that can, as Davis writes, "deal with overdispersion in the data, investigate inter-library (incl. biological) variability", and to my knowledge, edgeR and its derivative, DESeq, are the only tools currently available, which do this properly.

- While both edgeR and DESeq are easy enough to use that even users unfamiliar with R will manage, the summerization might be a bit more tricky. An alternative is htseq-count.

Simon
Simon and Davis,

I have 4 sets of solid reads (control & 3 experimental) that I'd like to generate DE for. There are no replicates for these samples.

I was initially planning to simply normalize against the control (rpkm) but this now seems like not the way to go. Will either edgeR or DESeq generate DE for non-replicated data sets?

thanks

Charles
crh is offline   Reply With Quote
Old 08-20-2010, 04:24 AM   #12
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 994
Default

Hi Charles

Short answer: no. You cannot get useful results from an experiment without replication, no matter what tool you use. (Why do people keep wasting their time and money on producing such data?)

Longer answer: DESeq has a mode to work with data without replicates that can give you at least those genes which really stick out by having way larger fold-changes then the rest. However, you might see only a small part of your potential hits.

Simon
Simon Anders is offline   Reply With Quote
Old 08-23-2010, 11:18 AM   #13
bioinfosm
Senior Member
 
Location: USA

Join Date: Jan 2008
Posts: 482
Default

Simon, I am curious, what kind and how many replicates are you suggesting?
Best I have seen is tumor normal pairs for rna-seq data; but by replicates do you mean redundant lanes of data per sample which are somehow averaged to get a more accurate read-out?
__________________
--
bioinfosm
bioinfosm is offline   Reply With Quote
Old 08-24-2010, 12:57 AM   #14
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 994
Default

Quote:
Originally Posted by bioinfosm View Post
Simon, I am curious, what kind and how many replicates are you suggesting?
Best I have seen is tumor normal pairs for rna-seq data; but by replicates do you mean redundant lanes of data per sample which are somehow averaged to get a more accurate read-out?
Not quite. Imagine Charles find a couple of genes which are, in one of his treatments, upregulated by 50% in comparison to the value in the controls, and he writes in his paper that these genes are obviously responding to the treatment.

Somebody else performs the same control experiment but does it twice, with two independent samples, and notices that Charles' genes differ between the two control samples by around 50%, too. This invalidates the initial conclusion that the genes upregulation is due to the treatment, as it happens without treatment as well. Without replicates, you would never know.

So, all I am talking about it the old-fashioned rule that you should do every experiment several times in order to see how much the measured quantities change even if you don't change anything. While this is considered absolutely required in most subfields of biology, for some reasons, people forget about it once they use high-throughput sequencing.

What you suggested, i.e., spreading a given sample over several lanes (called "technical replicates" by some), will not help at all with this; nevertheless, it might be necessary in addition if you work with organisms with large exomes.

Tumor-normal sample pairs are proper replicates, of course, if you have several pairs. The specific issue with paired samples is that DESeq cannot deal with them at the moment (and neither can edgeR) but we are working on it.

Simon
Simon Anders is offline   Reply With Quote
Old 08-24-2010, 07:33 PM   #15
quix
Junior Member
 
Location: New York

Join Date: Aug 2010
Posts: 6
Default

Quote:
Originally Posted by Davis McC View Post
Hi zorph

I am one of the developers for the Bioconductor package edgeR , which is designed for carrying out differential expression analysis of count data (like RNA-seq). Check out the User's Guide for more details and case studies to provide examples on how to use the package.

I'm not familiar with Wig files and can't tell what sort of analysis you've carried out already, but colleagues of mine suggest the following sort of steps to go from raw RNA-seq short read data from the raw fasta files, through to GO category testing. You may find this sort of analysis pipeline useful.

Steps with required tools & files

To perform the entire analysis, the following steps and tools will be needed:

1. Get some short read RNA-seq data, for at least two different experimental conditions you wish to compare

2. Choose a reference to map against, and map your data using a short read mapper that outputs in SAM format. We tend to use bowtie. Other options are bwa, SOAP2, novoalign, shrimp.

3. Use SAMtools to convert SAM output into the binary BAM format, which is both smaller on disk and allows for fast indexing.

4. Summarize reads on the gene/transcript/exon level. We use the R platform with the Rsamtools and GenomicFeatures packages.

5. Calculate DE genes from counts summarized on the gene level. We use the R package edgeR, which we have developed, although there are other tools out there. edgeR can account for biological variation in the data (using a negative binomial model), separate biological from technical variation, produce an MDS plot, and conduct exact testing procedures.

6. Perform GO category testing on the results of the differential expression analysis, using the R package goseq.

Considerations for DE Analysis

Extra-Poisson variation (or overdispersion) is typical of RNA-seq data, especially if there is biological replication amongst your samples. If you only have technical replicates then this may not be an issue, but I would recommend running your data through edgeR to get some idea of the inter-library variability. If you have overdispersed data, then using a Poisson model will *drastically* overestimate the levels of differential expression in your data. Using a NB model like in edgeR can account for this extra variation in the data and give much better assessment of DE.

edgeR can deal with overdispersion in the data, investigate inter-library (incl. biological) variability and get exact p-values for DE based on the NB model.

Hope that is helpful and good luck with your data analysis. Please ask if you have any more questions I might be able to help with.

Best regards
Davis

Thanks to Davis for your great advices! Such information is really useful for the beginners like me. I learned a lot from discussion here.

I am good with steps 1-3 and 6. However, I am not very clear with the software Rsamtools indicated in step 4 and DE genes calculation of step 5. Can anybody give a little more details about these?

Is it possible to run these software in my pc?

One more question, how to analyze the quality of RNA-seq output data?

I don't major in bio-informatics and I know these questions look naive.... Thanks for your answers

Quix

Last edited by quix; 08-24-2010 at 07:42 PM.
quix is offline   Reply With Quote
Old 08-25-2010, 01:23 AM   #16
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 994
Default

Quote:
Originally Posted by quix View Post
I am good with steps 1-3 and 6. However, I am not very clear with the software Rsamtools indicated in step 4 and DE genes calculation of step 5. Can anybody give a little more details about these?
To my knowledge, there is no good explanation yet on how to use Rsamtools for this task. The ones found on the web have a few issues (see this post).

Hence, maybe you want to give me htseq-count tool a try.

Quote:
Is it possible to run these software in my pc?
Usually yes.

Quote:
One more question, how to analyze the quality of RNA-seq output data?
For a first look, use htseq-qa or FastQC. Once you have counts, compare your replicates to see how well they agree. (I plan to a section to the DESeq on how to do that.)

Simon

Last edited by Simon Anders; 08-25-2010 at 01:23 AM. Reason: fmt
Simon Anders is offline   Reply With Quote
Old 08-25-2010, 08:03 AM   #17
quix
Junior Member
 
Location: New York

Join Date: Aug 2010
Posts: 6
Default

Thanks Simon for your kind reply,

about the replicates, I have a further question.
I have submitted my samples for RNA-seq(1, control; 2, protein treatment for 1 hr,; 3, protein treat for two hrs). What I have done is to pool the RNA samples from three independent experiments(ctrlX3, 1hrX3, 2hrsX3). For each experiment, I have verified that the protein works on my cells.
Is this biological replication?

Thanks
Quix
quix is offline   Reply With Quote
Old 08-25-2010, 11:35 AM   #18
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 994
Default

Quote:
Originally Posted by quix View Post
about the replicates, I have a further question.
I have submitted my samples for RNA-seq(1, control; 2, protein treatment for 1 hr,; 3, protein treat for two hrs). What I have done is to pool the RNA samples from three independent experiments(ctrlX3, 1hrX3, 2hrsX3). For each experiment, I have verified that the protein works on my cells.

Is this biological replication?
Not quite. Maybe, re-read my post #16. You will only see the average of your three replicates. How do you want to know that the spread within replicates (the within-group variance, in the terminology of anova) is not as large as the differences that you observe between conditions (the between-groups variance)? Without this, you cannot calculate a p value and only make wild guesses about the statistical significance of your findings.

Why didn't you use multiplexing (i.e., bar-coding tags next to the sequencing primer) to keep your samples separable before pooling them into a sequencing lane?

Simon
Simon Anders is offline   Reply With Quote
Old 08-29-2010, 04:15 PM   #19
yh253
Member
 
Location: Ireland

Join Date: Jul 2009
Posts: 16
Default

Quote:
Originally Posted by Simon Anders View Post
Tumor-normal sample pairs are proper replicates, of course, if you have several pairs. The specific issue with paired samples is that DESeq cannot deal with them at the moment (and neither can edgeR) but we are working on it.

Simon
Hi Simon,

I've been a bit confused by this 'sample pairs' concept. I got RNA-seq data of two samples: wide-type and knocked-down, with 4 'biological replicates' for each, all four replicates for each sample were sequenced on different lanes of a single flow cell, and two samples on each lane by using multiplex. Is my data regarded as 'pairs', which can't be analyzed by DESeq or edgeR? So far, I got the gene level read counts (rpkm values) from ERANGE, and going to preceding to DE analysis. If I can't use any of the two packages, do you have a suggestion of other tools for this purpose?
yh253 is offline   Reply With Quote
Old 08-30-2010, 01:25 AM   #20
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 994
Default

Quote:
Originally Posted by yh253 View Post
I've been a bit confused by this 'sample pairs' concept. I got RNA-seq data of two samples: wide-type and knocked-down, with 4 'biological replicates' for each, all four replicates for each sample were sequenced on different lanes of a single flow cell, and two samples on each lane by using multiplex. Is my data regarded as 'pairs', which can't be analyzed by DESeq or edgeR? So far, I got the gene level read counts (rpkm values) from ERANGE, and going to preceding to DE analysis. If I can't use any of the two packages, do you have a suggestion of other tools for this purpose?
"Sample pairs" means that your samples come in pairs, each pair containing one treatment and one control, such that the two samples within a pair might be more similar than two control samples or two treatment sample. For example, if you have several patients, and from each patient, you have one sample of normal tissue and one of tumor tissue, the differences between the patients might obscure the differences between tumor and normal and you drastically lose power to make statistical discoveries if your method is not informed about which healthy sample is paired with which tumor sample.

BTW: You need raw, unnormalized counts to use edgeR or DESeq. RPKM values are not suitable.

Simon
Simon Anders is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 02:09 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2021, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO