SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
RNA-Seq: S-MART, A Software Toolbox to Aid RNA-seq Data Analysis. Newsbot! Literature Watch 0 10-15-2011 04:11 AM
RNA-Seq: GENE-Counter: A Computational Pipeline for the Analysis of RNA-Seq Data for Newsbot! Literature Watch 0 10-15-2011 04:11 AM
RNA-Seq: RseqFlow: Workflows for RNA-Seq data analysis. Newsbot! Literature Watch 0 07-29-2011 03:00 AM
RNA-Seq: De novo assembly and analysis of RNA-seq data. Newsbot! Literature Watch 0 10-12-2010 04:50 AM
RNA-Seq: RNA-Seq Atlas of Glycine max: A guide to the soybean transcriptome. Newsbot! Literature Watch 0 08-07-2010 03:54 AM

Reply
 
Thread Tools
Old 09-28-2010, 10:08 PM   #1
MDY
Junior Member
 
Location: Melbourne

Join Date: Jun 2010
Posts: 7
Lightbulb Guide/tutorial for the analysis of RNA-seq data

UPDATE

This guide is now available in wiki form on the seqanswers wiki. http://seqanswers.com/wiki/How-to/RNASeq_analysis In addition, I would urge everyone to look at the other resources available on the wiki.
It has been nearly a year since I first wrote this guide and it is already starting to show its age. The only way this will continue to be a useful resource is if we as a community take the time to keep it up to date. Already minor things like syntax changes introduced in software updates are causing some errors to creep in. I have received many emails from people wanting to know how to fix these problems, some I have been able to answer and some others have worked out for themselves. If you are one of these people, I would strongly urge you to add the correction to the wiki (however minor it may be), so future readers can benefit. I will do my best to change things that people bring to my attention. However, I am no longer working in the field of RNA-seq analysis and so my knowledge on the topic will become less and less useful, as well as the time I am able to spend on it. I am glad this guide has been useful to so many people and hope that with your help it will continue to be a useful in the future.

Kind Regards,

Matt


Hello,

I've written a guide to the analysis of RNA-seq data, for the purpose of differential expression analysis. It currently lives on our internal wiki that can't be viewed outside of our division, although printouts have been used at workshops. It is by no means perfect and very much a work in progress, but a number of people have found it helpful, so I thought it would useful to have it somewhere more publicly accessible.

I've attached a pdf version of the guide, although really what I was hoping was that someone here could suggest somewhere where it could be publicly hosted as a wiki. This area is so multifaceted and fast-moving that the only way such a guide can remain useful is if it can be constantly extended and updated.

If anyone has any suggestions about potential hosting, they can contact me at myoung@wehi.edu.au

Cheers

Matt

Update: I've put a few extra things on our local Wiki and seeing as people here seem to be finding this useful I thought I'd post an updated version. I'm also an author on a review paper on Differential Expression using RNA-seq which people who find the guide useful, might also find relevant...

RNA-seq Review

Last edited by MDY; 08-16-2011 at 07:31 AM. Reason: Updated version
MDY is offline   Reply With Quote
Old 09-29-2010, 07:11 AM   #2
honey
Senior Member
 
Location: Pittsburgh

Join Date: Feb 2010
Posts: 151
Default

I think it is right place and is very useful
honey is offline   Reply With Quote
Old 10-26-2010, 02:03 AM   #3
natstreet
Member
 
Location: Sweden

Join Date: Nov 2009
Posts: 83
Default

The guide is really useful thanks. In it you use the data from Li et al 2008 as an example dataset. Can you point me to where I could download the fasta files you detail?
natstreet is offline   Reply With Quote
Old 10-26-2010, 03:41 AM   #4
poisson200
Member
 
Location: united kingdom

Join Date: Feb 2010
Posts: 63
Default

Dear Matt,
It is a very good place for a document like this. Someone asked me to detail how to perform RNA-seq gene diff-ex analyses on short read data; this document is an excellent example. I think it will really help a lot of people and save a lot of time (I would have done a couple of things differently but that is just personal experience/preference).

Thank you for the contribution.

Actually, a Next Generation Sequencing wiki, if it does not exist already, is a great idea.

John.
poisson200 is offline   Reply With Quote
Old 10-28-2010, 12:35 AM   #5
hanifk
Member
 
Location: China

Join Date: Oct 2010
Posts: 18
Default

I have spent a lot time to find such a tutorial
but it seems that very little material is availble
thanks for your help
hanifk is offline   Reply With Quote
Old 11-11-2010, 06:30 PM   #6
huyvuong
Member
 
Location: michigan

Join Date: Oct 2010
Posts: 10
Default

Hi Matt,
Thank very much for sharing your guide. Would you please let me know the link to download the Li Prostate cancer dataset you mentioned in the guide, i.e the 7 fa files? I couldn't find them in the publication's supporting information. Thanks
huyvuong is offline   Reply With Quote
Old 11-15-2010, 09:24 AM   #7
diya
Member
 
Location: TN

Join Date: Nov 2010
Posts: 11
Default Very useful document for beginners in deep-sequencing

Hi Matt,

I have been searching so much for such kind of tutorial. The tutorial is very helpful.

Thanks,

Diya
diya is offline   Reply With Quote
Old 11-15-2010, 07:50 PM   #8
MDY
Junior Member
 
Location: Melbourne

Join Date: Jun 2010
Posts: 7
Default

Hi everyone,

Sorry for the slow reply, I somehow managed to miss the replies. For those asking where to get the seven fasta files used in this guide, they are using the data used in the referenced paper, Li et al 2008 ( http://www.ncbi.nlm.nih.gov/sites/en...,f1000m,isrctn ). As far as I know, the files aren't stored on GEO, but the authors were happy to send the data when contacted by email. The 7 files are 3 treated and 4 untreated lanes of RNA-seq.

Cheers,

Matt
MDY is offline   Reply With Quote
Old 12-03-2010, 06:03 AM   #9
flyyuan
Junior Member
 
Location: China

Join Date: Nov 2010
Posts: 3
Default

Thanks Matt for this nice guide, now, I am tring to analysis some soybean rna-seq data following this article. However, I am very new to this work, could anybody give me some suggestions to solve following problems:

1. I try to use makeTranscriptDbFromBiomart to get the information of soybean in phytozome database, but it seems there many organisms in phytozome database, how can I select the G.max which I need?

2.bowtie software map the RNA-seq tag to reference gene, what is the criterion for match or does not match.

thanks in advance!
flyyuan is offline   Reply With Quote
Old 12-07-2010, 07:33 AM   #10
nancyelatimer
Junior Member
 
Location: Western TN, USA

Join Date: May 2010
Posts: 1
Default

Matt - Awesome super-polished resource for those with or without experience in NGS or RNA-seq! Please feel free to share any other resources you have created. Thank you.
nancyelatimer is offline   Reply With Quote
Old 12-09-2010, 07:39 PM   #11
Optimistix
Junior Member
 
Location: New York

Join Date: Jun 2010
Posts: 3
Default

Thanks a lot for the nice guide and sharing it with all of us, Matt!
Optimistix is offline   Reply With Quote
Old 12-13-2010, 06:54 PM   #12
MDY
Junior Member
 
Location: Melbourne

Join Date: Jun 2010
Posts: 7
Default

flyyuan - I'm not sure what the answer to your first question about biomart. A detailed description of how bowtie decides on a valid match can be found on the bowtie webpage and in particular the manual. You might want to look at this http://bowtie-bio.sourceforge.net/ma...alignment-mode

In brief, in the default mode bowtie will report a read as matching if it has fewer than -n mismatches from the reference in the seed and the sum of the quality scores at ANY mismatching base within the entire read is less than -e.
MDY is offline   Reply With Quote
Old 12-18-2010, 02:04 AM   #13
KevinLam
Senior Member
 
Location: SEA

Join Date: Nov 2009
Posts: 203
Default

Excellent. This should be a sticky!
KevinLam is offline   Reply With Quote
Old 01-25-2011, 04:07 AM   #14
colindaven
Senior Member
 
Location: Germany

Join Date: Oct 2008
Posts: 415
Default

Excellent introductory guide, thank you!
colindaven is offline   Reply With Quote
Old 01-26-2011, 08:47 PM   #15
Azazel
Member
 
Location: Japan

Join Date: Oct 2010
Posts: 52
Default

Hi Matt,

thanks for putting up this excellent tutorial.

I have one constructive critisism or discussion point though; as I understand it, when checking for differential expression (DE) you only consider reads "overlapping some annotation object, which is usually something like a collection of genes downloaded from the UCSC."

So you suggest checking DE only for something like RefSeq, and taking the number of reads within each RefSeq (or other object) as the expression level.

I think this discards not only much of the information gained by RNA-seq, but also some of the most important information: the most interesting genes are often among the non-annotated genes. Consider for example two cellular states, a very interesting gene might only be expressed in the very unusual state B, and be very highly expressed; while in state A it's not or so lowly expressed that it didn't make it into the annotation. So with this approach a researcher would miss this gene and others like it entirely because it's not in the annotation, although these might be the very genes which explain the biological question at hand.

If I'd use RNA-seq just to identify DE genes which are already annotated in UCSC, I almost might as well have used a tiling array spanning the annotated genes only. (sure RNA-seq is "digital", but the point I'm trying to make is that with UCSC or similar annotation one would ignore 90%+ of the RNA-seq data elsewhere in the genome!)

So I think a better approach would be first to use the RNA-seq data to produce an ad hoc annotation, including information from all sequenced conditions, then check DE against this annotation.

Now the question is of course, what is a very good way to create an annotation, i.e. how to identify the regions spanned by genes, from RNA-seq?

Last edited by Azazel; 01-27-2011 at 04:59 AM. Reason: typo
Azazel is offline   Reply With Quote
Old 01-26-2011, 08:52 PM   #16
Optimistix
Junior Member
 
Location: New York

Join Date: Jun 2010
Posts: 3
Default

Hi Azazel, you might want to check out tools from the Salzberg lab, in particular cufflinks, cuffdiff and cuffcompare:

http://cufflinks.cbcb.umd.edu/index.html

I'm new to RNASeq data analysis as well, but those three tools do the kind of thing you seem to have in mind.
Optimistix is offline   Reply With Quote
Old 02-06-2011, 07:17 PM   #17
MDY
Junior Member
 
Location: Melbourne

Join Date: Jun 2010
Posts: 7
Default

Hi Azazel,

I agree with you that the "differential expression of whole genes taken from UCSC" approach does ignore some important information provided by RNA-seq. However, I do not agree that this means the approach is without value.

The alternative you suggest is to use the data itself to produce an annotation, against which your analysis can proceed. There will be circumstances where this is indeed a superior approach, but this will not always be the case. Firstly, the ability to annotate a gene depends on the level of coverage, which for RNA-seq depends on the level of transcription. By relying on de-novo annotation routines you will ignore, or at the very least bias against, lowly expressed genes. There are certainly situations where the accurate identification of differential expression in lowly expressed genes is vital to the biological question driving the experiment. The example I am familiar with is known-down experiments of polycomb group proteins, where small expression changes in lowly expressed genes are important, but I am sure there are many others. A related point is that smaller experiments may lack the depth for accurate de novo annotation, even for highly expressed genes.

Furthermore, there is obviously extra information that is still ignored by doing differential expression against de-novo annotated genes, such as differential splicing or allele specific expression.

Ultimately, there are many different things that can be done with RNA-seq data and it seems to me that different analysis techniques offer complementary information rather than opposing, mutually exclusive viewpoints. How important different aspects of RNA-seq analysis are will depend on the biological question you are trying to answer.

The point of this guide is not to be a "one size fits all" guide to analyzing RNA-seq, but to provide a step by step introduction to one of the simpler (and possibly more well understood) analysis methods available. It was my original hope (and it still is) that this guide could form the basis for some kind of "RNA-seq analysis wiki", where people with more expertise with other areas of the analysis could add to it. For example, I think a section describing how to create an annotation using a reference genome and some RNA-seq data would be extremely useful, but I don't have the expertise to write it myself.
MDY is offline   Reply With Quote
Old 02-16-2011, 09:40 AM   #18
gunzip
Junior Member
 
Location: United Kingdom

Join Date: Feb 2011
Posts: 1
Default

Hi Matt,

Really really useful. Thank you.

I was wondering if you, or anyone else reading this, knew of a guide/workflow on novel transcripts/RNA editing, similar to the level of excellence as this guide.
gunzip is offline   Reply With Quote
Old 02-25-2011, 12:42 PM   #19
JueFish
Member
 
Location: Connecticut

Join Date: May 2010
Posts: 42
Default

Been checking out the tutorial that was posted on this thread. Can anyone comment what the command:

new_read_chr_names=gsub("(.*)[T]*\\..*","chr\\1",rname(reads))

is doing? I get the making of a new list of chromosome names and that the eaxmple uses gsub to do the substitutions, but I don't under stand what's going on in the first two fields of that command:

"(.*)[T]*\\..*"

and

"chr\\1"

In other words, I have no idea how to make sure that my chromosomal names will match up with ones in the genome (NCBI headers put in a bunch of noise) because of my naivete about syntax. Any thoughts? Thanks.
JueFish is offline   Reply With Quote
Old 02-28-2011, 12:54 AM   #20
m.nyine
Junior Member
 
Location: Uganda

Join Date: Feb 2011
Posts: 4
Default

Hi,

Thanks a lot

This is really helpful for a fresher in bioinformatics like me.

I hope our seniors will take MDY's idea of developing this further.
I know many are working hard to develop such material but there is limited access to it.

Please keep us informed of any developments.

Nyine
m.nyine is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 11:50 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2022, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO