SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Help! RNA-seq data for groupwise comparison with isoform expression validation Xi Wang Bioinformatics 2 08-23-2012 08:40 AM
RNA-Seq: Isoform-level microRNA-155 target prediction using RNA-seq. Newsbot! Literature Watch 0 02-15-2011 02:00 AM
RNA-Seq: Using non-uniform read distribution models to improve isoform expression inf Newsbot! Literature Watch 0 12-21-2010 02:00 AM

Reply
 
Thread Tools
Old 01-14-2013, 02:20 PM   #21
EduEyras
Member
 
Location: Barcelona, Spain

Join Date: Dec 2012
Posts: 17
Default

Thanks.

The list is back up
http://regulatorygenomics.upf.edu/So..._and_splicing/
EduEyras is offline   Reply With Quote
Old 01-14-2013, 02:54 PM   #22
Maayanster
Member
 
Location: Vancouver, BC

Join Date: Dec 2012
Posts: 30
Default

Hi Eduardo, thanks for all the work.

The categorization is till a bit unwieldy though. For example, the approaches in DSGseq, DEXseq, and DiffSplice are very similar (they all bypass transcriptome reconstruction and isoform quantification and focus only on differential exon expression), but they're in different categories.

Also, you've included tools that are purely qualitative/visual (like spliceSeq and JuncBase) in the same categories as tools that have statistical outputs... I don't know, maybe a "qualitative" section would be good?
Maayanster is offline   Reply With Quote
Old 01-14-2013, 03:13 PM   #23
EduEyras
Member
 
Location: Barcelona, Spain

Join Date: Dec 2012
Posts: 17
Default

Hi,

thanks for the comments. You're right, DSGSeq and DEXSeq are quite similar. However, as I understood, DSGSeq provides abundances for isoforms and calculates changes at the isoform level, whereas DEXSeq does that only at exon level.

DEXSeq and DiffSplice are in the same category already, since they provide the differential expression at the "event level", where that is an exon for DEXSeq and an "alternative splicing module" for DiffSplice.

SpliceSeq and JuncBASE provide a simple statistical test to compare the read content of each event under two conditions. It is much simpler than other methods as they do not model the read densities, but still they provide a measure of differential splicing. That could be a totally valid approach in some contexts.

I agree with you that this organization is not right from the point of view of the methodology,
would you think that could be a better way of grouping the methods?
I thought that perhaps it will be more useful for the end-user if they were grouped according to what they aim to calculate, e.g. regulated splicing events, quantify isoforms from the annotation, etc... and at the same time taking into account what sort of input you would need: annotations, a genome, events, junctions.... It's not easy, as many of the papers are not clear enough even about what they themselves do.

Thanks for your comments

E.
EduEyras is offline   Reply With Quote
Old 01-14-2013, 03:26 PM   #24
Maayanster
Member
 
Location: Vancouver, BC

Join Date: Dec 2012
Posts: 30
Default

I'm quite sure that DSGseq, like DEXseq and DiffSplice does not supply expression at the isoform level. The output is basically a file with the genes ranked by their NB statistic, each of which has an NB statistic per exon as well.

From the paper:
Quote:
As discussed above, for the purpose of inferring differential splicing, we do not necessarily need to estimate the expression of isoforms as differential splicing can be reflected from read distributions across the exons composing the isoforms. In this way, we also don't need to know isoform structures or even the existence of multiple isoforms. We use the negative binomial (NB) distribution to model read-counts on all exons of a gene. It considers over-dispersion in read-count distribution and borrows information across samples to get better estimation of the signal of each exon

...

On the other hand, one can also perceive that identifying differences in splicing patterns or isoform proportions is relatively an easier task than inferring splicing isoforms and estimating their expressions
As for suggestions for categorization, it's tricky and I think you're doing an awesome job! I might have more suggestions as I continue to test out the various options and learn more about them!
Maayanster is offline   Reply With Quote
Old 01-15-2013, 01:36 AM   #25
EduEyras
Member
 
Location: Barcelona, Spain

Join Date: Dec 2012
Posts: 17
Default

Hi,

thanks a lot for the clarification. Even though the abstract seems clear about what they do, the paper contains some sentences that led me to think that they produce differential expression of isoforms:

..."Therefore, we can detect differences in isoform proportion from the information at all exons without having to estimate the expression of all isoforms."...

..."For studying differential splicing, we are comparing the read count vector Yi = {Yij} conditional on Mi. That is, we focus on the proportion of isoform expression instead of the overall expression of the whole gene." ...

Thanks again for your comments and suggestions

Eduardo
EduEyras is offline   Reply With Quote
Old 01-15-2013, 02:43 AM   #26
EduEyras
Member
 
Location: Barcelona, Spain

Join Date: Dec 2012
Posts: 17
Default

... although, looking again at the output given in:
http://bioinfo.au.tsinghua.edu.cn/software/DSGseq/

There it does provide an NB statistics per isoform (line corresponding to column 7).
And this is reported per NM_ ... hence they must report the changes per isoform after all.
EduEyras is offline   Reply With Quote
Old 01-15-2013, 08:17 AM   #27
Maayanster
Member
 
Location: Vancouver, BC

Join Date: Dec 2012
Posts: 30
Default

The complete results file from running their test data has only 20804 lines, and each gene symbol appears once, so it is per gene. The refseq transcript id (NM_XXX) is probably just the longest transcript or something. Worth asking the authors.
Maayanster is offline   Reply With Quote
Old 01-15-2013, 08:46 AM   #28
EduEyras
Member
 
Location: Barcelona, Spain

Join Date: Dec 2012
Posts: 17
Default

Yes, I already asked them this morning. No answer yet.
EduEyras is offline   Reply With Quote
Old 01-16-2013, 09:57 AM   #29
Maayanster
Member
 
Location: Vancouver, BC

Join Date: Dec 2012
Posts: 30
Default

DEXseq summary (so far)
This is similar to DSGseq and Diffsplice insofar as the isoform reconstruction and quantification are skipped and differential exon expression is carried out. Whereas the other two tools say that they don't need an annotation for their statistics, this program is based on only annotated exons, and uses the supplied transcript annotation in the form of a GFF file.
It also needs at least two replicates per group.
I found the usage of this program extremely tedious (as a matlab person). To install it you need to also install numpy and HTSeq. For preparing the data (similarly to DSGseq) you need to do a preparation step on the annotations, and another preparation step for every sample separately which collects the counts (both using python scripts). Then you switch to R, where you need to prepare something called an ExonCountSet object. To do this you need to first make a data.frame R object with the files that come out of the counting step. Yo also need to define a bunch of parameters in the R console. Then you can finally run the analysis. Despite the long instructional PDF, all this is not especially clear, and it's a rather tedious process compared to the others I've tried so far. In the end, I ran makeCompleteDEUAnalysis, printed out a table of the results, and called it a day. I tried to plot some graphics too, but couldn't because "semi-transparency is not supported on this device". If anyone wants a copy of the workflow I used, send me a message, trying to figure it out might take weeks.


DiffSplice summary
This is a similar approach for exon-centric differential expression to DEXseq and DSGseq (no attempt to reconstruct or quantify specific isoforms). Also supports groups of treatments, minimum 2 samples per group. The SAM inputs and various rather detailed parameters are supplied in two config files. I found this very convenient. In the data config file you can specify treatment group ID, individual IDs, and sample IDs, which determine how the shuffling in their permuation test is done. It was unclear to me what the sample IDs are (as opposed to the individual ID).
DiffSplice prefers alignments that come from TopHat or MapSplice because it looks for the XS (strand) tag which BWA doesn't create. There's no need to do a separate preparation step on the alignments. However, if you want you can separate the three steps of the analysis using parameters for selective re-running. This program is user friendly and the doc page makes sense.
On the downside, when the program has bad inputs or stops in the middle there's no errors or warnings - it just completes in an unreasonably short time and you get no results.
Diffsplice appears to be sensitive to rare deviations from the SAM spec, because while I'm able to successfully run it on mini datasets, the whole datasets are crashing it. I ran Picard's FixMateInformation and ValidateSamFile tools to see if they will make my data acceptable (mates are fine, and sam files are valid! woot), but no dice. It definitely isn't due to the presence of unaligned reads.

Last edited by Maayanster; 04-25-2013 at 12:51 PM.
Maayanster is offline   Reply With Quote
Old 01-18-2013, 02:24 PM   #30
Maayanster
Member
 
Location: Vancouver, BC

Join Date: Dec 2012
Posts: 30
Default

eXpress, BitSeq, and RSEM are all based on the approach of aligning RNA-seq reads to the transcriptome, not the genome.

eXpress summary:
This program can take a BAM file in a stream, or a complete SAM or BAM file.
It produces a set of isoforms and a quantification of said isoforms. There is no built in differential expression function (yet) so they recommend inputting the rounded effective counts that eXpress produces into EdgeR or DEGSeq
I used bowtie2 for the alignments to the transcriptome. Once you have those, using eXpress is extremely simple and fun. There's also a cloud version available on Galaxy, though running from the command line is so simple in this case I don't see any advantage to that. Definite favorite!

RSEM +EBSeq summary:
This also generates isoforms and quantifies them. It also needs to be follwed by an external DE tool - they recommend EBSeq, which is actually included in the latest RSEM release.
RSEM can't tolerate any gaps in your transcriptome alignment, including the indels bowtie2 supports. Hence, you either need to align ahead of time with bowtie and input a SAM/BAM, or use the bowtie that's built into the RSEM call and input a fsta/fastq. For me this was unfortunate because we don't keep fastq files on hand (only illumina qseq files) which bowtie doesn't take as inputs. So for us I think this will be too slow. However, it does work! I successfully followed the instructions to execute EBSeq, which is conveniently included as an RSEM function, and gives intelligible results. Together, this workflow is complete.

BitSeq summary
This, like DEXSeq, is and R bioconductor package. I found the manual a lot easier to understand than DEXSeq.
They recalculate the probability of each alignment, come up with a set of isoforms, quantify them, and also provide a DE function. In this way, it is the most complete tool I've tried so far, since all the other tools have assumed, skipped, or left out at least one of these stages. Also, BitSeq automatically generates results file, which is useful for people that don't know R. However, I only successfully ran up to the isoform quantification stage, the DE stage threw an error that I have yet to resolve (google come up empty and I didn't get a reply from the developer).
For running BitSeq I used the same bowtie2 alignments to the transcriptome as for eXpress. You need to run the function getExpression on each sample separately. Then you make a list of the result objects in each treatment group and run the function getDE on those (this is where I got the error).

Last edited by Maayanster; 04-25-2013 at 09:13 AM.
Maayanster is offline   Reply With Quote
Old 01-21-2013, 01:48 PM   #31
Maayanster
Member
 
Location: Vancouver, BC

Join Date: Dec 2012
Posts: 30
Default

iReckon summary:
iReckon generates isoforms and quantifies them. However, this is based on gapped alignment to the genome (unlike eXpress, RSEM and BitSeq which are based on alignments to the transcriptome). It doesn't have a built in DE function, so each sample is run separately.
This tool is a little curious because it requires both a gapped alignment to the genome, and the unaligned reads in fastq or fasta format with a reference genome. Since it requires a BWA executable, it's doing some re-alignment. iReckon claims to generate novel isoforms with low false positives by taking into consideration a whole slew of biological and technical biases.
One irritating thing in getting the program running is that you need to re-format your refgene annotation file using an esoteric indexing tool from the Savant genome browser package. If you happen to use IGV, this is a bit tedious. Apparently, this will change in the next version. Also, iReckon takes up an enormous amount of memory and scratch space. For a library with 350 million reads, you would need about 800 G scratch space. Apparently everything (run time, RAM, and space) is linear to the number of reads, so this program would be a alright for a subset of the library or for lower coverage libraries.

Last edited by Maayanster; 04-25-2013 at 09:19 AM.
Maayanster is offline   Reply With Quote
Old 03-12-2013, 03:42 PM   #32
Maayanster
Member
 
Location: Vancouver, BC

Join Date: Dec 2012
Posts: 30
Default

Cufflinks + cuffdiff2 summary:
This pipeline, like iReckon, is based on gapped alignment to the genome. It requires the XS tag, so if you're not using tophat to align your RNA, you need to add that tag. I also found out that our gapped aligner introduces some pesky 0M and 0N's in the cigars, since cufflinks doesn't tolerate these. But with these matters sorted out, it's pretty easy to use.
So far I'm really liking the versatility. You can run cufflinks for transcriptome reconstruction and isoform quantification in a variety of different modes. For example, with annotations and novel transcript discovery, with annotations and no novel discovery, with no annotations, and with annotations to be ignored in the output. For differential expression, cuffdiff 2 can be run with the results of the transcript quantification from cufflinks to include novel transcripts, or, it can be run directly from the alignment bam files with an annotation. Unlike the exon-based approaches, you don't need to have more than one library in each treatment group, though if you do it's better to keep them separate than to merge them.

I'm posting a slide I made that shows the appraches used by the 8 programs I tried. The winners so far in each rough category are:
exon-based DE: DSGseq
transcript quantification and DE from genome alignments: cufflinks+cuffdiff2
transcript quantification from transcriptome alignments: eXpress

These winners are mostly based on how practical these programs are to use. A couple (DEXseq and DiffSplice) I gave up on, the former due to undue complication and the latter due to an undiscovered formatting issue with our bam files which crashes the program. RSEM and iReckon worked fine, but both require re-alignment so they're resource-intense. I enjoyed BitSeq, but it's run from the R console which is impractical for us.

Last edited by Maayanster; 03-12-2013 at 03:50 PM.
Maayanster is offline   Reply With Quote
Old 03-13-2013, 01:45 AM   #33
kopi-o
Senior Member
 
Location: Stockholm, Sweden

Join Date: Feb 2008
Posts: 319
Default

This is a useful overview, thanks. I thought RSEM also had a mode for quantification from genome alignments, though (where it maps with Bowtie) in addition to the transcriptome alignments?
kopi-o is offline   Reply With Quote
Old 03-13-2013, 04:25 PM   #34
Maayanster
Member
 
Location: Vancouver, BC

Join Date: Dec 2012
Posts: 30
Default

Quote:
Originally Posted by kopi-o View Post
This is a useful overview, thanks. I thought RSEM also had a mode for quantification from genome alignments, though (where it maps with Bowtie) in addition to the transcriptome alignments?
As far as I know it doesn't, but please post a link if I'm wrong.

from the RSEM website:
Quote:
However, please note that RSEM does * not * support gapped alignments. So make sure that your aligner does not produce alignments with intersions/deletions
Maayanster is offline   Reply With Quote
Old 03-14-2013, 12:06 AM   #35
areyes
Senior Member
 
Location: Heidelberg

Join Date: Aug 2010
Posts: 165
Default

Hi @Maayanster,

Regarding DEXSeq, have you seen the pasilla vignette in R? It indicates how to generate an ExonCountSet object. Alternatively, in the newest version of DEXSeq in the svn (1.5.9) you will find the code to generate an ExonCountSet object from transcriptDb objects and bam files.

Are you getting any error message? If so, could you add the output of your sessionInfo()? and a reproducible code?

Alejandro
areyes is offline   Reply With Quote
Old 03-14-2013, 02:59 PM   #36
Maayanster
Member
 
Location: Vancouver, BC

Join Date: Dec 2012
Posts: 30
Default

Hi Alejandro,
I was trying to follow the instructions sections 7 and 8 of the manual (which are sort of in the wrong logical order) with my own data, not with the pasilla. This was a few months ago, so I can't exactly remember, but I believe I managed to create the ExonCountSet object after reading a bunch of forum posts about it. But then I got stuck because some parameter in the makeCompleteDEUAnalysis step was missing and I had no idea what it was supposed to be.

I guess I'll go and install the pasilla dataset and try again.
Maayanster is offline   Reply With Quote
Old 03-20-2013, 12:36 PM   #37
Maayanster
Member
 
Location: Vancouver, BC

Join Date: Dec 2012
Posts: 30
Default

Hi again Alejandro,
So the pasilla vignette does not indicate how to create an exonCoutSet object. That object is already in the pasilla package. To figure out how to get the data organized into the object, I found this thread much more useful than the documentation:
http://grokbase.com/t/r/bioconductor...ountset-object

Please take this as constructive criticism, because I've spent quite a lot of time struggling with your software, and maybe nobody has said this to you before. The documentation for DEXseq needs an overhaul. The manual is so long, but it starts with an over-complicated example that's way beyond the level of detail most people are even interested in. The section that is supposed to explain how to run an analysis in one step includes parameters that aren't explained, and that most users don't need to use. The section covering the very first steps of how to set up the data come later in the manual, which makes no sense. Even worse, there are no actual code examples for how to make the count files, annotations, design dataframe, and exonCountSet object.

You're really missing a simple step by step workflow for getting the data from a bam file into R, and running the basic functionalities. I think that for non-R users (actually, even for R users) the manual is extremely cumbersome to use, totally in the wrong order, and full of both too much and too little detail in all the wrong places. And judging from the number of threads on technical problems in dexseq, I'm not the only one struggling. It's great that your team is so responsive to people's issues on seqanswers, but you would save yourself a lot of time answering questions by having a logical one-pager workflow.

here are the headers that i made for myself with the code for each step that I ran/am trying to run:
-reference preparation step
-counting step call (on each library separately)
-R data.frame preparation
-R ExonCountSet object preparation
-R analysis step

right now I'm still stuck:
Quote:
>samples = data.frame(
+ condition = c("gr3","gr3","gr3","gr4","gr4","gr4"),
+ replicate = c(1:3,1:3),
+ row.names = c("A08586","A08606","A08616","A08473", "A08485", "A08500"),
+ stringsAsFactors = TRUE,
+ check.names = FALSE
+ )

> annotationfile = '/projects/mkreitzman_prj/expression_quantification_testing/testing/DEXseq/Homo_sapiens.GRCh37.69_dexseq.gtf'
> fullFilenames<- list.files("/projects/mkreitzman_prj/expression_quantification_testing/testing/DEXseq",full.names=TRUE,pattern="dexseq.count")

exonCountSet <- read.HTSeqCounts(countfiles = fullFilenames,design = samples,flattenedfile = annotationfile)
Error: all(unlist(lapply(design, class)) == "factor") is not TRUE
Maayanster is offline   Reply With Quote
Old 03-20-2013, 11:31 PM   #38
areyes
Senior Member
 
Location: Heidelberg

Join Date: Aug 2010
Posts: 165
Default

Hi @Maayanster,

Thanks for your feedback. To be honest, I am surprised that you mention that there are no examples of how to generate the input files and objects since there is a whole section (3. Exon count files) in the pasilla vignette of how to generate these files. The section 4 of the pasilla vignette and the help of the function "read.HTSeqCounts" explain how to generate an ExonCountSet object from these. However, I am very glad that you mention this, and its clear evidence that we need to restructure our vignettes. So thanks again!

Regarding the error message, in your "samples" object, just remove the "replicate" row. I mean instead of what you have, just do:

Code:
samples = data.frame(
condition = c("gr3","gr3","gr3","gr4","gr4","gr4"),
row.names = c("A08586","A08606","A08616","A08473", "A08485", "A08500"),
stringsAsFactors = TRUE,
check.names = FALSE
)

annotationfile = '/projects/mkreitzman_prj/expression_quantification_testing/testing/DEXseq/Homo_sapiens.GRCh37.69_dexseq.gtf'
fullFilenames<- list.files("/projects/mkreitzman_prj/expression_quantification_testing/testing/DEXseq",full.names=TRUE,pattern="dexseq.count")

exonCountSet <- read.HTSeqCounts(countfiles = fullFilenames,design = samples,flattenedfile = annotationfile)
Let me know if you have any additional questions!
Alejandro
areyes is offline   Reply With Quote
Old 03-20-2013, 11:34 PM   #39
areyes
Senior Member
 
Location: Heidelberg

Join Date: Aug 2010
Posts: 165
Default

Actually since you have only a single variable design, there is no need to create a data frame.

Just do:

Code:
annotationfile = '/projects/mkreitzman_prj/expression_quantification_testing/testing/DEXseq/Homo_sapiens.GRCh37.69_dexseq.gtf'
fullFilenames<- list.files("/projects/mkreitzman_prj/expression_quantification_testing/testing/DEXseq",full.names=TRUE,pattern="dexseq.count")

exonCountSet <- read.HTSeqCounts(countfiles = fullFilenames,design =  c("gr3","gr3","gr3","gr4","gr4","gr4"),flattenedfile = annotationfile)
That should work!
areyes is offline   Reply With Quote
Old 03-21-2013, 08:46 AM   #40
Maayanster
Member
 
Location: Vancouver, BC

Join Date: Dec 2012
Posts: 30
Default

Thanks Alejandro, that worked.
I must be looking at the wrong manual or something... this is the one I have.
http://watson.nci.nih.gov/bioc_mirro...doc/DEXSeq.pdf
Maayanster is offline   Reply With Quote
Reply

Tags
expression, isoform, rna seq, tools, transcript

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 07:16 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO