SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Bacterial RNASeq Expression profiles and genome size Arvind Bhagwat Introductions 1 04-26-2012 04:32 AM
using NGS, what is the best miRNA expression normalization method? Giorgio C Bioinformatics 4 12-07-2011 06:32 AM
comparing samples with varying no. of reads for differential expression harshinamdar Bioinformatics 1 08-24-2011 05:16 AM
splitting 454 reads into kmers for diff expression Jeremy RNA Sequencing 0 01-18-2011 06:17 PM
A scaling normalization method for differential expression analysis of RNA-seq data severin Literature Watch 1 09-09-2010 11:09 PM

Reply
 
Thread Tools
Old 05-22-2012, 10:52 AM   #1
mbblack
Senior Member
 
Location: Research Triangle Park, NC

Join Date: Aug 2009
Posts: 245
Default Diff. expression with RNAseq - varying results by method

I have a dose response experiment using Rats, with Controls and 5 treatment doses, each with 5 biological replicates.

The data is single read RNAseq, 50bp reads, run on an ABI 5500xl, and I mapped the reads to the Ensembl rel. 66 Rat genome using ABI's LifeScope 2.5.1.

For mapped reads, I have an average of 12,216,885 reads on exons per sample (the numbers range from a low of 5,171,445 for a Dose group 4 animal, and a high of 17,323,22 for a Dose group 3 animal)

I've been analyzing this data for differential gene expression with a couple of approaches and am getting radically different results.

DESeq using raw counts summarized on genes, genes with FDR < 0.05:

Dose 1 = n/a (have not run this one yet)
Dose 2 = 45
Dose 3 = 146
Dose 4 = 520
Dose 5 = 540

Now with DESeq, I did analyze each dose as a single pair of Controls (N=5) and Treatment (N=5)

Similarly, I only have the highest dose done using cuffdiff v2.0.0 (3384), but again, genes with q-value < 0.05:
Dose 5 = 64

Then, I pulled the mapped BAM files into Partek Genome Suite 6.6, re-indexed them to Refseq genes, got gene RPKM values and did a 1way ANOVA on all 30 samples, with linear contrasts for individual Dose fold change and p-values.

genes significant by dose at FDR < 0.05 = 1524

genes significant by linear contrast at FDR < 0.05
Dose 1 = 3
Dose 2 = 3
Dose 3 = 280
Dose 4 = 1722
Dose 5 = 3665

Even if I break these down and run each Dose as a discrete ANOVA (n=10) I get, genes significant by dose at FDR < 0.05:

Dose 1 = 0
Dose 2 = 1
Dose 3 = 70
Dose 4 = 1043
Dose 5 = 1848

And just for comparison, I have these exact same animals run on Affy HT_Rat30_PM (Titan) arrays, again as a single ANOVA (N=30) of RMA normalized data, with linear contrasts for each Dose

genes significant by dose at FDR < 0.05 = 2375

genes significant by linear contrast at FDR < 0.05
Dose 1 = 2
Dose 2 = 67
Dose 3 = 1053
Dose 4 = 2223
Dose 5 = 3073

I have not yet looked at these lists of genes for overlap and so forth yet, although just casually looking at the top ranked (by FDR) genes in Partek and DESeq, I see a lot of the same ones. But what I am wondering about is the large discrepancy in the numbers of significant genes.

Any ideas why the different RNAseq methods would be so different? Is this purely a function of the differences between count data and RPKM data and the different variance models? I was not expecting the results to be identical, but at least more consistent than this?
mbblack is offline   Reply With Quote
Old 05-23-2012, 05:22 PM   #2
dietmar13
Senior Member
 
Location: Vienna

Join Date: Mar 2010
Posts: 107
Default similar varying results

i have compared different methods for differential gene expression with a 12 versus 12 marched pairs design and got between 2 (cuffdiff) and ~5600 (SAMseq) significant genes (FDR 5%). inbetween were NOISeq, DESseq, baySeq, limma/voom, and edgeR (ascendning order).

for your design probably SAMseq (R-package SAM) with quantitative outcome (use you doses as outcome) would be a good choice. it is a non-parametric approach with permutation, and with each five biological replicates you will have enough samples for stable results.

count with HTseq-count.
dietmar13 is offline   Reply With Quote
Old 05-23-2012, 10:29 PM   #3
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 994
Default

12 replicates is a large number, and that gives a non-parametric method as SAMSeq an edge. I wonder if this is already true for 5 samples.

How did you do the comparison, by the way? It sounds a bit as if you simply ranked by number of detected genes, but with would make sense only if you somehow were able to verify that all methods control false discovery rate as advertised. Being sure of this is usually the hard part.
Simon Anders is offline   Reply With Quote
Old 05-23-2012, 11:28 PM   #4
dietmar13
Senior Member
 
Location: Vienna

Join Date: Mar 2010
Posts: 107
Default @simon

i am no statistician, but what i understand is that as SAMseq can use the quantitative outcome as dependent variable and will find genes which are deregulated over the complete dose range, it is not completely relevant that there are only 5 replicates for each dose. all 30 samples will be considered together.

method comparison:
all of the programs state explicitly, that they provide FDRs, either by permutation or by benjamini-hochberg ...

i have also tried to interpret (overlap, pathway, compare to microarray data) the results, and found all gene lists more or less plausible. by the way, microarray in the same design (12 v 12) found nearly exactly the same number of sig. genes (~ 5600) with significance analysis of microarrays (SAM) -

notably: the RNAseq data were only median 2.6 mio paired reads (or median ~ 460 k mapped and counted reads)., there fore a very low sequencing depth!

very interesting was, that for all methods except baySeq a log-linear correlation of call-percentages (which proportion of genes are called for specific sum read counts = sum of count over all 24 samples). this call-percentage is increasing (log-linear) constantly over the complete range.
dietmar13 is offline   Reply With Quote
Old 05-24-2012, 12:09 AM   #5
arvid
Senior Member
 
Location: Berlin

Join Date: Jul 2011
Posts: 156
Default

Quote:
Originally Posted by dietmar13 View Post
notably: the RNAseq data were only median 2.6 mio paired reads (or median ~ 460 k mapped and counted reads)., there fore a very low sequencing depth!
I'd really be interested in seeing comparisons between DE methods for differing depths. Your depths seem to be more shallow than the typical datasets, that's why I'm wondering whether the modeling of the overdispersion in DESeq and edgeR is really "kicking in" at these depths...
Did you try to pull your data through BitSeq as well (available on BioC)? I'd be interested to hear how that package compares, it seems promising (to me).
arvid is offline   Reply With Quote
Old 05-24-2012, 12:45 AM   #6
dietmar13
Senior Member
 
Location: Vienna

Join Date: Mar 2010
Posts: 107
Default @arvid

i will try BitSeq - as I want compare all relevant methods...

is a matched pairs design possible with BitSeq?

if I run into problems, perheps you could assist me?

or do you have a typical script for analysis with BitSeq, starting from count data in a data.frame. it is easier to adapt a script...
dietmar13 is offline   Reply With Quote
Old 05-24-2012, 01:08 AM   #7
arvid
Senior Member
 
Location: Berlin

Join Date: Jul 2011
Posts: 156
Default

Quote:
Originally Posted by dietmar13 View Post
i will try BitSeq - as I want compare all relevant methods...

is a matched pairs design possible with BitSeq?

if I run into problems, perheps you could assist me?

or do you have a typical script for analysis with BitSeq, starting from count data in a data.frame. it is easier to adapt a script...
BitSeq is improving on the ideas in RSEM/IsoEM and is doing DE as well. It needs a SAM/BAM (reads aligned to the transcriptome) with all valid alignments reported (not only unique ones), as it computes probabilities of the alignments. They are doing a lot of MCMC, so the whole process can be quite slow for big datasets.

Unfortunately, matched pair designs seem not to be supported (yet) - at least not documented; the DE part is still somewhat new and not completely mature yet. The data the authors presented (I was talking to them on a workshop a couple of months ago) looked interesting, but I haven't seen a critical review of the software yet, that's why I suggested it.

I'll be trying it out on some data here, for which we do RT-qPCR (for more biological replicates than we can afford to do the RNA-Seq) as well. I'll let you know how it performs there...
arvid is offline   Reply With Quote
Old 05-24-2012, 01:19 AM   #8
dietmar13
Senior Member
 
Location: Vienna

Join Date: Mar 2010
Posts: 107
Default @arvid

the fasta-file you have to provide is the multifasta-file with chromosomes of the genome or with transcripts?

i have mapped with RUM, STAR and tophat.

RUM does a transcript as well as a genome mapping...
dietmar13 is offline   Reply With Quote
Old 05-24-2012, 01:21 AM   #9
arvid
Senior Member
 
Location: Berlin

Join Date: Jul 2011
Posts: 156
Default @dietmar13

Quote:
Originally Posted by dietmar13 View Post
the fasta-file you have to provide is the multifasta-file with chromosomes of the genome or with transcripts?

i have mapped with RUM, STAR and tophat.

RUM does a transcript as well as a genome mapping...
You should do the alignment on transcripts, in the sense of multifasta with one entry for each transcript. I'm not familiar with RUM; does it report all valid alignments for each read/pair?
arvid is offline   Reply With Quote
Old 05-24-2012, 05:26 AM   #10
mbblack
Senior Member
 
Location: Research Triangle Park, NC

Join Date: Aug 2009
Posts: 245
Default

Quote:
Originally Posted by dietmar13 View Post
i have compared different methods for differential gene expression with a 12 versus 12 marched pairs design and got between 2 (cuffdiff) and ~5600 (SAMseq) significant genes (FDR 5%). inbetween were NOISeq, DESseq, baySeq, limma/voom, and edgeR (ascendning order).

for your design probably SAMseq (R-package SAM) with quantitative outcome (use you doses as outcome) would be a good choice. it is a non-parametric approach with permutation, and with each five biological replicates you will have enough samples for stable results.

count with HTseq-count.
I will look at SAM, thanks. These are toxicology dose response experiments (the current data is just a trial run for a much larger series of chemical exposures), so we are far less interested in the genes significant across all samples as we are in the genes being dysregulated at each dose as concentration (and/or time of exposure) increases.

BTW, my count data is taken directly from LifeScope, which maps to the transcriptome (in my case, using Ensembl's default GTF file for rel. 66 of Rn4). LifeScope then summarized counts on exons, introns and intergenic based on whole library read sets (i.e. summarized across all barcoded read sets for each sample), although I could group those barcodes any way I want for summary stats.

Partek actually reads in its data from the mapped BAM files. Since LifeScope maps each barcode read set independently (and then summarizes data however it has been told to group those), one gets one BAM file for each barcode. I then had to merge those as desired to get a single BAM file per sample. Partek then reads those directly, and re-indexes the mapped reads to its downloaded annotation files for either Refseq or Ensembl and computes counts and RPKM for both genes and transcripts. I did my ANOVA's on the RefSeq gene RPKM data.

Since each sample used 3 barcodes, I can analyze them individually or in pairs as well. Average mapped reads for single barcode read set is about 4.6 million reads. Average mapped reads for any paired barcode read set is about 8.8 million reads, and average mapped reads for all 3 barcodes per library is about 13.5 million reads. And at least by ANOVA using gene-based RPKM data, the results seem comparable for 8.8 and 13.5 million reads, but there is a large decrease in significant genes with only one barcode read set per library.

For ANOVA, (Signficant by Dose at FDR < 0.05) of the 1620 unique annotated genes significant from the Affy array analysis (I discarded all promiscuous probes in the list as I cannot unambiguously assign them to a gene, and removed redundant probes), 548 of those are shared with the 1524 genes signficant by dose in the RNA seq data (using all 3 barcoded read sets per sample). I have not matched up the gene lists from the DEseq results yet.

For now I am just focusing on the Partek ANOVA results using gene-based RPKM values. As I mentioned, these exact same animals (in fact, these exact same original RNA extractions) were used for the Affy array experiments. The RNAseq ANOVA results at least then makes sense to me, relative to what I have for the array results. However, using count data with any tool I have tried thus far, the significantly differentially expressed genes detected are always far fewer than the array results or the RNAseq ANOVA results.

While I am by no means a statistician, that makes no sense to me. If the ANOVA results indicate my RNAseq data is at least as good as, or more sensitive, than differentially expressed genes by array data, it makes me very uncomfortable when count data and/or other analytical tools produces far fewer (4-6 times fewer) significant results. I fully expect that the results may differ by method, but this level of inconsistency seems extreme to me. It is far greater than the level of inconsistency one sees with array data, using different normalization algorithms and/or different significance tests (at least in my experience with Affy and Agilent array data).
mbblack is offline   Reply With Quote
Old 11-05-2012, 08:54 AM   #11
narges
Member
 
Location: Finland

Join Date: Aug 2012
Posts: 29
Default

Quote:
Originally Posted by arvid View Post
BitSeq is improving on the ideas in RSEM/IsoEM and is doing DE as well. It needs a SAM/BAM (reads aligned to the transcriptome) with all valid alignments reported (not only unique ones), as it computes probabilities of the alignments. They are doing a lot of MCMC, so the whole process can be quite slow for big datasets.
I wanted to ask what do you mean by "All valid alignments"? For example if I use "accepted_hits.bam" file from TopHat output, would it be acceptable? Because I have used this file as the BitSeq input and there is an error like below:
Error in parseAlignment(alignFile, probF, trSeqFile, trInfoFile = trF, :
Main: number of transcripts don't match: 25 vs 5927492
If this error is related to this topic what should I use instead?
Thank you in advance
narges is offline   Reply With Quote
Old 11-16-2012, 01:40 PM   #12
NGSfan
Senior Member
 
Location: Austria

Join Date: Apr 2009
Posts: 181
Default

Quote:
Originally Posted by mbblack View Post
I will look at SAM, thanks. These are toxicology dose response experiments (the current data is just a trial run for a much larger series of chemical exposures), so we are far less interested in the genes significant across all samples as we are in the genes being dysregulated at each dose as concentration (and/or time of exposure) increases.

BTW, my count data is taken directly from LifeScope, which maps to the transcriptome (in my case, using Ensembl's default GTF file for rel. 66 of Rn4). LifeScope then summarized counts on exons, introns and intergenic based on whole library read sets (i.e. summarized across all barcoded read sets for each sample), although I could group those barcodes any way I want for summary stats.

Partek actually reads in its data from the mapped BAM files. Since LifeScope maps each barcode read set independently (and then summarizes data however it has been told to group those), one gets one BAM file for each barcode. I then had to merge those as desired to get a single BAM file per sample. Partek then reads those directly, and re-indexes the mapped reads to its downloaded annotation files for either Refseq or Ensembl and computes counts and RPKM for both genes and transcripts. I did my ANOVA's on the RefSeq gene RPKM data.

Since each sample used 3 barcodes, I can analyze them individually or in pairs as well. Average mapped reads for single barcode read set is about 4.6 million reads. Average mapped reads for any paired barcode read set is about 8.8 million reads, and average mapped reads for all 3 barcodes per library is about 13.5 million reads. And at least by ANOVA using gene-based RPKM data, the results seem comparable for 8.8 and 13.5 million reads, but there is a large decrease in significant genes with only one barcode read set per library.

For ANOVA, (Signficant by Dose at FDR < 0.05) of the 1620 unique annotated genes significant from the Affy array analysis (I discarded all promiscuous probes in the list as I cannot unambiguously assign them to a gene, and removed redundant probes), 548 of those are shared with the 1524 genes signficant by dose in the RNA seq data (using all 3 barcoded read sets per sample). I have not matched up the gene lists from the DEseq results yet.

For now I am just focusing on the Partek ANOVA results using gene-based RPKM values. As I mentioned, these exact same animals (in fact, these exact same original RNA extractions) were used for the Affy array experiments. The RNAseq ANOVA results at least then makes sense to me, relative to what I have for the array results. However, using count data with any tool I have tried thus far, the significantly differentially expressed genes detected are always far fewer than the array results or the RNAseq ANOVA results.

While I am by no means a statistician, that makes no sense to me. If the ANOVA results indicate my RNAseq data is at least as good as, or more sensitive, than differentially expressed genes by array data, it makes me very uncomfortable when count data and/or other analytical tools produces far fewer (4-6 times fewer) significant results. I fully expect that the results may differ by method, but this level of inconsistency seems extreme to me. It is far greater than the level of inconsistency one sees with array data, using different normalization algorithms and/or different significance tests (at least in my experience with Affy and Agilent array data).

You might want to look at our paper... it could be your sequencing depth is not sufficient enough to quantify your expression...

http://seqanswers.com/forums/showpos...78&postcount=1
NGSfan is offline   Reply With Quote
Old 11-17-2012, 05:27 AM   #13
mbblack
Senior Member
 
Location: Research Triangle Park, NC

Join Date: Aug 2009
Posts: 245
Default

Quote:
Originally Posted by NGSfan View Post
You might want to look at our paper... it could be your sequencing depth is not sufficient enough to quantify your expression...

http://seqanswers.com/forums/showpos...78&postcount=1
Thanks, I already had it but have not had a chance to read through it yet - I'll do that next week for sure.

I actually do have more sequence now, as since my initial posting, we've re-sequenced the residual beads to nearly double what we had (a gain of about 91% in average read depth over our 30 samples), and I am writing up a manuscript now.

On a practical note, for us, the biggest concern right now is cost and throughput of RNAseq for DGE versus arrays (as we cost out some long term, large scale studies). As I'm sure is not surprising to many here, arrays (well, specifically Affy Titan arrays), to my mind, still have a very large pragmatic edge over any current or near-future proposed RNAseq technologies, both in terms of cost and wet-bench throughput for large sample whole genome DGE studies.
__________________
Michael Black, Ph.D.
ScitoVation LLC. RTP, N.C.
mbblack is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 01:58 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO