![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Bacterial RNASeq Expression profiles and genome size | Arvind Bhagwat | Introductions | 1 | 04-26-2012 05:32 AM |
using NGS, what is the best miRNA expression normalization method? | Giorgio C | Bioinformatics | 4 | 12-07-2011 07:32 AM |
comparing samples with varying no. of reads for differential expression | harshinamdar | Bioinformatics | 1 | 08-24-2011 06:16 AM |
splitting 454 reads into kmers for diff expression | Jeremy | RNA Sequencing | 0 | 01-18-2011 07:17 PM |
A scaling normalization method for differential expression analysis of RNA-seq data | severin | Literature Watch | 1 | 09-10-2010 12:09 AM |
![]() |
|
Thread Tools |
![]() |
#1 |
Senior Member
Location: Research Triangle Park, NC Join Date: Aug 2009
Posts: 245
|
![]()
I have a dose response experiment using Rats, with Controls and 5 treatment doses, each with 5 biological replicates.
The data is single read RNAseq, 50bp reads, run on an ABI 5500xl, and I mapped the reads to the Ensembl rel. 66 Rat genome using ABI's LifeScope 2.5.1. For mapped reads, I have an average of 12,216,885 reads on exons per sample (the numbers range from a low of 5,171,445 for a Dose group 4 animal, and a high of 17,323,22 for a Dose group 3 animal) I've been analyzing this data for differential gene expression with a couple of approaches and am getting radically different results. DESeq using raw counts summarized on genes, genes with FDR < 0.05: Dose 1 = n/a (have not run this one yet) Dose 2 = 45 Dose 3 = 146 Dose 4 = 520 Dose 5 = 540 Now with DESeq, I did analyze each dose as a single pair of Controls (N=5) and Treatment (N=5) Similarly, I only have the highest dose done using cuffdiff v2.0.0 (3384), but again, genes with q-value < 0.05: Dose 5 = 64 Then, I pulled the mapped BAM files into Partek Genome Suite 6.6, re-indexed them to Refseq genes, got gene RPKM values and did a 1way ANOVA on all 30 samples, with linear contrasts for individual Dose fold change and p-values. genes significant by dose at FDR < 0.05 = 1524 genes significant by linear contrast at FDR < 0.05 Dose 1 = 3 Dose 2 = 3 Dose 3 = 280 Dose 4 = 1722 Dose 5 = 3665 Even if I break these down and run each Dose as a discrete ANOVA (n=10) I get, genes significant by dose at FDR < 0.05: Dose 1 = 0 Dose 2 = 1 Dose 3 = 70 Dose 4 = 1043 Dose 5 = 1848 And just for comparison, I have these exact same animals run on Affy HT_Rat30_PM (Titan) arrays, again as a single ANOVA (N=30) of RMA normalized data, with linear contrasts for each Dose genes significant by dose at FDR < 0.05 = 2375 genes significant by linear contrast at FDR < 0.05 Dose 1 = 2 Dose 2 = 67 Dose 3 = 1053 Dose 4 = 2223 Dose 5 = 3073 I have not yet looked at these lists of genes for overlap and so forth yet, although just casually looking at the top ranked (by FDR) genes in Partek and DESeq, I see a lot of the same ones. But what I am wondering about is the large discrepancy in the numbers of significant genes. Any ideas why the different RNAseq methods would be so different? Is this purely a function of the differences between count data and RPKM data and the different variance models? I was not expecting the results to be identical, but at least more consistent than this? |
![]() |
![]() |
![]() |
#2 |
Senior Member
Location: Vienna Join Date: Mar 2010
Posts: 107
|
![]()
i have compared different methods for differential gene expression with a 12 versus 12 marched pairs design and got between 2 (cuffdiff) and ~5600 (SAMseq) significant genes (FDR 5%). inbetween were NOISeq, DESseq, baySeq, limma/voom, and edgeR (ascendning order).
for your design probably SAMseq (R-package SAM) with quantitative outcome (use you doses as outcome) would be a good choice. it is a non-parametric approach with permutation, and with each five biological replicates you will have enough samples for stable results. count with HTseq-count. |
![]() |
![]() |
![]() |
#3 |
Senior Member
Location: Heidelberg, Germany Join Date: Feb 2010
Posts: 994
|
![]()
12 replicates is a large number, and that gives a non-parametric method as SAMSeq an edge. I wonder if this is already true for 5 samples.
How did you do the comparison, by the way? It sounds a bit as if you simply ranked by number of detected genes, but with would make sense only if you somehow were able to verify that all methods control false discovery rate as advertised. Being sure of this is usually the hard part. |
![]() |
![]() |
![]() |
#4 |
Senior Member
Location: Vienna Join Date: Mar 2010
Posts: 107
|
![]()
i am no statistician, but what i understand is that as SAMseq can use the quantitative outcome as dependent variable and will find genes which are deregulated over the complete dose range, it is not completely relevant that there are only 5 replicates for each dose. all 30 samples will be considered together.
method comparison: all of the programs state explicitly, that they provide FDRs, either by permutation or by benjamini-hochberg ... i have also tried to interpret (overlap, pathway, compare to microarray data) the results, and found all gene lists more or less plausible. by the way, microarray in the same design (12 v 12) found nearly exactly the same number of sig. genes (~ 5600) with significance analysis of microarrays (SAM) - notably: the RNAseq data were only median 2.6 mio paired reads (or median ~ 460 k mapped and counted reads)., there fore a very low sequencing depth! very interesting was, that for all methods except baySeq a log-linear correlation of call-percentages (which proportion of genes are called for specific sum read counts = sum of count over all 24 samples). this call-percentage is increasing (log-linear) constantly over the complete range. |
![]() |
![]() |
![]() |
#5 | |
Senior Member
Location: Berlin Join Date: Jul 2011
Posts: 156
|
![]() Quote:
Did you try to pull your data through BitSeq as well (available on BioC)? I'd be interested to hear how that package compares, it seems promising (to me). |
|
![]() |
![]() |
![]() |
#6 |
Senior Member
Location: Vienna Join Date: Mar 2010
Posts: 107
|
![]()
i will try BitSeq - as I want compare all relevant methods...
is a matched pairs design possible with BitSeq? if I run into problems, perheps you could assist me? or do you have a typical script for analysis with BitSeq, starting from count data in a data.frame. it is easier to adapt a script... |
![]() |
![]() |
![]() |
#7 | |
Senior Member
Location: Berlin Join Date: Jul 2011
Posts: 156
|
![]() Quote:
Unfortunately, matched pair designs seem not to be supported (yet) - at least not documented; the DE part is still somewhat new and not completely mature yet. The data the authors presented (I was talking to them on a workshop a couple of months ago) looked interesting, but I haven't seen a critical review of the software yet, that's why I suggested it. I'll be trying it out on some data here, for which we do RT-qPCR (for more biological replicates than we can afford to do the RNA-Seq) as well. I'll let you know how it performs there... |
|
![]() |
![]() |
![]() |
#8 |
Senior Member
Location: Vienna Join Date: Mar 2010
Posts: 107
|
![]()
the fasta-file you have to provide is the multifasta-file with chromosomes of the genome or with transcripts?
i have mapped with RUM, STAR and tophat. RUM does a transcript as well as a genome mapping... |
![]() |
![]() |
![]() |
#9 |
Senior Member
Location: Berlin Join Date: Jul 2011
Posts: 156
|
![]()
You should do the alignment on transcripts, in the sense of multifasta with one entry for each transcript. I'm not familiar with RUM; does it report all valid alignments for each read/pair?
|
![]() |
![]() |
![]() |
#10 | |
Senior Member
Location: Research Triangle Park, NC Join Date: Aug 2009
Posts: 245
|
![]() Quote:
BTW, my count data is taken directly from LifeScope, which maps to the transcriptome (in my case, using Ensembl's default GTF file for rel. 66 of Rn4). LifeScope then summarized counts on exons, introns and intergenic based on whole library read sets (i.e. summarized across all barcoded read sets for each sample), although I could group those barcodes any way I want for summary stats. Partek actually reads in its data from the mapped BAM files. Since LifeScope maps each barcode read set independently (and then summarizes data however it has been told to group those), one gets one BAM file for each barcode. I then had to merge those as desired to get a single BAM file per sample. Partek then reads those directly, and re-indexes the mapped reads to its downloaded annotation files for either Refseq or Ensembl and computes counts and RPKM for both genes and transcripts. I did my ANOVA's on the RefSeq gene RPKM data. Since each sample used 3 barcodes, I can analyze them individually or in pairs as well. Average mapped reads for single barcode read set is about 4.6 million reads. Average mapped reads for any paired barcode read set is about 8.8 million reads, and average mapped reads for all 3 barcodes per library is about 13.5 million reads. And at least by ANOVA using gene-based RPKM data, the results seem comparable for 8.8 and 13.5 million reads, but there is a large decrease in significant genes with only one barcode read set per library. For ANOVA, (Signficant by Dose at FDR < 0.05) of the 1620 unique annotated genes significant from the Affy array analysis (I discarded all promiscuous probes in the list as I cannot unambiguously assign them to a gene, and removed redundant probes), 548 of those are shared with the 1524 genes signficant by dose in the RNA seq data (using all 3 barcoded read sets per sample). I have not matched up the gene lists from the DEseq results yet. For now I am just focusing on the Partek ANOVA results using gene-based RPKM values. As I mentioned, these exact same animals (in fact, these exact same original RNA extractions) were used for the Affy array experiments. The RNAseq ANOVA results at least then makes sense to me, relative to what I have for the array results. However, using count data with any tool I have tried thus far, the significantly differentially expressed genes detected are always far fewer than the array results or the RNAseq ANOVA results. While I am by no means a statistician, that makes no sense to me. If the ANOVA results indicate my RNAseq data is at least as good as, or more sensitive, than differentially expressed genes by array data, it makes me very uncomfortable when count data and/or other analytical tools produces far fewer (4-6 times fewer) significant results. I fully expect that the results may differ by method, but this level of inconsistency seems extreme to me. It is far greater than the level of inconsistency one sees with array data, using different normalization algorithms and/or different significance tests (at least in my experience with Affy and Agilent array data). |
|
![]() |
![]() |
![]() |
#11 | |
Member
Location: Finland Join Date: Aug 2012
Posts: 29
|
![]() Quote:
Error in parseAlignment(alignFile, probF, trSeqFile, trInfoFile = trF, : Main: number of transcripts don't match: 25 vs 5927492 If this error is related to this topic what should I use instead? Thank you in advance |
|
![]() |
![]() |
![]() |
#12 | |
Senior Member
Location: Austria Join Date: Apr 2009
Posts: 181
|
![]() Quote:
You might want to look at our paper... it could be your sequencing depth is not sufficient enough to quantify your expression... http://seqanswers.com/forums/showpos...78&postcount=1 |
|
![]() |
![]() |
![]() |
#13 | |
Senior Member
Location: Research Triangle Park, NC Join Date: Aug 2009
Posts: 245
|
![]() Quote:
I actually do have more sequence now, as since my initial posting, we've re-sequenced the residual beads to nearly double what we had (a gain of about 91% in average read depth over our 30 samples), and I am writing up a manuscript now. On a practical note, for us, the biggest concern right now is cost and throughput of RNAseq for DGE versus arrays (as we cost out some long term, large scale studies). As I'm sure is not surprising to many here, arrays (well, specifically Affy Titan arrays), to my mind, still have a very large pragmatic edge over any current or near-future proposed RNAseq technologies, both in terms of cost and wet-bench throughput for large sample whole genome DGE studies.
__________________
Michael Black, Ph.D. ScitoVation LLC. RTP, N.C. |
|
![]() |
![]() |
![]() |
Thread Tools | |
|
|