Seqanswers Leaderboard Ad

**dietmar13** · 05-23-2012, 05:22 PM

similar varying results

i have compared different methods for differential gene expression with a 12 versus 12 marched pairs design and got between 2 (cuffdiff) and ~5600 (SAMseq) significant genes (FDR 5%). inbetween were NOISeq, DESseq, baySeq, limma/voom, and edgeR (ascendning order).

for your design probably SAMseq (R-package SAM) with quantitative outcome (use you doses as outcome) would be a good choice. it is a non-parametric approach with permutation, and with each five biological replicates you will have enough samples for stable results.

count with HTseq-count.

**Simon Anders** · 05-23-2012, 10:29 PM

12 replicates is a large number, and that gives a non-parametric method as SAMSeq an edge. I wonder if this is already true for 5 samples.

How did you do the comparison, by the way? It sounds a bit as if you simply ranked by number of detected genes, but with would make sense only if you somehow were able to verify that all methods control false discovery rate as advertised. Being sure of this is usually the hard part.

**dietmar13** · 05-23-2012, 11:28 PM

@simon

i am no statistician, but what i understand is that as SAMseq can use the quantitative outcome as dependent variable and will find genes which are deregulated over the complete dose range, it is not completely relevant that there are only 5 replicates for each dose. all 30 samples will be considered together.

method comparison:
all of the programs state explicitly, that they provide FDRs, either by permutation or by benjamini-hochberg ...

i have also tried to interpret (overlap, pathway, compare to microarray data) the results, and found all gene lists more or less plausible. by the way, microarray in the same design (12 v 12) found nearly exactly the same number of sig. genes (~ 5600) with significance analysis of microarrays (SAM) -

notably: the RNAseq data were only median 2.6 mio paired reads (or median ~ 460 k mapped and counted reads)., there fore a very low sequencing depth!

very interesting was, that for all methods except baySeq a log-linear correlation of call-percentages (which proportion of genes are called for specific sum read counts = sum of count over all 24 samples). this call-percentage is increasing (log-linear) constantly over the complete range.

**arvid** · 05-24-2012, 12:09 AM

Originally posted by dietmar13 View Post

notably: the RNAseq data were only median 2.6 mio paired reads (or median ~ 460 k mapped and counted reads)., there fore a very low sequencing depth!

I'd really be interested in seeing comparisons between DE methods for differing depths. Your depths seem to be more shallow than the typical datasets, that's why I'm wondering whether the modeling of the overdispersion in DESeq and edgeR is really "kicking in" at these depths...
Did you try to pull your data through BitSeq as well (available on BioC)? I'd be interested to hear how that package compares, it seems promising (to me).

**dietmar13** · 05-24-2012, 12:45 AM

@arvid

i will try BitSeq - as I want compare all relevant methods...

is a matched pairs design possible with BitSeq?

if I run into problems, perheps you could assist me?

or do you have a typical script for analysis with BitSeq, starting from count data in a data.frame. it is easier to adapt a script...

**arvid** · 05-24-2012, 01:08 AM

Originally posted by dietmar13 View Post

i will try BitSeq - as I want compare all relevant methods...

is a matched pairs design possible with BitSeq?

if I run into problems, perheps you could assist me?

or do you have a typical script for analysis with BitSeq, starting from count data in a data.frame. it is easier to adapt a script...

BitSeq is improving on the ideas in RSEM/IsoEM and is doing DE as well. It needs a SAM/BAM (reads aligned to the transcriptome) with all valid alignments reported (not only unique ones), as it computes probabilities of the alignments. They are doing a lot of MCMC, so the whole process can be quite slow for big datasets.

Unfortunately, matched pair designs seem not to be supported (yet) - at least not documented; the DE part is still somewhat new and not completely mature yet. The data the authors presented (I was talking to them on a workshop a couple of months ago) looked interesting, but I haven't seen a critical review of the software yet, that's why I suggested it.

I'll be trying it out on some data here, for which we do RT-qPCR (for more biological replicates than we can afford to do the RNA-Seq) as well. I'll let you know how it performs there...

**dietmar13** · 05-24-2012, 01:19 AM

@arvid

the fasta-file you have to provide is the multifasta-file with chromosomes of the genome or with transcripts?

i have mapped with RUM, STAR and tophat.

RUM does a transcript as well as a genome mapping...

**arvid** · 05-24-2012, 01:21 AM

@dietmar13

Originally posted by dietmar13 View Post

the fasta-file you have to provide is the multifasta-file with chromosomes of the genome or with transcripts?

i have mapped with RUM, STAR and tophat.

RUM does a transcript as well as a genome mapping...

You should do the alignment on transcripts, in the sense of multifasta with one entry for each transcript. I'm not familiar with RUM; does it report all valid alignments for each read/pair?

**mbblack** · 05-24-2012, 05:26 AM

Originally posted by dietmar13 View Post

i have compared different methods for differential gene expression with a 12 versus 12 marched pairs design and got between 2 (cuffdiff) and ~5600 (SAMseq) significant genes (FDR 5%). inbetween were NOISeq, DESseq, baySeq, limma/voom, and edgeR (ascendning order).

for your design probably SAMseq (R-package SAM) with quantitative outcome (use you doses as outcome) would be a good choice. it is a non-parametric approach with permutation, and with each five biological replicates you will have enough samples for stable results.

count with HTseq-count.

I will look at SAM, thanks. These are toxicology dose response experiments (the current data is just a trial run for a much larger series of chemical exposures), so we are far less interested in the genes significant across all samples as we are in the genes being dysregulated at each dose as concentration (and/or time of exposure) increases.

BTW, my count data is taken directly from LifeScope, which maps to the transcriptome (in my case, using Ensembl's default GTF file for rel. 66 of Rn4). LifeScope then summarized counts on exons, introns and intergenic based on whole library read sets (i.e. summarized across all barcoded read sets for each sample), although I could group those barcodes any way I want for summary stats.

Partek actually reads in its data from the mapped BAM files. Since LifeScope maps each barcode read set independently (and then summarizes data however it has been told to group those), one gets one BAM file for each barcode. I then had to merge those as desired to get a single BAM file per sample. Partek then reads those directly, and re-indexes the mapped reads to its downloaded annotation files for either Refseq or Ensembl and computes counts and RPKM for both genes and transcripts. I did my ANOVA's on the RefSeq gene RPKM data.

Since each sample used 3 barcodes, I can analyze them individually or in pairs as well. Average mapped reads for single barcode read set is about 4.6 million reads. Average mapped reads for any paired barcode read set is about 8.8 million reads, and average mapped reads for all 3 barcodes per library is about 13.5 million reads. And at least by ANOVA using gene-based RPKM data, the results seem comparable for 8.8 and 13.5 million reads, but there is a large decrease in significant genes with only one barcode read set per library.

For ANOVA, (Signficant by Dose at FDR < 0.05) of the 1620 unique annotated genes significant from the Affy array analysis (I discarded all promiscuous probes in the list as I cannot unambiguously assign them to a gene, and removed redundant probes), 548 of those are shared with the 1524 genes signficant by dose in the RNA seq data (using all 3 barcoded read sets per sample). I have not matched up the gene lists from the DEseq results yet.

For now I am just focusing on the Partek ANOVA results using gene-based RPKM values. As I mentioned, these exact same animals (in fact, these exact same original RNA extractions) were used for the Affy array experiments. The RNAseq ANOVA results at least then makes sense to me, relative to what I have for the array results. However, using count data with any tool I have tried thus far, the significantly differentially expressed genes detected are always far fewer than the array results or the RNAseq ANOVA results.

While I am by no means a statistician, that makes no sense to me. If the ANOVA results indicate my RNAseq data is at least as good as, or more sensitive, than differentially expressed genes by array data, it makes me very uncomfortable when count data and/or other analytical tools produces far fewer (4-6 times fewer) significant results. I fully expect that the results may differ by method, but this level of inconsistency seems extreme to me. It is far greater than the level of inconsistency one sees with array data, using different normalization algorithms and/or different significance tests (at least in my experience with Affy and Agilent array data).

**narges** · 11-05-2012, 09:54 AM

Originally posted by arvid View Post

BitSeq is improving on the ideas in RSEM/IsoEM and is doing DE as well. It needs a SAM/BAM (reads aligned to the transcriptome) with all valid alignments reported (not only unique ones), as it computes probabilities of the alignments. They are doing a lot of MCMC, so the whole process can be quite slow for big datasets.

I wanted to ask what do you mean by "All valid alignments"? For example if I use "accepted_hits.bam" file from TopHat output, would it be acceptable? Because I have used this file as the BitSeq input and there is an error like below:
Error in parseAlignment(alignFile, probF, trSeqFile, trInfoFile = trF, :
Main: number of transcripts don't match: 25 vs 5927492
If this error is related to this topic what should I use instead?
Thank you in advance

**NGSfan** · 11-16-2012, 02:40 PM

Originally posted by mbblack View Post

I will look at SAM, thanks. These are toxicology dose response experiments (the current data is just a trial run for a much larger series of chemical exposures), so we are far less interested in the genes significant across all samples as we are in the genes being dysregulated at each dose as concentration (and/or time of exposure) increases.

BTW, my count data is taken directly from LifeScope, which maps to the transcriptome (in my case, using Ensembl's default GTF file for rel. 66 of Rn4). LifeScope then summarized counts on exons, introns and intergenic based on whole library read sets (i.e. summarized across all barcoded read sets for each sample), although I could group those barcodes any way I want for summary stats.

Partek actually reads in its data from the mapped BAM files. Since LifeScope maps each barcode read set independently (and then summarizes data however it has been told to group those), one gets one BAM file for each barcode. I then had to merge those as desired to get a single BAM file per sample. Partek then reads those directly, and re-indexes the mapped reads to its downloaded annotation files for either Refseq or Ensembl and computes counts and RPKM for both genes and transcripts. I did my ANOVA's on the RefSeq gene RPKM data.

Since each sample used 3 barcodes, I can analyze them individually or in pairs as well. Average mapped reads for single barcode read set is about 4.6 million reads. Average mapped reads for any paired barcode read set is about 8.8 million reads, and average mapped reads for all 3 barcodes per library is about 13.5 million reads. And at least by ANOVA using gene-based RPKM data, the results seem comparable for 8.8 and 13.5 million reads, but there is a large decrease in significant genes with only one barcode read set per library.

For ANOVA, (Signficant by Dose at FDR < 0.05) of the 1620 unique annotated genes significant from the Affy array analysis (I discarded all promiscuous probes in the list as I cannot unambiguously assign them to a gene, and removed redundant probes), 548 of those are shared with the 1524 genes signficant by dose in the RNA seq data (using all 3 barcoded read sets per sample). I have not matched up the gene lists from the DEseq results yet.

For now I am just focusing on the Partek ANOVA results using gene-based RPKM values. As I mentioned, these exact same animals (in fact, these exact same original RNA extractions) were used for the Affy array experiments. The RNAseq ANOVA results at least then makes sense to me, relative to what I have for the array results. However, using count data with any tool I have tried thus far, the significantly differentially expressed genes detected are always far fewer than the array results or the RNAseq ANOVA results.

While I am by no means a statistician, that makes no sense to me. If the ANOVA results indicate my RNAseq data is at least as good as, or more sensitive, than differentially expressed genes by array data, it makes me very uncomfortable when count data and/or other analytical tools produces far fewer (4-6 times fewer) significant results. I fully expect that the results may differ by method, but this level of inconsistency seems extreme to me. It is far greater than the level of inconsistency one sees with array data, using different normalization algorithms and/or different significance tests (at least in my experience with Affy and Agilent array data).

You might want to look at our paper... it could be your sequencing depth is not sufficient enough to quantify your expression...

Characterization and improvement of RNA-Seq precision - SEQanswers

http://seqanswers.com/forums/showpost.php?p=44378&postcount=1

Discussion of any scientific study related to high content or next generation genomics. Whole genome association, metagenomics, digital gene expression, etc.

**mbblack** · 11-17-2012, 06:27 AM

Originally posted by NGSfan View Post

You might want to look at our paper... it could be your sequencing depth is not sufficient enough to quantify your expression...

http://seqanswers.com/forums/showpos...78&postcount=1

Thanks, I already had it but have not had a chance to read through it yet - I'll do that next week for sure.

I actually do have more sequence now, as since my initial posting, we've re-sequenced the residual beads to nearly double what we had (a gain of about 91% in average read depth over our 30 samples), and I am writing up a manuscript now.

On a practical note, for us, the biggest concern right now is cost and throughput of RNAseq for DGE versus arrays (as we cost out some long term, large scale studies). As I'm sure is not surprising to many here, arrays (well, specifically Affy Titan arrays), to my mind, still have a very large pragmatic edge over any current or near-future proposed RNAseq technologies, both in terms of cost and wet-bench throughput for large sample whole genome DGE studies.

Topics	Statistics	Last Post
Evaluating Genome Sequencing for ECMO Patients in the NICU by seqadmin Started by seqadmin, 12-17-2024, 10:28 AM	0 responses 33 views 0 likes	Last Post by seqadmin 12-17-2024, 10:28 AM
New Genetic Toolkit Refines Studies on Gene Function and Disease by seqadmin Started by seqadmin, 12-13-2024, 08:24 AM	0 responses 48 views 0 likes	Last Post by seqadmin 12-13-2024, 08:24 AM
Study Links Brain Mechanism to Emotional Responses in Animals and Humans by seqadmin Started by seqadmin, 12-12-2024, 07:41 AM	0 responses 34 views 0 likes	Last Post by seqadmin 12-12-2024, 07:41 AM
Study Identifies Ribosomal RNA Fingerprints as Early Cancer Biomarkers by seqadmin Started by seqadmin, 12-11-2024, 07:45 AM	0 responses 46 views 0 likes	Last Post by seqadmin 12-11-2024, 07:45 AM

Seqanswers Leaderboard Ad

Announcement

Diff. expression with RNAseq - varying results by method

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News