Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Diff. expression with RNAseq - varying results by method

    I have a dose response experiment using Rats, with Controls and 5 treatment doses, each with 5 biological replicates.

    The data is single read RNAseq, 50bp reads, run on an ABI 5500xl, and I mapped the reads to the Ensembl rel. 66 Rat genome using ABI's LifeScope 2.5.1.

    For mapped reads, I have an average of 12,216,885 reads on exons per sample (the numbers range from a low of 5,171,445 for a Dose group 4 animal, and a high of 17,323,22 for a Dose group 3 animal)

    I've been analyzing this data for differential gene expression with a couple of approaches and am getting radically different results.

    DESeq using raw counts summarized on genes, genes with FDR < 0.05:

    Dose 1 = n/a (have not run this one yet)
    Dose 2 = 45
    Dose 3 = 146
    Dose 4 = 520
    Dose 5 = 540

    Now with DESeq, I did analyze each dose as a single pair of Controls (N=5) and Treatment (N=5)

    Similarly, I only have the highest dose done using cuffdiff v2.0.0 (3384), but again, genes with q-value < 0.05:
    Dose 5 = 64

    Then, I pulled the mapped BAM files into Partek Genome Suite 6.6, re-indexed them to Refseq genes, got gene RPKM values and did a 1way ANOVA on all 30 samples, with linear contrasts for individual Dose fold change and p-values.

    genes significant by dose at FDR < 0.05 = 1524

    genes significant by linear contrast at FDR < 0.05
    Dose 1 = 3
    Dose 2 = 3
    Dose 3 = 280
    Dose 4 = 1722
    Dose 5 = 3665

    Even if I break these down and run each Dose as a discrete ANOVA (n=10) I get, genes significant by dose at FDR < 0.05:

    Dose 1 = 0
    Dose 2 = 1
    Dose 3 = 70
    Dose 4 = 1043
    Dose 5 = 1848

    And just for comparison, I have these exact same animals run on Affy HT_Rat30_PM (Titan) arrays, again as a single ANOVA (N=30) of RMA normalized data, with linear contrasts for each Dose

    genes significant by dose at FDR < 0.05 = 2375

    genes significant by linear contrast at FDR < 0.05
    Dose 1 = 2
    Dose 2 = 67
    Dose 3 = 1053
    Dose 4 = 2223
    Dose 5 = 3073

    I have not yet looked at these lists of genes for overlap and so forth yet, although just casually looking at the top ranked (by FDR) genes in Partek and DESeq, I see a lot of the same ones. But what I am wondering about is the large discrepancy in the numbers of significant genes.

    Any ideas why the different RNAseq methods would be so different? Is this purely a function of the differences between count data and RPKM data and the different variance models? I was not expecting the results to be identical, but at least more consistent than this?
    Michael Black, Ph.D.
    ScitoVation LLC. RTP, N.C.

  • #2
    similar varying results

    i have compared different methods for differential gene expression with a 12 versus 12 marched pairs design and got between 2 (cuffdiff) and ~5600 (SAMseq) significant genes (FDR 5%). inbetween were NOISeq, DESseq, baySeq, limma/voom, and edgeR (ascendning order).

    for your design probably SAMseq (R-package SAM) with quantitative outcome (use you doses as outcome) would be a good choice. it is a non-parametric approach with permutation, and with each five biological replicates you will have enough samples for stable results.

    count with HTseq-count.

    Comment


    • #3
      12 replicates is a large number, and that gives a non-parametric method as SAMSeq an edge. I wonder if this is already true for 5 samples.

      How did you do the comparison, by the way? It sounds a bit as if you simply ranked by number of detected genes, but with would make sense only if you somehow were able to verify that all methods control false discovery rate as advertised. Being sure of this is usually the hard part.

      Comment


      • #4
        @simon

        i am no statistician, but what i understand is that as SAMseq can use the quantitative outcome as dependent variable and will find genes which are deregulated over the complete dose range, it is not completely relevant that there are only 5 replicates for each dose. all 30 samples will be considered together.

        method comparison:
        all of the programs state explicitly, that they provide FDRs, either by permutation or by benjamini-hochberg ...

        i have also tried to interpret (overlap, pathway, compare to microarray data) the results, and found all gene lists more or less plausible. by the way, microarray in the same design (12 v 12) found nearly exactly the same number of sig. genes (~ 5600) with significance analysis of microarrays (SAM) -

        notably: the RNAseq data were only median 2.6 mio paired reads (or median ~ 460 k mapped and counted reads)., there fore a very low sequencing depth!

        very interesting was, that for all methods except baySeq a log-linear correlation of call-percentages (which proportion of genes are called for specific sum read counts = sum of count over all 24 samples). this call-percentage is increasing (log-linear) constantly over the complete range.

        Comment


        • #5
          Originally posted by dietmar13 View Post
          notably: the RNAseq data were only median 2.6 mio paired reads (or median ~ 460 k mapped and counted reads)., there fore a very low sequencing depth!
          I'd really be interested in seeing comparisons between DE methods for differing depths. Your depths seem to be more shallow than the typical datasets, that's why I'm wondering whether the modeling of the overdispersion in DESeq and edgeR is really "kicking in" at these depths...
          Did you try to pull your data through BitSeq as well (available on BioC)? I'd be interested to hear how that package compares, it seems promising (to me).

          Comment


          • #6
            @arvid

            i will try BitSeq - as I want compare all relevant methods...

            is a matched pairs design possible with BitSeq?

            if I run into problems, perheps you could assist me?

            or do you have a typical script for analysis with BitSeq, starting from count data in a data.frame. it is easier to adapt a script...

            Comment


            • #7
              Originally posted by dietmar13 View Post
              i will try BitSeq - as I want compare all relevant methods...

              is a matched pairs design possible with BitSeq?

              if I run into problems, perheps you could assist me?

              or do you have a typical script for analysis with BitSeq, starting from count data in a data.frame. it is easier to adapt a script...
              BitSeq is improving on the ideas in RSEM/IsoEM and is doing DE as well. It needs a SAM/BAM (reads aligned to the transcriptome) with all valid alignments reported (not only unique ones), as it computes probabilities of the alignments. They are doing a lot of MCMC, so the whole process can be quite slow for big datasets.

              Unfortunately, matched pair designs seem not to be supported (yet) - at least not documented; the DE part is still somewhat new and not completely mature yet. The data the authors presented (I was talking to them on a workshop a couple of months ago) looked interesting, but I haven't seen a critical review of the software yet, that's why I suggested it.

              I'll be trying it out on some data here, for which we do RT-qPCR (for more biological replicates than we can afford to do the RNA-Seq) as well. I'll let you know how it performs there...

              Comment


              • #8
                @arvid

                the fasta-file you have to provide is the multifasta-file with chromosomes of the genome or with transcripts?

                i have mapped with RUM, STAR and tophat.

                RUM does a transcript as well as a genome mapping...

                Comment


                • #9
                  @dietmar13

                  Originally posted by dietmar13 View Post
                  the fasta-file you have to provide is the multifasta-file with chromosomes of the genome or with transcripts?

                  i have mapped with RUM, STAR and tophat.

                  RUM does a transcript as well as a genome mapping...
                  You should do the alignment on transcripts, in the sense of multifasta with one entry for each transcript. I'm not familiar with RUM; does it report all valid alignments for each read/pair?

                  Comment


                  • #10
                    Originally posted by dietmar13 View Post
                    i have compared different methods for differential gene expression with a 12 versus 12 marched pairs design and got between 2 (cuffdiff) and ~5600 (SAMseq) significant genes (FDR 5%). inbetween were NOISeq, DESseq, baySeq, limma/voom, and edgeR (ascendning order).

                    for your design probably SAMseq (R-package SAM) with quantitative outcome (use you doses as outcome) would be a good choice. it is a non-parametric approach with permutation, and with each five biological replicates you will have enough samples for stable results.

                    count with HTseq-count.
                    I will look at SAM, thanks. These are toxicology dose response experiments (the current data is just a trial run for a much larger series of chemical exposures), so we are far less interested in the genes significant across all samples as we are in the genes being dysregulated at each dose as concentration (and/or time of exposure) increases.

                    BTW, my count data is taken directly from LifeScope, which maps to the transcriptome (in my case, using Ensembl's default GTF file for rel. 66 of Rn4). LifeScope then summarized counts on exons, introns and intergenic based on whole library read sets (i.e. summarized across all barcoded read sets for each sample), although I could group those barcodes any way I want for summary stats.

                    Partek actually reads in its data from the mapped BAM files. Since LifeScope maps each barcode read set independently (and then summarizes data however it has been told to group those), one gets one BAM file for each barcode. I then had to merge those as desired to get a single BAM file per sample. Partek then reads those directly, and re-indexes the mapped reads to its downloaded annotation files for either Refseq or Ensembl and computes counts and RPKM for both genes and transcripts. I did my ANOVA's on the RefSeq gene RPKM data.

                    Since each sample used 3 barcodes, I can analyze them individually or in pairs as well. Average mapped reads for single barcode read set is about 4.6 million reads. Average mapped reads for any paired barcode read set is about 8.8 million reads, and average mapped reads for all 3 barcodes per library is about 13.5 million reads. And at least by ANOVA using gene-based RPKM data, the results seem comparable for 8.8 and 13.5 million reads, but there is a large decrease in significant genes with only one barcode read set per library.

                    For ANOVA, (Signficant by Dose at FDR < 0.05) of the 1620 unique annotated genes significant from the Affy array analysis (I discarded all promiscuous probes in the list as I cannot unambiguously assign them to a gene, and removed redundant probes), 548 of those are shared with the 1524 genes signficant by dose in the RNA seq data (using all 3 barcoded read sets per sample). I have not matched up the gene lists from the DEseq results yet.

                    For now I am just focusing on the Partek ANOVA results using gene-based RPKM values. As I mentioned, these exact same animals (in fact, these exact same original RNA extractions) were used for the Affy array experiments. The RNAseq ANOVA results at least then makes sense to me, relative to what I have for the array results. However, using count data with any tool I have tried thus far, the significantly differentially expressed genes detected are always far fewer than the array results or the RNAseq ANOVA results.

                    While I am by no means a statistician, that makes no sense to me. If the ANOVA results indicate my RNAseq data is at least as good as, or more sensitive, than differentially expressed genes by array data, it makes me very uncomfortable when count data and/or other analytical tools produces far fewer (4-6 times fewer) significant results. I fully expect that the results may differ by method, but this level of inconsistency seems extreme to me. It is far greater than the level of inconsistency one sees with array data, using different normalization algorithms and/or different significance tests (at least in my experience with Affy and Agilent array data).
                    Michael Black, Ph.D.
                    ScitoVation LLC. RTP, N.C.

                    Comment


                    • #11
                      Originally posted by arvid View Post
                      BitSeq is improving on the ideas in RSEM/IsoEM and is doing DE as well. It needs a SAM/BAM (reads aligned to the transcriptome) with all valid alignments reported (not only unique ones), as it computes probabilities of the alignments. They are doing a lot of MCMC, so the whole process can be quite slow for big datasets.
                      I wanted to ask what do you mean by "All valid alignments"? For example if I use "accepted_hits.bam" file from TopHat output, would it be acceptable? Because I have used this file as the BitSeq input and there is an error like below:
                      Error in parseAlignment(alignFile, probF, trSeqFile, trInfoFile = trF, :
                      Main: number of transcripts don't match: 25 vs 5927492
                      If this error is related to this topic what should I use instead?
                      Thank you in advance

                      Comment


                      • #12
                        Originally posted by mbblack View Post
                        I will look at SAM, thanks. These are toxicology dose response experiments (the current data is just a trial run for a much larger series of chemical exposures), so we are far less interested in the genes significant across all samples as we are in the genes being dysregulated at each dose as concentration (and/or time of exposure) increases.

                        BTW, my count data is taken directly from LifeScope, which maps to the transcriptome (in my case, using Ensembl's default GTF file for rel. 66 of Rn4). LifeScope then summarized counts on exons, introns and intergenic based on whole library read sets (i.e. summarized across all barcoded read sets for each sample), although I could group those barcodes any way I want for summary stats.

                        Partek actually reads in its data from the mapped BAM files. Since LifeScope maps each barcode read set independently (and then summarizes data however it has been told to group those), one gets one BAM file for each barcode. I then had to merge those as desired to get a single BAM file per sample. Partek then reads those directly, and re-indexes the mapped reads to its downloaded annotation files for either Refseq or Ensembl and computes counts and RPKM for both genes and transcripts. I did my ANOVA's on the RefSeq gene RPKM data.

                        Since each sample used 3 barcodes, I can analyze them individually or in pairs as well. Average mapped reads for single barcode read set is about 4.6 million reads. Average mapped reads for any paired barcode read set is about 8.8 million reads, and average mapped reads for all 3 barcodes per library is about 13.5 million reads. And at least by ANOVA using gene-based RPKM data, the results seem comparable for 8.8 and 13.5 million reads, but there is a large decrease in significant genes with only one barcode read set per library.

                        For ANOVA, (Signficant by Dose at FDR < 0.05) of the 1620 unique annotated genes significant from the Affy array analysis (I discarded all promiscuous probes in the list as I cannot unambiguously assign them to a gene, and removed redundant probes), 548 of those are shared with the 1524 genes signficant by dose in the RNA seq data (using all 3 barcoded read sets per sample). I have not matched up the gene lists from the DEseq results yet.

                        For now I am just focusing on the Partek ANOVA results using gene-based RPKM values. As I mentioned, these exact same animals (in fact, these exact same original RNA extractions) were used for the Affy array experiments. The RNAseq ANOVA results at least then makes sense to me, relative to what I have for the array results. However, using count data with any tool I have tried thus far, the significantly differentially expressed genes detected are always far fewer than the array results or the RNAseq ANOVA results.

                        While I am by no means a statistician, that makes no sense to me. If the ANOVA results indicate my RNAseq data is at least as good as, or more sensitive, than differentially expressed genes by array data, it makes me very uncomfortable when count data and/or other analytical tools produces far fewer (4-6 times fewer) significant results. I fully expect that the results may differ by method, but this level of inconsistency seems extreme to me. It is far greater than the level of inconsistency one sees with array data, using different normalization algorithms and/or different significance tests (at least in my experience with Affy and Agilent array data).

                        You might want to look at our paper... it could be your sequencing depth is not sufficient enough to quantify your expression...

                        Discussion of any scientific study related to high content or next generation genomics. Whole genome association, metagenomics, digital gene expression, etc.

                        Comment


                        • #13
                          Originally posted by NGSfan View Post
                          You might want to look at our paper... it could be your sequencing depth is not sufficient enough to quantify your expression...

                          http://seqanswers.com/forums/showpos...78&postcount=1
                          Thanks, I already had it but have not had a chance to read through it yet - I'll do that next week for sure.

                          I actually do have more sequence now, as since my initial posting, we've re-sequenced the residual beads to nearly double what we had (a gain of about 91% in average read depth over our 30 samples), and I am writing up a manuscript now.

                          On a practical note, for us, the biggest concern right now is cost and throughput of RNAseq for DGE versus arrays (as we cost out some long term, large scale studies). As I'm sure is not surprising to many here, arrays (well, specifically Affy Titan arrays), to my mind, still have a very large pragmatic edge over any current or near-future proposed RNAseq technologies, both in terms of cost and wet-bench throughput for large sample whole genome DGE studies.
                          Michael Black, Ph.D.
                          ScitoVation LLC. RTP, N.C.

                          Comment

                          Latest Articles

                          Collapse

                          • seqadmin
                            Current Approaches to Protein Sequencing
                            by seqadmin


                            Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                            04-04-2024, 04:25 PM
                          • seqadmin
                            Strategies for Sequencing Challenging Samples
                            by seqadmin


                            Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                            03-22-2024, 06:39 AM

                          ad_right_rmr

                          Collapse

                          News

                          Collapse

                          Topics Statistics Last Post
                          Started by seqadmin, 04-11-2024, 12:08 PM
                          0 responses
                          27 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 04-10-2024, 10:19 PM
                          0 responses
                          30 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 04-10-2024, 09:21 AM
                          0 responses
                          26 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 04-04-2024, 09:00 AM
                          0 responses
                          52 views
                          0 likes
                          Last Post seqadmin  
                          Working...
                          X