Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • beliefbio
    Junior Member
    • Oct 2009
    • 7

    how to study differential expression?

    My data is got from 2 tissues by Illumina sequencing with 75nt reads. Are there any standard ways to study differential expression?
    Is it necessary to calculate RPKM for each gene? If so what is the best tool to calculate RPKM? ERANGE, TopHat or Cufflinks?
    Is simply counting and comparing the number of reads mapping to each gene between tissues also acceptable for studying differential gene expression?

    Thank you for your time!
  • Xi Wang
    Senior Member
    • Oct 2009
    • 317

    #2
    You may use DEGseq to do the anaysis you want to. The input for DEGseq could be mapped reads rather than RPKM.
    Have a look:

    and the related paper:


    best,
    Xi
    Xi Wang

    Comment

    • svl
      Member
      • Sep 2009
      • 43

      #3
      Originally posted by beliefbio View Post
      Is it necessary to calculate RPKM for each gene?
      Because transcripts (or genes) vary in length (kilobases) and sequence-runs vary in the amount of reads produced, you would somehow like to account for these variations if you want to compare runs/samples. RPKM is a measure that (up to a certain degree of course) accounts for these.

      If so what is the best tool to calculate RPKM? ERANGE, TopHat or Cufflinks?
      Erange I haven't used yet. Tophat is for mapping not counting (it does count, but the creator of this software has said this will be removed from future versions since Cufflinks now exists), so Cufflinks is meant for RPKM determination.

      So, you could map with tophat and then feed the produced "accepted_hits.sam" file to Cufflinks which will count and return RPKM values. But do realize that Tophat does more than just mapping, it tries to find exon-exon splice junctions (and is therefor potentially slow for just mapping).

      -svl

      update: and btw, when you have the RPKM values from Cufflinks you could also use the mentioned DEGseq for determining which transcripts are differentially expressed.
      Last edited by svl; 12-01-2009, 03:25 AM.

      Comment

      • beliefbio
        Junior Member
        • Oct 2009
        • 7

        #4
        Thanks a lot svl!!

        Comment

        • kmcarr
          Senior Member
          • May 2008
          • 1181

          #5
          Originally posted by svl View Post
          Because transcripts (or genes) vary in length (kilobases) and sequence-runs vary in the amount of reads produced, you would somehow like to account for these variations if you want to compare runs/samples. RPKM is a measure that (up to a certain degree of course) accounts for these.
          If you are examining in differential expression of genes between samples you don't really need to normalize for transcript length. When comparing gene to gene between samples the length of the transcript is constant (let's ignore the possibility of differential isoform expression). In this case you only need to normalize for the total number of reads in each sample pool.

          Comment

          • Xi Wang
            Senior Member
            • Oct 2009
            • 317

            #6
            Originally posted by kmcarr View Post
            If you are examining in differential expression of genes between samples you don't really need to normalize for transcript length. When comparing gene to gene between samples the length of the transcript is constant (let's ignore the possibility of differential isoform expression). In this case you only need to normalize for the total number of reads in each sample pool.
            I totally agree with your point. DEGseq follows this to identify differentially expressed genes.
            Xi Wang

            Comment

            • svl
              Member
              • Sep 2009
              • 43

              #7
              Agreed. Looking at other things, like the top (100) expressing genes/transcripts though, is impossible then, so for the sake of future comparison it's nice to use RPKM instead of RPM, it's not hard to calculate anyway. But you're absolutely right !
              Last edited by svl; 12-01-2009, 02:04 PM.

              Comment

              • tebuffer
                Member
                • Jun 2009
                • 13

                #8
                CuffCompare

                Originally posted by Xi Wang View Post
                I totally agree with your point. DEGseq follows this to identify differentially expressed genes.
                Cuffcompare (which is part of the Cufflinks) could be used to identify differentially expressed genes.

                Comment

                • yvan.wenger
                  Member
                  • Aug 2009
                  • 30

                  #9
                  Hello everybody,

                  Some quick questions about the topic, I number them as they are quite different from each other. Any input appreciated!

                  1. Can tophat/cufflinks be used with a de-novo transcriptome assembly if no good genome is available (assuming that SOME contigs are actually long isoforms containing most exons)?

                  2. Is it correct that the model behind cufflinks tries to allocate reads mapping at multiple locations? Thus giving a more precise result in the case where two isoforms are almost identicals (e.g. premature stops)

                  3. I understand that the RPKM (Reads Per Kilobase exon Model per million mapped reads) is:
                  3a. number of reads normalized per kilobase exon (to make it more comparable to qPCR results... although with caveats --> good for relative comparison of transcripts abundance in one sample)
                  3b. per millions mapped reads (to normalize between different sequenced librairies)
                  (3c. limited to uniquely mapped reads except in the case of cufflinks???)

                  I think that the point 3a cannot be really detrimental, although it can give a false sense on absolute quantitation for example in case of premature stops if unambiguously mapped reads only are taken into account. Howver it can be useful as mentioned above by svl.

                  On 3b. This is my main question: I am not that to normalize on the total number of reads mapped is fully satisfying in case where gene expression is massively altered for highly expressed transcripts. Do somebody knows if a package for RNAseq (or adapted from microarrays) allows to do quantile regressions, even better with outlier removal? Or if this method would perform worse than normalization on the total mapped count in certain cases?

                  Cheers,

                  Yvan

                  Comment

                  • jiwu2573
                    Member
                    • Jun 2009
                    • 34

                    #10
                    Cuffcompare output for DE genes

                    Originally posted by tebuffer View Post
                    Cuffcompare (which is part of the Cufflinks) could be used to identify differentially expressed genes.
                    Can Cuffcompare directly give out the list of differentially expressed genes?

                    If not, how its output can be used for the identification of DE genes?

                    Comment

                    • mkatari
                      Junior Member
                      • Jan 2009
                      • 5

                      #11
                      Originally posted by svl View Post
                      Agreed. Looking at other things, like the top (100) expressing genes/transcripts though, is impossible then, so for the sake of future comparison it's nice to use RPKM instead of RPM, it's not hard to calculate anyway. But you're absolutely right !
                      If you are interested in differential expression then once you calculate the log ratio, you may be more interested in the top 100 induced/repressed transcripts rather than 100 most highly expressed transcripts.

                      Comment

                      • Cole Trapnell
                        Senior Member
                        • Nov 2008
                        • 213

                        #12
                        Originally posted by jiwu2573 View Post
                        Can Cuffcompare directly give out the list of differentially expressed genes?

                        If not, how its output can be used for the identification of DE genes?
                        I just wanted to point out that we just released a standalone tool, "cuffdiff", as part of the Cufflinks package to help you test for differential expression and regulation in your samples. Cuffdiff does differential expression on genes and transcripts, and a few other tests you may find helpful.

                        Comment

                        • Simon Anders
                          Senior Member
                          • Feb 2010
                          • 995

                          #13
                          Hi,

                          as already pointed out, it is not necessary to normalize for transcript length. It is even advantageous to not do so, as you can then use a statistical test that takes the specificities of count data into account, which gives you much better power at low count rates.

                          We have recently released a tool to do this, called DESeq: http://www-huber.embl.de/users/anders/DESeq/

                          DESeq is based on the so-called negative binomial distribution, which allows a powerful test for count data. Furthermore, it can estimate the variance between the samples from the data and uses this information in the test. The basic idea is older and has, e.g., already been used in the edgeR package (Robinson and Smyth), but we added an improved variance estimation that does a better job if the amount of noise depends on the expression strength as is often the case.

                          Note that this variance estimation is crucial. It is often claimed (e.g. by the DEGSeq package suggested above) that a Poisson-based test, such as the binomial or the chi-squared test, are suitable, but then, the p value will only tell you whether your difference is stronger than what to expect between _technical_ replicates, which is not biologically meaningful.

                          Comment

                          • Fabien Campagne
                            Member
                            • Feb 2010
                            • 39

                            #14
                            You would need biological replicates to assess biological variability. One sample in each group limits your ability to see how much biological variability you should expect in future experiments, irrespective of the statistical test being used.

                            Regarding benchmarking of statistical methods for RNA-Seq data, I would recommend this paper from the Dudoit lab:

                            Digital Commons helps institutions save, share, showcase, publish and promote research, scholarship and collections.


                            On the practical side of things, we have recently released a set of tools with a program to estimate various statistics of differential expression. It can evaluate RPKMs, Fisher exact tests to compare low counts across groups, but also t-test when you have several samples per group. All statistics are corrected for multiple testing with a Benjamini Hochberg FDR correction. We've tried to make it easy and fast to go from reads to differential expression results.

                            See the Goby home page at http://icbtools.med.cornell.edu/goby/ and a tutorial at http://icb.med.cornell.edu/wiki/index.php/Goby/DE

                            Comment

                            • lpachter
                              Member
                              • Feb 2010
                              • 40

                              #15
                              I'd just like to clarify some of the discussion on this thread regarding how to normalize reads, how to measure expression, and then how to find differential expression.

                              First of all, RPKM is a unit, not a method. It stands for "reads per kilobase of transcript per million of sequenced reads". As we point out in the Cufflinks paper (to appear shortly) this unit is flawed, as the objects being sequenced are fragments, not reads. We use the unit FPKM (expected fragments per kilobase of transcript per million fragments sequenced). This is not only a technicality- it is crucial to use units that are proportional (i.e. a scalar multiple) of the estimated proportion of each transcript. FPKM has this property, RPKM cannot.

                              Secondly, regarding expression estimates, a current favored method is to "count" the reads that map to a gene and normalize by length. If the gene is single isoform, this is well-defined, but its problematic with multiple isoforms that may have different lengths, and share different exons. The current favored method I allude to of counting all reads that map somewhere in the locus, and dividing by the number of exonic bases _provably underestimates gene expression_ It is essential to normalize not only by transcript length, but in fact it is essential to probabilistically assign fragments to isoforms. This is what Cufflinks does.

                              Regarding differential expression tests, one has to keep in mind that in genes with multiple isoforms the relative abundances may chance, making it crucial to have correctly estimated individual expression levels.

                              Comment

                              Latest Articles

                              Collapse

                              • SEQadmin2
                                From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                                by SEQadmin2


                                Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                                The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                                ...
                                Yesterday, 10:05 AM
                              • SEQadmin2
                                Single-Cell Sequencing at an Inflection Point: Early Impacts of New Platforms and Emerging Trends
                                by SEQadmin2


                                With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.


                                Introduction

                                Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...
                                05-22-2026, 06:42 AM
                              • SEQadmin2
                                Environmental Genomics in the Age of NGS: From Microbes to Conservation Strategies
                                by SEQadmin2

                                Studying ecosystems means dealing with complex, multi-species communities that are hard to observe at scale. This complexity, however, hides many important questions to be answered, from how biogeochemical cycles work and how climate change can affect species distribution to how conservation strategies can work best.


                                Genomics, particularly since the expansion of NGS, has transformed ecosystem ecology. By sequencing environmental DNA, we can now assess biodiversity without direct...
                                05-06-2026, 09:04 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by SEQadmin2, Yesterday, 12:03 PM
                              0 responses
                              17 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, Yesterday, 11:40 AM
                              0 responses
                              13 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 05-28-2026, 11:40 AM
                              0 responses
                              29 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 05-26-2026, 10:12 AM
                              0 responses
                              31 views
                              0 reactions
                              Last Post SEQadmin2  
                              Working...