Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • zero rna-seq values AFTER normalisation in edgeR

    I am using edgeR to analyze RNA-Seq data. This is my script:


    library("edgeR")
    #############################
    #read in metadata & DGE
    #############################
    composite_samples <- read.csv(file="samples.csv",header=TRUE,sep=",")
    counts <- readDGE(composite_samples$CountFiles)$counts
    #############################
    #Filter & Library Size Re-set
    #############################
    noint <- rownames(counts) %in% (c("no_feature", "ambiguous", "too_low_aQual", "not_aligned", "alignment_not_unique"))
    cpms <- cpm(counts)
    keep <- rowSums(cpms>1)>=3 & !noint
    counts <- counts[keep,]
    colnames(counts) <- composite_samples$SampleName
    d <- DGEList(counts=counts, group=composite_samples$Condition)
    d$samples$lib.size <- colSums(d$counts)
    #############################
    #Normalisation
    #############################
    d <- calcNormFactors(d)
    #############################
    #Recording the normalized counts
    #############################
    all_cpm=cpm(d, normalized.lib.size=TRUE)
    all_counts <- cbind(rownames(all_cpm), all_cpm)
    colnames(all_counts)[1] <- "Ensembl.Gene.ID"
    rownames(all_counts) <- NULL
    #############################
    #Estimate Dispersion
    #############################
    d <- estimateCommonDisp(d)
    d <- estimateTagwiseDisp(d)
    #############################
    #Perform a test
    #############################
    de_ctl_mo_composite <- exactTest(d, pair=c("NY", "N"))


    I believe that the variable "all_counts" shall contain the normalized counts for each sample in each condition. My understanding is also that edgeR adds pseudocounts BEFORE performing the library normalisation. Thus it is possible that some values revert to being zero after normalisation. But I thought that this would happen rarely. Yet in a recent dataset I find an improbably large number of zero values in "all_counts" which made me think that my understanding of how pseudocounts and normalisation work in edgeR might be incorrect. Can, please, somebody comment on this?

  • #2
    Please don't cross-post on here and on the Bioconductor email list.

    Comment


    • #3
      the counts reported by edgeR are not normalized

      This is the kind response by James MacDonald which I got in the Bioconductor list:



      In short, the scores reported in all_counts are not normalised.

      Comment


      • #4
        dpryan, I see your point (and appreciate the help you have generously given on so many occasions) but the reason for cross-posting is that not everyone is following all the forums. In this case within a few hours I got help from the Bioconductor list and I was able to proceed with my work. But you never know how long is this going to take. Or whether you will get a response at all. I have had questions that haven't been answered at all.

        What I try to do is to always crosspost the answers, too, so that people don't respond in vain and so that other people having the same issue can benefit, too.

        Comment


        • #5
          Originally posted by feralBiologist View Post
          dpryan, I see your point (and appreciate the help you have generously given on so many occasions) but the reason for cross-posting is that not everyone is following all the forums.
          We do ask users please not to post the same question to multiple forums simultaneously.

          In this case within a few hours I got help from the Bioconductor list and I was able to proceed with my work. But you never know how long is this going to take. Or whether you will get a response at all. I have had questions that haven't been answered at all.
          All reasonable questions sent to the Bioconductor mailing list get an answer. A search suggests that you have posted three questions to the Bioconductor mailing list, and that I have answered all of them myself.

          The edgeR developers don't live in the same time zone as you and we can't answer everything within a few hours.

          What I try to do is to always crosspost the answers, too, so that people don't respond in vain and so that other people having the same issue can benefit, too.
          But your cross post of James MacDonald's answer isn't correct. The cpm values are of course normalized, they are just not "normalized counts".

          Comment


          • #6
            A search suggests that you have posted three questions to the Bioconductor mailing list, and that I have answered all of them myself.
            You are right - and I once again thank you for this. I will not post edgeR questions to seqanswers anymore. In the past I have used seqanswers a lot more often than I have used bioconductor (and not just for edgeR) and not all of my questions have been answered. Quick search in seqanswers shows this. Maybe some of them were not precisely formulated - I don't know. But they made me think that help might not always come.
            But your cross post of James MacDonald's answer isn't correct.
            This is how I understood the answer of James. He says that counts are not affected by the normalization and I explained on the bioconductor thread that I understood "normalisation" to comprise all the transformations performed on the raw counts. Thanks to your kind reply in bioconductor I was reminded that in edgeR "normalisation" refers to multiple transformations and that not all of them are reflected in the cpm() output. I was about to post this clarification but you were faster than me.

            Once more - thanks again for your assistance and for helping to create edgeR and other analytic tools that I have used.

            Comment


            • #7
              Originally posted by feralBiologist View Post
              Thanks to your kind reply in bioconductor I was reminded that in edgeR "normalisation" refers to multiple transformations and that not all of them are reflected in the cpm() output.
              Well, the cpm values are fully normalized. The issue is rather that the cpm values produced by cpm() are just for descriptive purposes. They are not used by any of the core functions in edgeR which estimate parameters or evaluate differential expression.

              Comment


              • #8
                Originally posted by Gordon Smyth View Post
                Well, the cpm values are fully normalized. The issue is rather that the cpm values produced by cpm() are just for descriptive purposes. They are not used by any of the core functions in edgeR which estimate parameters or evaluate differential expression.
                Now I am confused again. And maybe I am not the only one as the response by James MacDonald in the bioconductor thread indicates. I believe this confusion is due to the fact that "normalization" in edgeR seems to mean different things depending on the context. I might be a bit naive but to me any transformation performed on the raw score prior to computing differential expression can be described as "normalisation". This would include library size scaling, TMM, pseudocounts. You seemed to agree with James' response and he literally said "The counts are not affected by the normalization".

                Now you seem to say exactly the opposite. Can you, please, clarify?

                What I can say with certainty is that no pseudo-counts seem to have been added to the raw counts otherwise I wouldn't have observed the zeros. What is not clear to me whether both library scaling and TMM normalisation have been applied.

                Comment


                • #9
                  CPM isn't used to calculate differential expression, so it doesn't fit your definition of normalization (normalization is a generic term that doesn't really fit what you wrote). Nothing in Gordon's reply contradicts James' a reply on the mailing list.

                  Comment


                  • #10
                    Originally posted by dpryan View Post
                    CPM isn't used to calculate differential expression, so it doesn't fit your definition of normalization (normalization is a generic term that doesn't really fit what you wrote). Nothing in Gordon's reply contradicts James' a reply on the mailing list.
                    Thanks for your response but it still does not clarify the question I asked. OK, let's drop "normalisation" as it is a confusing term. What I really wanted to know is "How do you come from raw counts to cpm()'s output? What are the transformations/manipulations performed?"

                    One thing mentioned by Gordon Smyth is the library size scaling. Is this all? I had a look at the help info on cpm() - it does not explicitly mention anything else.

                    Comment

                    Latest Articles

                    Collapse

                    • seqadmin
                      Current Approaches to Protein Sequencing
                      by seqadmin


                      Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                      04-04-2024, 04:25 PM
                    • seqadmin
                      Strategies for Sequencing Challenging Samples
                      by seqadmin


                      Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                      03-22-2024, 06:39 AM

                    ad_right_rmr

                    Collapse

                    News

                    Collapse

                    Topics Statistics Last Post
                    Started by seqadmin, 04-11-2024, 12:08 PM
                    0 responses
                    18 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 04-10-2024, 10:19 PM
                    0 responses
                    22 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 04-10-2024, 09:21 AM
                    0 responses
                    16 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 04-04-2024, 09:00 AM
                    0 responses
                    47 views
                    0 likes
                    Last Post seqadmin  
                    Working...
                    X