Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • edgeR mean normalized reads counts for each gene

    Hi,
    I am trying to analyse my rna seq data using edgeR (4 conditions (VA, PA, BP, NP) (3 replicates each)). The “counts” matrix file is like the following in R

    row.names VA1 VA2 VA3 PA1 PA2 PA3 BP1 BP2 BP3 NP1 NP2 NP3
    1 mp090_00001 1 16 10 2 1 1 1 2 2 3 0 4
    2 mp090_00002 289 433 512 130 253 313 193 267 317 480 595 888
    3 mp090_00003 721 566 702 220 360 474 299 512 647 2147 1769 2601
    4 mp090_00004 921 716 724 41 56 77 62 84 97 588 528 1175
    5 mp090_00005 1 2 2 1 1 1 0 1 0 1 0 0



    I would like at first step to calculate normalized reads counts for each gene in each replicate out of the 12 (4X3) relative to the total number of mapped reads in each replicate. Then I am thinking to take just the mean number of reads measured for each of the three replicates, so by this way, I would have the normalised read counts for each condition per gene instead of the raw read counts that are currently present in the counts” matrix file as above.
    In order to do that I fired up in R the following code:

    > library(edgeR)
    >group <- c(rep("VA",3),rep("PA",3),rep("BP",3), rep("NP",3))
    >d <- DGEList(counts, group=group,lib.size=colSums(counts))
    >dnorm <-calcNormFactors(d)
    >dnorm$samples
    group lib.size norm.factors
    VA1 VA 6974289 0.8485103
    VA2 VA 6976361 0.8982784
    VA3 VA 7817591 0.8927855
    PA1 PA 1687274 1.2081254
    PA2 PA 2639866 1.1679706
    PA3 PA 3453869 1.1371379
    BP1 BP 2578638 1.0063343
    BP2 BP 3384353 1.0768956
    BP3 BP 4056560 1.0409395
    NP1 NP 9047082 0.9355826
    NP2 NP 7851071 0.9358234
    NP3 NP 12021020 0.9272787

    > dnorm$samples$lib.size*dnorm$samples$norm.factors # calculate the effective library sizes
    [1] 5917756 6266714 6979432 2038439 3083286 3927525 2594972 3644595 4222634 8464293 7347216 11146836
    > lin.norm.factors <- (sum(d$samples$lib.size)/nrow(d$samples))/ d$samples$lib.size
    > lin.norm.factors
    [1] 0.8183388 0.8180957 0.7300626 3.3825752 2.1619776 1.6524458 2.2133123 1.6863877 1.4069387 0.6308477 0.7269494 0.4747793

    I don’t really know if the norm.factors that I am looking for are these that are mentioned above and which exactly is their difference with the lin.norm.factors that have been calculated in the end. I am wondering also how exactly I will make use of these normalization factors I older to find the normalized reads counts per gene per condition. I found that one option is to do the following but when I don’t understand why you have to pass through this?

    f <- calcNormFactors(counts)
    f <- f/exp(mean(log(f)))
    d <- DGEList(counts , group=group,lib.size=colSums(counts) * f)

    or to calculate d$pseudo.alt and save the relevant table but it does not work in my case.
    Any help of what exactly do I have to do would be so important for me as I am new in edgeR and I am struggled myself to find the way out!!
    Thanks in advance
    Last edited by antoza; 02-10-2014, 08:31 AM.

  • #2
    It's likely easier to just use cpm():

    Code:
    library(edgeR)
    group <- c(rep("VA",3),rep("PA",3),rep("BP",3), rep("NP",3))
    d <- DGEList(counts, group=group)
    d <- calcNormFactors(d)
    norm.counts <- cpm(d)*1e6
    Or something close to that.

    Comment


    • #3
      Thanks dpryan for your reply,

      I have used your suggestion in running the code as below without face problems:

      >library(edgeR)
      >group <- c(rep("VA",3),rep("PA",3),rep("BP",3), rep("NP",3))
      >d <- DGEList(counts=countsTable, group=group,lib.size=colSums(countsTable))
      >f <- calcNormFactors(countsTable)
      >f <- f/exp(mean(log(f)))
      >d <- DGEList(counts=countsTable, group=group,lib.size=colSums(countsTable) * f)
      >pair <- c("VA", "PA")
      >d1 <- estimateCommonDisp(d)
      >f
      [1] 0.8485103 0.8982784 0.8927855 1.2081254 1.1679706 1.1371379 1.0063343 1.0768956 1.0409395 0.9355826 0.9358234 0.9272787 # I guess these are the normalization factors per each sample of my 12 (right?)
      >norm.counts <- cpm(d)*1e6
      #########################
      However, after all I still have the following inquires:
      1.When I remark the components of the object “f” among the others are mentioned the $pseudo.counts for all genes / per the 12 samples (what exactly are these numbers mean and which is their deference from normalized counts according to your suggestion?). There is also an indication of $pseudo.lib.size where I don’t quite understand ([1] 4867506). I checked also that the following code
      cpm <- 1e06*t(t(d$counts) / (d$samples$lib.size*d$samples$norm.factors) )
      might gives similar results as the syntax that you gave me. I am not quite sure whether the outputs of one or the other way would be the genes normalised counts or if the pseudo.counts which are mentioned above.

      2. I have also find that you can calculate normalized counts running the following code which gave different results both from the pseudo.counts and norm.counts (cpm) results. I am quite confused about the difference among these counts matrixes per gene / per sample. Please could you please further clarify this towards this..
      >scale <- d$samples$lib.size*d$samples$norm.factors
      >normCounts <- round(t(t(countsTable)/scale)*mean(scale))

      3. Finally, my common dispersion is 0.05073274 (is that mean that I have in general low coefficient of variation among my 3 replicates for all the 4 conditions (I don’t be fully sure about the biological value of common dispersion?)

      Thank you so much……

      Comment

      Latest Articles

      Collapse

      • seqadmin
        Current Approaches to Protein Sequencing
        by seqadmin


        Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
        04-04-2024, 04:25 PM
      • seqadmin
        Strategies for Sequencing Challenging Samples
        by seqadmin


        Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
        03-22-2024, 06:39 AM

      ad_right_rmr

      Collapse

      News

      Collapse

      Topics Statistics Last Post
      Started by seqadmin, 04-11-2024, 12:08 PM
      0 responses
      30 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 04-10-2024, 10:19 PM
      0 responses
      32 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 04-10-2024, 09:21 AM
      0 responses
      28 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 04-04-2024, 09:00 AM
      0 responses
      53 views
      0 likes
      Last Post seqadmin  
      Working...
      X