Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • edgeR: Normalization does not seem to work

    Hi folks,

    I have a problem with edgeR, and I hope someone here can help me with it.
    I try to compare 11 samples (4 conditions, 3 replicates each, besides 1, where one sample failed).
    The sequencing depth is highly different (in 1 condition only 2000-3000 reads * map, in another condition 120.000-550.000 reads).
    I performed the DE analysis with edgeR, and there are actually differentially expressed genes. But if I take a closer look at it, it's always so that the genes are upregulated in the samples, which have higher sequencing depth/more reads.
    So it seems the normalization doesn't work.
    One of my colleagues took a look at the smear plot (attached), and immediatly said "this is wrong". If I take a look on how it should look, e.g. the figure from the TMM paper http://genomebiology.com/2010/11/3/r25 , then I'd also say "yes, this is wrong".
    If I look at the pseudocounts, I'd say that the normalization didn't work properly, because the counts are still extremely different between the conditions.

    I've also checked again with only putting in the counts from the 2 conditions which have the most reads (#1: 57770, 19668, 82123, #2 124571, 554628 ; under the assumption that this change the result of the normalization), but the end result doesn't change considerably.



    Any clue what could have gone wrong there?
    I'm especially worried, because I've already worked with edgeR for another dataset, which is less uneven, but shows also some bias in the distribution of DE genes (not really extreme, but now I don't know if that's biology or technical problems).


    Background:

    Used R commands, as taken from the tutorial:
    PHP Code:
    <- read.delim("input.csv",row.names="Contig")
    group <- c(1,1,1,2,2,2,3,3,4,4,4
    library("edgeR")
    <- DGEList(counts=xgroup=group)
    d$samples$lib.size <- colSums(d$counts)
    <- calcNormFactors(d)
    <- estimateCommonDisp(dverbose=TRUE)
    <- estimateTagwiseDisp(dtrend="none")
    result21 <- exactTest(dpair=c(2,1) )
    plotSmear(d,pair=c(2,1),de.tags=rownames(topTags(result21,n=min(dim(result21$table[result21$table[,3]<0.05,])[1],500))$table),pch=19,cex=1,lowess=TRUE,smooth.scatter=TRUE
    - why is the coverage so uneven and so low:
    It's a meta-transcriptomics sample, and they apparenlty sequenced to 99.9% host, because the samples were mainly scraped from the host tissue.

    - what did I do to get the counts?
    First we did a denovo cross-assembly, but I was told afterwards that these were gnotobiotic animals with 4 different known bacteria. The assembly confirmed this, but for easier working we mapped the data back to the original genomes (concatenated together, then just ran bowtie2 with default parameters, and then got the read coverage with subread). Unmapped reads are getting assembled right now, to catch any biological differences between the bacteria in the samples and in the database, but that shouldn't affect the current data massively.

    *I'm aware that the statistics will either suck or will not be existent for this.


    I'm really worried at the moment, any idea would help.
    Thanks !
    Attached Files
    Last edited by bastianwur; 03-20-2015, 05:58 AM.

  • #2
    Your data sounds pretty crazy and it is quite possible that TMM is not a strong enough normalization. You might try quantile, but it is hard to imagine any normalization that will robustly handle systematic imbalances in sequencing depth as strong as in your data.

    However your conclusion that the DE genes shouldn't be mostly in one direction is not correct. It is perfectly possible almost all the DE to be one direction even when normalization has worked perfectly; in fact it is one of the aims of normalization to allow this.

    Comment


    • #3
      Okay, thanks, I'll have a try .

      Originally posted by Gordon Smyth View Post
      However your conclusion that the DE genes shouldn't be mostly in one direction is not correct. It is perfectly possible almost all the DE to be one direction even when normalization has worked perfectly; in fact it is one of the aims of normalization to allow this.
      Yeah, I'm aware that this could be, that the impact is just in the other direction distributed over more genes, and therefore no statistical significance can be seen. But I don't believe that for our data, we don't expect that, it wouldn't make any sense, and that the regulation really only followed the sequencing depth makes it really very suspicous.

      But as said, I'll gonna try, thanks .

      Comment


      • #4
        Okay...doesn't seem to work. Was sort of expected :/.

        Any idea what I could do with this data?
        The data itself is:
        - 3 sampling sites in a gut, from the tissue
        - 1 sampling site in the gut, from the content
        (all in triplicate, besides one, where a sample failed...and one more sampling site from the content in triplicate which had to be discarded due to DNA contamination)

        2 sampling sites from the tissue have the mentioned extremly low coverage, so I'd preferably not do anything with it.
        The 2 remaining sampling sites (1 tissue, 1 content) are not from the same site (e.g. one was upper gut, the other one lower gut), so you cannot really compare them either.
        We also cannot take a look at which genes are present, because the animals have been inoculated with a known mix of bacteria, so there's nothing to gain from that.
        I'm still waiting for the assembly of the reads, which didn't map to the genomes, to see the genetic differences, but I'm not sure if that'll be meaningful.

        If anyone had an idea to what to try here, then it'd be really appreciated .

        Comment


        • #5
          In this paper,
          "Anders, S., McCarthy, D. J., Chen, Y., Okoniewski, M., Smyth, G. K., Huber, W., & Robinson, M. D. (2013). Count-based differential expression analysis of RNA sequencing data using R and Bioconductor. Nature protocols, 8(9), 1765-1786."
          They mentioned that "In edgeR, it is recommended to remove features without at least 1 read per million in n of the samples, where n is the size of the smallest group of replicates."

          However, since your sequencing depths are highly different with each other, maybe you can try removing features separately for all 6 comparisons.

          Comment


          • #6
            Thanks .
            That helped a bit.

            I've talked to the lab person, and we actually now went with only comparing the 2 better groups (which don't fit together).
            The organism distribution is so, that in 1 group you have 3/4 organism A and 1/4 organism B, whereas in the 2. group it's the other way around.
            No surprise, the DE result was that the organism with the more reads has upregulated genes in both cases.
            Then I thought again what I talked about earlier with a colleague: With normalization you try to make the RNA output of a cell (or at least a population) comparable. That is not the same as making the RNA output of the ecosystem comparable, since the different players might vary in the ecosystem. While we cannot do anything about it in our other meta-omics data (because we cannot fully define the organisms), I know what bacteria I have here -> I separated the expression per organism, then did the normalization and DE on it.
            That gives some really crappy output (as expected; in one case FDR needs to be 0.1, else I don't have any rsults), but I now at least get up- and downregulation per organism, and they make a tad bit sense (e.g. biofilm formation at a surface).
            That'll probably be good enough to fill the poster, and that's then about it, afterwards we'll throw the data away.

            Thanks for the help everyone .

            Comment

            Latest Articles

            Collapse

            • seqadmin
              Essential Discoveries and Tools in Epitranscriptomics
              by seqadmin


              The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist on Modified Bases...
              Yesterday, 07:01 AM
            • seqadmin
              Current Approaches to Protein Sequencing
              by seqadmin


              Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
              04-04-2024, 04:25 PM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, 04-11-2024, 12:08 PM
            0 responses
            55 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-10-2024, 10:19 PM
            0 responses
            51 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-10-2024, 09:21 AM
            0 responses
            45 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-04-2024, 09:00 AM
            0 responses
            55 views
            0 likes
            Last Post seqadmin  
            Working...
            X