Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Understanding edgeR protocol from Anders et al 2013

    Hello, I am using edgeR for the first time, and relying on the edgeR user's guide and Anders et al 2013 protocol in Nature Protocols to help me along. I would like to fully understand the steps involved but I need some additional explanation.

    I want to determine which features have a significantly higher number of counts in some samples compared to others. Since I will never be able to establish a significantly higher number of counts for features that have very low counts across all samples, it is not useful to analyze these features. Therefore, I will remove them from the DGEList object by filtering.

    In the edgeR user's guide this is done simply by:
    > keep <- rowSums(cpm(d)>100) >= 2
    > d <- d[keep,,keep.lib.sizes=FALSE]

    This is straight-forward and I believe I fully understand the code. But in Anders et al there is an additional step involving the %in% operator:
    > noint = rownames(counts) %in% c("no_feature","ambiguous","too_low_aQual","not_aligned","alignment_not_unique")
    > cpms = cpm(counts)
    > keep = rowSums(cpms > 1) >=2 & !noint
    > counts = counts[keep,]

    Does the vector noint contain all the rownames of counts with data other than "no_feature" and "ambigous" etc?
    It seems necessary to me to remove the "__no_feature" etc data at the end of the htseq-count output files, but I do not understand how this code does that. It's way too clever for me!

  • #2
    Here is a good overview of what %in% does.

    noint actually contains only those c("no_feature"...) that are found in rownames(counts), then uses a '!' (not) statement to remove them when defining 'keep'.

    There are other ways to do this, for example I use ENSEMBL annotations and so can just grep all lines with 'ENS' in the shell before I read counts into R.

    Hope that helps.

    Comment


    • #3
      Thank you, that is what I thought it was supposed to do!
      So if I understand properly, noint contains a FALSE value for every rowname that doesn't include any "no_feature" or "ambiguous" etc and contains a TRUE value for every rowname that includes "no_feature" or "ambiguous" etc.

      But it doesn't give the expected output. I ran the following lines:
      data = readDGE(listfiles)
      noint = rownames(data) %in% c("no_feature","ambiguous","too_low_aQual","not_aligned","alignment_not_unique")
      cpmd = cpm(data)
      keep = rowSums(cpmd > 1) >=2 & !noint
      data = data[keep,]
      data$samples$lib.size = colSums(data$counts)

      The number of rows of data$counts was reduced from 28031 to 18064, but the no_feature etc rows are still present:
      > tail(rownames(data$counts))
      [1] "CGI_10028935" "CGI_10028939" "__no_feature" "__ambiguous"
      [5] "__too_low_aQual" "__not_aligned"

      I cannot find my error...

      Comment


      • #4
        The problem is you are using the vector 'c("no_feature"...)', which does not contain "__no_feature" etc. If you add them to the previous vector then they will also be removed.

        Comment


        • #5
          Yes, I just noticed this and fixed the bug myself! Obviously the string has to match exactly...
          Like a fool I was using the exact syntax from the Nature Protocols paper, which surprisingly does not seem to be correct. The code that works for me is:

          Code:
          data = readDGE(listfiles)
          noint = rownames(data) %in% c("__no_feature","__ambiguous","__too_low_aQual","__not_aligned","__alignment_not_unique")
          cpmd = cpm(data)
          keep = rowSums(cpmd > 1) >=2 & !noint
          data = data[keep,]
          data$samples$lib.size = colSums(data$counts)
          > table(noint)
          noint
          FALSE TRUE
          28026 5

          > tail(rownames(data$counts))
          [1] "CGI_10028931" "CGI_10028932" "CGI_10028933" "CGI_10028934" "CGI_10028935"
          [6] "CGI_10028939"

          Also I noticed in the Nature Protocols paper there is no mention of recomputing library sizes, although this is always done in the examples from the edgeR user's guide. Can anyone think of a reason that the library sizes should NOT be recomputed after filtering? I just want to check... Thanks a lot!

          Comment

          Latest Articles

          Collapse

          • seqadmin
            Strategies for Sequencing Challenging Samples
            by seqadmin


            Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
            03-22-2024, 06:39 AM
          • seqadmin
            Techniques and Challenges in Conservation Genomics
            by seqadmin



            The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

            Avian Conservation
            Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
            03-08-2024, 10:41 AM

          ad_right_rmr

          Collapse

          News

          Collapse

          Topics Statistics Last Post
          Started by seqadmin, Yesterday, 06:37 PM
          0 responses
          8 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, Yesterday, 06:07 PM
          0 responses
          8 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 03-22-2024, 10:03 AM
          0 responses
          49 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 03-21-2024, 07:32 AM
          0 responses
          67 views
          0 likes
          Last Post seqadmin  
          Working...
          X