SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
BioT 2013 SWP Bioinformatics 0 09-05-2013 01:39 PM
Sfaf 2013 Genohub The Pipeline 0 05-30-2013 12:56 PM
'Spring School 2013 - Epigenetics of Civilization Diseases', May 27th - 31st 2013 ecSeq Bioinformatics Events / Conferences 1 01-31-2013 02:17 PM
Anyone has protocol for Ion PES protocol? marcowanger Ion Torrent 1 01-18-2012 09:28 PM
illumina alternative v1.5 protocol for small rna seq vs. the standard protocol ik76 Sample Prep / Library Generation 1 03-25-2010 03:24 PM

Reply
 
Thread Tools
Old 04-28-2015, 02:09 AM   #1
elizabeth000
Member
 
Location: France

Join Date: Apr 2015
Posts: 21
Default Understanding edgeR protocol from Anders et al 2013

Hello, I am using edgeR for the first time, and relying on the edgeR user's guide and Anders et al 2013 protocol in Nature Protocols to help me along. I would like to fully understand the steps involved but I need some additional explanation.

I want to determine which features have a significantly higher number of counts in some samples compared to others. Since I will never be able to establish a significantly higher number of counts for features that have very low counts across all samples, it is not useful to analyze these features. Therefore, I will remove them from the DGEList object by filtering.

In the edgeR user's guide this is done simply by:
> keep <- rowSums(cpm(d)>100) >= 2
> d <- d[keep,,keep.lib.sizes=FALSE]

This is straight-forward and I believe I fully understand the code. But in Anders et al there is an additional step involving the %in% operator:
> noint = rownames(counts) %in% c("no_feature","ambiguous","too_low_aQual","not_aligned","alignment_not_unique")
> cpms = cpm(counts)
> keep = rowSums(cpms > 1) >=2 & !noint
> counts = counts[keep,]

Does the vector noint contain all the rownames of counts with data other than "no_feature" and "ambigous" etc?
It seems necessary to me to remove the "__no_feature" etc data at the end of the htseq-count output files, but I do not understand how this code does that. It's way too clever for me!
elizabeth000 is offline   Reply With Quote
Old 04-28-2015, 03:23 AM   #2
bruce01
Senior Member
 
Location: .

Join Date: Mar 2011
Posts: 157
Default

Here is a good overview of what %in% does.

noint actually contains only those c("no_feature"...) that are found in rownames(counts), then uses a '!' (not) statement to remove them when defining 'keep'.

There are other ways to do this, for example I use ENSEMBL annotations and so can just grep all lines with 'ENS' in the shell before I read counts into R.

Hope that helps.
bruce01 is offline   Reply With Quote
Old 04-28-2015, 04:29 AM   #3
elizabeth000
Member
 
Location: France

Join Date: Apr 2015
Posts: 21
Default

Thank you, that is what I thought it was supposed to do!
So if I understand properly, noint contains a FALSE value for every rowname that doesn't include any "no_feature" or "ambiguous" etc and contains a TRUE value for every rowname that includes "no_feature" or "ambiguous" etc.

But it doesn't give the expected output. I ran the following lines:
data = readDGE(listfiles)
noint = rownames(data) %in% c("no_feature","ambiguous","too_low_aQual","not_aligned","alignment_not_unique")
cpmd = cpm(data)
keep = rowSums(cpmd > 1) >=2 & !noint
data = data[keep,]
data$samples$lib.size = colSums(data$counts)

The number of rows of data$counts was reduced from 28031 to 18064, but the no_feature etc rows are still present:
> tail(rownames(data$counts))
[1] "CGI_10028935" "CGI_10028939" "__no_feature" "__ambiguous"
[5] "__too_low_aQual" "__not_aligned"

I cannot find my error...
elizabeth000 is offline   Reply With Quote
Old 04-28-2015, 04:43 AM   #4
bruce01
Senior Member
 
Location: .

Join Date: Mar 2011
Posts: 157
Default

The problem is you are using the vector 'c("no_feature"...)', which does not contain "__no_feature" etc. If you add them to the previous vector then they will also be removed.
bruce01 is offline   Reply With Quote
Old 04-28-2015, 04:56 AM   #5
elizabeth000
Member
 
Location: France

Join Date: Apr 2015
Posts: 21
Default

Yes, I just noticed this and fixed the bug myself! Obviously the string has to match exactly...
Like a fool I was using the exact syntax from the Nature Protocols paper, which surprisingly does not seem to be correct. The code that works for me is:

Code:
data = readDGE(listfiles)
noint = rownames(data) %in% c("__no_feature","__ambiguous","__too_low_aQual","__not_aligned","__alignment_not_unique")
cpmd = cpm(data)
keep = rowSums(cpmd > 1) >=2 & !noint
data = data[keep,]
data$samples$lib.size = colSums(data$counts)
> table(noint)
noint
FALSE TRUE
28026 5

> tail(rownames(data$counts))
[1] "CGI_10028931" "CGI_10028932" "CGI_10028933" "CGI_10028934" "CGI_10028935"
[6] "CGI_10028939"

Also I noticed in the Nature Protocols paper there is no mention of recomputing library sizes, although this is always done in the examples from the edgeR user's guide. Can anyone think of a reason that the library sizes should NOT be recomputed after filtering? I just want to check... Thanks a lot!
elizabeth000 is offline   Reply With Quote
Reply

Tags
bioconductor, dgelist, edger, htseq

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 10:29 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2018, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO