Seqanswers Leaderboard Ad

**bruce01** · 04-28-2015, 02:23 AM

Here is a good overview of what %in% does.

noint actually contains only those c("no_feature"...) that are found in rownames(counts), then uses a '!' (not) statement to remove them when defining 'keep'.

There are other ways to do this, for example I use ENSEMBL annotations and so can just grep all lines with 'ENS' in the shell before I read counts into R.

Hope that helps.

**elizabeth000** · 04-28-2015, 03:29 AM

Thank you, that is what I thought it was supposed to do!
So if I understand properly, noint contains a FALSE value for every rowname that doesn't include any "no_feature" or "ambiguous" etc and contains a TRUE value for every rowname that includes "no_feature" or "ambiguous" etc.

But it doesn't give the expected output. I ran the following lines:
data = readDGE(listfiles)
noint = rownames(data) %in% c("no_feature","ambiguous","too_low_aQual","not_aligned","alignment_not_unique")
cpmd = cpm(data)
keep = rowSums(cpmd > 1) >=2 & !noint
data = data[keep,]
data$samples$lib.size = colSums(data$counts)

The number of rows of data$counts was reduced from 28031 to 18064, but the no_feature etc rows are still present:
> tail(rownames(data$counts))
[1] "CGI_10028935" "CGI_10028939" "__no_feature" "__ambiguous"
[5] "__too_low_aQual" "__not_aligned"

I cannot find my error...

**bruce01** · 04-28-2015, 03:43 AM

The problem is you are using the vector 'c("no_feature"...)', which does not contain "__no_feature" etc. If you add them to the previous vector then they will also be removed.

**elizabeth000** · 04-28-2015, 03:56 AM

Yes, I just noticed this and fixed the bug myself! Obviously the string has to match exactly...
Like a fool I was using the exact syntax from the Nature Protocols paper, which surprisingly does not seem to be correct. The code that works for me is:

Code:

data = readDGE(listfiles)
noint = rownames(data) %in% c("__no_feature","__ambiguous","__too_low_aQual","__not_aligned","__alignment_not_unique")
cpmd = cpm(data)
keep = rowSums(cpmd > 1) >=2 & !noint
data = data[keep,]
data$samples$lib.size = colSums(data$counts)

> table(noint)
noint
FALSE TRUE
28026 5

> tail(rownames(data$counts))
[1] "CGI_10028931" "CGI_10028932" "CGI_10028933" "CGI_10028934" "CGI_10028935"
[6] "CGI_10028939"

Also I noticed in the Nature Protocols paper there is no mention of recomputing library sizes, although this is always done in the examples from the edgeR user's guide. Can anyone think of a reason that the library sizes should NOT be recomputed after filtering? I just want to check... Thanks a lot!

Topics	Statistics	Last Post
The Role of Spliceosomes in RNA Splicing and Genome Evolution by seqadmin Started by seqadmin, Today, 07:03 AM	0 responses 10 views 0 likes	Last Post by seqadmin Today, 07:03 AM
A Closer Look at the Enigmatic Genomes of Oikopleura dioica by seqadmin Started by seqadmin, 05-10-2024, 06:35 AM	0 responses 31 views 0 likes	Last Post by seqadmin 05-10-2024, 06:35 AM
Advanced Epigenome Editing Platform Explores Gene Regulation Mechanisms by seqadmin Started by seqadmin, 05-09-2024, 02:46 PM	0 responses 41 views 0 likes	Last Post by seqadmin 05-09-2024, 02:46 PM
Telomere Maintenance by PARP1: A New Perspective in Cancer Research by seqadmin Started by seqadmin, 05-07-2024, 06:57 AM	0 responses 33 views 0 likes	Last Post by seqadmin 05-07-2024, 06:57 AM

Seqanswers Leaderboard Ad

Announcement

Understanding edgeR protocol from Anders et al 2013

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News