Hello, I am using edgeR for the first time, and relying on the edgeR user's guide and Anders et al 2013 protocol in Nature Protocols to help me along. I would like to fully understand the steps involved but I need some additional explanation.
I want to determine which features have a significantly higher number of counts in some samples compared to others. Since I will never be able to establish a significantly higher number of counts for features that have very low counts across all samples, it is not useful to analyze these features. Therefore, I will remove them from the DGEList object by filtering.
In the edgeR user's guide this is done simply by:
> keep <- rowSums(cpm(d)>100) >= 2
> d <- d[keep,,keep.lib.sizes=FALSE]
This is straight-forward and I believe I fully understand the code. But in Anders et al there is an additional step involving the %in% operator:
> noint = rownames(counts) %in% c("no_feature","ambiguous","too_low_aQual","not_aligned","alignment_not_unique")
> cpms = cpm(counts)
> keep = rowSums(cpms > 1) >=2 & !noint
> counts = counts[keep,]
Does the vector noint contain all the rownames of counts with data other than "no_feature" and "ambigous" etc?
It seems necessary to me to remove the "__no_feature" etc data at the end of the htseq-count output files, but I do not understand how this code does that. It's way too clever for me!
I want to determine which features have a significantly higher number of counts in some samples compared to others. Since I will never be able to establish a significantly higher number of counts for features that have very low counts across all samples, it is not useful to analyze these features. Therefore, I will remove them from the DGEList object by filtering.
In the edgeR user's guide this is done simply by:
> keep <- rowSums(cpm(d)>100) >= 2
> d <- d[keep,,keep.lib.sizes=FALSE]
This is straight-forward and I believe I fully understand the code. But in Anders et al there is an additional step involving the %in% operator:
> noint = rownames(counts) %in% c("no_feature","ambiguous","too_low_aQual","not_aligned","alignment_not_unique")
> cpms = cpm(counts)
> keep = rowSums(cpms > 1) >=2 & !noint
> counts = counts[keep,]
Does the vector noint contain all the rownames of counts with data other than "no_feature" and "ambigous" etc?
It seems necessary to me to remove the "__no_feature" etc data at the end of the htseq-count output files, but I do not understand how this code does that. It's way too clever for me!
Comment