Data filters at what stage of NGS data analysis?

anandksrao

Junior Member

Join Date: Jun 2011
Posts: 9

Data filters at what stage of NGS data analysis?

10-08-2012, 10:27 AM

Greetings friends!

I seek help with data that I have : 3 time points, 3 genotypes, 3 replicates for each of these = 27 libraries

The goal is to find genes that have different time expression profiles amongst 2 or more genotypes.

After our 1st round of data analysis, (including TMM normalization), the time course graphs and box plots were so noisy in terms of high std error at each time point, that it was hard to say if expression profile of one genotype was overlapping or distinct from that for the other genotypes! R code attached at bottom of this post.

So in short - we now need to employ data filters to check and reduce noise in our data. Some ideas are
removing genes that have low expression (count) levels
removing genes that have high variance across replicates
removing genes that have low variance across time (constitutively expressed genes are biologically less interesting)

So my question to you is what stage of my analysis do I employ these filters?
On the raw data itself, prior to normalization?
Or should I perform the TMM normalization, use the norm factors to transform my data to non-integer normalized counts and then filter (in which case I think I cannot fit them into negative binomial model, right?)

Code:

count = read.table("Input.txt", sep="\t", header=T)                     					
#$#$ read in raw count mapped data

f.count = count[apply(count[,-c(1,ncol(count))],1,sum) > 27,]                               
#$#$ filter ou genes with total read count < 27 across all libraries

f.dat = f.count[,-c(1,ncol(count))]                                                         
#$#$ select only read count, not rest of data frame

S = factor(rep(c("gen1","gen2","gen3"),rep(9,3)))                                           
#$#$ define group

Time = factor(rep(rep(c("0","10","20"),rep(3,3)),3))         								
#$#$ define time

Time.rep = rep(1:3,9)                                                                        
#$#$ define replicate

Group = paste(S,Time,Time.rep,sep="_")                                                         
#$#$ define group_time_replicate

library(edgeR)                                                                              
#$#$ load edgeR package

f.factor = data.frame(files = names(f.dat), S = S , Time = Time, lib.size = c(apply(f.dat,2,sum)),norm.factors = calcNormFactors(as.matrix(f.dat)))  
#$#$  make data for edgeR method

count.d = new("DGEList", list(samples = f.factor, counts = as.matrix(f.dat)))               
#$#$  make data for edgeR method

design = model.matrix(~ Time + S)                                                           
#$#$  make design data for edgeR method

count.d = calcNormFactors(count.d)                                                          
#$#$  Normalize TMM

glmfit.d = glmFit(count.d, design, dispersion = 0.1)                                        
#$#$  Fit the Negative Binomial Gen Lin Models

lrt.count = glmLRT(count.d, glmfit.d)                                                       
#$#$  Likelihood ratio tests

result.count = data.frame(f.count, lrt.count$table)                                         
#$#$  combining raw data and results from edgeR

result.count$FDR = p.adjust(result.count$p.value,method="BH")                               
#$#$  calculating the False Discovery Rate

write.table(result.count, "edgeR.Medicago_count_WT_Mu3.txt",sep="\t",row.names=F)           
#$#$  saving the combined data set

Tags: data filter, model fitting, negative binomial, time course deseq, tmm normalization

markrobinsonca

Junior Member

Join Date: Mar 2010

Posts: 7
- Share
- Tweet
#2

10-10-2012, 02:11 AM

See this:

[BioC] Data filtering

https://stat.ethz.ch/pipermail/bioconductor/2012-October/048508.html
Comment

Previous template Next

Essential Discoveries and Tools in Epitranscriptomics

by seqadmin

The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
- Channel: Articles
04-22-2024, 07:01 AM
Current Approaches to Protein Sequencing

by seqadmin

Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
- Channel: Articles
04-04-2024, 04:25 PM

Topics	Statistics	Last Post
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, Yesterday, 08:47 AM	0 responses 12 views 0 likes	Last Post by seqadmin Yesterday, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 59 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 54 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM

Seqanswers Leaderboard Ad

Announcement

Data filters at what stage of NGS data analysis?

Comment

Latest Articles

ad_right_rmr

News