Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Data filters at what stage of NGS data analysis?

    Greetings friends!

    I seek help with data that I have : 3 time points, 3 genotypes, 3 replicates for each of these = 27 libraries

    The goal is to find genes that have different time expression profiles amongst 2 or more genotypes.

    After our 1st round of data analysis, (including TMM normalization), the time course graphs and box plots were so noisy in terms of high std error at each time point, that it was hard to say if expression profile of one genotype was overlapping or distinct from that for the other genotypes! R code attached at bottom of this post.

    So in short - we now need to employ data filters to check and reduce noise in our data. Some ideas are
    removing genes that have low expression (count) levels
    removing genes that have high variance across replicates
    removing genes that have low variance across time (constitutively expressed genes are biologically less interesting)

    So my question to you is what stage of my analysis do I employ these filters?
    On the raw data itself, prior to normalization?
    Or should I perform the TMM normalization, use the norm factors to transform my data to non-integer normalized counts and then filter (in which case I think I cannot fit them into negative binomial model, right?)

    Code:
    count = read.table("Input.txt", sep="\t", header=T)                     					
    #$#$ read in raw count mapped data
    
    f.count = count[apply(count[,-c(1,ncol(count))],1,sum) > 27,]                               
    #$#$ filter ou genes with total read count < 27 across all libraries
    
    f.dat = f.count[,-c(1,ncol(count))]                                                         
    #$#$ select only read count, not rest of data frame
    
    S = factor(rep(c("gen1","gen2","gen3"),rep(9,3)))                                           
    #$#$ define group
    
    Time = factor(rep(rep(c("0","10","20"),rep(3,3)),3))         								
    #$#$ define time
    
    Time.rep = rep(1:3,9)                                                                        
    #$#$ define replicate
    
    Group = paste(S,Time,Time.rep,sep="_")                                                         
    #$#$ define group_time_replicate
    
    library(edgeR)                                                                              
    #$#$ load edgeR package
    
    f.factor = data.frame(files = names(f.dat), S = S , Time = Time, lib.size = c(apply(f.dat,2,sum)),norm.factors = calcNormFactors(as.matrix(f.dat)))  
    #$#$  make data for edgeR method
    
    count.d = new("DGEList", list(samples = f.factor, counts = as.matrix(f.dat)))               
    #$#$  make data for edgeR method
    
    design = model.matrix(~ Time + S)                                                           
    #$#$  make design data for edgeR method
    
    count.d = calcNormFactors(count.d)                                                          
    #$#$  Normalize TMM
    
    glmfit.d = glmFit(count.d, design, dispersion = 0.1)                                        
    #$#$  Fit the Negative Binomial Gen Lin Models
    
    lrt.count = glmLRT(count.d, glmfit.d)                                                       
    #$#$  Likelihood ratio tests
    
    result.count = data.frame(f.count, lrt.count$table)                                         
    #$#$  combining raw data and results from edgeR
    
    result.count$FDR = p.adjust(result.count$p.value,method="BH")                               
    #$#$  calculating the False Discovery Rate
    
    write.table(result.count, "edgeR.Medicago_count_WT_Mu3.txt",sep="\t",row.names=F)           
    #$#$  saving the combined data set

  • #2
    See this:

    Comment

    Latest Articles

    Collapse

    • seqadmin
      Essential Discoveries and Tools in Epitranscriptomics
      by seqadmin




      The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
      04-22-2024, 07:01 AM
    • seqadmin
      Current Approaches to Protein Sequencing
      by seqadmin


      Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
      04-04-2024, 04:25 PM

    ad_right_rmr

    Collapse

    News

    Collapse

    Topics Statistics Last Post
    Started by seqadmin, Yesterday, 08:47 AM
    0 responses
    12 views
    0 likes
    Last Post seqadmin  
    Started by seqadmin, 04-11-2024, 12:08 PM
    0 responses
    60 views
    0 likes
    Last Post seqadmin  
    Started by seqadmin, 04-10-2024, 10:19 PM
    0 responses
    59 views
    0 likes
    Last Post seqadmin  
    Started by seqadmin, 04-10-2024, 09:21 AM
    0 responses
    54 views
    0 likes
    Last Post seqadmin  
    Working...
    X