SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Data Processing Pipeline question mboursnell Bioinformatics 10 02-28-2013 06:03 PM
some question in processing 454 data (Pyrobayes, indel) chrislove 454 Pyrosequencing 0 08-16-2012 10:03 AM
CummeRbund csDendro error: "need finite ylim values" when replicates=TRUE mebbert Bioinformatics 4 07-18-2012 08:22 AM
View Cross Machine Image Processing Data problem in V2.5.3 AndyChung 454 Pyrosequencing 0 05-26-2011 06:47 PM
Question about the values of quality zino SOLiD 5 05-28-2010 03:31 AM

Reply
 
Thread Tools
Old 10-09-2012, 12:29 PM   #1
bioliyezhang
Member
 
Location: Boston, MA

Join Date: Mar 2011
Posts: 19
Default Question: problem with data containing non-finite values in cummeRund processing

Hi,all:

Recently, I am trying to identify the differentially expressed genes between two conditions. Each condition has three replicates(paired end). I got a problem when I am trying to analyze the data using Tophat v2.0.4, Cufflinks v2.0.2 and cummeRbund v2.0.0.

The problem is that when I try to generate the density plot using cummeRbund. It showed warning message like this:

http://www.freeimagehosting.net/t4yzi
Warning messages:

1: Removed 24226 rows containing non-finite values (stat_density).

2: Removed 22713 rows containing non-finite values (stat_density).

And the volcano plot (http://www.freeimagehosting.net/ot994) also did not seem to be right. I am not clear the cause of non-finite values in my dataset and why there are so many of them have this problem.

I suspect that this is because I use "--compatible-hits-norm" normalization parameter in cufflinks. I just wonder whether you have some experience on this and any suggestions will be greatly appreciated.

Best, Liye

Last edited by bioliyezhang; 10-09-2012 at 01:03 PM.
bioliyezhang is offline   Reply With Quote
Old 10-10-2012, 01:55 AM   #2
blanco
Member
 
Location: Iceland

Join Date: Apr 2012
Posts: 28
Default

Hi Liye,
I believe this is caused by fpkm values of zero. When the program takes the log of zero it becomes minus infinite (-Inf) and these values are excluded from the analysis.

I think there is an option called "pseudocount=0.0001" which adds 0.0001 to all zero values to work around this and keep all values in the analysis. If you do that the density should pile up close to the y-axis. Not sure if that is what you would want but it is an option.

I tend to think it is ok to discard the zero fpkm values.
blanco is offline   Reply With Quote
Old 10-10-2012, 07:56 AM   #3
bioliyezhang
Member
 
Location: Boston, MA

Join Date: Mar 2011
Posts: 19
Default

Hi, Blanco:

thanks a lot. Your explanation for the non-finite values is very clear. And I tried to set the pseudocount, and it worked as you described.
However, it seem to me that the problem in Volcano plot is another problem. Because, even if I set pseudocount to an non-zero value, same error occurred again.

Removed 4573 rows containing missing values (geom_point).

I am pretty new to Cufflinks and CummeRbund. Because previously there are 24226 and 22713 transcripts have zero value in FPKM. My guess is that there are 4573 genes have zero FPKM value in both sample and control case, which lead to the missing values (geom_point) in Volcano plot? I am still trying to check this by looking into the data itself.

I wonder whether you have any clue on the cause of this missing values?
Thanks a lot.
bioliyezhang is offline   Reply With Quote
Old 10-11-2012, 06:50 AM   #4
lgoff
Member
 
Location: Cambridge, MA

Join Date: Feb 2008
Posts: 82
Default

This is a similar situation in the volcano plot, with the exception that this warning is not arising from zero-values, but rather we restrict the axes to make the plot a bit more visually interpretable due to very high and very low log-fold change values. You should be able to adjust the x and y axes to be more inclusive of the missing values if you like, but be forewarned that the log-fold change values can be very high in these data and this will compress the rest of the image.

-Loyal
lgoff is offline   Reply With Quote
Old 10-11-2012, 06:15 PM   #5
bioliyezhang
Member
 
Location: Boston, MA

Join Date: Mar 2011
Posts: 19
Default

After checking into the data, I see what you are trying to say.
Basically, the Removed 4573 rows is due to their have higher test_stat value (one column of .diff output) the default cutoff, which seems to be 40.
So basically the data with too high test-stat value is removed for visualization purpose. So there is no problem with the data it self.

Thanks.
bioliyezhang is offline   Reply With Quote
Old 01-30-2014, 01:40 PM   #6
rosielee
Junior Member
 
Location: NI

Join Date: Dec 2013
Posts: 1
Default

cummeRbund (ref the Jan 2014 Bioconductor package manual) describes how one can add pseudocount=eg. 0.0001 or 1.0 (for reasons discussed above)

Is there any specific reason why a value of 0.0001 is used in the csBoxplot function and it is set to 1.0 in other functions (eg. csDistHeat)?

Or have these values been shown just as examples of the type of values that one could use?

If carrying out an analysis with one set of data, would you keep the pseudocount value the same across all the functions?

(sorry if these questions are daft, I am very new to RNA-seq and cummeRbund)

Thanks in advance
Rosielee
rosielee is offline   Reply With Quote
Reply

Tags
cufflinks, cummerbund, rnaseq

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 05:37 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO