SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Negative RPKM values by EDGE-pro hi-koike Bioinformatics 10 08-05-2019 01:37 PM
Strange results Deseq2 no adjusted p values or even p values chariko RNA Sequencing 3 04-22-2015 06:31 AM
Negative P-values from MACS2 Daytwa Bioinformatics 0 04-16-2015 12:02 PM
bowtie: negative values rejected? Nicolas902 Bioinformatics 0 12-11-2012 05:36 AM
SAM format: negative values in tags viola1 Bioinformatics 3 03-12-2010 07:29 AM

Reply
 
Thread Tools
Old 11-28-2016, 07:24 AM   #1
Schisto
Junior Member
 
Location: Europe

Join Date: May 2016
Posts: 3
Default DEseq2 - some values in assay are negative

Dear all,

I am a first time DEseq2 user, and I am already stuck with importing my dataset.

My RNAseq data has been going through the Hisat2 - StringTie pipeline and I have created a gene counts file using the python script provided with StringTie.

As far as I can tell, my gene count data set looks just fine, except that there is something weird going on with negative values, and I have no idea what.

I am trying to import the data into DEseq2 with the DESeqDataSetFromMatrix function.

Here's a step-by-step version of what I have done so far:

# Import data file that contains gene counts
countdata <- as.matrix(read_excel("DEseqcounts.xlsx"),header=TRUE)
# take row names from the first column
rownames(countdata) <- countdata[ , 1]
# first column is now duplicated, so remove
countdata <- countdata[,-1]

# Import data file that contains phenotype data in columns
coldata=as.matrix(read_excel("coldata.xlsx"),header=TRUE)
# take row names from the first column
rownames(coldata) <- coldata[ , 1]
# first column is now duplicated, so remove
coldata <- coldata[,-1]

(I have visually checked that the files are imported correctly, and I can't seem to find anything that looks wrong)

I would like to run the DESeqDataSetFromMatrix as follows:

DESeqDataSetFromMatrix(countData = countdata, colData = coldata, design = ~ treatment, tidy = FALSE, ignoreRank = FALSE)

which returns this error message:
Error in DESeqDataSet(se, design = design, ignoreRank) : some values in assay are negative

Indeed, there seem to be values in my "countdata" object that are somehow classified as negative:

countdata["" < 0] omitted 1280373 entries, which look like this:

[1] " 0" " 0" " 0" " 0" " 5" " 0" " 26" " 104" " 10" " 24"
[11] " 22" " 3" " 22" " 0" " 226" " 0" " 152" " 2" " 153" " 178"
[21] " 0" " 2" " 427" " 153" " 0" " 475" " 0" " 0" " 16" " 101"
[31] " 78" " 26" " 71" " 372" " 35" " 17" " 108" " 100" " 43" " 0"

I have no ideas where that comes from. I couldn't find any negative, empty or NA cells in my count data file, nor are there any spaces in the cells.

Does anyone have a solution, or an idea on what went wrong?

Any help is highly appreciated,

Thanks so much!
Schisto is offline   Reply With Quote
Old 11-28-2016, 12:07 PM   #2
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

It looks like you have an extra space in front of all of your numbers and that's screwing everything up. Fix how the values are imported and ensure they're actually numbers and not strings.
dpryan is offline   Reply With Quote
Old 11-28-2016, 10:51 PM   #3
Michael.Ante
Senior Member
 
Location: Vienna

Join Date: Oct 2011
Posts: 123
Default

I'm not so familiar with the stringtie pipeline, but I recommend avoiding Excel for most NGS related analyses (see Zeeberg et al. 2004: Mistaken Identifiers: Gene name errors can be introduced inadvertently when using Excel in bioinformatics)

Can you use the python script to get simple csv/tsv output?
[Update]
The prepDE.py script produces csv files. Import these directly R; any selection and computation you've done with Excel can be done there as well.

Last edited by Michael.Ante; 11-28-2016 at 11:15 PM.
Michael.Ante is offline   Reply With Quote
Old 11-29-2016, 01:18 AM   #4
Schisto
Junior Member
 
Location: Europe

Join Date: May 2016
Posts: 3
Default

I have double checked and there is no extra space in each of my cells,
that is actually the reason I later saved this file as excel.

The python script gives me the gene counts in csv format, I have of course tried that too and it gives the same error.

Using the same file in edgeR for example works without issues.
Schisto is offline   Reply With Quote
Old 11-29-2016, 02:21 AM   #5
Michael.Ante
Senior Member
 
Location: Vienna

Join Date: Oct 2011
Posts: 123
Default

Try as a first solution:
countdata <- as.matrix(read_excel("DEseqcounts.xlsx"),header=TRUE, row.names=1)

And check then
summary(is.numeric(countdata[,1]))

Maybe there are some empty lines at the end, which lead to the fact that R is reading it as factors rather than numbers. This can be checked by tail(countdata) .
Michael.Ante is offline   Reply With Quote
Old 11-29-2016, 02:28 AM   #6
Schisto
Junior Member
 
Location: Europe

Join Date: May 2016
Posts: 3
Default

The class of countdata[,1] is "character"

summary(is.numeric(countdata[,1]))
Mode FALSE NA's
logical 1 0

class(countdata[,1])
[1] "character"

That should be the issue I guess?

Thanks for your help!
Schisto is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 04:10 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO