SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Splitting the fasta file into 2 columns dena.dinesh Bioinformatics 2 02-24-2015 03:51 AM
cummeRbund scatterplot labels orthodoc RNA Sequencing 4 02-26-2014 01:04 PM
Comparing data from two columns ron128 Bioinformatics 3 05-14-2013 01:40 AM
gel purification vs Chromaspin columns... Coltom Illumina/Solexa 0 12-20-2012 07:20 AM
qiagen and minelute columns seqgirl123 Illumina/Solexa 0 03-09-2009 04:59 AM

Reply
 
Thread Tools
Old 10-15-2015, 12:50 PM   #1
Marcos Lancia
Member
 
Location: Argentina

Join Date: Apr 2015
Posts: 31
Question Scatterplot of 2 columns

Hi everybody. I have a simple question for bioinformatics. Id like make a scatterplot of 2 of my columns from a table .csv or .tabular. How can I do that? Thanks!
Marcos Lancia is offline   Reply With Quote
Old 10-15-2015, 12:56 PM   #2
Marcos Lancia
Member
 
Location: Argentina

Join Date: Apr 2015
Posts: 31
Default

Oh, I'm sorry. I'm working on R. Thanks!
Marcos Lancia is offline   Reply With Quote
Old 10-15-2015, 01:08 PM   #3
blancha
Senior Member
 
Location: Montreal

Join Date: May 2013
Posts: 367
Default

Code:
library(data.table)

# Read data with fread.
# You could also use read.table or read.csv, but fread has many advantages.
data <- data.table::fread("test.csv")

# Basic scatter plot.
plot(data$V1, data$V2)

library(ggplot2)

# A slightly different scatter plot with ggplot2
# Very easy to generate prettier plots with ggplot2. 
# Takes time to understand underlying concepts though.
ggplot2::qplot(data$V1, data$V2)
blancha is offline   Reply With Quote
Old 10-15-2015, 01:10 PM   #4
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

Code:
d <- read.csv("some_file.csv", header=T) #Presuming there's a header
plot(d[,1], d[,2]) # or plot(d$FirstColumnLabel, d$SecondColumnLabel)
Edit: I guess I should have refreshed, Blancha beat me to it.
dpryan is offline   Reply With Quote
Old 10-15-2015, 01:31 PM   #5
blancha
Senior Member
 
Location: Montreal

Join Date: May 2013
Posts: 367
Default

@dpryan
That's fine.
Your answer does raise a good point for a novice R user.

@Marcos Lancia If your csv file does have column headers, you should use those instead of V1 and V2, which are the default column headers generated by fread.
There are several other subtleties, even for this question, which is just about the most basic question one can ask in R.
That's why I've stopped trying to teach R to wet-lab biologists.
I like fread, because it is extremely fast, and can detect several other features of the files, such as the column separator, or the presence or absence of column headers, which often have to be explicitly specified with read.csv.
However, fread returns a data.table by default, not a data.frame, which is a slightly different data structure.
If you'd like to get a more traditional data.frame, you just have to specify data.table = FALSE.
Code:
 fread("test.csv", data.table=FALSE)
With a data.table, you can't specify the column index in exactly the format given by @dpryan. You can specify the column by the column label in the same manner, though.
To get the first column by index of a data.table, you need to use the following format.

Code:
data[, 1, with=FALSE]
Hopefully, you're not now completely lost.
You'll understand why I've stopped given workshops in R. D)

@dpryan's answer is more direct. read.csv() and plot()
blancha is offline   Reply With Quote
Old 10-16-2015, 05:32 AM   #6
Marcos Lancia
Member
 
Location: Argentina

Join Date: Apr 2015
Posts: 31
Default

Hi people, thanks for writing. I'll check it out your ideas, I'll see how well I understand your commands.
Marcos Lancia is offline   Reply With Quote
Old 10-16-2015, 06:11 AM   #7
Marcos Lancia
Member
 
Location: Argentina

Join Date: Apr 2015
Posts: 31
Default

Im working with dpryan's commands. I could make variable d, but when I tried plotting

plot(d$log2(FC transcriptome), d$log2(FC translatome))

but there's an error message saying
Error: unexpected symbol in "plot(d$log2(FC transcriptome"

What can I do?
Thanks!
Marcos Lancia is offline   Reply With Quote
Old 10-16-2015, 06:16 AM   #8
Marcos Lancia
Member
 
Location: Argentina

Join Date: Apr 2015
Posts: 31
Arrow little advances

"log2(FC transcriptome)", and "log2(FC translatome)" are my headers that I want to plot. Theyre the 10th and 24th columns in my table.
Marcos Lancia is offline   Reply With Quote
Old 10-16-2015, 06:25 AM   #9
blancha
Senior Member
 
Location: Montreal

Join Date: May 2013
Posts: 367
Default

Yes, that's not going to work.

You can't have space in a R variable name.
As always, there is a work-around.
You can put the variable names in back ticks: `FC transcriptome`is acceptable.
You need to call log2 on the column too.
log2(d$`FC transcriptome`)

The function read.csv must have replaced the blank spaces by periods anyway.

So, the correct code would be.

Code:
 plot(log2(d$FC.transcriptome), log2(d$FC.translatome))
You should check the head of the data frame, to be sure that you have the correct column names, and that the file was read correctly.

Code:
head(d)
colnames(d)
EDIT for update
If the colunm name in your file is log2(FC transcriptome), read.csv will convert the column name to log2.FC.transcriptome.
Just check the colnames and the header of the data frame with the commands posted above.

Last edited by blancha; 10-16-2015 at 06:27 AM.
blancha is offline   Reply With Quote
Old 10-16-2015, 06:28 AM   #10
Marcos Lancia
Member
 
Location: Argentina

Join Date: Apr 2015
Posts: 31
Default

I tried the same with ggplot, and the answer was the same.
Marcos Lancia is offline   Reply With Quote
Old 10-16-2015, 06:30 AM   #11
blancha
Senior Member
 
Location: Montreal

Join Date: May 2013
Posts: 367
Default

Just want to point out that with RStudio you can import a file by just clicking on "Import Dataset", and also just view the data frame by clicking on the name of the data frame.

Highly recommend RStudio for both novices and power users.

Edit for update
Just post the output of the following command, so that we know the format of your data frame.
Code:
head(d)
blancha is offline   Reply With Quote
Old 10-16-2015, 06:49 AM   #12
blancha
Senior Member
 
Location: Montreal

Join Date: May 2013
Posts: 367
Default

Or, since you said that it is the 10th and 24th columns, you could just specify the columns by the index number.

Code:
plot(d[,10], d[,24])
One should always check the format of the data frame first after reading a CSV file, though.
blancha is offline   Reply With Quote
Old 10-16-2015, 06:53 AM   #13
Marcos Lancia
Member
 
Location: Argentina

Join Date: Apr 2015
Posts: 31
Default

Ok! you're right. I changed the names of the headers, but I crashed with a new problem. I have some non-numerical data in the columns, some are "inf" or "-inf". How can I avoid that?
Marcos Lancia is offline   Reply With Quote
Old 10-16-2015, 07:00 AM   #14
Marcos Lancia
Member
 
Location: Argentina

Join Date: Apr 2015
Posts: 31
Default

I tried before with plot(d[,10], d[,24]), but it says "error in evaluating the argument 'x' in selecting a method for function 'plot': Error in `[.data.frame`(d, , 10) : undefined columns selected"

> head(d)
test_id.gene_id.gene.locus.sample_1.sample_2.status.value_1.value_2.log2.FC.transcriptome..test_stat.p_value.q_value.significant.test_id.gene_id.gene.locus.sample_1.sample_2.status.value_1.value_2.log2.FC.translatome..test_stat.p_value.q_value.significant
1 TCONS_00000001\tXLOC_000001\tMedtr1g004940\tchr1:688-7366\tSN16 mock\tSN16 Sm\tNOTEST\t0\t0\t0\t0\t1\t1\tno\tTCONS_00000001\tXLOC_000001\tMedtr1g004940\tchr1:688-7366\tTRAP mock\tTRAP Sm\tNOTEST\t0\t0\t0\t0\t1\t1\tno
2 TCONS_00000002\tXLOC_000002\tMedtr1g004950\tchr1:14513-15729\tSN16 mock\tSN16 Sm\tNOTEST\t0\t0\t0\t0\t1\t1\tno\tTCONS_00000002\tXLOC_000002\tMedtr1g004950\tchr1:14513-15729\tTRAP mock\tTRAP Sm\tNOTEST\t0\t0\t0\t0\t1\t1\tno
3 TCONS_00000003\tXLOC_000003\tMedtr1g004960\tchr1:16282-18382\tSN16 mock\tSN16 Sm\tNOTEST\t0\t0\t0\t0\t1\t1\tno\tTCONS_00000003\tXLOC_000003\tMedtr1g004960\tchr1:16282-18382\tTRAP mock\tTRAP Sm\tNOTEST\t0\t0\t0\t0\t1\t1\tno
4 TCONS_00000004\tXLOC_000004\tMedtr1g004980\tchr1:31972-32344\tSN16 mock\tSN16 Sm\tNOTEST\t0\t0\t0\t0\t1\t1\tno\tTCONS_00000004\tXLOC_000004\tMedtr1g004980\tchr1:31972-32344\tTRAP mock\tTRAP Sm\tNOTEST\t0\t0\t0\t0\t1\t1\tno
5 TCONS_00000005\tXLOC_000005\tMedtr1g004990\tchr1:35909-40554\tSN16 mock\tSN16 Sm\tNOTEST\t0
6 203657\t0
Marcos Lancia is offline   Reply With Quote
Old 10-16-2015, 07:16 AM   #15
blancha
Senior Member
 
Location: Montreal

Join Date: May 2013
Posts: 367
Default

The default line separator for read.csv is the comma.
If your file is tab-delimited, it will not separate the columns by tab.

Choice #1: Use read.csv, and specify the separator. Not my favorite solution, but most commonly used.
Code:
d <- read.csv("test.csv", sep="\t")
Choice #2: Much better in my opinion. Install dtable, and use the fread function.
fread automatically picks up the column separator
Code:
install.packages("data.table")
library(data.table) 
d <- fread("test.csv", data.table=FALSE)
Choice #3
Use the Import Dataset button in RStudio.
Specify the column separator in pop-up menu.
Quick solution for R novices.
blancha is offline   Reply With Quote
Old 10-16-2015, 07:33 AM   #16
Marcos Lancia
Member
 
Location: Argentina

Join Date: Apr 2015
Posts: 31
Default

Ok, my table is ready. Now, my problem is that I have non-numerical data in my columns. How can I avoid that? For example,
plot(data$log2.FC.transcriptome., data$log2.FC.translatome.)

In min(x) : no non-missing arguments to min; returning Inf
Marcos Lancia is offline   Reply With Quote
Old 10-16-2015, 08:08 AM   #17
blancha
Senior Member
 
Location: Montreal

Join Date: May 2013
Posts: 367
Default

As always, there are many ways, and many subtleties.
If you've been using read.csv, I would recommend that you add the argument stringsAsFactors=FALSE.
You don't need to bother with fread, which won't treat strings as factors.

I'll just post two ways to do the filtering, supposing that you know the strings that you want to filter out.
I'll have to get back to work on my own script then.
There is an older, more conventional method, but dplyr has many advantages.

Code:
library(dplyr)
data.filtered <- data %>% filter(log2.FC.transcriptome. != "Inf" & log2.FC.transcriptome. != "-Inf" & log2.FC.translatome. != "Inf" & log2.FC.translatome. != "-Inf")
Older method
Code:
data.filtered <- data[data$log2.FC.transcriptome. != "Inf" & data$log2.FC.transcriptome. != "-Inf" & data$log2.FC.translatome. != "Inf" & data$log2.FC.translatome. != "-Inf", ]
Sorry, I really have to go and finish my own script now.
blancha is offline   Reply With Quote
Old 10-16-2015, 09:16 AM   #18
Marcos Lancia
Member
 
Location: Argentina

Join Date: Apr 2015
Posts: 31
Default

Ok, thanks so much for your time. You helped me so much. If you can read this, there's a problem with the %>%.
Error: could not find function "%>%"

The 2nd option you gave me deleted all data. The table is empty after that command.
Marcos Lancia is offline   Reply With Quote
Old 10-16-2015, 10:09 AM   #19
blancha
Senior Member
 
Location: Montreal

Join Date: May 2013
Posts: 367
Default

The "%>%" forward-pipe operator comes with dplyr.
Either you haven't loaded the dplyr package, or you made a mistake writing the operator.
It's just the greater sign in between two percentage sign, so the second mistake is less likely.
Just make sure you've installed dplyr and loaded the package before using "%>%".

Code:
install.packages("dplyr")
library(dplyr)

For the problems you're experiencing with the more conventional filtering step, I'm not sure what the problem could be.
I don't see any typos in my command, but there could always be a mistake on my part.
The best troubleshooting step is always to simplify the command.
Try just one filtering step at the time.

Code:
data.filtered <- data[data$log2.FC.transcriptome. != "Inf", ]
I'm going to stop procrastinating now by spending time on seqanswers, and actually finish my own R script.
blancha is offline   Reply With Quote
Reply

Tags
scatter plot, table

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 10:26 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO