SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Finding the Longest ORF for all sequences in EMBOSS dena.dinesh Bioinformatics 12 02-25-2015 12:51 PM
Finding repetitive elements from BAC sequences and Illumina sequences. int11ap1 De novo discovery 0 11-03-2014 07:23 AM
Help finding repetitive sequences SamBR90 Bioinformatics 0 10-16-2013 10:06 AM
Help with finding complementary sequences in ngs data Ami Bioinformatics 1 04-02-2013 11:48 PM
Finding Perfect matches and mismatches bhargavpbk88 Bioinformatics 2 01-19-2012 07:00 AM

Reply
 
Thread Tools
Old 07-30-2015, 05:25 AM   #1
chudar
Junior Member
 
Location: Europe

Join Date: Jul 2015
Posts: 9
Default Finding mismatches between two sequences using R

Hi,

I am new to R programming. I have two fasta files namely WT and MT that contains 3 protein sequences which is given below

WT:
>seq1
MGFTIIIJKLPADFDEMMJDHGASFWEIHDI
>seq2
IIJHGTIPIOIZTRWERQQMSCVYMMDMMLI
>seq3
KPOIHUTWMMKLIZGTIIWHSAKDFJGVAHD

MT:
>seq1
TGFTHIHJKLPADFDEMTJDHGASFWEIHDH
>seq2
HHJHGTIPIOIZTRWERQQMSCVYTMDTMLH
>seq3
KPOIHUTWTTKLIZGTHHWHSAKDFJGVAHD


WT contains set of fasta sequences without any mutated aminoacids while the MT contains same protein sequences of WT but some amino acid are mutated.

I want compare the two sequences and find out counts for particular mutations wherever M mutated to T, I mutated to H for every protein and list them like Seq1 MT_counts:xx;IH_counts:YY Seq2:MT_counts:xx; IH_counts:yy Seq3:MT_counts:xx IH_counts:yy

I have written a small function in R where I read the fasta sequencces using read.fasta from seqinr package and later using two for loops for iterating through all sequences and another for iterating through every letter of seq in WT and comparing with its counterpart in MT.

Code:
library(seqinr)
wt=read.fasta("C:/Users/tsekaran/Documents/sample_ref_protein.fasta")
mt=read.fasta("C:/Users/tsekaran/Documents/sample_mut_protein.fasta")
mismatch_finder=function(wt,mt)
{
for (i in 1:length(names(wt)))
{ 
  MT_count=0
  IH_count=0
  #for(j in 1:length(wt$seq1))
  for(j in 1:length(wt[[i]]))
  {  
    if(wt[j]=="m" && mt[j]=="t" )
    {
      MT_count=MT_count+1
    }
  else if(wt[j]=="i" && mt[j]=="h" )
    {
      IH_count=IH_count+1
    }
  }
  print(names(wt[i]))
  print(MT_count)
  print(IH_count)
}
}
But it gives me counts for MT and IH as 0 for all seq1 ,seq2 and seq3. I know there are some mutated aminoacid but I am surprised as the function reports the count as 0. Could some guide me where if there is any mistake in comparison. Kindly help me. Thanks in advance.

Last edited by chudar; 07-31-2015 at 02:04 AM. Reason: Updating R script
chudar is offline   Reply With Quote
Old 07-30-2015, 09:15 AM   #2
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

If you ask people to help you with an error message then you need to post the error message. I see a number of problems with your code, from using non-existent variables ("wtf") to incorrect syntax ("wt$[j]").
dpryan is offline   Reply With Quote
Old 07-31-2015, 02:15 AM   #3
chudar
Junior Member
 
Location: Europe

Join Date: Jul 2015
Posts: 9
Default

Quote:
Originally Posted by dpryan View Post
If you ask people to help you with an error message then you need to post the error message. I see a number of problems with your code, from using non-existent variables ("wtf") to incorrect syntax ("wt$[j]").
Dear DpRyan,

Thank you very much for your concern. I have updated my R script and have edited my question also. When I ran my updated script it expected it give number of mismatches but it gave me only 0 for all my sequences. I dont whether my comparison code is corrrect. Kindly take a look and guide me please. Thanks in advance
chudar is offline   Reply With Quote
Old 07-31-2015, 03:38 AM   #4
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

Grr, the site ate my initial reply!

I assume you meant to write this, since otherwise you're trying to compare a character and a list (wt[i] == "m"), which will always be false and thus yield counts of 0:

Code:
for (i in 1:length(names(wt)))
{ 
  MT_count=0
  IH_count=0
  #for(j in 1:length(wt$seq1))
  for(j in 1:length(wt[[i]]))
  {  
    if(wt[[i]][j]=="m" && mt[[i]][j]=="t" )
    {
      MT_count=MT_count+1
    }
  else if(wt[[i]][j]=="i" && mt[[i]][j]=="h" )
    {
      IH_count=IH_count+1
    }
  }
  print(names(wt[i]))
  print(MT_count)
  print(IH_count)
}
This could be done more succinctly (and probably with better performance) with:
Code:
for (i in 1:length(names(wt))) {
    print(names(wt)[i])
    print(length(intersect(which(wt[[i]] == "m"), which(mt[[i]] == "t"))))
    print(length(intersect(which(wt[[i]] == "i"), which(mt[[i]] == "h"))))
}
dpryan is offline   Reply With Quote
Old 08-03-2015, 02:15 AM   #5
chudar
Junior Member
 
Location: Europe

Join Date: Jul 2015
Posts: 9
Unhappy

Hi Dpryan,

Thank you very much for code. it works fine. I would like to output the data into a data frame so that it looks like

Code:
ID     MT  IH
seq1   2    5
seq2   4    7
seq3   6    9
So I edited the code like below where I created an data frame with NA and used it inside the for loop

Code:
mismatch=function(wt,mt)
{
df=data.frame(ID=NA,MT=NA,IH=NA)
for (i in 1:length(names(wt))) {
  df$ID[i]=names(wt)[i]
  df$MT[i]= length(intersect(which(wt[[i]] == "m"), which(mt[[i]] == "t")))
  df$IH[i]=length(intersect(which(wt[[i]] == "i"), which(mt[[i]] == "h")))
}
 return (df)
}

but it gives me an error as follows
Code:
Error in `$<-.data.frame`(`*tmp*`, "ID", value = c("seq1", "seq2")) : 
  replacement has 2 rows, data has 1
I have no clue how am I ending up like this.
chudar is offline   Reply With Quote
Old 08-03-2015, 02:25 AM   #6
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

Dataframes need to have identical length rows, so you can't add something to ID without also adding values to MT and IH. The actual R way to do this would be something like:

Code:
library(seqinr)
wt=read.fasta("C:/Users/tsekaran/Documents/sample_ref_protein.fasta")
mt=read.fasta("C:/Users/tsekaran/Documents/sample_mut_protein.fasta")

getDiffs <- function(wt, mt) {
    c(names(wt),
    length(intersect(which(wt == "m"), which(mt == "t"))),
    length(intersect(which(wt == "i"), which(mt == "h"))))
}

res <- as.data.frame(t(mapply(getDiffs, wt, mt)))
names(res) <- c("MT","IT")
dpryan is offline   Reply With Quote
Old 08-03-2015, 03:01 AM   #7
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

BTW, you could also just use rbind() if you want to keep the for loop.
dpryan is offline   Reply With Quote
Reply

Tags
alignemnt, mismatches, r programming, seq

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 05:40 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO