Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Finding mismatches between two sequences using R

    Hi,

    I am new to R programming. I have two fasta files namely WT and MT that contains 3 protein sequences which is given below

    WT:
    >seq1
    MGFTIIIJKLPADFDEMMJDHGASFWEIHDI
    >seq2
    IIJHGTIPIOIZTRWERQQMSCVYMMDMMLI
    >seq3
    KPOIHUTWMMKLIZGTIIWHSAKDFJGVAHD

    MT:
    >seq1
    TGFTHIHJKLPADFDEMTJDHGASFWEIHDH
    >seq2
    HHJHGTIPIOIZTRWERQQMSCVYTMDTMLH
    >seq3
    KPOIHUTWTTKLIZGTHHWHSAKDFJGVAHD


    WT contains set of fasta sequences without any mutated aminoacids while the MT contains same protein sequences of WT but some amino acid are mutated.

    I want compare the two sequences and find out counts for particular mutations wherever M mutated to T, I mutated to H for every protein and list them like Seq1 MT_counts:xx;IH_counts:YY Seq2:MT_counts:xx; IH_counts:yy Seq3:MT_counts:xx IH_counts:yy

    I have written a small function in R where I read the fasta sequencces using read.fasta from seqinr package and later using two for loops for iterating through all sequences and another for iterating through every letter of seq in WT and comparing with its counterpart in MT.

    Code:
    library(seqinr)
    wt=read.fasta("C:/Users/tsekaran/Documents/sample_ref_protein.fasta")
    mt=read.fasta("C:/Users/tsekaran/Documents/sample_mut_protein.fasta")
    mismatch_finder=function(wt,mt)
    {
    for (i in 1:length(names(wt)))
    { 
      MT_count=0
      IH_count=0
      #for(j in 1:length(wt$seq1))
      for(j in 1:length(wt[[i]]))
      {  
        if(wt[j]=="m" && mt[j]=="t" )
        {
          MT_count=MT_count+1
        }
      else if(wt[j]=="i" && mt[j]=="h" )
        {
          IH_count=IH_count+1
        }
      }
      print(names(wt[i]))
      print(MT_count)
      print(IH_count)
    }
    }
    But it gives me counts for MT and IH as 0 for all seq1 ,seq2 and seq3. I know there are some mutated aminoacid but I am surprised as the function reports the count as 0. Could some guide me where if there is any mistake in comparison. Kindly help me. Thanks in advance.
    Last edited by chudar; 07-31-2015, 02:04 AM. Reason: Updating R script

  • #2
    If you ask people to help you with an error message then you need to post the error message. I see a number of problems with your code, from using non-existent variables ("wtf") to incorrect syntax ("wt$[j]").

    Comment


    • #3
      Originally posted by dpryan View Post
      If you ask people to help you with an error message then you need to post the error message. I see a number of problems with your code, from using non-existent variables ("wtf") to incorrect syntax ("wt$[j]").
      Dear DpRyan,

      Thank you very much for your concern. I have updated my R script and have edited my question also. When I ran my updated script it expected it give number of mismatches but it gave me only 0 for all my sequences. I dont whether my comparison code is corrrect. Kindly take a look and guide me please. Thanks in advance

      Comment


      • #4
        Grr, the site ate my initial reply!

        I assume you meant to write this, since otherwise you're trying to compare a character and a list (wt[i] == "m"), which will always be false and thus yield counts of 0:

        Code:
        for (i in 1:length(names(wt)))
        { 
          MT_count=0
          IH_count=0
          #for(j in 1:length(wt$seq1))
          for(j in 1:length(wt[[i]]))
          {  
            if(wt[[i]][j]=="m" && mt[[i]][j]=="t" )
            {
              MT_count=MT_count+1
            }
          else if(wt[[i]][j]=="i" && mt[[i]][j]=="h" )
            {
              IH_count=IH_count+1
            }
          }
          print(names(wt[i]))
          print(MT_count)
          print(IH_count)
        }
        This could be done more succinctly (and probably with better performance) with:
        Code:
        for (i in 1:length(names(wt))) {
            print(names(wt)[i])
            print(length(intersect(which(wt[[i]] == "m"), which(mt[[i]] == "t"))))
            print(length(intersect(which(wt[[i]] == "i"), which(mt[[i]] == "h"))))
        }

        Comment


        • #5
          Hi Dpryan,

          Thank you very much for code. it works fine. I would like to output the data into a data frame so that it looks like

          Code:
          ID     MT  IH
          seq1   2    5
          seq2   4    7
          seq3   6    9
          So I edited the code like below where I created an data frame with NA and used it inside the for loop

          Code:
          mismatch=function(wt,mt)
          {
          df=data.frame(ID=NA,MT=NA,IH=NA)
          for (i in 1:length(names(wt))) {
            df$ID[i]=names(wt)[i]
            df$MT[i]= length(intersect(which(wt[[i]] == "m"), which(mt[[i]] == "t")))
            df$IH[i]=length(intersect(which(wt[[i]] == "i"), which(mt[[i]] == "h")))
          }
           return (df)
          }

          but it gives me an error as follows
          Code:
          Error in `$<-.data.frame`(`*tmp*`, "ID", value = c("seq1", "seq2")) : 
            replacement has 2 rows, data has 1
          I have no clue how am I ending up like this.

          Comment


          • #6
            Dataframes need to have identical length rows, so you can't add something to ID without also adding values to MT and IH. The actual R way to do this would be something like:

            Code:
            library(seqinr)
            wt=read.fasta("C:/Users/tsekaran/Documents/sample_ref_protein.fasta")
            mt=read.fasta("C:/Users/tsekaran/Documents/sample_mut_protein.fasta")
            
            getDiffs <- function(wt, mt) {
                c(names(wt),
                length(intersect(which(wt == "m"), which(mt == "t"))),
                length(intersect(which(wt == "i"), which(mt == "h"))))
            }
            
            res <- as.data.frame(t(mapply(getDiffs, wt, mt)))
            names(res) <- c("MT","IT")

            Comment


            • #7
              BTW, you could also just use rbind() if you want to keep the for loop.

              Comment

              Latest Articles

              Collapse

              • seqadmin
                Essential Discoveries and Tools in Epitranscriptomics
                by seqadmin


                The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist on Modified Bases...
                Yesterday, 07:01 AM
              • seqadmin
                Current Approaches to Protein Sequencing
                by seqadmin


                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                04-04-2024, 04:25 PM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by seqadmin, 04-11-2024, 12:08 PM
              0 responses
              37 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-10-2024, 10:19 PM
              0 responses
              41 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-10-2024, 09:21 AM
              0 responses
              35 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-04-2024, 09:00 AM
              0 responses
              54 views
              0 likes
              Last Post seqadmin  
              Working...
              X