Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • chudar
    Junior Member
    • Jul 2015
    • 9

    Finding mismatches between two sequences using R

    Hi,

    I am new to R programming. I have two fasta files namely WT and MT that contains 3 protein sequences which is given below

    WT:
    >seq1
    MGFTIIIJKLPADFDEMMJDHGASFWEIHDI
    >seq2
    IIJHGTIPIOIZTRWERQQMSCVYMMDMMLI
    >seq3
    KPOIHUTWMMKLIZGTIIWHSAKDFJGVAHD

    MT:
    >seq1
    TGFTHIHJKLPADFDEMTJDHGASFWEIHDH
    >seq2
    HHJHGTIPIOIZTRWERQQMSCVYTMDTMLH
    >seq3
    KPOIHUTWTTKLIZGTHHWHSAKDFJGVAHD


    WT contains set of fasta sequences without any mutated aminoacids while the MT contains same protein sequences of WT but some amino acid are mutated.

    I want compare the two sequences and find out counts for particular mutations wherever M mutated to T, I mutated to H for every protein and list them like Seq1 MT_counts:xx;IH_counts:YY Seq2:MT_counts:xx; IH_counts:yy Seq3:MT_counts:xx IH_counts:yy

    I have written a small function in R where I read the fasta sequencces using read.fasta from seqinr package and later using two for loops for iterating through all sequences and another for iterating through every letter of seq in WT and comparing with its counterpart in MT.

    Code:
    library(seqinr)
    wt=read.fasta("C:/Users/tsekaran/Documents/sample_ref_protein.fasta")
    mt=read.fasta("C:/Users/tsekaran/Documents/sample_mut_protein.fasta")
    mismatch_finder=function(wt,mt)
    {
    for (i in 1:length(names(wt)))
    { 
      MT_count=0
      IH_count=0
      #for(j in 1:length(wt$seq1))
      for(j in 1:length(wt[[i]]))
      {  
        if(wt[j]=="m" && mt[j]=="t" )
        {
          MT_count=MT_count+1
        }
      else if(wt[j]=="i" && mt[j]=="h" )
        {
          IH_count=IH_count+1
        }
      }
      print(names(wt[i]))
      print(MT_count)
      print(IH_count)
    }
    }
    But it gives me counts for MT and IH as 0 for all seq1 ,seq2 and seq3. I know there are some mutated aminoacid but I am surprised as the function reports the count as 0. Could some guide me where if there is any mistake in comparison. Kindly help me. Thanks in advance.
    Last edited by chudar; 07-31-2015, 02:04 AM. Reason: Updating R script
  • dpryan
    Devon Ryan
    • Jul 2011
    • 3478

    #2
    If you ask people to help you with an error message then you need to post the error message. I see a number of problems with your code, from using non-existent variables ("wtf") to incorrect syntax ("wt$[j]").

    Comment

    • chudar
      Junior Member
      • Jul 2015
      • 9

      #3
      Originally posted by dpryan View Post
      If you ask people to help you with an error message then you need to post the error message. I see a number of problems with your code, from using non-existent variables ("wtf") to incorrect syntax ("wt$[j]").
      Dear DpRyan,

      Thank you very much for your concern. I have updated my R script and have edited my question also. When I ran my updated script it expected it give number of mismatches but it gave me only 0 for all my sequences. I dont whether my comparison code is corrrect. Kindly take a look and guide me please. Thanks in advance

      Comment

      • dpryan
        Devon Ryan
        • Jul 2011
        • 3478

        #4
        Grr, the site ate my initial reply!

        I assume you meant to write this, since otherwise you're trying to compare a character and a list (wt[i] == "m"), which will always be false and thus yield counts of 0:

        Code:
        for (i in 1:length(names(wt)))
        { 
          MT_count=0
          IH_count=0
          #for(j in 1:length(wt$seq1))
          for(j in 1:length(wt[[i]]))
          {  
            if(wt[[i]][j]=="m" && mt[[i]][j]=="t" )
            {
              MT_count=MT_count+1
            }
          else if(wt[[i]][j]=="i" && mt[[i]][j]=="h" )
            {
              IH_count=IH_count+1
            }
          }
          print(names(wt[i]))
          print(MT_count)
          print(IH_count)
        }
        This could be done more succinctly (and probably with better performance) with:
        Code:
        for (i in 1:length(names(wt))) {
            print(names(wt)[i])
            print(length(intersect(which(wt[[i]] == "m"), which(mt[[i]] == "t"))))
            print(length(intersect(which(wt[[i]] == "i"), which(mt[[i]] == "h"))))
        }

        Comment

        • chudar
          Junior Member
          • Jul 2015
          • 9

          #5
          Hi Dpryan,

          Thank you very much for code. it works fine. I would like to output the data into a data frame so that it looks like

          Code:
          ID     MT  IH
          seq1   2    5
          seq2   4    7
          seq3   6    9
          So I edited the code like below where I created an data frame with NA and used it inside the for loop

          Code:
          mismatch=function(wt,mt)
          {
          df=data.frame(ID=NA,MT=NA,IH=NA)
          for (i in 1:length(names(wt))) {
            df$ID[i]=names(wt)[i]
            df$MT[i]= length(intersect(which(wt[[i]] == "m"), which(mt[[i]] == "t")))
            df$IH[i]=length(intersect(which(wt[[i]] == "i"), which(mt[[i]] == "h")))
          }
           return (df)
          }

          but it gives me an error as follows
          Code:
          Error in `$<-.data.frame`(`*tmp*`, "ID", value = c("seq1", "seq2")) : 
            replacement has 2 rows, data has 1
          I have no clue how am I ending up like this.

          Comment

          • dpryan
            Devon Ryan
            • Jul 2011
            • 3478

            #6
            Dataframes need to have identical length rows, so you can't add something to ID without also adding values to MT and IH. The actual R way to do this would be something like:

            Code:
            library(seqinr)
            wt=read.fasta("C:/Users/tsekaran/Documents/sample_ref_protein.fasta")
            mt=read.fasta("C:/Users/tsekaran/Documents/sample_mut_protein.fasta")
            
            getDiffs <- function(wt, mt) {
                c(names(wt),
                length(intersect(which(wt == "m"), which(mt == "t"))),
                length(intersect(which(wt == "i"), which(mt == "h"))))
            }
            
            res <- as.data.frame(t(mapply(getDiffs, wt, mt)))
            names(res) <- c("MT","IT")

            Comment

            • dpryan
              Devon Ryan
              • Jul 2011
              • 3478

              #7
              BTW, you could also just use rbind() if you want to keep the for loop.

              Comment

              Latest Articles

              Collapse

              • SEQadmin2
                Nine Things a Sample Prep Scientist Thinks About Before Sequencing
                by SEQadmin2


                I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.

                Here are nine questions we think about, in roughly the order they matter, before...
                06-18-2026, 07:11 AM
              • SEQadmin2
                From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                by SEQadmin2


                Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                ...
                06-02-2026, 10:05 AM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by SEQadmin2, Yesterday, 05:37 AM
              0 responses
              6 views
              0 reactions
              Last Post SEQadmin2  
              Started by SEQadmin2, 06-26-2026, 11:10 AM
              0 responses
              16 views
              0 reactions
              Last Post SEQadmin2  
              Started by SEQadmin2, 06-17-2026, 06:09 AM
              0 responses
              51 views
              0 reactions
              Last Post SEQadmin2  
              Started by SEQadmin2, 06-09-2026, 11:58 AM
              0 responses
              110 views
              0 reactions
              Last Post SEQadmin2  
              Working...