Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • batch editing of ABI files

    Hello all. I am unsure if this is the correct forum, but was referred here wit sort of a unique question.

    I have a run of over 1000 sequences, in ABI format. Each read is relatively short, less than 1 kb. Our goal is to align/analyze each read together - but to do so we must first eliminate sequence from both the 5' and 3' ends.

    We typically do this by "hand" in Geneious, using unique restriction sites at both the 5' and 3' ends. Simply put, we delete sequence upstream of one restriction site and then everything downstream of another restriction site (which is also downstream of the first restriction site.)

    With less than 100 sequences, this is just a few hours of work. However with over 1000 reads, it gets to be very time consuming!

    Does software exist which would allow me to do this sequence editing in a batch format? I believe we're looking for something which would identify two restriction sites in a read, and delete the sequence either upstream or downstream of those sites. Below is the general idea, in case the description above is not clear.

    Thanks for any help,

    Ray


  • #2
    sure, any scripting language (such as Python) will do this, especially since both the hindIII and the ecoRi site have no ambiguity whatsoever.
    Only problem might be that your files are in ABI, and you either need to convert this to something the script will be able to read (FASTA), or find an adaptor.. there is one for python http://www.bioinformatics.org/groups/?group_id=497

    here's a simple script from the top of my head that will do it to a folder of fasta files...
    Code:
    #!/usr/bin/python
    directory = 'path/to/data'
    target_directory = 'path/to/output'
    for filename in os.listdir(directory):
        op = open(os.path.join(directory, filename),'rb')
        fasta = op.read().split("\n")
        op.close()
        name = fasta[0][1:] #cut off >
        sequence = "".join(fasta[1:]).upper() #transform into one line of uppercase bases...
        hindIIIpos = sequence.find("AAGCTT")
        if hindIIIpos == -1:
            raise ValueError("%s did not contain a hindIII site" % filename)
        ecoRIpos = sequence.rfind("GAATTC")#search for last ecoRI site
        if ecoRIpos == -1:
            raise ValueError("%s did not contain a ecoRI site" % filename)
        cut = sequence[hindIIIpos + 1: ecoRIpos + 1] #compensate for actual cutting position
        op = open(os.path.join(target_directory, filename), 'wb')
        op.write('>%s\n%s' % (name, cut))
        op.close()
    Last edited by ffinkernagel; 09-28-2010, 07:23 AM.

    Comment


    • #3
      Thanks for the help. I should have also disclosed that I am not versed in any computer language - but am fortunate to have some friends who are. I will take this to them. Thank you again.

      Comment


      • #4
        you might want to download 'lucy' at


        or, if you prefer a GUI,


        cheers,
        Sven

        Comment


        • #5
          Thanks very much Sven, I will give those a try.

          Comment


          • #6
            I am not versed in any computer language
            Doing nextgen without knowing Perl/bash/c/java/sed/awk and/or python is like backpacking through South America without knowing any Spanish. You're liable to wind up lost and sick. You need to make the time to learn some basic text stream manipulation.

            Comment


            • #7
              Well, the OP's questions was a bit off-topic, as he still uses PGS data (previous generation sequencing ;-) ), some 1000 ABI traces ... lucy2 should do the job. If not, I totally agree, without basic knowledge of e.g. perl it gets pretty hard ...

              Sven

              Comment

              Latest Articles

              Collapse

              • seqadmin
                Essential Discoveries and Tools in Epitranscriptomics
                by seqadmin


                The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist on Modified Bases...
                Today, 07:01 AM
              • seqadmin
                Current Approaches to Protein Sequencing
                by seqadmin


                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                04-04-2024, 04:25 PM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by seqadmin, 04-11-2024, 12:08 PM
              0 responses
              37 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-10-2024, 10:19 PM
              0 responses
              41 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-10-2024, 09:21 AM
              0 responses
              35 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-04-2024, 09:00 AM
              0 responses
              54 views
              0 likes
              Last Post seqadmin  
              Working...
              X