Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • batch editing of ABI files

    Hello all. I am unsure if this is the correct forum, but was referred here wit sort of a unique question.

    I have a run of over 1000 sequences, in ABI format. Each read is relatively short, less than 1 kb. Our goal is to align/analyze each read together - but to do so we must first eliminate sequence from both the 5' and 3' ends.

    We typically do this by "hand" in Geneious, using unique restriction sites at both the 5' and 3' ends. Simply put, we delete sequence upstream of one restriction site and then everything downstream of another restriction site (which is also downstream of the first restriction site.)

    With less than 100 sequences, this is just a few hours of work. However with over 1000 reads, it gets to be very time consuming!

    Does software exist which would allow me to do this sequence editing in a batch format? I believe we're looking for something which would identify two restriction sites in a read, and delete the sequence either upstream or downstream of those sites. Below is the general idea, in case the description above is not clear.

    Thanks for any help,

    Ray


  • #2
    sure, any scripting language (such as Python) will do this, especially since both the hindIII and the ecoRi site have no ambiguity whatsoever.
    Only problem might be that your files are in ABI, and you either need to convert this to something the script will be able to read (FASTA), or find an adaptor.. there is one for python http://www.bioinformatics.org/groups/?group_id=497

    here's a simple script from the top of my head that will do it to a folder of fasta files...
    Code:
    #!/usr/bin/python
    directory = 'path/to/data'
    target_directory = 'path/to/output'
    for filename in os.listdir(directory):
        op = open(os.path.join(directory, filename),'rb')
        fasta = op.read().split("\n")
        op.close()
        name = fasta[0][1:] #cut off >
        sequence = "".join(fasta[1:]).upper() #transform into one line of uppercase bases...
        hindIIIpos = sequence.find("AAGCTT")
        if hindIIIpos == -1:
            raise ValueError("%s did not contain a hindIII site" % filename)
        ecoRIpos = sequence.rfind("GAATTC")#search for last ecoRI site
        if ecoRIpos == -1:
            raise ValueError("%s did not contain a ecoRI site" % filename)
        cut = sequence[hindIIIpos + 1: ecoRIpos + 1] #compensate for actual cutting position
        op = open(os.path.join(target_directory, filename), 'wb')
        op.write('>%s\n%s' % (name, cut))
        op.close()
    Last edited by ffinkernagel; 09-28-2010, 07:23 AM.

    Comment


    • #3
      Thanks for the help. I should have also disclosed that I am not versed in any computer language - but am fortunate to have some friends who are. I will take this to them. Thank you again.

      Comment


      • #4
        you might want to download 'lucy' at


        or, if you prefer a GUI,


        cheers,
        Sven

        Comment


        • #5
          Thanks very much Sven, I will give those a try.

          Comment


          • #6
            I am not versed in any computer language
            Doing nextgen without knowing Perl/bash/c/java/sed/awk and/or python is like backpacking through South America without knowing any Spanish. You're liable to wind up lost and sick. You need to make the time to learn some basic text stream manipulation.

            Comment


            • #7
              Well, the OP's questions was a bit off-topic, as he still uses PGS data (previous generation sequencing ;-) ), some 1000 ABI traces ... lucy2 should do the job. If not, I totally agree, without basic knowledge of e.g. perl it gets pretty hard ...

              Sven

              Comment

              Latest Articles

              Collapse

              • seqadmin
                Recent Advances in Sequencing Analysis Tools
                by seqadmin


                The sequencing world is rapidly changing due to declining costs, enhanced accuracies, and the advent of newer, cutting-edge instruments. Equally important to these developments are improvements in sequencing analysis, a process that converts vast amounts of raw data into a comprehensible and meaningful form. This complex task requires expertise and the right analysis tools. In this article, we highlight the progress and innovation in sequencing analysis by reviewing several of the...
                05-06-2024, 07:48 AM
              • seqadmin
                Essential Discoveries and Tools in Epitranscriptomics
                by seqadmin




                The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                04-22-2024, 07:01 AM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by seqadmin, Yesterday, 02:46 PM
              0 responses
              11 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 05-07-2024, 06:57 AM
              0 responses
              13 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 05-06-2024, 07:17 AM
              0 responses
              16 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 05-02-2024, 08:06 AM
              0 responses
              23 views
              0 likes
              Last Post seqadmin  
              Working...
              X