Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Extracting 22-mers ending in GG from FASTA using sed/awk?

    I feel like there must be a simple one-liner with see/awk that can do this, but can't think of it and I hope y'all can help be out.

    I have a set of FASTA files for gene sequences, from GenBank. I need to extract all the instances of "GG," plus the twenty bases up stream. The output would be a list of 22-mers all ending in GG (with line breaks after each), all the instances of this in each gene.

    Any help is greatly appreciated!

  • #2
    Let me see if I understand this correctly. You can use Biopieces (www.biopieces.org) like this (or modify a bit):

    Code:
    read_fasta -i genes.fna | # read in your sequences
    split_seq -w 22 | # split the sequences in 1-base overlapping 22-mers
    grab -r 'GG$' -k SEQ | # grab all k-mers ending in GG
    write_fasta -xo 22-mers_terminal_GG.fna

    Comment


    • #3
      I think he means that you first find the 'GG' and then take the 20 bases before the 'GG' and the 'GG' itself, so not splitting the sequences in 22-mers before hand.

      Comment


      • #4
        Do you expect there to be more than one instance per line?

        Comment


        • #5
          Wait, never mind, remembered the flag needed:

          Code:
          grep -io ......................GG input.fasta > output
          (-i = case insensitive, -o = just outputs match)

          Edit - no wait, this won't output overlapping matches, never mind!

          You might have to go a little higher level than bash tools.
          Last edited by JamieHeather; 09-16-2013, 03:59 AM.

          Comment


          • #6
            Originally posted by Genomics101 View Post
            I have a set of FASTA files for gene sequences, from GenBank. I need to extract all the instances of "GG," plus the twenty bases up stream. The output would be a list of 22-mers all ending in GG (with line breaks after each), all the instances of this in each gene.
            Not exactly a one liner but it might do the trick...

            Code:
            ## Test fasta file
            cat test.fa 
            >seq1
            GGGGAATTAGCTCAAGCGGTAGAGCGCTCCCTTAGCATGCGAGAGGTAGCGGGATCGACG
            CCCCCATTCTCTA
            >seq2
            GGGGGATTAGCTCAAGCGGTAGGGTGCCTGCTTAGCATGCAAGAGGTAGCAGGATCGACG
            CCTGCATTCTCCA
            
            python -c "
            import re
            fin= open('test.fa')
            seqname= fin.readline().strip().lstrip('>')
            faseq= []
            while True:
                line= fin.readline().strip()
                if line.startswith('>') or line == '':
                    mpat= re.finditer(r'(?=(.{20}GG))', ''.join(faseq))
                    for m in mpat:
                        print(seqname + '\t' + m.group(1))
                    seqname= line.lstrip('>')
                    faseq= []
                else:
                    faseq.append(line)
                if line == '':
                    break
            "
            ## Output:
            seq1	CGCTCCCTTAGCATGCGAGAGG
            seq1	CTTAGCATGCGAGAGGTAGCGG
            seq1	TTAGCATGCGAGAGGTAGCGGG
            seq2	GGGGATTAGCTCAAGCGGTAGG
            seq2	GGGATTAGCTCAAGCGGTAGGG
            seq2	TGCCTGCTTAGCATGCAAGAGG
            seq2	TTAGCATGCAAGAGGTAGCAGG
            Dario

            Comment


            • #7
              OK, another try with Biopieces - this time from GenBank files ...

              Code:
              read_genbank -i test.gb -f gene | # read in genbank entries with gene feature key
              add_ident -k SEQ_NAME | # add an identifier
              patscan_seq -cp '20 ... 20 GG' | # scan for pattern
              write_tab -xck S_ID,MATCH,S_BEG,S_END,STRAND # output table

              Comment

              Latest Articles

              Collapse

              • seqadmin
                Current Approaches to Protein Sequencing
                by seqadmin


                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                04-04-2024, 04:25 PM
              • seqadmin
                Strategies for Sequencing Challenging Samples
                by seqadmin


                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                03-22-2024, 06:39 AM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by seqadmin, 04-11-2024, 12:08 PM
              0 responses
              27 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-10-2024, 10:19 PM
              0 responses
              31 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-10-2024, 09:21 AM
              0 responses
              27 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-04-2024, 09:00 AM
              0 responses
              52 views
              0 likes
              Last Post seqadmin  
              Working...
              X