Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Extracting 22-mers ending in GG from FASTA using sed/awk?

    I feel like there must be a simple one-liner with see/awk that can do this, but can't think of it and I hope y'all can help be out.

    I have a set of FASTA files for gene sequences, from GenBank. I need to extract all the instances of "GG," plus the twenty bases up stream. The output would be a list of 22-mers all ending in GG (with line breaks after each), all the instances of this in each gene.

    Any help is greatly appreciated!

  • #2
    Let me see if I understand this correctly. You can use Biopieces (www.biopieces.org) like this (or modify a bit):

    Code:
    read_fasta -i genes.fna | # read in your sequences
    split_seq -w 22 | # split the sequences in 1-base overlapping 22-mers
    grab -r 'GG$' -k SEQ | # grab all k-mers ending in GG
    write_fasta -xo 22-mers_terminal_GG.fna

    Comment


    • #3
      I think he means that you first find the 'GG' and then take the 20 bases before the 'GG' and the 'GG' itself, so not splitting the sequences in 22-mers before hand.

      Comment


      • #4
        Do you expect there to be more than one instance per line?

        Comment


        • #5
          Wait, never mind, remembered the flag needed:

          Code:
          grep -io ......................GG input.fasta > output
          (-i = case insensitive, -o = just outputs match)

          Edit - no wait, this won't output overlapping matches, never mind!

          You might have to go a little higher level than bash tools.
          Last edited by JamieHeather; 09-16-2013, 03:59 AM.

          Comment


          • #6
            Originally posted by Genomics101 View Post
            I have a set of FASTA files for gene sequences, from GenBank. I need to extract all the instances of "GG," plus the twenty bases up stream. The output would be a list of 22-mers all ending in GG (with line breaks after each), all the instances of this in each gene.
            Not exactly a one liner but it might do the trick...

            Code:
            ## Test fasta file
            cat test.fa 
            >seq1
            GGGGAATTAGCTCAAGCGGTAGAGCGCTCCCTTAGCATGCGAGAGGTAGCGGGATCGACG
            CCCCCATTCTCTA
            >seq2
            GGGGGATTAGCTCAAGCGGTAGGGTGCCTGCTTAGCATGCAAGAGGTAGCAGGATCGACG
            CCTGCATTCTCCA
            
            python -c "
            import re
            fin= open('test.fa')
            seqname= fin.readline().strip().lstrip('>')
            faseq= []
            while True:
                line= fin.readline().strip()
                if line.startswith('>') or line == '':
                    mpat= re.finditer(r'(?=(.{20}GG))', ''.join(faseq))
                    for m in mpat:
                        print(seqname + '\t' + m.group(1))
                    seqname= line.lstrip('>')
                    faseq= []
                else:
                    faseq.append(line)
                if line == '':
                    break
            "
            ## Output:
            seq1	CGCTCCCTTAGCATGCGAGAGG
            seq1	CTTAGCATGCGAGAGGTAGCGG
            seq1	TTAGCATGCGAGAGGTAGCGGG
            seq2	GGGGATTAGCTCAAGCGGTAGG
            seq2	GGGATTAGCTCAAGCGGTAGGG
            seq2	TGCCTGCTTAGCATGCAAGAGG
            seq2	TTAGCATGCAAGAGGTAGCAGG
            Dario

            Comment


            • #7
              OK, another try with Biopieces - this time from GenBank files ...

              Code:
              read_genbank -i test.gb -f gene | # read in genbank entries with gene feature key
              add_ident -k SEQ_NAME | # add an identifier
              patscan_seq -cp '20 ... 20 GG' | # scan for pattern
              write_tab -xck S_ID,MATCH,S_BEG,S_END,STRAND # output table

              Comment

              Latest Articles

              Collapse

              • seqadmin
                Essential Discoveries and Tools in Epitranscriptomics
                by seqadmin


                The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist on Modified Bases...
                Today, 07:01 AM
              • seqadmin
                Current Approaches to Protein Sequencing
                by seqadmin


                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                04-04-2024, 04:25 PM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by seqadmin, 04-11-2024, 12:08 PM
              0 responses
              37 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-10-2024, 10:19 PM
              0 responses
              41 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-10-2024, 09:21 AM
              0 responses
              35 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-04-2024, 09:00 AM
              0 responses
              54 views
              0 likes
              Last Post seqadmin  
              Working...
              X