Seqanswers Leaderboard Ad

**maasha** · 09-16-2013, 03:37 AM

Let me see if I understand this correctly. You can use Biopieces (www.biopieces.org) like this (or modify a bit):

Code:

read_fasta -i genes.fna | # read in your sequences
split_seq -w 22 | # split the sequences in 1-base overlapping 22-mers
grab -r 'GG$' -k SEQ | # grab all k-mers ending in GG
write_fasta -xo 22-mers_terminal_GG.fna

**RickBioinf** · 09-16-2013, 03:45 AM

I think he means that you first find the 'GG' and then take the 20 bases before the 'GG' and the 'GG' itself, so not splitting the sequences in 22-mers before hand.

**JamieHeather** · 09-16-2013, 03:47 AM

Do you expect there to be more than one instance per line?

**JamieHeather** · 09-16-2013, 03:54 AM

Wait, never mind, remembered the flag needed:

Code:

grep -io ......................GG input.fasta > output

(-i = case insensitive, -o = just outputs match)

Edit - no wait, this won't output overlapping matches, never mind!

You might have to go a little higher level than bash tools.

**dariober** · 09-16-2013, 05:29 AM

Originally posted by Genomics101 View Post

I have a set of FASTA files for gene sequences, from GenBank. I need to extract all the instances of "GG," plus the twenty bases up stream. The output would be a list of 22-mers all ending in GG (with line breaks after each), all the instances of this in each gene.

Not exactly a one liner but it might do the trick...

Code:

## Test fasta file
cat test.fa 
>seq1
GGGGAATTAGCTCAAGCGGTAGAGCGCTCCCTTAGCATGCGAGAGGTAGCGGGATCGACG
CCCCCATTCTCTA
>seq2
GGGGGATTAGCTCAAGCGGTAGGGTGCCTGCTTAGCATGCAAGAGGTAGCAGGATCGACG
CCTGCATTCTCCA

python -c "
import re
fin= open('test.fa')
seqname= fin.readline().strip().lstrip('>')
faseq= []
while True:
    line= fin.readline().strip()
    if line.startswith('>') or line == '':
        mpat= re.finditer(r'(?=(.{20}GG))', ''.join(faseq))
        for m in mpat:
            print(seqname + '\t' + m.group(1))
        seqname= line.lstrip('>')
        faseq= []
    else:
        faseq.append(line)
    if line == '':
        break
"
## Output:
seq1	CGCTCCCTTAGCATGCGAGAGG
seq1	CTTAGCATGCGAGAGGTAGCGG
seq1	TTAGCATGCGAGAGGTAGCGGG
seq2	GGGGATTAGCTCAAGCGGTAGG
seq2	GGGATTAGCTCAAGCGGTAGGG
seq2	TGCCTGCTTAGCATGCAAGAGG
seq2	TTAGCATGCAAGAGGTAGCAGG

Dario

**maasha** · 09-16-2013, 06:39 AM

OK, another try with Biopieces - this time from GenBank files ...

Code:

read_genbank -i test.gb -f gene | # read in genbank entries with gene feature key
add_ident -k SEQ_NAME | # add an identifier
patscan_seq -cp '20 ... 20 GG' | # scan for pattern
write_tab -xck S_ID,MATCH,S_BEG,S_END,STRAND # output table

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 37 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 41 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 35 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 54 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Extracting 22-mers ending in GG from FASTA using sed/awk?

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News