SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Extracting specific lines/rows with awk vishal.rossi Bioinformatics 2 05-06-2013 03:41 AM
How to remove the reads whose k-mers are more or less than an abundance threshold Lisa0508 Bioinformatics 3 07-26-2012 04:52 AM
sed,head and tail dont work? very interesting. hanifk Bioinformatics 1 10-10-2011 08:58 AM
K-mers charltt Bioinformatics 2 06-08-2011 12:03 PM

Reply
 
Thread Tools
Old 09-15-2013, 12:17 PM   #1
Genomics101
Member
 
Location: Maryland, USA

Join Date: May 2012
Posts: 60
Default Extracting 22-mers ending in GG from FASTA using sed/awk?

I feel like there must be a simple one-liner with see/awk that can do this, but can't think of it and I hope y'all can help be out.

I have a set of FASTA files for gene sequences, from GenBank. I need to extract all the instances of "GG," plus the twenty bases up stream. The output would be a list of 22-mers all ending in GG (with line breaks after each), all the instances of this in each gene.

Any help is greatly appreciated!
Genomics101 is offline   Reply With Quote
Old 09-16-2013, 04:37 AM   #2
maasha
Senior Member
 
Location: Denmark

Join Date: Apr 2009
Posts: 153
Default

Let me see if I understand this correctly. You can use Biopieces (www.biopieces.org) like this (or modify a bit):

Code:
read_fasta -i genes.fna | # read in your sequences
split_seq -w 22 | # split the sequences in 1-base overlapping 22-mers
grab -r 'GG$' -k SEQ | # grab all k-mers ending in GG
write_fasta -xo 22-mers_terminal_GG.fna
maasha is offline   Reply With Quote
Old 09-16-2013, 04:45 AM   #3
RickBioinf
Member
 
Location: Leiden, The Netherlands

Join Date: Sep 2012
Posts: 28
Default

I think he means that you first find the 'GG' and then take the 20 bases before the 'GG' and the 'GG' itself, so not splitting the sequences in 22-mers before hand.
RickBioinf is offline   Reply With Quote
Old 09-16-2013, 04:47 AM   #4
JamieHeather
@jamimmunology
 
Location: London

Join Date: Nov 2012
Posts: 96
Default

Do you expect there to be more than one instance per line?
JamieHeather is offline   Reply With Quote
Old 09-16-2013, 04:54 AM   #5
JamieHeather
@jamimmunology
 
Location: London

Join Date: Nov 2012
Posts: 96
Default

Wait, never mind, remembered the flag needed:

Code:
grep -io ......................GG input.fasta > output
(-i = case insensitive, -o = just outputs match)

Edit - no wait, this won't output overlapping matches, never mind!

You might have to go a little higher level than bash tools.

Last edited by JamieHeather; 09-16-2013 at 04:59 AM.
JamieHeather is offline   Reply With Quote
Old 09-16-2013, 06:29 AM   #6
dariober
Senior Member
 
Location: Cambridge, UK

Join Date: May 2010
Posts: 311
Default

Quote:
Originally Posted by Genomics101 View Post
I have a set of FASTA files for gene sequences, from GenBank. I need to extract all the instances of "GG," plus the twenty bases up stream. The output would be a list of 22-mers all ending in GG (with line breaks after each), all the instances of this in each gene.
Not exactly a one liner but it might do the trick...

Code:
## Test fasta file
cat test.fa 
>seq1
GGGGAATTAGCTCAAGCGGTAGAGCGCTCCCTTAGCATGCGAGAGGTAGCGGGATCGACG
CCCCCATTCTCTA
>seq2
GGGGGATTAGCTCAAGCGGTAGGGTGCCTGCTTAGCATGCAAGAGGTAGCAGGATCGACG
CCTGCATTCTCCA

python -c "
import re
fin= open('test.fa')
seqname= fin.readline().strip().lstrip('>')
faseq= []
while True:
    line= fin.readline().strip()
    if line.startswith('>') or line == '':
        mpat= re.finditer(r'(?=(.{20}GG))', ''.join(faseq))
        for m in mpat:
            print(seqname + '\t' + m.group(1))
        seqname= line.lstrip('>')
        faseq= []
    else:
        faseq.append(line)
    if line == '':
        break
"
## Output:
seq1	CGCTCCCTTAGCATGCGAGAGG
seq1	CTTAGCATGCGAGAGGTAGCGG
seq1	TTAGCATGCGAGAGGTAGCGGG
seq2	GGGGATTAGCTCAAGCGGTAGG
seq2	GGGATTAGCTCAAGCGGTAGGG
seq2	TGCCTGCTTAGCATGCAAGAGG
seq2	TTAGCATGCAAGAGGTAGCAGG
Dario
dariober is offline   Reply With Quote
Old 09-16-2013, 07:39 AM   #7
maasha
Senior Member
 
Location: Denmark

Join Date: Apr 2009
Posts: 153
Default

OK, another try with Biopieces - this time from GenBank files ...

Code:
read_genbank -i test.gb -f gene | # read in genbank entries with gene feature key
add_ident -k SEQ_NAME | # add an identifier
patscan_seq -cp '20 ... 20 GG' | # scan for pattern
write_tab -xck S_ID,MATCH,S_BEG,S_END,STRAND # output table
maasha is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 04:36 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2021, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO