![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Extracting specific lines/rows with awk | vishal.rossi | Bioinformatics | 2 | 05-06-2013 03:41 AM |
How to remove the reads whose k-mers are more or less than an abundance threshold | Lisa0508 | Bioinformatics | 3 | 07-26-2012 04:52 AM |
sed,head and tail dont work? very interesting. | hanifk | Bioinformatics | 1 | 10-10-2011 08:58 AM |
K-mers | charltt | Bioinformatics | 2 | 06-08-2011 12:03 PM |
![]() |
|
Thread Tools |
![]() |
#1 |
Member
Location: Maryland, USA Join Date: May 2012
Posts: 60
|
![]()
I feel like there must be a simple one-liner with see/awk that can do this, but can't think of it and I hope y'all can help be out.
I have a set of FASTA files for gene sequences, from GenBank. I need to extract all the instances of "GG," plus the twenty bases up stream. The output would be a list of 22-mers all ending in GG (with line breaks after each), all the instances of this in each gene. Any help is greatly appreciated! |
![]() |
![]() |
![]() |
#2 |
Senior Member
Location: Denmark Join Date: Apr 2009
Posts: 153
|
![]()
Let me see if I understand this correctly. You can use Biopieces (www.biopieces.org) like this (or modify a bit):
Code:
read_fasta -i genes.fna | # read in your sequences split_seq -w 22 | # split the sequences in 1-base overlapping 22-mers grab -r 'GG$' -k SEQ | # grab all k-mers ending in GG write_fasta -xo 22-mers_terminal_GG.fna |
![]() |
![]() |
![]() |
#3 |
Member
Location: Leiden, The Netherlands Join Date: Sep 2012
Posts: 28
|
![]()
I think he means that you first find the 'GG' and then take the 20 bases before the 'GG' and the 'GG' itself, so not splitting the sequences in 22-mers before hand.
|
![]() |
![]() |
![]() |
#4 |
@jamimmunology
Location: London Join Date: Nov 2012
Posts: 96
|
![]()
Do you expect there to be more than one instance per line?
|
![]() |
![]() |
![]() |
#5 |
@jamimmunology
Location: London Join Date: Nov 2012
Posts: 96
|
![]()
Wait, never mind, remembered the flag needed:
Code:
grep -io ......................GG input.fasta > output Edit - no wait, this won't output overlapping matches, never mind! You might have to go a little higher level than bash tools. Last edited by JamieHeather; 09-16-2013 at 04:59 AM. |
![]() |
![]() |
![]() |
#6 | |
Senior Member
Location: Cambridge, UK Join Date: May 2010
Posts: 311
|
![]() Quote:
Code:
## Test fasta file cat test.fa >seq1 GGGGAATTAGCTCAAGCGGTAGAGCGCTCCCTTAGCATGCGAGAGGTAGCGGGATCGACG CCCCCATTCTCTA >seq2 GGGGGATTAGCTCAAGCGGTAGGGTGCCTGCTTAGCATGCAAGAGGTAGCAGGATCGACG CCTGCATTCTCCA python -c " import re fin= open('test.fa') seqname= fin.readline().strip().lstrip('>') faseq= [] while True: line= fin.readline().strip() if line.startswith('>') or line == '': mpat= re.finditer(r'(?=(.{20}GG))', ''.join(faseq)) for m in mpat: print(seqname + '\t' + m.group(1)) seqname= line.lstrip('>') faseq= [] else: faseq.append(line) if line == '': break " ## Output: seq1 CGCTCCCTTAGCATGCGAGAGG seq1 CTTAGCATGCGAGAGGTAGCGG seq1 TTAGCATGCGAGAGGTAGCGGG seq2 GGGGATTAGCTCAAGCGGTAGG seq2 GGGATTAGCTCAAGCGGTAGGG seq2 TGCCTGCTTAGCATGCAAGAGG seq2 TTAGCATGCAAGAGGTAGCAGG |
|
![]() |
![]() |
![]() |
#7 |
Senior Member
Location: Denmark Join Date: Apr 2009
Posts: 153
|
![]()
OK, another try with Biopieces - this time from GenBank files ...
Code:
read_genbank -i test.gb -f gene | # read in genbank entries with gene feature key add_ident -k SEQ_NAME | # add an identifier patscan_seq -cp '20 ... 20 GG' | # scan for pattern write_tab -xck S_ID,MATCH,S_BEG,S_END,STRAND # output table |
![]() |
![]() |
![]() |
Thread Tools | |
|
|