SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Illumina-Tag sequencing, filtering homopolymers/entropy based filtering tonybert Bioinformatics 0 12-30-2014 02:23 PM
What is the cause of this kind of mapping result? ZoeG Bioinformatics 0 06-27-2013 06:13 PM
Experimental Design: Which Kind of Replicate to Use? SamCurt RNA Sequencing 6 12-24-2011 11:35 AM
what is the name of this kind of graph and how to plot? tianyub836 Bioinformatics 6 10-04-2011 01:48 PM
what kind of program do you use for RNA-seq? carljason Illumina/Solexa 1 09-26-2008 06:52 AM

Reply
 
Thread Tools
Old 02-23-2017, 06:41 AM   #1
SDPA_Pet
Senior Member
 
Location: US

Join Date: Apr 2013
Posts: 203
Default How can I do this kind of filtering

Hi I have a fasta file, sequence like this. Basically, it is an annotated files the sequences name include fuction name, and organism.

I want to do this kind of filtering.

1> extract the sequence name ("mgm4510423.3|contig02227|RefSeq|73954f841ecd7c512c5428ed1b1a747e accession=[NP_559375.1],function=[carbamate kinase],organism=[Pyrobaculum aerophilum str. IM2]")to a text file. I would be better to separate by comma. That is, make three columns. ID, function and organism.

2> After I create the upper text file. I can choose the organism that I want to keep. Filter the fasta files, so I will get all the sequences that I need for particular organisms.

Any software or Unix command like grep /awk can do this.

Code:
>mgm4510423.3|contig02227|RefSeq|73954f841ecd7c512c5428ed1b1a747e accession=[NP_559375.1],function=[carbamate kinase],organism=[Pyrobaculum aerophilum str. IM2]
AAGAAACGTCGACGTAGCCGCCAGAGTCGTGGCAGGGgTAATGCAGGGAGGCCACCAGGTGGTGGTGACGCACGGCAACGGGCCCCAGGTGGGCTACCTGGCGgAGTTGCaGAgaGACAACGGCACATTTCGGCTGGACGCCCTAAACGCCATGACGCaGGGgATGCTCGGCTACTTCCTTGTCTCTGCGCTTGATAAATACTTAGGCAGGGGGAGGGCCGCGGCTTTGGTGACCAGAGTCGAGGTGGACTGCGACGACCCGGCTTTTaaagaCCCGACcAAGTTCATAGGTCCCCTATACGGCAAGGAaCaGgCTGAGGCCCTCGCACAGAGGTACGGGTGGCAGTTTAGGCAAGACCCAAGAGGAGGCTGGCgtCGCGTCGTCGCGTCGCCTACGCCGCTCAGAAtcGTGGAGATAGAGGCCGTAAAGaGGTTGCTGgACGCGgGTTTCGTCGTTGTGGCGgCGGGCGGCGGCGGTaTACCGCTCTGCGGAGACAGAgaCGTAGAGGGGGTTATAGACAAGGACTTGGCCTCTTCTCTCCTCGCTGTGGAGCTCGGCGCGGACTTCTTCATGATGCTGACCGACATAGACGCCGTCTACCTAAACTACGGGAaGCCGAACCAGAGGAGGCTAGACAGCGTAGGGGCTGACGAGCTGGAGAGGTATTTCGCCGAtGGCcACTTCCCGCCGGGCTCCATGGGGCCGAAGGTGCAGGCCGCGATAAACTTCGTGAAacAAAcGGggaGAaGGGCGGCCATCGGGGCGCTGGAGGAGGGCTAtGACGtGTTCAGGGGAATAAAGGGGACCCAGGTgACGCCTTAGAGCTCGTTTATTGGCTTTTCGTATTCCTCCCTcTtCtGGAGGTCTCGgATCTTgACTACGCCGCGCTCCAGCTCTTTCTTGCCGATTATGATTAGGtACCGCGTGCCTATCTTCAAGATGTATTCAAAGGCCTcTTTtAGGCTTTTCTCGCCCAGCTCCACAGCCACGCTGAAGCCTGCGCTCCTCAGCTTCTtcGCAACTGCCACGGCCTGCGGGTACGCCTCgTCGTCGAAGATGTAGATGTAGTAGTCCAGCGGCTTCTCCACGTTGTGGAGCCCcACGgCCTCcATAAACcTCTCAACGCCGATGGCGAaCCCCAGCGCCGgCGtCtttACGCCGCTGTAGAGCT

Last edited by Brian Bushnell; 02-23-2017 at 08:45 AM.
SDPA_Pet is offline   Reply With Quote
Old 02-23-2017, 06:53 AM   #2
vivek_
Bioinformatician
 
Location: Denmark

Join Date: Jul 2012
Posts: 158
Default

Quote:
grep "^>" your_fasta > headers.txt
Gives you the headers

If your entire sequence is in one line, you can use

Quote:
grep -A 1 "Pyrobaculum aerophilum str. IM2" > paerophilum.fasta
to select all sequences belonging to that organism.
vivek_ is offline   Reply With Quote
Old 02-23-2017, 06:56 AM   #3
SDPA_Pet
Senior Member
 
Location: US

Join Date: Apr 2013
Posts: 203
Default

for the 2nd question. Is it possible to filter it by column? Let's say I have column list all the name of organism that I want to extract.
SDPA_Pet is offline   Reply With Quote
Old 02-23-2017, 06:58 AM   #4
vivek_
Bioinformatician
 
Location: Denmark

Join Date: Jul 2012
Posts: 158
Default

You can try this

Quote:
for organism in `cat your_organisms.txt`;do grep -A 1 -w $organism your_fasta.fa > $organism.fa;done
vivek_ is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 11:04 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2018, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO