SEQanswers

SEQanswers (http://seqanswers.com/forums/index.php)
-   Bioinformatics (http://seqanswers.com/forums/forumdisplay.php?f=18)
-   -   How can I do this kind of filtering (http://seqanswers.com/forums/showthread.php?t=74413)

SDPA_Pet 02-23-2017 06:41 AM

How can I do this kind of filtering
 
Hi I have a fasta file, sequence like this. Basically, it is an annotated files the sequences name include fuction name, and organism.

I want to do this kind of filtering.

1> extract the sequence name ("mgm4510423.3|contig02227|RefSeq|73954f841ecd7c512c5428ed1b1a747e accession=[NP_559375.1],function=[carbamate kinase],organism=[Pyrobaculum aerophilum str. IM2]")to a text file. I would be better to separate by comma. That is, make three columns. ID, function and organism.

2> After I create the upper text file. I can choose the organism that I want to keep. Filter the fasta files, so I will get all the sequences that I need for particular organisms.

Any software or Unix command like grep /awk can do this.

Code:

>mgm4510423.3|contig02227|RefSeq|73954f841ecd7c512c5428ed1b1a747e accession=[NP_559375.1],function=[carbamate kinase],organism=[Pyrobaculum aerophilum str. IM2]
AAGAAACGTCGACGTAGCCGCCAGAGTCGTGGCAGGGgTAATGCAGGGAGGCCACCAGGTGGTGGTGACGCACGGCAACGGGCCCCAGGTGGGCTACCTGGCGgAGTTGCaGAgaGACAACGGCACATTTCGGCTGGACGCCCTAAACGCCATGACGCaGGGgATGCTCGGCTACTTCCTTGTCTCTGCGCTTGATAAATACTTAGGCAGGGGGAGGGCCGCGGCTTTGGTGACCAGAGTCGAGGTGGACTGCGACGACCCGGCTTTTaaagaCCCGACcAAGTTCATAGGTCCCCTATACGGCAAGGAaCaGgCTGAGGCCCTCGCACAGAGGTACGGGTGGCAGTTTAGGCAAGACCCAAGAGGAGGCTGGCgtCGCGTCGTCGCGTCGCCTACGCCGCTCAGAAtcGTGGAGATAGAGGCCGTAAAGaGGTTGCTGgACGCGgGTTTCGTCGTTGTGGCGgCGGGCGGCGGCGGTaTACCGCTCTGCGGAGACAGAgaCGTAGAGGGGGTTATAGACAAGGACTTGGCCTCTTCTCTCCTCGCTGTGGAGCTCGGCGCGGACTTCTTCATGATGCTGACCGACATAGACGCCGTCTACCTAAACTACGGGAaGCCGAACCAGAGGAGGCTAGACAGCGTAGGGGCTGACGAGCTGGAGAGGTATTTCGCCGAtGGCcACTTCCCGCCGGGCTCCATGGGGCCGAAGGTGCAGGCCGCGATAAACTTCGTGAAacAAAcGGggaGAaGGGCGGCCATCGGGGCGCTGGAGGAGGGCTAtGACGtGTTCAGGGGAATAAAGGGGACCCAGGTgACGCCTTAGAGCTCGTTTATTGGCTTTTCGTATTCCTCCCTcTtCtGGAGGTCTCGgATCTTgACTACGCCGCGCTCCAGCTCTTTCTTGCCGATTATGATTAGGtACCGCGTGCCTATCTTCAAGATGTATTCAAAGGCCTcTTTtAGGCTTTTCTCGCCCAGCTCCACAGCCACGCTGAAGCCTGCGCTCCTCAGCTTCTtcGCAACTGCCACGGCCTGCGGGTACGCCTCgTCGTCGAAGATGTAGATGTAGTAGTCCAGCGGCTTCTCCACGTTGTGGAGCCCcACGgCCTCcATAAACcTCTCAACGCCGATGGCGAaCCCCAGCGCCGgCGtCtttACGCCGCTGTAGAGCT


vivek_ 02-23-2017 06:53 AM

Quote:

grep "^>" your_fasta > headers.txt
Gives you the headers

If your entire sequence is in one line, you can use

Quote:

grep -A 1 "Pyrobaculum aerophilum str. IM2" > paerophilum.fasta
to select all sequences belonging to that organism.

SDPA_Pet 02-23-2017 06:56 AM

for the 2nd question. Is it possible to filter it by column? Let's say I have column list all the name of organism that I want to extract.

vivek_ 02-23-2017 06:58 AM

You can try this

Quote:

for organism in `cat your_organisms.txt`;do grep -A 1 -w $organism your_fasta.fa > $organism.fa;done


All times are GMT -8. The time now is 10:32 PM.

Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.