Seqanswers Leaderboard Ad

**Wallysb01** · 07-22-2014, 07:45 AM

Hum, looks like you’ve type “spilt” when you mean “split”.

You could also convert you fasta file into tab format then just use a join command. Ie it could be:

File1:

Code:

someheader1	GATC….
someheader2	ATCG….

File2:

Code:

someheader1	othername1
someheader2	othername2

So then it would be:

Code:

sort -k1,1 File1 > File1.sort
sort -k1,1 File2 > File2.sort
join -j1 -o1.1,2.2,1.2 -e “NA” -a1 File1.sort File2.sort | sed ’s/ /_/‘ | see ’s/ /\t/‘ > Newfile

That will give you something that would look like this that could be transformed back in fasta format:

Code:

someheader1_othername1	GATC...
someheader2_othername2	ATCG...

And careful with “\t” in that code above. In case you don’t know, you’ll probably need to replace it with Ctrl-V-tab, to get it Unix to recognize it.

Also you can you the fastx toolkit to convert to and from tab format.

**westerman** · 07-22-2014, 07:48 AM

This is a syntax error. You need to use the proper function name. You have two similarly named functions -- 'spilt' and 'split'. Only one of them is giving you an error.

[edit: I see that Wallysb01 beat me to the answer ... and in a much nicer fashion :-) ]

**crazyhottommy** · 07-22-2014, 10:15 AM

you may use awk:

replace text based on a dictionary

http://stackoverflow.com/questions/11678939/replace-text-based-on-a-dictionary

I need to do something similar to this post (but with a twist). That is why I am asking. unix shell: replace by dictionary I have a dictionary(dict.txt). It is space separated and it reads like...

Usage: awk -f foo.awk dict.dat user.dat

String Functions (The GNU Awk User’s Guide)

http://www.gnu.org/software/gawk/manual/html_node/String-Functions.html

String Functions (The GNU Awk User’s Guide)

Arrays (The GNU Awk User’s Guide)

http://www.gnu.org/software/gawk/manual/html_node/Arrays.html

Arrays (The GNU Awk User’s Guide)

NR == FNR {
rep[$1] = $2
next
}

{
for (key in rep) {
gsub(key, rep[key])
}
print
}

**Simonli** · 07-22-2014, 10:42 AM

Thank you Wallysb01,
But the thing is, it might need some levels of "header matching",
the headers in annotation file are not completed in the fasta file..

someheader1 othername1
someheader3 othername2

[QUOTE]
someheader1
ATTTTCTG...
someheader2
ATCGGG...
someheader3
ATCTTG...
someheader4
ATCGAG...[QUOTE]

by the way, my code still not work after fix the syntax error...
then I wrote a R code trying to do the same thing,,no error message comes out, but the output all containing the " "..

ano <- read.table("annotation.txt",sep="\t")
fasta <- readLines("TrinityS_condensed.fasta")
i<-1
j<-1
while (i<5000){
ano1 <- paste(">",ano[i,1],sep="")
while (j<40604){
str <- strsplit(fasta[j]," ")
item <- str[[1]]
if (ano1 == item[1]) {
fasta[j] <- paste(fasta[j],ano[i,2])
}
j <- j+2
}
i <- i+1
}
dput(fasta,file="mergeoutput")

**crazyhottommy** · 07-22-2014, 11:23 AM

I am sorry, the awk does not work in this case, awk use $1 for first word and $2 for the second word.

I just wrote a python script below, and I tested it, it worked fine.
myfasta file

>comp1558_c0_seq1 len=204 path=[4976:0-203]
ATTGTACTTAATCTA.....
>comp8142_c0_seq1 len=222 path=[8835:0-221]
GTACAATCACGGCACATT......
>comp8570_c0_seq1 len=232 path=[3344:0-232]
GCATCGCTTATCGGTTTA......

annotation file:

comp1558_c0_seq1 repressor protein
comp8142_c0_seq1 aspartate aminotransferase
comp8357_c0_seq1 l-xylulose reductase
comp8387_c0_seq1 succinyl- synthetase alpha
comp8570_c0_seq1 fatty acid synthase beta subunit dehydratase

script:

Code:

with open ("C:/Users/Tang Ming/Desktop/anotation.txt", "r") as annotation:
    anotation_dict = {}
    for line in annotation:
        line = line.split()
        if line: #test whether it is an empty line
            anotation_dict[line[0]]=line[1:]
        else:
            continue

# really should not parse the fasta file by myself. there are
# many parsers already there. you can use module from Biopython 
ofile = open ("C:/Users/Tang Ming/Desktop/output.txt", "w")

with open ("C:/Users/Tang Ming/Desktop/my_fasta.txt", "r") as myfasta:
    for line in myfasta:
        if line.startswith (">"):
            line = line[1:] # skip the ">" character
            line = line.split()
            if line[0] in anotation_dict:
                new_line = ">" + str(line[0]) + " " + " ".join(anotation_dict[line[0]])
                ofile.write ( new_line + "\n")
        else:
            ofile.write(line)
                
                
ofile.close() # always remember to close the file.

output:
>comp1558_c0_seq1 repressor protein
ATTGTACTTAATCTA.....
>comp8142_c0_seq1 aspartate aminotransferase
GTACAATCACGGCACATT......
>comp8570_c0_seq1 fatty acid synthase beta subunit dehydratase
GCATCGCTTATCGGTTTA......

**Simonli** · 07-22-2014, 01:16 PM

thank you crazyhottommy for you codes,
it works but somehow the lines were not right..
in fasta file it should looks like 1,3,5,7,9..... lines are the headings and 2,4,6,8,10.... lines are ATCGs...

and in the output there are ATCGs take one line and some take multiple lines.

I`m new to python so not sure what was wrong...

**crazyhottommy** · 07-22-2014, 01:42 PM

I see, just changed the second last line code to:

ofile.write(line + "\n")

let me know if it works or not.

change_fasta_header.py

https://gist.github.com/crazyhottommy/a59df92b48e6ad3f4630

GitHub Gist: instantly share code, notes, and snippets.

**Simonli** · 07-28-2014, 10:40 AM

it seems it still doesn`t work, the multi-lines issue still exists...
It is probably caused by the maximum line length limit while read in the data---now sure how to fix it.

**crazyhottommy** · 07-31-2014, 07:50 AM

Originally posted by Simonli View Post

it seems it still doesn`t work, the multi-lines issue still exists...
It is probably caused by the maximum line length limit while read in the data---now sure how to fix it.

see my updated code. you need to add "\n" at the last line of the code.

change_fasta_header.py

https://gist.github.com/crazyhottommy/a59df92b48e6ad3f4630

GitHub Gist: instantly share code, notes, and snippets.

**syfo** · 08-01-2014, 01:56 AM

That simple awk command works fine:

Code:

awk 'NR==FNR{des[">"$1]=$0;next}/^>/ && des[$1]{$0=">"des[$1]}1' annotation_file fasta_file

cat annotation_file

Code:

comp1558_c0_seq1 repressor protein
comp8142_c0_seq1 aspartate aminotransferase
comp8357_c0_seq1 l-xylulose reductase
comp8387_c0_seq1 succinyl- synthetase alpha
comp8570_c0_seq1 fatty acid synthase beta subunit dehydratase 
comp0_c1_seq1 bli
comp0_c0_seq1 blahblah blah blah

cat fasta_file

Code:

>comp0_c0_seq1 len=204 path=[4976:0-203]
ATTGTACTTAATCTA.....
>comp0_c1_seq1 len=222 path=[8835:0-221]
GTACAATCACGGCACATATCTATGCA
CACAGTCAGCTAC

resulting output:

Code:

>comp0_c0_seq1 blahblah blah blah
ATTGTACTTAATCTA.....
>comp0_c1_seq1 bli
GTACAATCACGGCACATATCTATGCA
CACAGTCAGCTAC

**Simonli** · 08-11-2014, 06:09 AM

Thank you syfo,
but I`m not sure how to save the output into a txt file... awk command seems just print the result.

**syfo** · 08-11-2014, 06:28 AM

Originally posted by Simonli View Post

Thank you syfo,
but I`m not sure how to save the output into a txt file... awk command seems just print the result.

Just redirect the default output into a file by adding at the end of the command something like

> name-of-your-output-file.txt

**syfo** · 08-11-2014, 06:30 AM

Originally posted by syfo View Post

Just redirect the default output into a file by adding at the end of the command something like

> name-of-your-output-file.txt

which gives you more explicitly:

Code:

awk 'NR==FNR{des[">"$1]=$0;next}/^>/ && des[$1]{$0=">"des[$1]}1' annotation_file fasta_file > output-file.txt

**Simonli** · 08-11-2014, 06:32 AM

You are awesome! Thank you so much!

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 18 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 22 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 17 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 49 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Merging annotations and fasta headings

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News