Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Merging annotations and fasta headings

    I have a fasta file like this:
    >comp0_c0_seq1 len=204 path=[4976:0-203]
    ATTGTACTTAATCTA.....
    >comp0_c1_seq1 len=222 path=[8835:0-221]
    GTACAATCACGGCACATT......
    and an annotation file like this:
    comp1558_c0_seq1 repressor protein
    comp8142_c0_seq1 aspartate aminotransferase
    comp8357_c0_seq1 l-xylulose reductase
    comp8387_c0_seq1 succinyl- synthetase alpha
    comp8570_c0_seq1 fatty acid synthase beta subunit dehydratase
    And I want to replace the fasta files` heading by the annotation to produce:
    >comp0_c0_seq1 xxx protein
    ATTGTACTTAATCTA.....
    >comp0_c1_seq1 xxx reductase
    GTACAATCACGGCACATT......
    Here is what I wrote with python trying to do this:
    f = open("test")
    ano = open("annotation.txt")
    output = open("merged.fasta",'w')
    for line in ano:
    x = line.spilt('\t')
    for line in f:
    str = line.split(' ')
    if x[0] == str[0]:
    output.write(x)
    else:
    output.write(line)
    But it does not work, says "AttributeError: 'str' object has no attribute 'spilt'
    "
    ... any better ideas to do this or found any way to fix the codes?

    Thanks

  • #2
    Hum, looks like you’ve type “spilt” when you mean “split”.

    You could also convert you fasta file into tab format then just use a join command. Ie it could be:

    File1:
    Code:
    someheader1	GATC….
    someheader2	ATCG….
    File2:
    Code:
    someheader1	othername1
    someheader2	othername2
    So then it would be:
    Code:
    sort -k1,1 File1 > File1.sort
    sort -k1,1 File2 > File2.sort
    join -j1 -o1.1,2.2,1.2 -e “NA” -a1 File1.sort File2.sort | sed ’s/ /_/‘ | see ’s/ /\t/‘ > Newfile
    That will give you something that would look like this that could be transformed back in fasta format:

    Code:
    someheader1_othername1	GATC...
    someheader2_othername2	ATCG...
    And careful with “\t” in that code above. In case you don’t know, you’ll probably need to replace it with Ctrl-V-tab, to get it Unix to recognize it.

    Also you can you the fastx toolkit to convert to and from tab format.

    Comment


    • #3
      This is a syntax error. You need to use the proper function name. You have two similarly named functions -- 'spilt' and 'split'. Only one of them is giving you an error.

      [edit: I see that Wallysb01 beat me to the answer ... and in a much nicer fashion :-) ]

      Comment


      • #4
        you may use awk:
        I need to do something similar to this post (but with a twist). That is why I am asking. unix shell: replace by dictionary I have a dictionary(dict.txt). It is space separated and it reads like...


        Usage: awk -f foo.awk dict.dat user.dat



        NR == FNR {
        rep[$1] = $2
        next
        }

        {
        for (key in rep) {
        gsub(key, rep[key])
        }
        print
        }

        Comment


        • #5
          Thank you Wallysb01,
          But the thing is, it might need some levels of "header matching",
          the headers in annotation file are not completed in the fasta file..

          someheader1 othername1
          someheader3 othername2


          [QUOTE]
          someheader1
          ATTTTCTG...
          someheader2
          ATCGGG...
          someheader3
          ATCTTG...
          someheader4
          ATCGAG...[QUOTE]

          by the way, my code still not work after fix the syntax error...
          then I wrote a R code trying to do the same thing,,no error message comes out, but the output all containing the " "..

          ano <- read.table("annotation.txt",sep="\t")
          fasta <- readLines("TrinityS_condensed.fasta")
          i<-1
          j<-1
          while (i<5000){
          ano1 <- paste(">",ano[i,1],sep="")
          while (j<40604){
          str <- strsplit(fasta[j]," ")
          item <- str[[1]]
          if (ano1 == item[1]) {
          fasta[j] <- paste(fasta[j],ano[i,2])
          }
          j <- j+2
          }
          i <- i+1
          }
          dput(fasta,file="mergeoutput")

          Comment


          • #6
            I am sorry, the awk does not work in this case, awk use $1 for first word and $2 for the second word.

            I just wrote a python script below, and I tested it, it worked fine.
            myfasta file

            >comp1558_c0_seq1 len=204 path=[4976:0-203]
            ATTGTACTTAATCTA.....
            >comp8142_c0_seq1 len=222 path=[8835:0-221]
            GTACAATCACGGCACATT......
            >comp8570_c0_seq1 len=232 path=[3344:0-232]
            GCATCGCTTATCGGTTTA......

            annotation file:

            comp1558_c0_seq1 repressor protein
            comp8142_c0_seq1 aspartate aminotransferase
            comp8357_c0_seq1 l-xylulose reductase
            comp8387_c0_seq1 succinyl- synthetase alpha
            comp8570_c0_seq1 fatty acid synthase beta subunit dehydratase

            script:
            Code:
            with open ("C:/Users/Tang Ming/Desktop/anotation.txt", "r") as annotation:
                anotation_dict = {}
                for line in annotation:
                    line = line.split()
                    if line: #test whether it is an empty line
                        anotation_dict[line[0]]=line[1:]
                    else:
                        continue
            
            # really should not parse the fasta file by myself. there are
            # many parsers already there. you can use module from Biopython 
            ofile = open ("C:/Users/Tang Ming/Desktop/output.txt", "w")
            
            with open ("C:/Users/Tang Ming/Desktop/my_fasta.txt", "r") as myfasta:
                for line in myfasta:
                    if line.startswith (">"):
                        line = line[1:] # skip the ">" character
                        line = line.split()
                        if line[0] in anotation_dict:
                            new_line = ">" + str(line[0]) + " " + " ".join(anotation_dict[line[0]])
                            ofile.write ( new_line + "\n")
                    else:
                        ofile.write(line)
                            
                            
            ofile.close() # always remember to close the file.
            output:
            >comp1558_c0_seq1 repressor protein
            ATTGTACTTAATCTA.....
            >comp8142_c0_seq1 aspartate aminotransferase
            GTACAATCACGGCACATT......
            >comp8570_c0_seq1 fatty acid synthase beta subunit dehydratase
            GCATCGCTTATCGGTTTA......

            Comment


            • #7
              thank you crazyhottommy for you codes,
              it works but somehow the lines were not right..
              in fasta file it should looks like 1,3,5,7,9..... lines are the headings and 2,4,6,8,10.... lines are ATCGs...

              and in the output there are ATCGs take one line and some take multiple lines.

              I`m new to python so not sure what was wrong...

              Comment


              • #8
                I see, just changed the second last line code to:

                ofile.write(line + "\n")

                let me know if it works or not.

                Comment


                • #9
                  it seems it still doesn`t work, the multi-lines issue still exists...
                  It is probably caused by the maximum line length limit while read in the data---now sure how to fix it.

                  Comment


                  • #10
                    Originally posted by Simonli View Post
                    it seems it still doesn`t work, the multi-lines issue still exists...
                    It is probably caused by the maximum line length limit while read in the data---now sure how to fix it.
                    see my updated code. you need to add "\n" at the last line of the code.

                    Comment


                    • #11
                      That simple awk command works fine:
                      Code:
                      awk 'NR==FNR{des[">"$1]=$0;next}/^>/ && des[$1]{$0=">"des[$1]}1' annotation_file fasta_file
                      cat annotation_file
                      Code:
                      comp1558_c0_seq1 repressor protein
                      comp8142_c0_seq1 aspartate aminotransferase
                      comp8357_c0_seq1 l-xylulose reductase
                      comp8387_c0_seq1 succinyl- synthetase alpha
                      comp8570_c0_seq1 fatty acid synthase beta subunit dehydratase 
                      comp0_c1_seq1 bli
                      comp0_c0_seq1 blahblah blah blah
                      cat fasta_file
                      Code:
                      >comp0_c0_seq1 len=204 path=[4976:0-203]
                      ATTGTACTTAATCTA.....
                      >comp0_c1_seq1 len=222 path=[8835:0-221]
                      GTACAATCACGGCACATATCTATGCA
                      CACAGTCAGCTAC
                      resulting output:
                      Code:
                      >comp0_c0_seq1 blahblah blah blah
                      ATTGTACTTAATCTA.....
                      >comp0_c1_seq1 bli
                      GTACAATCACGGCACATATCTATGCA
                      CACAGTCAGCTAC

                      Comment


                      • #12
                        Thank you syfo,
                        but I`m not sure how to save the output into a txt file... awk command seems just print the result.

                        Comment


                        • #13
                          Originally posted by Simonli View Post
                          Thank you syfo,
                          but I`m not sure how to save the output into a txt file... awk command seems just print the result.
                          Just redirect the default output into a file by adding at the end of the command something like

                          > name-of-your-output-file.txt

                          Comment


                          • #14
                            Originally posted by syfo View Post
                            Just redirect the default output into a file by adding at the end of the command something like

                            > name-of-your-output-file.txt
                            which gives you more explicitly:

                            Code:
                            awk 'NR==FNR{des[">"$1]=$0;next}/^>/ && des[$1]{$0=">"des[$1]}1' annotation_file fasta_file > output-file.txt
                            Last edited by syfo; 08-11-2014, 06:31 AM. Reason: typo

                            Comment


                            • #15
                              You are awesome! Thank you so much!

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Current Approaches to Protein Sequencing
                                by seqadmin


                                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                04-04-2024, 04:25 PM
                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 04-11-2024, 12:08 PM
                              0 responses
                              18 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 10:19 PM
                              0 responses
                              22 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 09:21 AM
                              0 responses
                              17 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-04-2024, 09:00 AM
                              0 responses
                              49 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X