Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • emanlee
    Member
    • Apr 2013
    • 15

    format uniref90.xml to database for BLAST

    we want to get a formated database for BLAST from Uniref90.
    we referred an article which used an early release Uniref90, say version 10.0
    ftp.uniprot.org only provide the XML format of this version.

    How can we convert XML uniref90.xml to uniref90.fasta ? or, can formatdb take XML file as input file?
    Thanks!

    XML file:
    ftp://ftp.uniprot.org/pub/databases/...ref10.0.tar.gz
  • maubp
    Peter (Biopython etc)
    • Jul 2009
    • 1544

    #2
    You should be able to convert the UniProt XML to FASTA using Biopython,

    Code:
    from Bio import SeqIO
    count = SeqIO.convert("uniref90.xml", "uniprot-xml", "converted.fasta", "fasta")
    print("Converted %i records" % count)

    Comment

    • emanlee
      Member
      • Apr 2013
      • 15

      #3
      Thank you for your quick reply. I'll try it out.

      Comment

      • emanlee
        Member
        • Apr 2013
        • 15

        #4
        Code:
        >>> from Bio import SeqIO
        >>> count = SeqIO.convert("uniref90.xml", "uniprot-xml", "uniref90converted.fasta", "fasta")
        >>> print("Converted %i records" % count)
        Converted 0 records

        We checked uniref90.xml:
        more uniref90.xml
        Code:
        <?xml version="1.0" encoding="ISO-8859-1" ?>
        <UniRef90 xmlns="http://uniprot.org/uniref" 
        xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
        xsi:schemaLocation="http://uniprot.org/uniref http://www.uniprot.org/support/docs/uniref.xsd" 
         releaseDate="2007-03-06" version="10.0"> 
        <entry id="UniRef90_Q3ASY8" updated="2007-03-06">
        <name>Cluster: Parallel beta-helix repeat</name>
        <property type="member count" value="1"/>
        <property type="common taxon" value="Chlorobium chlorochromatii CaD3"/>
        <property type="common taxon ID" value="340177"/>
        <representativeMember>
        <dbReference type="UniProtKB ID" id="Q3ASY8_CHLCH">
        <property type="UniProtKB accession" value="Q3ASY8"/>
        <property type="UniParc ID" value="UPI00005D5563"/>
        <property type="UniRef100 ID" value="UniRef100_Q3ASY8"/>
        <property type="UniRef50 ID" value="UniRef50_Q3ASY8"/>
        <property type="protein name" value="Parallel beta-helix repeat"/>
        <property type="source organism" value="Chlorobium chlorochromatii (strain CaD3)"/>
        <property type="NCBI taxonomy" value="340177"/>
        <property type="length" value="36805"/>
        <property type="isSeed" value="true"/>
        </dbReference>
        <sequence length="36805" checksum="A7A8EA21B9345FF9">
        MKPRFYIEQLEPRILLSGDILSELVPLLSSREASQMQSDYLLEHPEARRVAPLSAVEAAR
        ....
        Could you help us, thanks.
        Last edited by emanlee; 09-22-2013, 04:55 PM.

        Comment

        • kmcarr
          Senior Member
          • May 2008
          • 1181

          #5
          Originally posted by emanlee View Post
          Could you help us, thanks.
          Wouldn't it just be much easier to download the UniRef90 FASTA file directly?

          Comment

          • maubp
            Peter (Biopython etc)
            • Jul 2009
            • 1544

            #6
            Originally posted by kmcarr View Post
            Wouldn't it just be much easier to download the UniRef90 FASTA file directly?
            Indeed, I should have doubled checked that really didn't exist.

            As to the Biopython conversion failing, that is probably a bug - I'd have replied earlier but missed the thread reply alert - sorry.

            Comment

            • GenoMax
              Senior Member
              • Feb 2008
              • 7142

              #7
              The file linked by kmcarr does not refer to a "version 10.0" that emanlee was asking for. Perhaps that is not important.

              Comment

              • maubp
                Peter (Biopython etc)
                • Jul 2009
                • 1544

                #8
                OK then... first this is how I just extracted the uniref90.xml file from the FTP site (multiple levels of bundling!):
                Code:
                $ wget ftp://ftp.uniprot.org/pub/databases/uniprot/previous_releases/release10.0/uniref/uniref10.0.tar.gz
                ...
                $ tar -zxvf uniref10.0.tar.gz 
                uniref100.tar
                uniref50.tar
                $ tar -xvf uniref90.tar 
                README
                uniref90.dtd
                uniref90.xml.gz
                $ gunzip uniref90.xml.gz
                And here is what the start of the file looks like for me too (same as emanlee reported):
                Code:
                $ head -n 25 uniref90.xml 
                <?xml version="1.0" encoding="ISO-8859-1" ?>
                <UniRef90 xmlns="http://uniprot.org/uniref" 
                xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
                xsi:schemaLocation="http://uniprot.org/uniref http://www.uniprot.org/support/docs/uniref.xsd" 
                 releaseDate="2007-03-06" version="10.0"> 
                <entry id="UniRef90_Q3ASY8" updated="2007-03-06">
                <name>Cluster: Parallel beta-helix repeat</name>
                <property type="member count" value="1"/>
                <property type="common taxon" value="Chlorobium chlorochromatii CaD3"/>
                <property type="common taxon ID" value="340177"/>
                <representativeMember>
                <dbReference type="UniProtKB ID" id="Q3ASY8_CHLCH">
                <property type="UniProtKB accession" value="Q3ASY8"/>
                <property type="UniParc ID" value="UPI00005D5563"/>
                <property type="UniRef100 ID" value="UniRef100_Q3ASY8"/>
                <property type="UniRef50 ID" value="UniRef50_Q3ASY8"/>
                <property type="protein name" value="Parallel beta-helix repeat"/>
                <property type="source organism" value="Chlorobium chlorochromatii (strain CaD3)"/>
                <property type="NCBI taxonomy" value="340177"/>
                <property type="length" value="36805"/>
                <property type="isSeed" value="true"/>
                </dbReference>
                <sequence length="36805" checksum="A7A8EA21B9345FF9">
                MKPRFYIEQLEPRILLSGDILSELVPLLSSREASQMQSDYLLEHPEARRVAPLSAVEAAR
                ACMVVVQSEAPSLLTEDGLMYPFEVGVGEERSSEANAEPTLAADFSADYTFSKSEWDALE
                And here's how many records there seem to be according to grep:
                Code:
                $ grep -c "^<entry id" uniref90.xml 
                2781437
                Biopython 1.61 and 1.62 do appear to have a problem parsing this - I suspect the XML is different in some way to what we expect.

                Update: Raised here: http://lists.open-bio.org/pipermail/...er/010909.html
                Last edited by maubp; 09-27-2013, 07:52 AM.

                Comment

                Latest Articles

                Collapse

                • SEQadmin2
                  Nine Things a Sample Prep Scientist Thinks About Before Sequencing
                  by SEQadmin2


                  I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.


                  Here are nine questions we think about, in roughly the order they matter, before...
                  06-18-2026, 07:11 AM
                • SEQadmin2
                  From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                  by SEQadmin2


                  Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                  The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                  ...
                  06-02-2026, 10:05 AM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by SEQadmin2, 06-17-2026, 06:09 AM
                0 responses
                24 views
                0 reactions
                Last Post SEQadmin2  
                Started by SEQadmin2, 06-09-2026, 11:58 AM
                0 responses
                42 views
                0 reactions
                Last Post SEQadmin2  
                Started by SEQadmin2, 06-05-2026, 10:09 AM
                0 responses
                48 views
                0 reactions
                Last Post SEQadmin2  
                Started by SEQadmin2, 06-04-2026, 08:59 AM
                0 responses
                49 views
                0 reactions
                Last Post SEQadmin2  
                Working...