Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • retrieving <Hit_def> information from XML output

    Greetings, I am trying to retrieve information regarding the putative taxonomic identifications of 16S/18S rRNA genes retrieved from a HiSeq Illumina run using BLASTN (from the blast+ package). Thus far i have been relying on biopython to parse the data. I'm able to retrieve information regarding the e-values to each query, alignment lengths and such for all of the hits using commands like the ones below.

    ###############################################
    >>>from Bio.Blast import NCBIXML
    >>>blast = NCBIXML.parse(open('16SxmlResults', 'rU'))
    >>>for record in blast:
    >>> print record.alignments[0].hsps[0].score
    ###############################################

    The above prints all the high-scoring pair bit scores to standard output.
    However, the piece of information i can't seem to access is located in the <Hit_def>. Looks like this;

    <Hit_def>JR951091.270.2233 Bacteria;Proteobacteria;Alphaproteobacteria;Rickettsiales;mitochondria;Pisum sativum (pea)

    I have looked into the biopython Bio.Blast.Record documention, as well as the tutorial, and can't seem to find any mixes/matches of how to retrieve this information. As well, I have also tried using elementtree to parse the data. This works, but i'm having a hard time "looping" through the whole file (there are ~4000 entries.

    If anyone has any suggestions, or can provide some guidance i would sincerely appreciate it. Thanks,

    -Tony

  • #2
    You want the alignment's hit_def attribute, e.g.

    Code:
    from Bio.Blast import NCBIXML
    blast = NCBIXML.parse(open('16SxmlResults', 'rU'))
    for record in blast:
        for align in record.alignments:
            for hsp in align.hsps:
                print hsp.score, align.hit_def
    Tip: Explore dir(x) and help(x) at the Python prompt where x is an unfamiliar class.

    Comment


    • #3
      I am very new to python, as the codes above are just printing, could you please tell me how to save this in a file(.csv or .txt).

      Thanks

      Comment


      • #4
        Easy way: When you run BLAST+ rather than asking for XML output with
        Code:
        -outfmt 5
        ask for tabular output with
        Code:
        -outfmt 6
        (or ask for CSV if you prefer).

        Hard way: Convert the BLAST XML into tabular format using a script like https://github.com/peterjc/galaxy_bl..._to_tabular.py
        Last edited by maubp; 12-09-2014, 08:41 AM. Reason: formatting

        Comment


        • #5
          Originally posted by maubp View Post
          You want the alignment's hit_def attribute, e.g.

          Code:
          from Bio.Blast import NCBIXML
          blast = NCBIXML.parse(open('16SxmlResults', 'rU'))
          for record in blast:
              for align in record.alignments:
                  for hsp in align.hsps:
                      print hsp.score, align.hit_def
          Tip: Explore dir(x) and help(x) at the Python prompt where x is an unfamiliar class.
          What does 'rU' refers to? a second input file?

          Comment


          • #6
            Originally posted by bernardo_bello View Post
            What does 'rU' refers to? a second input file?
            r is open for reading. As for "U"

            Python is usually built with universal newlines support; supplying 'U' opens the file as a text file, but lines may be terminated by any of the following: the Unix end-of-line convention '\n', the Macintosh convention '\r', or the Windows convention '\r\n'. All of these external representations are seen as '\n' by the Python program.

            Comment

            Latest Articles

            Collapse

            • seqadmin
              Essential Discoveries and Tools in Epitranscriptomics
              by seqadmin




              The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
              04-22-2024, 07:01 AM
            • seqadmin
              Current Approaches to Protein Sequencing
              by seqadmin


              Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
              04-04-2024, 04:25 PM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, 04-11-2024, 12:08 PM
            0 responses
            59 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-10-2024, 10:19 PM
            0 responses
            57 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-10-2024, 09:21 AM
            0 responses
            52 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-04-2024, 09:00 AM
            0 responses
            56 views
            0 likes
            Last Post seqadmin  
            Working...
            X