bestkungfu weblog

Ogbuji: Python and XML

Filed in: XML2003, Fri, Dec 12 2003 15:30 PT

Uche Ogbuji presented on processing XML with Python. He says Python has been called “readable pseudocode.” He said that he was “hooked” on Python once he discovered it expressiveness and readability relative to, for example, Perl.

Python has well-designed Unicode support, built-in support for text processing, Internet, and XML. Recently added have been generators and iterators, which help for working with lists and program control. He went over a large number of the implementations, many with code examples.

xmllib

Not highly recommended: out of date.

xml.parsers.expat

Low-level interface to James Clark’s expat. Very fast, SAX-like interface.

xml.sax

SAX implementation. The dominant push model for XML. The parser streams events to a custom handler module. Methods are invoked as callbacks. To get SAX working, you set up a custom class built on sax.ContentHandler. You handle your own depth with your own implementations of startDocument, startElement, and endElement methods. You instantiate the parser with sax.make_parser(), and set the handler to the custom class you created. SAX is low-memory, is portable and reusable, but can require sophisticated state management code for big tasks, and has some syntactical hooks that are noticeably non-Pythonesque. (I didn’t write down the code sample. Sorry. It’s easy to find.)

xml.dom

DOM implementation. There are actually many implementations of DOM in Python (Ogbuji suggests trying pxdom, which he says is a rigorous implementation of DOM level 3).

document = minidom.parseString(doc) document = document.documentElement() child = document.childNodes[1] print child.attributes children = [ node for node in document.childNodes if node.codeName = u'line' ] # note that 'u' before "line": It denotes Unicode. third_child = lines[2] third_child.normalize() print third_child.firstChild.data print document.toxml(encoding="utf-8")

There’s no complex state management needed, like there is with SAX, decent interoperability, and Python generators can make things fun. It is, on the other hand, memory-heavy, and also has a somewhat non-Pythonesque implementation.

xml.dom.pulldom

Says Ogbuji: “It’s good for what it’s good for”: pulling bits out of large files. He says it’s easier than SAX, but it’s more of a theoretical ease than a practical one.

from xml.dom import pulldom events = pulldom.parseString(doc) line_counter = 0 for (event, node) in events: if event == pulldom.START_ELEMENT: if node.tagName == "line": line_counter += 1 if line_counter == 3: # start processing events.expandNode(node) # do the other stuff you want at that level print node.firstChild.data

xmlTextReader

Pull API similar to .NET’s TextReader interface built in libxml2 from the GNOME project. The core is implemented in C.

import cStringIO import libxml2 XMLREADER_START_ELEMENT_NODE_TYPE = 1 # hacked in from C library stream = cStringIO.StringIO(doc) input_source = libxml2.inputBuffer(stream) reader = input_source.newtextReader("urn:bogus") line_counter = 0 while reader.Read(): if reader.NodeType() == XML_READER_START_ELEMENT_NODE_TYPE: if reader.Name() == "line": line_counter += 1 if line_counter == 1: node = reader.Expand() print node.children.content if reader.Next() != -1: # skip what you just expanded so you don't see it twice break

ElementTree

“Designed largely out of frustration with DOM’s lack of Python idiom.”

import cStringIO stream = cStringIO.StringIO(doc) from elementtree.ElementTree import ElementTree root = ElementTree(filestream) third_child = root.findall('line')[2]

gnosis.xml.objectify

maps from XML to Python objects. Part of the gnosis tool set.

import gnosis.xml.objectify import cStringIO stream = cStringIO.StringIO(doc) dom.obj = XML_Objectify(stream) verse.line[2].PCDATA

Anobind

Requires Ogbuji’s 4Suite. Uses declarative rules to bind to Python better. Gives extra tools like XPath, RELAX NG, XML Catalogs and XInclude.

import anobind from Ft.Xml import InputSource isrc_factory = InputSource.DefaultFactory isrc = isrc_factory.fromString(doc, "urn:bogus") binder = anobind.binder() binding = binder.read_xml(isrc) print binding verse.line[2].text_content()

Ogbuji has an article on xml.com listing all of the 15 million XML tools for Python.

Comments are closed.

Powered by WordPress (RSS 2.0, Atom)