Last time, we looked at one of Python’s built-in XML parsers. In this article, we will look at the fun third-party package, lxml from codespeak. It uses the ElementTree API, among other things. The lxml package has XPath and XSLT support, includes an API for SAX and a C-level API for compatibility with C/Pyrex modules. We’ll just do a few simple things with it though.
Anyway, for this article, we will use the examples from the minidom parsing example and see how to parse those with lxml. Here’s an XML example from a program that was written for keeping track of appointments:
1181251680 040000008200E000 1181572063 1800 Bring pizza home 1234360800 1800 Check MS Office website for updates 604f4792-eb89-478b-a14f-dd34d3cc6c21-1234360800 dismissed
The XML above shows two appointments. The beginning time is in seconds since the epoch; the uid is generated based on a hash of the beginning time and a key (I think); the alarm time is the number of seconds since the epoch, but should be less than the beginning time; and the state is whether or not the appointment has been snoozed, dismissed or not. The rest are pretty self-explanatory. Now let’s see how to parse it.
from lxml import etree from StringIO import StringIO #---------------------------------------------------------------------- def parseXML(xmlFile): """ Parse the xml """ f = open(xmlFile) xml = f.read() f.close() tree = etree.parse(StringIO(xml)) context = etree.iterparse(StringIO(xml)) for action, elem in context: if not elem.text: text = "None" else: text = elem.text print elem.tag + " => " + text if __name__ == "__main__": parseXML("example.xml")
First off, we import the needed modules, namely the etree module from the lxml package and the StringIO function from the builtin StringIO module. Our parseXML function accepts one argument: the path to the XML file in question. We open the file, read it and close it. Now comes the fun part! We use etree’s parse function to parse the XML code that is returned from the StringIO module. For reasons I don’t completely understand, the parse function requires a file-like object.
Anyway, next we iterate over the context (i.e. the lxml.etree.iterparse object) and extract the tag elements. We add the conditional if statement to replace the empty fields with the word “None” to make the output a little clearer. And that’s it.
Parsing the Book Example
Well, the result of that example was kind of lame. Most of the time, you want to save the data you extract and do something with it, not just print it out to stdout. So for our next example, we’ll create a data structure to contain the results. Our data structure for this example will be a list of dicts. We’ll use the MSDN book example here:
Gambardella, Matthew XML Developer's Guide Computer 44.95 2000-10-01 An in-depth look at creating applications with XML. Ralls, Kim Midnight Rain Fantasy 5.95 2000-12-16 A former architect battles corporate zombies, an evil sorceress, and her own childhood to become queen of the world. Corets, Eva Maeve Ascendant Fantasy 5.95 2000-11-17 After the collapse of a nanotechnology society in England, the young survivors lay the foundation for a new society. Corets, Eva Oberon's Legacy Fantasy 5.95 2001-03-10 In post-apocalypse England, the mysterious agent known only as Oberon helps to create a new life for the inhabitants of London. Sequel to Maeve Ascendant. Corets, Eva The Sundered Grail Fantasy 5.95 2001-09-10 The two daughters of Maeve, half-sisters, battle one another for control of England. Sequel to Oberon's Legacy. Randall, Cynthia Lover Birds Romance 4.95 2000-09-02 When Carla meets Paul at an ornithology conference, tempers fly as feathers get ruffled. Thurman, Paula Splish Splash Romance 4.95 2000-11-02 A deep sea diver finds true love twenty thousand leagues beneath the sea. Knorr, Stefan Creepy Crawlies Horror 4.95 2000-12-06 An anthology of horror stories about roaches, centipedes, scorpions and other insects. Kress, Peter Paradox Lost Science Fiction 6.95 2000-11-02 After an inadvertant trip through a Heisenberg Uncertainty Device, James Salway discovers the problems of being quantum. O'Brien, Tim Microsoft .NET: The Programming Bible Computer 36.95 2000-12-09 Microsoft's .NET initiative is explored in detail in this deep programmer's reference. O'Brien, Tim MSXML3: A Comprehensive Guide Computer 36.95 2000-12-01 The Microsoft MSXML3 parser is covered in detail, with attention to XML DOM interfaces, XSLT processing, SAX and more. Galos, Mike Visual Studio 7: A Comprehensive Guide Computer 49.95 2001-04-16 Microsoft Visual Studio 7 is explored in depth, looking at how Visual Basic, Visual C++, C#, and ASP+ are integrated into a comprehensive development environment.
Now let’s parse this and put it in our data structure!
from lxml import etree from StringIO import StringIO #---------------------------------------------------------------------- def parseBookXML(xmlFile): f = open(xmlFile) xml = f.read() f.close() tree = etree.parse(StringIO(xml)) print tree.docinfo.doctype context = etree.iterparse(StringIO(xml)) book_dict = {} books = [] for action, elem in context: if not elem.text: text = "None" else: text = elem.text print elem.tag + " => " + text book_dict[elem.tag] = text if elem.tag == "book": books.append(book_dict) book_dict = {} return books if __name__ == "__main__": parseBookXML("example2.xml")
This example is pretty similar to our last one, so we’ll just focus on the differences present here. Right before we start iterating over the context, we create an empty dictionary object and an empty list. Then inside the loop, we create our dictionary like this:
book_dict[elem.tag] = text
The text is either elem.text or “None”. Finally, if the tag happens to be “book”, then we’re at the end of a book section and need to add the dict to our list as well as reset the dict for the next book. As you can see, that is exactly what we have done. A more realistic example would be to put the extracted data into a Book class. I have done the latter with json feeds before.
Refactoring the Code
As pointed out by my vigilant readers, I wrote some pretty crappy code. So I have cleaned the code up a bit and hope this is a little better:
from lxml import etree #---------------------------------------------------------------------- def parseBookXML(xmlFile): """""" context = etree.iterparse(xmlFile) book_dict = {} books = [] for action, elem in context: if not elem.text: text = "None" else: text = elem.text print elem.tag + " => " + text book_dict[elem.tag] = text if elem.tag == "book": books.append(book_dict) book_dict = {} return books if __name__ == "__main__": parseBookXML("example.xml")
As you can see, we dropped the StringIO module entirely and put all the file I/O stuff right in the lxml method calls. The rest is the same. Cool huh? As usual, Python rocks!
Wrapping Up
Did you learn anything in this article? I certainly hope so. Python has lots of cool parsing libraries both in its standard library and outside of it. Be sure to check them out and see which one fits your way of programming the best.
Further Reading
- The lxml official website
- An IBM article on lxml
- StringIO documentation
The next time you read the contents of a file into a variable only to turn around and put those contents back into a file like object, I’m going to strangle you! 🙂
Either go with… etree.parse(open(‘file.xml’)) … or if you’re really insistent on reading a file out to a variable, then just use etree.fromstring(myvar)
You don’t need to call etree.parse if you are using iterparse.
Thanks to both you and brutimus, I was inspired to try to fix the code. I’ve added another section with some refactored code that hopefully won’t “offend” anyone else. Thanks a lot for the constructive feedback!
Ah…Thanks for the info. I updated my last example to reflect this. Thanks again!
– Mike
I do show an example of the parse command…but not quite in the way you’re talking about. Thanks for the suggestion though. I’m still a little green when it comes to XML parsing, I guess.