If you’re a long time reader, you may remember that I started programming Python in 2006. Within a year or so, my employer decided to move away from Microsoft Exchange to the open source Zimbra client. Zimbra is an alright client, but it was missing a good way to alert the user to the fact that they had an appointment coming up, so I had to create a way to query Zimbra for that information and show a dialog. What does all this mumbo jumbo have to do with XML though? Well, I thought that using XML would be a great way to keep track of which appointments had been added, deleted, snoozed or whatever. It turned out that I was wrong, but that’s not the point of this story.
In this article, we’re going to look at my first foray into parsing XML with Python. If you do a little research on this topic, you’ll soon discover that Python has an XML parser built into the language in its xml module. I ended up using the minidom sub-component of that module…at least at first. Eventually I switched to lxml, which uses ElementTree, but that’s outside the scope of this article. Let’s take a quick look at some ugly XML that I came up with:
1181251680 040000008200E000 1181572063 1800 Bring pizza home
Now we know what I needed to parse. Let’s take a look at the typical way of parsing something like this using minidom in Python.
import xml.dom.minidom import urllib2 class ApptParser(object): def __init__(self, url, flag='url'): self.list = [] self.appt_list = [] self.flag = flag self.rem_value = 0 xml = self.getXml(url) print "xml" print xml self.handleXml(xml) def getXml(self, url): try: print url f = urllib2.urlopen(url) except: f = url #print f doc = xml.dom.minidom.parse(f) node = doc.documentElement if node.nodeType == xml.dom.Node.ELEMENT_NODE: print 'Element name: %s' % node.nodeName for (name, value) in node.attributes.items(): #print ' Attr -- Name: %s Value: %s' % (name, value) if name == 'reminder': self.rem_value = value return node def handleXml(self, xml): rem = xml.getElementsByTagName('zAppointments') appointments = xml.getElementsByTagName("appointment") self.handleAppts(appointments) def getElement(self, element): return self.getText(element.childNodes) def handleAppts(self, appts): for appt in appts: self.handleAppt(appt) self.list = [] def handleAppt(self, appt): begin = self.getElement(appt.getElementsByTagName("begin")[0]) duration = self.getElement(appt.getElementsByTagName("duration")[0]) subject = self.getElement(appt.getElementsByTagName("subject")[0]) location = self.getElement(appt.getElementsByTagName("location")[0]) uid = self.getElement(appt.getElementsByTagName("uid")[0]) self.list.append(begin) self.list.append(duration) self.list.append(subject) self.list.append(location) self.list.append(uid) if self.flag == 'file': try: state = self.getElement(appt.getElementsByTagName("state")[0]) self.list.append(state) alarm = self.getElement(appt.getElementsByTagName("alarmTime")[0]) self.list.append(alarm) except Exception, e: print e self.appt_list.append(self.list) def getText(self, nodelist): rc = "" for node in nodelist: if node.nodeType == node.TEXT_NODE: rc = rc + node.data return rc
If I recall correctly, this code was based on an example from the Python documentation (or maybe a chapter in Dive Into Python). I still don’t like this code. The url parameter you see in the ApptParser class can be either a url or a file. I had an XML feed from Zimbra that I would check periodically for changes and compare it to the last copy of that XML that I had downloaded. If there was something new, I would add the changes to the downloaded copy. Anyway, let’s unpack this code a little.
In the getXml, we use an exception handler to try and open the url. If it happens to raise an error, than we assume that the url is actually a file path. Next we use minidom’s parse method to parse the XML. Then we pull out a node from the XML. We’ll ignore the conditional as it isn’t important to this discussion (it has to do with my program). Finally, we return the node object.
Technically, the node is XML and we pass it on to the handleXml. To grab all the appointment instances in the XML, we do this: xml.getElementsByTagName(“appointment”). Then we pass that information to the handleAppts method. Yes, there is a lot of passing around various values here and there. It drove me crazy trying to follow this and debug it later on. Anyway, all the handleAppts method does is loop over each appointment and call the handleAppt method to pull some additional information out of it, add the data to a list and add that list to another list. The idea was to end up with a list of lists that held all the pertinent data regarding my appointments.
You will notice that the handleAppt method calls the getElement method which calls the getText method. I don’t know why the original author did it that way. I would have just called the getText method and skipped the getElement one. I guess that can be an exercise for you, dear reader.
Now you know the basics of parsing with minidom. Personally I never liked this method, so I decided to try to come up with a cleaner way of parsing XML with minidom.
Making minidom Easier to Follow
I’m not going to claim that my code is any good, but I will say that I think I came up with something much easier to follow. I’m sure some will argue that the code is not as flexible, but oh well. Here’s a new XML example that we will parse (found on MSDN):
Gambardella, Matthew XML Developer's Guide Computer 44.95 2000-10-01 An in-depth look at creating applications with XML. Ralls, Kim Midnight Rain Fantasy 5.95 2000-12-16 A former architect battles corporate zombies, an evil sorceress, and her own childhood to become queen of the world. Corets, Eva Maeve Ascendant Fantasy 5.95 2000-11-17 After the collapse of a nanotechnology society in England, the young survivors lay the foundation for a new society. Corets, Eva Oberon's Legacy Fantasy 5.95 2001-03-10 In post-apocalypse England, the mysterious agent known only as Oberon helps to create a new life for the inhabitants of London. Sequel to Maeve Ascendant. Corets, Eva The Sundered Grail Fantasy 5.95 2001-09-10 The two daughters of Maeve, half-sisters, battle one another for control of England. Sequel to Oberon's Legacy. Randall, Cynthia Lover Birds Romance 4.95 2000-09-02 When Carla meets Paul at an ornithology conference, tempers fly as feathers get ruffled. Thurman, Paula Splish Splash Romance 4.95 2000-11-02 A deep sea diver finds true love twenty thousand leagues beneath the sea. Knorr, Stefan Creepy Crawlies Horror 4.95 2000-12-06 An anthology of horror stories about roaches, centipedes, scorpions and other insects. Kress, Peter Paradox Lost Science Fiction 6.95 2000-11-02 After an inadvertant trip through a Heisenberg Uncertainty Device, James Salway discovers the problems of being quantum. O'Brien, Tim Microsoft .NET: The Programming Bible Computer 36.95 2000-12-09 Microsoft's .NET initiative is explored in detail in this deep programmer's reference. O'Brien, Tim MSXML3: A Comprehensive Guide Computer 36.95 2000-12-01 The Microsoft MSXML3 parser is covered in detail, with attention to XML DOM interfaces, XSLT processing, SAX and more. Galos, Mike Visual Studio 7: A Comprehensive Guide Computer 49.95 2001-04-16 Microsoft Visual Studio 7 is explored in depth, looking at how Visual Basic, Visual C++, C#, and ASP+ are integrated into a comprehensive development environment.
For this example, we’ll just parse the XML, extract the book titles and print them to stdout. Are you ready? Here we go!
import xml.dom.minidom as minidom #---------------------------------------------------------------------- def getTitles(xml): """ Print out all titles found in xml """ doc = minidom.parse(xml) node = doc.documentElement books = doc.getElementsByTagName("book") titles = [] for book in books: titleObj = book.getElementsByTagName("title")[0] titles.append(titleObj) for title in titles: nodes = title.childNodes for node in nodes: if node.nodeType == node.TEXT_NODE: print node.data if __name__ == "__main__": document = 'example.xml' getTitles(document)
This code is just one short function that accepts one argument, the XML file. We import the minidom module and give it the same name to make it easier to reference. Then we parse the XML. The first two lines in the function are pretty much the same as the previous example. We use getElementsByTagName to grab the parts of the XML that we want, then iterate over the result and extract the book titles from them. This actually extracts title objects, so we need to iterate over that as well and pull out the plain text, which is what the second nested for loop is for.
That’s it. There is no more.
Wrapping Up
Well, I hope this rambling article taught you a thing or two about parsing XML with Python’s builtin XML parser. We will be looking at XML parsing some more in future articles. If you have a method or module that you like, feel free to point me to it and I’ll take a look.
Additional Reading
- Python minidom official documentation
- Python and XML wiki page
- Python’s other builtin XML parsers: ElementTree, sax, expat and dom