I had to process some pretty big xml docs recently from the USPTO. Each doc is about 60mb and (oddly enough) contains several thousand individual documents all concatenated. So the document isn’t valid xml..but that’s a different story.
The reason for writing this was to show a quick demo of how to use SAX to process a large XML file. You can read about SAX here, but basically, SAX (Simple API for XML) is an event-driven model that solves the problem of having to read an entire tree structure into memory which can be realllly sloooow, and instead reads the stream of data and raises events along the way.
The code below uses the Nokogiri library (which as a side note has this odd, albeit entertaining tagline: “XML is like violence – if it doesn’t solve your problems, you are not using enough of it.”). Most other XML parsing libraries also have SAX implementations.
What the code does below is looks for the root node of each doc and builds a string for each individual document. After the doc has been assembled, the doc can be processed via the more pleasant:
doc = Nokogiri::HTML(xml)
serial = doc.css("application-reference document-id doc-number").inner_text
So this ends up being sort of a hybrid and much, much faster than loading the entire doc at once. It would be faster not parsing the doc again at all but the docs have too much nested complexity that requires the ability to use xpath to get at what I need.