I bet if you work out how many nodes and attributes are in that XML file, multip...

Roboprog · on Nov 4, 2014

Exactly. If you have to have a "DOM" of an XML, it's going to take A LOT of memory.

I recently replaced some generalized Java code (my own, alas) that built a DOM of some XML with substring operations to find the element text in the only element in the XML that mattered. (fortunately, I could guarantee that the text had no markup related special chars in it) This caused about 6 GC cycles (in a 32 bit JVM) to disappear from this process.

However, if you have to display or navigate the document tree, you are stuck with the memory hogging DOM.

throwawayaway · on Nov 4, 2014

nah, the guy is right there's better tech since 2010, just nobody has bothered to swap out the old style DOM stuff for pugixml/rapidxml/vtdxml.

http://pugixml.org/benchmark/

specifically: http://pugixml.files.wordpress.com/2010/10/dom-memory-compar...

Roboprog · on Nov 4, 2014

I meant "DOM" as in the general "tree in memory" data structure. You did find an implementation that uses about 1/3 the memory of some of the piggier ones, though.

throwawayaway · on Nov 7, 2014

closer to 1/5th.

also, in actual use -> those are peak values, so while libxml will hog the memory until the DOM is freed, the streaming style parsers hold on to the smaller amount of memory for a shorter time.

Roboprog · on Nov 13, 2014

In the problem above, I did try using the SAX parser that the WebLogic-JVM "factory factory factory" returned, but the element text was about 2 MB, and it wanted to return it in pieces by repeatedly firing the event handler.

Manually finding the index of the open/close elements and doing a substring to get the element text was SO incredibly much faster and smaller, albeit something that only worked for a VERY specific situation.

zimpenfish · on Nov 6, 2014

I wonder why those haven't got traction with things like Mozilla though. Presumably -someone- would at least have looked at them if they're that big a win?