It is, but many of the XML values and attributes are just dumps of bitfields and identifiers that are structured exactly as they are in memory. In other words, the XML format was mostly a PR move to stave off accusations of not being "open" enough, and the people who implement support for the new format still have to reverse-engineer a bunch of Office-specific binary nonsense.
Open XML formats (.docx, etc) came with Office 2003.
From what I remember, the EU wanted to standardize an open document format, and had their eye set on OO ODF, as the most mature of those at the time. Microsoft put out their OXML format, gave out free patent grants, lobbied hard, and managed to get it into ISO.
If you want some nightmares though, open an office XML file in a text editor some day.
No, there was a separate XML format for 2003 before OOXML, called WordprocessingML. Basically nothing else ever used it, but later MS Office and LibreOffice still read it.
I wrote a tool to convert pptx files to markdown. Using python it was maybe a 3 hour project. I'm sure some of the more nuanced formatting is a challenge, but the format really isn't bad and keeps the content quite manageable.
That is in fact what I was referring to, although I don't think it should still be up.
I wrote that and on a whim thought, "I'll make this a pip package," severely underestimating that task. The tool works but I never finished that packaging. The working source should be on github if you're interested.
If you want some nightmares though It is not really all that bad. The hardest part of pulling it all out and reassembling it into what you want is the shared strings aspect.
It is really quite distinct from the memory dump of pre-xml doc/excel files.
For real. On Palladium/NGSCB there was a strong effort to maintain correspondence between documentation and code, to the point that there was an effort to have header files be generated directly from the specs, which were Word docs. The biggest practical challenge was that the extractor had to instantiate Word just to read the text content of paragraphs with the specified style. This is not something that you want in your build pipeline if you can avoid it.
Hahaha, me too, but the old Word format really is as bad as everyone says, and we were trying to build a system with formal correspondence from spec to code (and sometimes from spec to proof to code), so having a "cobbled" together something really didn't fit the model.
The real problem was using word as our documentation format, but at Microsoft in the early oughts there really weren't many alternatives.
The cobble-together part, to be successful, would pull the text out reliably. And the format is readily documented, and Open/Libre office processes it as well. The code to do the extract might be ugly, but so long as it reliably produced the text in a CI/CD environment, that would be OK.
It'd be hilarious if they said "Here's my 2006 tax report, but you need Office 2013 to open it."