Isn't the file format a zipped XML nowadays? It'd be hilarious if they said "Her...

skywhopper · on July 12, 2017

It is, but many of the XML values and attributes are just dumps of bitfields and identifiers that are structured exactly as they are in memory. In other words, the XML format was mostly a PR move to stave off accusations of not being "open" enough, and the people who implement support for the new format still have to reverse-engineer a bunch of Office-specific binary nonsense.

jankotek · on July 12, 2017

Office97 was binary format, XML comed much latter as response to OpenOffice

auxym · on July 12, 2017

Open XML formats (.docx, etc) came with Office 2003.

From what I remember, the EU wanted to standardize an open document format, and had their eye set on OO ODF, as the most mature of those at the time. Microsoft put out their OXML format, gave out free patent grants, lobbied hard, and managed to get it into ISO.

If you want some nightmares though, open an office XML file in a text editor some day.

vetinari · on July 12, 2017

> Open XML formats (.docx, etc) came with Office 2003.

They came with Office 2007. For Office 2003, there was a plugin that allowed to open the XML files, in a limited way.

davidgerard · on July 12, 2017

No, there was a separate XML format for 2003 before OOXML, called WordprocessingML. Basically nothing else ever used it, but later MS Office and LibreOffice still read it.

nxc18 · on July 12, 2017

I wrote a tool to convert pptx files to markdown. Using python it was maybe a 3 hour project. I'm sure some of the more nuanced formatting is a challenge, but the format really isn't bad and keeps the content quite manageable.

xyzxyz998 · on July 12, 2017

You're not going to leave us hanging, are you?

l1n · on July 12, 2017

Not OP but it could be https://pypi.python.org/pypi/ppmd/0.1.1

nxc18 · on July 12, 2017

That is in fact what I was referring to, although I don't think it should still be up.

I wrote that and on a whim thought, "I'll make this a pip package," severely underestimating that task. The tool works but I never finished that packaging. The working source should be on github if you're interested.

wglb · on July 12, 2017

If you want some nightmares though It is not really all that bad. The hardest part of pulling it all out and reassembling it into what you want is the shared strings aspect.

It is really quite distinct from the memory dump of pre-xml doc/excel files.

pacaro · on July 12, 2017

For real. On Palladium/NGSCB there was a strong effort to maintain correspondence between documentation and code, to the point that there was an effort to have header files be generated directly from the specs, which were Word docs. The biggest practical challenge was that the extractor had to instantiate Word just to read the text content of paragraphs with the specified style. This is not something that you want in your build pipeline if you can avoid it.

wglb · on July 13, 2017

Ewwww.

I woulda cobbled something together to suck the text out of them.

pacaro · on July 13, 2017

Hahaha, me too, but the old Word format really is as bad as everyone says, and we were trying to build a system with formal correspondence from spec to code (and sometimes from spec to proof to code), so having a "cobbled" together something really didn't fit the model.

The real problem was using word as our documentation format, but at Microsoft in the early oughts there really weren't many alternatives.

wglb · on July 14, 2017

Well, I do believe LaTeX was available.

The cobble-together part, to be successful, would pull the text out reliably. And the format is readily documented, and Open/Libre office processes it as well. The code to do the extract might be ugly, but so long as it reliably produced the text in a CI/CD environment, that would be OK.

pacaro · on July 14, 2017

The constraints were entirely organizational, many better technical solutions were suggested

oniony · on July 12, 2017

Comed?