It's kind of a downer that the article didn't mention Safari which seems to take a different approach to PDFs. Instead of treating them as "active content", PDF documents are merely rendered with Quartz/Core Graphics and so are free of scripts of any kind. This also has the upside that PDFs look exactly the same everywhere on macOS/iOS, even Quick Look previews.
I like Safari's approach much more than having to hunt down some obscure browser setting or trust that it does the right thing.
For anyone (like me) wondering why PDFs would need to support JavaScript in the first place, the main motivation/use-case appears to be validation and interactivity of embedded forms.
I've seen javascript in PDFs be used for unintended exploits more often than every legitimate use combined. It's kind of like if JPEGs could run arbitrary code by design.
I guess this was specified in a time when nobody thought it would one day be possible to embed an SVG document in an HTML DOM and add animations and interactivity in a performant way there.
ninja edit:
It's also from a time when W3C started to lose focus and authority.
It's amazing that SVG was so successful despite this mess and also the confusion potential of CSS in SVG.
Browsers ignore scripts in external SVG images. Don't know if that is for security reasons (JS sandbox unreliable) or because a full isolated JS context per image would be to expensive...
Wasn't there also a time where you could open a raw socket with SVG? SVG is very much from a time when we didn't know what the web was going to be or how it was going to work.
The core issue iirc was that one of the major use cases for SVG was map/navigation systems where a number of environments required fully standardized systems. But they didn’t want to say implement a full browser stack”, so they just came up with their own “networking api” that was just “sockets!”.
A lot of this work predated html5, and the subsequent rationalization of web specs such that (for example) the xhr API was not fully specified, and it was not a separate specification from the rest of the browser stack, so SVG couldn’t just do what they could (in principle) do now.
The SVG WG was not the most functional - i recall that something a subset of the committee did at one point was to after the end of one person’s work day they rescheduled a meeting to later “that day” (while they were asleep) and took a vote without them present.
A number of other choices were made to the detriment of the spec for specific use cases (the various performance profiles have fundamentally incompatible rendering behavior rather than gradual decay, etc)
Compression: for some images, you can't use SVG's <use>, but a small script can generate the repetitive bits quite nicely. Also, aperiodic animation (e.g. a double pendulum): SGML animations can represent a few minutes, but don't try putting a few hours' worth in.
PostScript, the printer file format, is Turing-complete, for different reasons.
I knew a guy who wrote a PostScript document that was a map of the sky at that moment. If you rendered it an hour later it was different again. It used the `file` capabilities of host-based interpreters.
There are "legitimate use cases" for just about everything imaginable on this planet because there will always be a user that goes "I spend all my day in X software wouldn't it be great if it could read my email/monitor my plants/talk to sales/..".
That's how cursed enterprise software develops email clients and chat services. Just say no.
I understand the motivation, but IMHO a PDF should be a static document, hence, something you can trust without worrying.
Since they can contain code, they can carry malicious code. PDFs have, in fact, been used for exploits. Meaning that you shouldn't really trust them. Which is a shame.
Iphones don't support JS in PDFs, but yet an integer overflow in image decompression code led to a zero-click imessage exploit.[1] So lack of explicit code support doesn't mean you can trust without worrying. Bugs can be anywhere. Iphones have been known to have crash-causing bugs in unicode-handling code.[2] So even just text could be a problem. Disclosure: I work at Google but not on Project Zero.
JS in PDF might be a mis-feature, but any security lapse is indeed a bug in the implementation (made doubly worse by firefox running the JS in a web context).
Yes, removing JS support would get rid of potential security exploits. It doesn't change the fact that said exploits rely on bugs in the implementation.
That's true, but it misses the point that scripting adds orders of magnitude greater complexity to the attack surface.
Fixing other kinds of bugs is fairly straightforward. Update your toolchain, update your dependencies, use the right dependencies, avoid undefined behavior, etc. Fixing scripting issues means participating in an active arms race.
There does seem to be a mismatch between what PDFs are mostly used for, and their full capabilities.
IMO it’s be nice to define a file format for PDFs main use (I think?), papers and documentation. PDF, scripting, but maybe the ability to zoom and pan figures?
In the engineering world outside software, our cad tools generate rich interactive functionality into PDFs, including but not limited to 3d models for those doing mechanical work.
I've known about those capabilities for a long time and I've always wondered: How commonly is that used? For what use case(s)? What makes PDF the format of choice for that purpose and not, for example, a CAD file? What PDF apps are popular for creating and using those files?
> Instead, Firefox offers an individual pdfjs.enableScripting preference that can be configured from the about:flags page.
As a long time FF user I have never heard of about:flags and it does not work either. about:config contains the setting like a million of other ones that no ordinary mortal can ever manage.
But it got wrong both entities (the config address and the setting name). It feels like an LLM error on writing based on short high levels descriptions of sections.
The article is a bit one-sided: it reviews the topic only from the aspect of rendering PDFs using a copy of pdf.js embedded in a web browser. However, this is not the only copy of pdf.js. It would be interesting to check software like NextCloud or its proprietary workalikes (e.g., PCloud) for their handling of untrusted JavaScript in PDF files shared through these platforms.
They are mostly talking about chrome which does not use pdf.js (unless they changed it)
In any case, its pretty similar in both cases. Even in the client side rendering case, if there is a sandbox you still have to escape it before your script execution is a real vuln.
Just because PDF files can't hijack your domain, doesn't mean they can't spy on you. Unfortunately, there isn't really an open source tool for sanitizing PDF files.
Converts incoming documents into a PDF that is a sequence of filtered/optimised images. Dangerzone will also handle office docs, epub, and a lot of different image formats (e.g. SVG and others that can possibly contain active content).
You can always open the PDF in a stand-alone PDF viewer that doesn't support javascript. Like xpdf, mupdf, or atril, or one of possibly many other PDF viewers.
(those are the only ones I have installed right now, I think)
Or there's poppler-utils which contains a bunch of tools for dealing with PDF files. You might be able to write a script which uses them to extract the contents into a safe format (maybe even a different PDF file?).
JS in SVGs can be dangerous, but you can mitigate it using a CSP or by sending "Content-Disposition: attachment" so the file will be downloaded instead of being executed in your current browser context.
Issue is that PDF has turned into a form submitting for some US government agencies and other organizations. Until PDFs are no longer used for form submission and are transitioned to static documents they are a good exploit entry.
I like Safari's approach much more than having to hunt down some obscure browser setting or trust that it does the right thing.