> What exactly are we doing with these servers?
Our CA software, Boulder, uses MySQL-style schemas and queries to manage subscriber accounts and the entire certificate issuance process.
The post doesn't specify requirements or application level targets for performance. They show a couple of good latency improvements but don't describe the business or technical impact. The closest we get is this.
> If this database isn’t performing well enough, it can cause API errors and timeouts for our subscribers.
What are the SLO's? How was this being met (or not) before vs after the hardware upgrade? There's a lot of additional context that could have been added in this post. It's not a bad post but instead it simply reduces down to this new hardware is faster than our old hardware.
What exactly needs to be stored once the certificate is created and published in the hash tree? It seems like the kind of data that possibly needn't be stored at all or onto something like Glacier for archival.
AFAIK, nobody has suggested removal of OCSP from end-entity certificates. This article you linked (and the comment you wrote) is purely about removal from intermediate CA certificates.
The majority of OCSP traffic will probably be for end-entity certificates; most OCSP validation (in browsers and cryptographic libraries) is end-entity validation, not leaf-and-chain.
Removal of intermediate CA's OCSP is probably not really relevant to their overall OCSP performance numbers (and if it was, it was likely cached already).
There's an argument for not doing OCSP on end-entity certificates if you can approach the lifetime for the certificates that you'd realistically need for OCSP responses anyway.
Suppose you promise to issue OCSP revocations within 48 hours if it's urgent, and your OCSP responses are valid for 48 hours. That means after a problem happens OCSP revocation takes up to 96 hours to be effective.
If you only issue certificates with lifetimes of 96 hours then OCSP didn't add anything valuable - the certificates expire before they can effectively be revoked anyway.
Let's Encrypt is much closer to this idea (90 days) than many issuers were when it started (offering typically 1-3 years) but not quite close enough to argue revocation isn't valuable. However, the automation Let's Encrypt strongly encourages makes shortening lifetimes practical. Many of us have Let's Encrypt certs automated enough that if they renewed every 48 hours instead of every 60 days we'd barely care.
The solution to excessive OCSP traffic and privacy risk is supposed to be OCSP stapling instead, but TLS servers that can't get stapling right are still ridiculously popular so that hasn't gone so well.
I'm not sure, e.g. Chrome doesn't do OCSP by default, lots of embedded clients like curl won't either. Unless the protocol is terribly broken, that also seems like the kind of use case where 99% of queries just come out of cache and should never hit a database.
Let's Encrypt still has to publish OCSP responses for every non-expired leaf certificate, at least in time that you can always get a new OCSP response before the previous one expires. In practice they have a tighter schedule so that there's a period between "We are not meeting our self-imposed deadline" and "The Internet broke, oops" in which staff can figure out the problem and fix it.
To do this they automatically generate and sign OCSP responses (the vast majority of which will just say the certificate is still good) on a periodic cycle, and then they deliver them in bulk to a CDN. The CDN is who your client (or server if you do OCSP stapling, which you ideally should) talks to when checking OCSP.
To generate those responses they need a way (hey, a database) to get the set of all certificates which have not yet expired and whether those certificates are revoked or not.