My personal experience, over 20 years, is that if something like credential rotation isn't automated, it simply doesn't happen. If it does happen, it's a major hassle, probably doesn't get done correctly (causes downtime, something gets missed, probably isn't documented, etc.). Also, there's a huge organizational inertia against doing it. So, for example, when an employee leaves, if you don't have this automated, it likely doesn't happen because "why would we do all this work, it's not like they were a bad person and they aren't stupid".
If you automate this and run it on an automated schedule < 30 days then it is pretty likely that it won't be causing downtime unexpectedly, that you'll have monitoring in place to make sure it actually gets done, that, even if you forget to trigger it for a specific reason (e.g., aforesaid person leaves the organization) it will happen within a reasonable period of time.
In terms of securing such a system... you need to make sure that you separate the system into appropriate pieces with limited access. So, for example, you want a job that is run with an account that only has access to rotate the credentials. It can't use them for anything, just rotate them. Services that consume those credentials should not be able to update them, just use them. You can then ensure that the process that rotates credentials executes in a highly locked-down part of your infrastructure.
Indeed, automating this process also encourages you to create processes with limited access, rather than relying on administrators who have so many responsibilities, you probably just throw them in the equivalent of wide-open sudoers file and call it a day.
It sounds complicated, but if you have decent abstractions, this kind of stuff is actually pretty easy to accomplish.
It sounds complicated, but if you have decent abstractions, this kind of stuff is actually pretty easy to accomplish.
I'd be interested in seeing any end-to-end examples of how people are doing this in practice.
For example, suppose you're maintaining a SaaS application and you have a private key to access some third party API that certain parts of your back end code need. How do you automate this process, so you change your private key on a regular schedule and update all affected hosts so your application code picks up the new one?
Ideally this needs to avoid introducing risks like a single point of failure, a new attack surface, or the possibility of losing access to the API altogether if something goes wrong. Assuming the old key is immediately invalidated when you request a new one via some API, you also need a real time way of looking up the current active key from any of your application hosts when they need it, again without creating single points of failure, etc.
No doubt this could be done with enough work, but it doesn't feel like a trivial problem.
The key rotation issue where there are extra complications around synchronizing multiple keyholders to use the updated credential is neatly solved by having two keys, both valid, and rotating one at a time only after all nodes that need it have moved to the new one.
I would say that an API that requires authorization keys is incomplete if it doesn't provide the tools to manage them securely. How could a service even offer scalability and security if it doesn't support two keys with rotation? It's a "non-trivial problem" because consistent, available, distributed API key rotation is not just hard, it's impossible. (See: CAP)
That doesn't solve your problem, but it means that you should take this complaint to whatever API service offering you are using.
If the skeleton of your system starts with these processes in place, then you can evolve the arch while maintaining these invariants. If something is an invariant rule, then it needs to exist at the start of the system's life. If you patch the system later, it won't have proper coherence.
If you automate this and run it on an automated schedule < 30 days then it is pretty likely that it won't be causing downtime unexpectedly, that you'll have monitoring in place to make sure it actually gets done, that, even if you forget to trigger it for a specific reason (e.g., aforesaid person leaves the organization) it will happen within a reasonable period of time.
In terms of securing such a system... you need to make sure that you separate the system into appropriate pieces with limited access. So, for example, you want a job that is run with an account that only has access to rotate the credentials. It can't use them for anything, just rotate them. Services that consume those credentials should not be able to update them, just use them. You can then ensure that the process that rotates credentials executes in a highly locked-down part of your infrastructure.
Indeed, automating this process also encourages you to create processes with limited access, rather than relying on administrators who have so many responsibilities, you probably just throw them in the equivalent of wide-open sudoers file and call it a day.
It sounds complicated, but if you have decent abstractions, this kind of stuff is actually pretty easy to accomplish.