Some tips I would like to add, talking specifically about logging:
1, Make sure you have a git hash attached, either in the filename, at the start of the file, or when you throw an exception.
This helps massively when you can switch your local env to exactly the code being run when it crashed.
2, Log format should be standardised and have the following:
Timestamp (in utc), log level, guid of the data being processed[1], log message, filename and line of the message that created it.
[1] When dealing with distributed systems, multiprocessing systems or complex dataflows, starting a logging message with "... %guid%: some log message about this ..." can be a massive time saver. The guid could be a literal guid or some serial attached to the data with a type identifier.
3, Try and make your error message unique across the codebase. The filename and line log format helps to track down where the message comes from but if you're just given the text "%id%: Failed to locate TPM in existing KYC lookup being able to ctrl-f that exact message across the codebase and get started debugging saves time.
4, Debug logging isn't helpful when you need to debug something if it isn't turned on. Roll your logging files by the hour if you really have to keep the file size down. But you're going to need those debug messages when you're debugging. If you're generating gigs of log files per day, allow me to introduce you to the concept "cost of doing business"
5, Don't do stupid stuff that makes your log files harder to read. Binary encoded log files that you have to run through a tool to get the data out, archiving after some arbitrary time period into zip files that collect logs over a different time period, anything that is going to put friction between ops and getting the info they need when they're stressed and rushing to fix stuff.
Probably some more stuff but that will do for now.
But you're going to need those debug messages when you're debugging
I second this so much. Clean up your debug diagnostics mess into something readable and useful instead of turning it off. Otherwise you’re deliberately throwing valuable info away. Yes you can re-enable it, but will that bad thing happen again today or in two months? You’ll need them when something goes wrong (and it sure will). There is no point in having logs like “it started”, “it working”, “it failed”. When they ask you how it failed and how to fix that, you’ll be more likely able to answer quickly.
For 4 there are log libraries that let you log on info and debug but only standard print the info logs, then later if an error occurs in scope emit also the debug logs.
This helps save data and performance in happy paths and still retains access to debug information.
That's kinda cool, I didn't know that was a thing but doesn't seem too difficult to achieve with a buffer.
Problem though, most of the cpu time to log a message is in preparing the log statement. Once you hand it off to a logging library you're off the critical path, probably on a background process and then off to the OS and sequential writes to disk (which is faaast).
There's also some benefit from having the other debug statements from previous correct runs to compare. If there's any state being set (you kinda deserve problems, but) it may help with debugging to have the previous messages.
And of course, the most difficult problems to solve are the ones that don't crash but run through feeling fine with values reversed / inverted / off by 1 etc.
> Don't do stupid stuff that makes your log files harder to read.
Like use cloudwatch. I'm still amazed at how bad that interface is for looking through log files. I'd be about a million times happier with just a plain text log file.
1, Make sure you have a git hash attached, either in the filename, at the start of the file, or when you throw an exception.
This helps massively when you can switch your local env to exactly the code being run when it crashed.
2, Log format should be standardised and have the following:
Timestamp (in utc), log level, guid of the data being processed[1], log message, filename and line of the message that created it.
[1] When dealing with distributed systems, multiprocessing systems or complex dataflows, starting a logging message with "... %guid%: some log message about this ..." can be a massive time saver. The guid could be a literal guid or some serial attached to the data with a type identifier.
3, Try and make your error message unique across the codebase. The filename and line log format helps to track down where the message comes from but if you're just given the text "%id%: Failed to locate TPM in existing KYC lookup being able to ctrl-f that exact message across the codebase and get started debugging saves time.
4, Debug logging isn't helpful when you need to debug something if it isn't turned on. Roll your logging files by the hour if you really have to keep the file size down. But you're going to need those debug messages when you're debugging. If you're generating gigs of log files per day, allow me to introduce you to the concept "cost of doing business"
5, Don't do stupid stuff that makes your log files harder to read. Binary encoded log files that you have to run through a tool to get the data out, archiving after some arbitrary time period into zip files that collect logs over a different time period, anything that is going to put friction between ops and getting the info they need when they're stressed and rushing to fix stuff.
Probably some more stuff but that will do for now.