Yes! It should definitely be thoughtful about what you log and how you expect to use it. My biggest gripe with logs is often people writing them never think about "how would I use this when things are on fire?" and tend to log useless information or fail to tag them in ways that are actually searchable.
Tagging the right IDs are a huge thing - customer X is saying their instance is really slow, but if none of your logs let you link service performance to customer X, your telemetry you're paying for is absolutely useless!
You have an ally in me on this one :) I'm hoping given a bit more time we get to write things like this - practical observability from the perspective of a dev, as opposed to the SRE angle that I think is well covered. Feel free to join us on discord btw if you want to chat more - I (for better/worse) love musing about these things :)
Good stuff. Much industry progress since I was last in the arena.
Their site has words about manual and automatic instrumentation. I'd have to dig a bit to see what they mean.
--
So. Remembering a bit more... Will try to keep this brief; you're a busy person.
> tend to log useless information or fail to tag them in ways that are actually searchable
#1 - I don't know know to manage lifecycle of meta. Who needs what? When is it safe to remove stuff?
We logged a lot of URLs. So many URL params. And when that wasn't crazy enough, over flow into HTTP headers. Plus partially duplicate, incorrectly, info in the payloads, a la SOAP. ("A person with two watches has no idea what time it is.")
When individual teams were uncertain, they'd just forward everything they received (copypasta), and add their own stuff.
Just replace all that context with correlation IDs, right?
Ah, but there's "legacy". And unsupported protocols, like Redis and JDBC. And brain dead 3rd party services, with their own brain dead CSRs and engrs.
This is really bad, and just propagates badness, but a few times, in a pinch, I've created Q&D "logging proxy". Just to get some visibility.
So dumb. And yet... Why stop there? Just have "the fabric" record stuff. Repurpose Wireguard into an Omniscient Logger. (Like the NSA does. Probably.) That'd eliminate most I/O trace style logging, right?
Image all these "webservices" and serverless apps without any need for instrumentation. Just have old school app level logging.
#2 - So much text processing.
An egregious example is logging HTTP headers. Serialize them as JSON and send that payload to a logging service. Which then rehydrate and store it some where.
My radical idea, which exactly no one has bought into, is to just pipe HTTP (Requests and Responses) as-is to log files. Then rotate, groom, archive, forward, ingest, compress, whatever as desired.
That's what I did on the system I mentioned. All I/O was just streamed to files. And in the case of the HL7 (medical records stuff), it was super easy to extract the good bits, use that for Lucene's metadata, and store the whole message as the Lucene document.
I know such a radical idea is out of scope for your work. Just something fun to think about.
#3
> if none of your logs let you link service performance to customer X
Yup. Just keep adding servers. Kick the can down the road.
One team I helped had stuff randomly peg P95. And then sometimes a seemingly unrelated server would tip over. Between timeouts, retries, and load balancers, it really seemed like the ankle bone was connected to the shoulder bone. It just made no sense.
Fortunately, I had some prior experience. Being new to nodejs, maybe 5 years ago, I was shocked to learn there was no notion of back pressure. It was a challenging concept to explain to those teammates. But the omission of backpressure, and a hunch, was a good place for me start. (I'm no Dan Luu or Bryan Cantrill.)
I'd like to think that proper end-to-end logging, and the ability to find signal in the noise, diagnosis would have been more mundane.
Yes OpenTelemetry is awesome in what it's done for the industry - it was really early when I was still at Mezmo/LogDNA but it's matured a lot, though I think still has a ways more to go.
For automatic logging - I think you'd enjoy OpenTelemetry's automatic tracing implementation, it helps pull out standard telemetry from things like your Redis requests and correlate them with trace IDs so you can tie everything together from the moment your server starts accepting the HTTP request to the Redis and DB requests and what was sent in each request (without needing to do it manually)
For capturing HTTP req/res - we actually have a few options depending on the language (ex. we do this for Python and Node.js) to enable more advanced network capture (so you can actually get the full req/res information, or whatever subset you're interested in actually storing)! It's actually been asked by a number of teams to make it easier to debug tricky API issues they're running into.
Proper end-to-end logging definitely makes it easier to find the right clue among a sea of logs, hopefully we make it easier to get there!
Tagging the right IDs are a huge thing - customer X is saying their instance is really slow, but if none of your logs let you link service performance to customer X, your telemetry you're paying for is absolutely useless!
You have an ally in me on this one :) I'm hoping given a bit more time we get to write things like this - practical observability from the perspective of a dev, as opposed to the SRE angle that I think is well covered. Feel free to join us on discord btw if you want to chat more - I (for better/worse) love musing about these things :)