This isn't an article, this is just whinging that things aren't perfect. "Log Le...

crabbone · on June 27, 2023

I can absolutely get behind the "nobody knows what log levels are for".

First of all, logs come from application components, and the bigger your application is, the more levels are in this hierarchy (components of components etc.) Authors of individual components may not have any idea about how performance of their component is going to affect the whole application.

So... log level needs to be re-interpreted? Logs need to be collected hierarchically? Logs need to become part of the application public interface? -- This is too high price to pay, and, in practice, nobody is going to do this. So, in reality, outside of very simple cases, it's just easier to ignore the level of the log message.

Another aspect is that log levels aren't really a sequence. Messages can belong to several categories at the same time, but these categories need not occupy a continuous sub-sequence. And the more detailed you make it, the more obvious this problem becomes.

Yet another aspect: people writing logs may rely on another (dumb) program processing them. So, they will leave semantics of logs aside, while concentrating on the desired side effect caused by the application processing them.

---

My personal experience with, eg. Kubernetes is that its authors consistently underestimate severity of some conditions and overestimate severity of others. I often find that what was labeled as "warning" was the reason the whole cluster doesn't function anymore, while something that was reported as "error" was a completely expected condition that the program absolutely knew how to recover from.

0xbadcafebee · on June 27, 2023

> logs come from application components, and the bigger your application is, the more levels are in this hierarchy (components of components etc.) Authors of individual components may not have any idea about how performance of their component is going to affect the whole application.

There is no way to solve this problem. If you have 50 unrelated sub-components in an application, there is no way to know if a given event is critical or just informational, because it requires specific context to understand what the event is actually impacting in that moment.

That doesn't mean the log level is useless. It just means it is one aspect of the signal you are getting from a sub-component. You can then filter on that specific signal from that sub-component in cases where the sub-component might provide a different log with a different log level signal. Is it perfectly accurate? No. Does that make it useless? No.

> Another aspect is that log levels aren't really a sequence. Messages can belong to several categories at the same time, but these categories need not occupy a continuous sub-sequence. And the more detailed you make it, the more obvious this problem becomes.

This just means you have a low-quality signal. Is every low-quality signal useless? No.

> Yet another aspect: people writing logs may rely on another (dumb) program processing them. So, they will leave semantics of logs aside, while concentrating on the desired side effect caused by the application processing them.

Again, low quality signals aren't useless, and you don't have to do this if you don't want to. Throwing away log levels is literally throwing away more signal, which is going to make life harder, not easier.

crabbone · on June 28, 2023

> You can then filter on that specific signal from that sub-component

I already answered why this is a bad idea, but I will repeat: this makes logs part of the public interface. Which, in turn, imposes a lot more restrictions on component providers than they currently have / than is plausible to expect from them. And if you (the mothership application developer) decide on your own to rely on the feature that is not a part of public interface (i.e. you decide to filter logs from components and translate them somehow) anyways -- well, you've done a crappy job as an engineer by essentially planting a time bomb in your application.

So, your plan is not good.

This also means you cannot automate response to log levels. As a human reading the logs, in most if not even all cases, you could probably get down to the bottom of why a particular log message had the level it had, but it's not humanly possible to write a program to do that. This is what makes log levels worthless (in the context of monitoring).

> low quality signals aren't useless

You seem to confuse the goal OP set for logs (use them for monitoring, i.e. in an automated way) with other possible goals (eg. anthropological, where you are studying how human understanding of log messages evolved over time).

shadowgovt · on June 27, 2023

> Throwing away log levels is literally throwing away more signal, which is going to make life harder, not easier.

Signal that can't be used is just noise, and noise complicates solving a problem by interfering with the signal.

In projects I work on, I generally only use two logging levels: info and error. Error indicates unrecoverable conditions that mean an execution context is terminating. Info is everything else. "Warn" is useless because something is only a warning in a particular context and I don't have that context when I'm building the logs. "Debug" is a lie; logging isn't debugging, and if I need to debug I need to slap an actual debugger on the binary with source code available.

I couple that with the ability to turn on and off logging at a fine-grained module level and (if I'm living my best life) being able to instrument the production code for breakpointing and logging on the fly (via systems such as Google Cloud Debugger).

morelisp · on June 27, 2023

Not only logs - that year I spent where CPUThrottlingHigh was an "error" alert (on a cluster with mandatory CPU limits) was awful.

KaiserPro · on June 27, 2023

> They're for an application to tell you what is going on. Come on dude.

I mean yeah, but the format is unique to every app. trying to rationalise them to reduce them to a tertiary value (ok, degraded, fucked) is as unique to each app.

Plus, with every update, it'll change, means that its loads of work

wspeirs · on June 27, 2023

You just need a tool that's adaptable and will let you parse-as-you-search. Without a schema, your logs can change whenever, and you can still easily derive value from them.

KaiserPro · on June 27, 2023

but thats debugging, not monitoring.

If you don't know that your service is down, then you don't know where and what to look for. for monitoring, you need to be looking for a specific parameter or string. if that changes, its very difficult to generate an automated alert.

wspeirs · on June 27, 2023

Fair enough... if you're monitoring "response_time", and a developer changes that field to "time_taken_to_send_the_bits" you'll probably have a tough time monitoring the service. However, if the dev communicates that the value has changed, with the right tool it isn't hard to have something that covers both fields.

KaiserPro · on June 27, 2023

100% but in the real world its a bit hard to coordinate.

Ideally you'd have a schema that you agree company wide. Before an app is deployed it has to pass the schema test. At least that would cover most of the basic checks.

but, For most small places, logs are perfectly fine for monitoring, as you imply

shadowgovt · on June 27, 2023

To translate: I believe the author is saying logging isn't a scalable, transferrable skill (and therefore a bit of an ill-fit for -as-a-service software stacks to provide benefit). Due to lack of standardization and the fact that every application is different (and logs are an application-level concern), the rules differ completely from app to app on what should be logged, what logs matter, and what error levels even mean.

... the problem with logs has always been that logs aren't needed until you need to deep-inspect the behavior of a running system, and then they're necessary. But since you really never know how your system will break down until it's in production, all logging is in some sense crystal-balling future failure modes.

http://thecodelesscode.com/case/73

http://thecodelesscode.com/case/74

http://thecodelesscode.com/case/103

(... while it can be very expensive to set up the infrastructure to support it, I'm a huge fan of the ability to enable live-tracing: in essence, I've seen systems that let you do something equivalent to breakpoint debugging on a live service via injection of novel code to be run when a specific line of code is hit in the service. That live code can then log whatever is relevant, dump a stack trace, dump variable values into a log somewhere, and so on. Of course, the downside is that your runtime has to be very specially constructed to support that kind of hot-patching on a production service).

0xbadcafebee · on June 27, 2023

> Due to lack of standardization and the fact that every application is different (and logs are an application-level concern), the rules differ completely from app to app on what should be logged, what logs matter, and what error levels even mean.

This has nothing to do with logging or metrics. Every application actually is different. There is no such thing as a universal application error. Even when somebody tries to standardize errors, they still need to be interpreted on a case-by-case basis.

Let's say you change your logging method to just output HTTP status codes. Someone sent a request, you send back a 408. What timed out? What caused it to time out? To figure that out, we need more context; at least the body of the request that timed out, but also network connection details, and the details of whatever server app was supposed to process the request. We need context outside of the application to make sense of the error.

The application literally doesn't have that context, so it literally cannot communicate to you "enough" information for that log to be "useful". The log is just a signal, which needs to be processed with a lot of other signals, correlated, and interpreted, on a case-by-case basis.

Yes, the rules are different from app to app. Because the context is different, the operations are different, virtually everything is potentially different. Could there be a little bit more standardization of errors, a la HTTP status codes? Sure. But would that solve the problem of actually understanding the context and implications of what's actually going wrong? No way.