Hacker News new | past | comments | ask | show | jobs | submit login
A skeptic's first contact with Kubernetes (davidv.dev)
167 points by todsacerdoti 6 months ago | hide | past | favorite | 101 comments



His take on text interpolation is very right. I'm a SWE turned SRE because as a developer I really enjoyed using K8s. But as a full-time SRE where I work just means YAML juggling. It's mind numbing that everybody is okay with this, this really is our domain's assembly era, albeit with whitespace, colons, dashes and brackets.

I've found solace in CUE which I just run locally to catch all the small errors everybody makes on a daily basis. Putting the CUE validation in our pipeline is too confronting for others, yet they're constantly making up best practices adhoc during reviews which could've easily been codified with CUE (or some other serious config language).


> Putting the CUE validation in our pipeline is too confronting for others

Sad. Can you get away with boiling the frog by getting it initially added but configured to check very, very little? Maybe some specific rare category of mistake that doesn't happen enough to annoy your colleagues when it catches it but is implicated in a recent outage management still remembers?

Then slowly adding rules over time as feasible (exploiting each outage as an opportunity to enlist stakeholder support for adding more rules that would have caught that particular misconfiguration event)

Sometimes I think figuring out how to stage and phase gradually making the changes you need to improve the system in light of social inertia against it is the most complex part of corporate enterprise software work. I definitely remember a time I wrote a whole perl-based wheel-reinventing crappy puppet knockoff just for a set of nagios plugins, entirely not for technical reasons, but for the political reason that this way we would control it instead of the other department which refused to get with the program and let us do staged canary rollouts. It was wrong technically if you assume frictionless corporate politics, but it was the right and only practical way to achieve the goal of ending the drumbeat of customer outages.


I don't think that the k8s yamls/jsons are bad. It's just bad that we write and edit them as text files

My take on the k8s yaml mess is, that we are lacking a modern replacement for Helm. The yaml templating on text file level is just pure crazyness. I think we would need an approach that is more like React/JSX or even better TSX (typescript JSX). Some possibility to get away from simple templating and more into structured descriptions, with typing and control flows.

I think there are some approaches into this direction (like the k8s Terraform module, or the defunct ksonnet), but none of them got it perfectly right yet.


This is a naive thought, as someone who has only been doing devops stuff for a little over a year...but what about HCL? Terraform was my favorite aspect of the devops stack, because I really enjoyed the simplicity of the configuration language. It just made sense to me.


There is a kubernetes provider for terraform: https://registry.terraform.io/providers/hashicorp/kubernetes...

I just think it doesn't map well to the declarative way of kubernetes. And HCL/Terraform often feel a bit clumsy when control flow/complex mapping is needed. Doesn't feel like a real programming language.

Terraform is great for managing cloud resources, diffing and applying changes. But this is not needed for kubernetes, in some way k8s can even be considered an alternative terraform (declare desired resources, k8s will create/update/destroy them as specified).


RCL has control flow and types: https://rcl-lang.org/. No record types yet though, so they are not yet very useful to validate e.g. a Kubernetes manifest against a schema.


Looks a lot like Jsonnet. This is going into the direction I'm talking about.

Is there also some typing-checking/code-completion available for Kubernetes resources? I think this would be an essential part to improve developer experience and automatic checks.

Some linter to validate the k8s manifests must be a part of a solution. It's possible to check the output, but a perfect solution would also lint the RCL source code and highlight the sections that generate invalid manifests.


At the moment nothing like that exists. Eventually it should be possible to generate RCL types like https://github.com/dhall-lang/dhall-kubernetes does for Dhall.


I was just comparing different approaches and stumbled upon KCL. Might be worth a look:

https://www.kcl-lang.io/docs/user_docs/getting-started/intro


Ksonnet being deprecated was one of the worst things to happen with the ecosystem IMO. Some kind of Jsonnet tool should have been integrated with kubectl like kustomize.

Shameless plug for Etcha, a configuration management tool (that works with kube!) built around Jsonnet manifests: https://etcha.dev


The better way is to use manifest generators with a Turing-complete language. I wrote such a tool in Ruby.

The problem is that it is written in Ruby. It's great for shops that already have Ruby expertise. Wherever I have implemented this, it's worked out great. I have my doubts about it in other shops.

The community went a different way -- starting with templated JSON (not even JSON transforms). It was the least common denominator.

This is an issue with the ecosystem rather than design flaws of Kubernetes itself.


Sort of. Kubernetes is lowest common denominator by design. Specific ecosystems had much more efficient solutions to these problems (e.g. rather than having to deploy a whole docker container to do a version upgrade you deploy just your specific web application) but the problem was those solutions only worked for those specific ecosystems.


I disagree. Kubernetes is composed from primitives that can be applied in versatile ways. Those primitives can be swapped out for even greater extensibility. Kubernetes is not the lowest common denominator, and more of a toolkit that can broadly and deeply applied to many setups.

Templated yaml or json is a step backwards. Those are not composable. If we wanted something like that, then we would need to at least be able to do merges and transforms. Something like jquery or css for yaml that allows to transform fields based upon selectors.

Manifest generators using a real language, on the other hand, start with data structures that can be merged and transformed before its final output as yaml. They can take advantage of language features. The one I wrote in Ruby can take advantage of class inheritance and mixins, allowing me to define common configs across resources. Someone who can distill the algebra of merging Kubernetes manifests can do that for a functional programming language.


> The better way is to use manifest generators with a Turing-complete language. I wrote such a tool in Ruby.

I've seen something similar at a previous company - some Ruby DSL written to generate CloudFormation. It was used in another department though so not sure how well it worked in practice.


There's a ton of discussion of alternative ideas as comments. Many look amazing.

But I keep wondering how many would be useful in giving us the modularity to make something like the shareable charts Helm has. That bitnami and others have these massive troves of charts is such a superpower for Kubernetes.

We all seem to agree it's not the string templating that's excellent, but figuring out how to package & make modular the templates/generators, and how to manage the outputs of the generators: that's "the rest of the owl" that tools such as Kustomzie and languages such as Dhall and starlark don't really buy us.

There is MetaController, which seems like some way to operationalize running generators and managing outputs. But I'm not sure what good examples there are of its use and specifically what examples are intended for distribution alike Helm Charts sometimes are.


> ...just means YAML juggling.

Perhaps I'm missing something, but even the "YAML juggling" still seems to be in an immature stage. As an example, a problem I've been wrestling with recently where I could not find a solution with yq or other YAML wrangling tooling is the following.

Say I have a YAML fragment like this:

rules: - apiGroups: - rbac.authorization.k8s.io resources: -clusterroles verbs: - get - apiGroups: - rbac.authorization.k8s.io resources: -clusterroles verbs: - list

I want to condense it to this: rules: - apiGroups: - rbac.authorization.k8s.io resources: -clusterroles verbs: - get - list

More generally, I want to find the smallest most compact representation possible between the apiGroups, resources and verbs terms, combining everywhere possible. In the example, I combined the verbs, but the same combining can take place for the apiGroups and resources terms. This comes in really handy when cleaning up a litter trail of these constructs built from figuring out what resources to add to enable more complex Helm Chart installs (the iterative nature of that exercise is another rant in itself) in reasonably restrictive environments.

I resorted to writing a script to solve just this problem, but I keep thinking I must have missed some solution to this everyone else seems to have solved but me. Apparently the solution eludes my local K8S experts as well, because they condense it by hand, which I refuse to accept as the proper solution.


Something curious I noticed: while this comment was posted "6 hours ago", when I click reply, it instead says that the comment was posted "3 days ago".

Is this happening to others as well?

From digging around it looks like the comment really was posted 3 days ago, looking at the `title="2024-07-29T08:41:07"` attribute in the timestamp tag.


Submissions get renewed sometimes and the time stamps are set to a newer time (this is to get it ranked high enough to hit the front page). All comments have a false (newer) time stamp for a while if they predate the renewed time stamp. After a few hours the original time is restored for the submission and comments.

These usually come from the second chance pool (click lists at the bottom to get to the pool and some other lists).


> I've found solace in CUE

We rebuilt our Kustomization layers to be built via cue, so that we could theoretically work with less mistakes. Oh my god, never again.

Incredibly confusing and horrible errors. Writing new modules was painful and unclear, trying to compose them together was even worse. I still get flashbacks to that “Invalid bytes” error.


Man I'd love some real war story sharing here! This might save me some time. B/c I have to admit i'm still in my honeymoon period with CUE


I think it's worth re-examining the widespread opposition to writing configs in an interpreted "real" programming language, rather than a string templating language (if you're looking to split hairs, in this taxonomy PHP and JSX are real languages, but they're close to the line. Python is prototypically a real language. Awk, Jinja, CUE, and Bash/ZSH string expansion are not. If the word "real" bothers you, pick another one).

Like, people are correct that it is not great to allow (restricted at runtime or not) arbitrary code to define configs. I think "our config language is Python"-type statements are concerning in a whole lot of contexts, and for good reasons.

But holy shit at what cost. The landscape formed by the extremity of that no-real-scripting-languages-allowed sentiment is bleak. We've all dealt with it: the proliferation of handfuls of arcane tools that have to be present at specific versions in order to render a string. The triple-templated hellscapes written in Helm/Jinja/CUE/Jsonnet/DTL/ytt/JSON-patch/Dhall. Quoting hell. Indentation hell. Brace hell. The endless, endless bikesheds over which data language to render to (which are never really resolved and always just end up with JSON-interpolated-in-YAML-interpolated-in-TOML and transformed to the "right" format at render time). The flavor-of-the-month XKCD#927 creation of each iteration of a data+templating language that will solve all of the problems, but for real this time.

This really sucks. Not because its tedious; tedium is not necessarily a bad thing, with apologies to the automate-all-the-things crowd. But because it's expensive: in time, training, bugs, incident MTTR, and so on.

I dunno. Maybe we should accept that past some complexity threshold, you should just write your configs in a general purpose scripting language. Not because it's a good practice, but because our best attempts at an alternative are so much worse that it's probably the best practice to just do it in fucking Python. Starting from there and adding restrictions (i.e. Starlark) can't possibly be worse than the currently blessed way.


I don't object to the desired state being represented in YAML, I object to generating that YAML using something that's not a "real" programming language.

I don't particularly care if you generate your data structures from schemas, or generate your schemas from data structures, but I do care if you don't generate your output YAML using anything that's not type-safe.

If you're generating YAML to deploy, you can do that during build. If you're needing to combine configuration to generate new YAML at runtime, that's what controllers are for.


I generally agree.

I use Ruby for this. Duck typing is sufficient.

I’ve also found that coupling manifests generation with application deployment to close a lot of doors. It’s better to modify manifests to change image references (or something similar), and have a separate process for applying manifest changes.


I’ve had great success with Flux and a small bit of CI for this.

We currently have configs in raw K8s yaml in our monorepo in kustomize base+overlay style setup. It’s super duper obvious what’s in each env, and straightforward to change them. Images are swapped in using Kustomize’s “newImage” functionality. Flux watches git and deploys to K8s.

At any point it’s super clear what the configs are, what image we were using at the time, flux just pulls in what it finds in git, so there complete separation between “make them” and “deploy them”. Nicest CICD experience ever. Zero helm tempting, zero pain.


Thanks for sharing that. That validates my observation. I had implemented a system with separate manifest and deployment and it had worked great. I came to a team with an existing process that coupled manifest and deployment process … and it contributed to their inability to respond to changing circumstances; you get the Kubernetes complexity and none of the Kubernetes advantages.


How are you drawing the line to put Dhall in your bad list and Starlark in your good list? As far as I can see they're extremely close cousins.


Agreed - I actually think TypeScript is easier to craft JSON output in tho: https://news.ycombinator.com/item?id=39242314

The Starlark approach is good too - I just like types if it's going to get narly and find Python typing to be suboptimal compared to TypeScript in terms of LSP and tooling (and only Facebook's Rust starlark interpreter supports types)


I think that templated languages are the lowest common denominator.

Examples for using real languages for configuration management is Chef. One of the bigger hurdles is that someone with an ops background has to learn a language.

On the other hand, if you are recruiting someone for an SRE or platform team, you are looking for ops people that are willing to learn how to dev, or dev people willing to get their hands dirty with ops.


I find Python to be one of the most useful text generation languages I’ve used.

F-Strings are super useful. Partial application and list comprehension can turn pages of wallpaper code into a few short statements.


I’d encourage you to try F#. The list comprehensions are even more powerful and partial application / currying is pervasive.


> which I just run locally to catch all the small errors everybody makes on a daily basis

What small errors?


KCL is amazing and growing quickly.

Can use Crossplane function KCL too.


Did you look at CDK8s?


Great writeup on the core fundamentals, saved this to share with engineers who are new to k8s and need a quick primer.

Re: This piece -

> Given the Controller pattern, why isn't there support for "Cloud Native" architectures?

> I would like to have a ReplicaSet which scales the replicas based on some simple calculation for queue depth (eg: queue depth / 16 = # replicas)

> Defining interfaces for these types of events (queue depth, open connections, response latency) would be great

> Basically, Horizontal Pod Autoscaler but with sensors which are not just "CPU"

HPAs are actually still what you want here - you can configure HPAs to scale automatically based on custom metrics. If you run Prometheus (or a similar collector), you can define the metric you want (e.g. queue-depth) and the autoscaler will make scaling decisions with these in mind.

Resources:

https://kubernetes.io/docs/tasks/run-application/horizontal-...

https://learnk8s.io/autoscaling-apps-kubernetes


Using custom/external metrics is exactly what I was looking for, thanks!

I think my misunderstanding comes from the distinction between "first party" sensors (CPU scaler) and "third party" sensors (via "external metrics").

Is there a reason for this distinction? Will the CPU scaler eventually be pushed out of k8s as well?


Kubelet is already managing the CPU resource I guess from scheduling?

I don't think there's any real movement on moving CPU scaler out of HPA.


CPU tracking is provided by the metrics API, which either reads kubelet metrics directly (the original, old, but simplest way), or a metrics adapter that reads the metrics from a third party collector and implements the API.

The behavior is supported by the v1 api rules so no, it’s extremely unlikely to be moved out.

That said, with GPU workloads gaining steam I wouldn’t be surprised if we added new “supported everywhere” metrics at some point.


See KEDA and Karpenter for advanced k8s acaling


Karpenter is great for managing spot fleet nodes as well. Most of our clusters run a small aws managed node group for karpenter and the rest of the nodes will be spot fleet and managed by karpenter.


^ this right here. We used KEDA to query DynamoDB to look at a queue depth we wrote to a table. If number was X, then we would scale on it. Was pretty slick.


This was a solid write up; I've been using K8s (intermittently) for like, 5 years now, and I still spend an inordinate amount of time looking things up and trying to convert the nonsense naming conventions used to something understandable. I can think of 20 or so projects that would have run great on K8s, and I can think of 0 projects that were running on K8s, which worked well.

Eventually, seeing the wrong tool used for the wrong job time and time again I came around to seeing K8s as the latest iteration of time sharing on a mainframe, but this time with YAML, and lots of extra steps.


My problem with K8s: the network abstraction layer just feels _wrong_.

It's an attempt to replicate the old model of "hard exterior, gooey interior" model of corporate networks.

I would very much prefer if K8s used public routable IPv6 for traffic delivery, and then simply provided an authenticated overlay on top of it.


> My problem with K8s: the network abstraction layer just feels _wrong_.

> I would very much prefer if K8s used public routable IPv6 for traffic delivery

shudder... nothing could feel more wrong to me than public routable IPv6, yuck.


Publicly routable is wonderful. My first job was a company that happened to have somehow acquired a class B, so all our computers just had normal real addresses, they always had the same address whether you were on a VPN or a home network or whatever and remoting into the company network just worked.


Same! It was incredibly easy to obtain address space in the 80's and 90's. I have a /24 ("class C") routed to my home!


Why? It neatly separates concerns. Routing and reachability should be handled by the network. The upper layers should handle authorization and discovery.

Public IPs also definitely don't need to be accessible from the wide Internet. Border firewalls are still a thing.



It's still an overlay network.


How do you suggest talking to ipv4-only internet hosts and supporting ipv4-only containers?


Via the border load balancers, just like we do it now.


So basically i need to run another piece of infra that does NAT64 and DNS64 and limits my deployment options quite a bit (can't do DSR)? Totally unnecessary in cloud... Not sure how that's better for users but probably better for vendors ;)

Btw, overlay is not the only option to do CNI - Calico, Cilium and few others can do it via l3 by integrating with your equipment. Even possible in cloud but has serious scale limitations...


No, you misunderstand me. My dream infrastructure would run IPv6 with publicly routable IP addresses for the internal network, for everything.

IPv4 is needed only for the external IPv4 clients, and for the server code to reach any external resources that are IPv4-only. The clients are simply going to connect via the border load balancers, just as usual.

For the external IPv4-only resources, you'll need to use DNS64. But this is not any different from the status quo. Regular K8s nodes can only reach external resources through NAT anyway.

I'm actually trialing this infrastructure for my current company. We got an IPv6 assignment from ARIN, so we can use consistent blocks in US West and US East locations. We don't use K8s, though. AWS ECS works pretty great for us right now.

> Btw, overlay is not the only option to do CNI - Calico, Cilium and few others can do it via l3 by integrating with your equipment. Even possible in cloud but has serious scale limitations...

It's still an overlay network, just in hardware.


> It's still an overlay network, just in hardware.

It really isn't, at least not in commonly understood sense. See [0] for example - you can use this with dual-stack and route everything natively even with ipv4 using rfc1918 cidrs. No ipip/gre/vxlan tunneling required. Does require setting up BGP peering on your routers.

[0] - https://cloudnativelabs.github.io/post/2017-05-22-kube-pod-n...


There are ways to unbolt the native networking stack and roll your own. Tons of options available: https://github.com/containernetworking/cni

I don’t agree with your approach (curious as to why you would want this) but I believe it’s possible.


I've felt similarly. Possibly because I was online pretty early, pre-NAT... there was public IPv4 everywhere.


How would that work with load balancing and horizontal scaling?


Just like it works currently. Either via dedicated load balancers or by using individual service endpoints.


> Why are the storage and networking implementations "out of tree" (CNI / CSI)? Given the above question, why is there explicit support for Cloud providers? eg: LoadBalancer supports AWS/GCP/Azure/..

Kubernetes has been pruning out vendor-specific code for a while now, moving it out of tree. The upcoming 1.31 release will drop a lot of existing, already deprecated support for AWS & others from Kubernetes proper. https://github.com/kubernetes/enhancements/blob/master/keps/...

There's some plan to make this non-dosruptive to users but I haven followed it closely (I don't use these providers anyhow).

> Why are we generating a structured language (YAML), with a computer, by manually adding spaces to make the syntax valid? There should be no intermediate text-template representation like this one.

Helm is indeed a wild world. It's also worth noting that Kubernetes is also pushing towards neutrality here; Helm has never been an official tool, but Kustomzie is builtin to kubectl & is being removed. https://github.com/orgs/kubernetes/projects/183/views/1?filt...

There's a variety of smart awesome options out there. First place I worked at that went to kube used jsonnet (which alas went unmaintained). Folks love CUE and Dhall and others. But to my knowledge there's no massive bases of packaged software like exists for Helm. Two examples, https://github.com/bitnami/charts/tree/main/bitnami https://github.com/onedr0p/home-ops . It'd be lovely to see more works outside Helm.

Thanks sysdig for your 1.31 write up, https://sysdig.com/blog/whats-new-kubernetes-1-31/


Kustomize is builtin to kubectl & is being removed

Hadn't heard this until now, I'm a rather happy user after being jaded about Helm 2/3. Do you happen to know if this removal is sentimentally closer to "we don't want to keep maintaining kustomize" or to "we don't want to keep embedding kustomize, please use its binary directly"?


The motivation is more the latter, but it's not at all clear the proposed removal of the embedded kustomize will proceed, given the compatibility implications. See discussion at https://github.com/kubernetes/enhancements/issues/4706#issue... and following.


Thank you for this and your many many other contributions!


> Kubernetes has been pruning out vendor-specific code for a while now, moving it out of tree.

I wasn't aware, but it makes sense, and explains the current "confusing" state.


Thanks for the write-up, and especially for storytelling your ability & willingness to wade in & find out! Proper hacker spirit!

You did such a great job offering an informative high level view (honing in on the control loop feels like seizing upon a critical insight), & iterating through bits nicely. You were tactful & clear in raising some ongoing qualms, which indeed seem largely correct & frustrating, even to those who started as believers. Thanks thanks, a recommendable article indeed.


Off topic - thank you for not having a super narrow text layout for your site. It seems like every other website these days has an incredibly narrow text width (I've seen as small as 600px which is so annoying to read). It was like a breath of fresh air to go to your site and have the text be a reasonable width without me having to fiddle with page styles.


> Why are we generating a structured language (YAML), with a computer, by manually adding spaces to make the syntax valid?

Yep, it sucks. It's not like nobody has tried to do better, but nothing else has the adoption of Helm. Ultimately text is, as always, universal.

If you want a fun fact: the communication between kubectl and the kube-api-server is actually in JSON, not YAML.


> If you want a fun fact: the communication between kubectl and the kube-api-server is actually in JSON, not YAML.

YAML is a super set of JSON. All JSON is valid YAML. No one uses YAML over the wire as you cannot guarantee the conversion from YAML to JSON as YAML is a super set and make contain things like anchors, which the JSON parser cannot handle.

That's why :)


Nobody tell the JSON::XS CPAN maintainer you said that: https://metacpan.org/pod/JSON::XS#JSON-and-YAML

Or John Millikin: https://john-millikin.com/json-is-not-a-yaml-subset


One of many reasons why most of the perl devs I know use the Cpanel::JSON::XS fork now.


If we replace “YAML” with “JSON” and then talk about naive text-based templating, it seems wild.

That’s because it is. Then we go back to YAML and add whitespace sensitivity and suddenly it’s the state-of-the-art for declaring infrastructure.


The helm yaml thing really is annoying. Unfortunately it feels like helm is too firmly embedded to unseat at this point.


My annoyance with helm is that it's possible for a helm install to fail, with no top level install status, yet have failed chart resources where you need to go digging around and manually delete before another install can be applied. I mean helm has one key job..


I've switched to kustomize for my homelab. Its built into kubectl (no extra tool needed) and its not really templating.

It doesn't fully replace helm for 3rd party tools or packaging but its a solid alternative for less complicated setups.


Funny then that kustomize apparently is being removed, according to another comment:

https://news.ycombinator.com/item?id=41093797


You don't need helm at all. I've used k8s at a few places, big and small, and we just used kustomize or simple scripts to generate the actualized configs.


This was a useful read and somewhat gels with my experiences.

Looking at the statement at the beginning

> and it only requires you to package your workload as a Docker image, which seems like a reasonable price to pay.

is no longer true as you continue down the path, since it actually requires you to do a lot more than you'd think.


I've not yet gone down that path, in which other ways does your workload need to adapt? I understand that you may want to get more value out of logs, and include side-cars, you may want inter-pod encryption and include service meshes, etc; but that'd be something that requires extra complexity in any setup


OIDC integration, RBAC, data persistence/replication across nodes, observability, mTLS.

And yes, you're right, all these things are complex in any situation, except when you simply use a load balancer and two servers. There are companies estimated to be worth close to US$2B using with sub 30 servers using a service you may have heard of: Stack Overflow (https://nickcraver.com/blog/2016/03/29/stack-overflow-the-ha... && https://nickcraver.com/blog/2016/05/03/stack-overflow-how-we...).

(As an aside: K8s does not free you from your Cloud vendor - after floating around my city working on various "platforms", all locked in due to IAM, for example.)


> And yes, you're right, all these things are complex in any situation, except when you simply use a load balancer and two servers.

How does that simplify or solve any of the problems you mention? As far as I can see they're just as present and just as complex when you have two servers as when you have hundreds.


None of those things are specific to Kubernetes, though. If anything it's a great forcing function to do the things you aught to for non-k8s deployments. It's far easier to write, say, a systems unit for a service that is properly configurable, has health checks, etc.


On the last point ("Stringy Types") - k8s API types are actually defined as Protobufs[0] so they have strictly defined schemas. There are some types that are sum types (IntOrString) but generally no type depends on some other field's value afaik. Ofc that doesn't stop CRD developers from making everything a String type and then interpolating server-side (within their custom controller) based on phases of the moon and what not...

[0] - e.g https://github.com/kubernetes/api/blob/master/core/v1/genera...


I agree with:

> My opinion is that a large part of Kubernetes' value is derived from just two concepts

I agree with the first one, "control loops," but the second one is "API-based," not "Services."


Solid write up, but small nitpick with the diagram at the start:

It displays a pod containing multiple containers (this is fine and normal) but then highlights some of those containers to be different services.

Unless you guys are running wildly different setups, or we’re talking about sidecar and init containers which is a whole rabbit hole unto itself, I put different services in different pods.


Also the same diagram shows nodes being lower in the hierarchy than namespaces. Namespaces and nodes are orthogonal.


Here's what I would like from a Kubernetes-like system.

I have a collection of machines.

I have a very simple file that defines A) what I want to run, B) how the pieces communicate to each other, and C) how I want it to scale.

Make that work without me having to think about any of the underlying infrastructure. (If this sounds suspiciously similar to Heroku, there you go)


There are tools that can do that, and are not as complex as Kubernetes

I once interviewed at a place where their notion of "scaling" is to turn up the knob on Heroku. They also pay for it. They did not have anyone on their team who knew what questions to ask. They had grown to the point where they had outgrown their approach, and yet, have never developed the in-house capability.

I mentioned the Cynefine framework elsewhere, and I'll mention it again. Infrastructure and how it supports application, users, stakeholders, is a complex system.

A lot of people treat this as if it were in the complicated, or even clear domain. For certain class of use-cases, you can get away with this. But a company itself is a complex adaptive system, whose needs changes as it matures and scales ... or it doesn't, and it collapses.

The Cynefine framework describes four different domains, and the way you make decisions are different for each of them. You start getting in trouble when you attempt to use the decision-making process for one domain to a different domain. This often happens when the situation has evolved without the decision-makers noticing. It is easier for experts to see how clear domains changes to complicated domains, but it is more difficult to see when complicated domains change to complex domains.

Kubernetes is a sophisticated, adaptive, extensible tool for the complex domain. If it seems overly complex, it is because the domain is complex, and is not necessary if you are working from the complicated or clear domain.


No idea why you're being downvoted. Makes sense to me.


I think I'd be even happier if it had reasonable defaults for C. At least as a novice, I probably don't know how it wants to scale, and should be able to figure it out better than me.


Kubernetes provides a relatively simple abstraction to represent a cluster of machines as a single entity. It's just about as simple as it can be IMO.

Its simplicity leverages the idea that each part of your software should be fully responsible for its own lifecycle and can handle and recover from all scenarios that can impact it.

For example, if your app service happens to launch before your database service, then it should be able to seamlessly handle this situation and keep trying to reconnect until the database service is started. This characteristic is generally desirable in any operating environment but it is critical for Kubernetes. Its orchestration model is declarative, so it doesn't make much fuss over launch order... Yet it works well and simply if each of your services have this failure tolerance and self-healing characteristic.


Reminds me of myself, and probably many others on here. I've always been a skeptic of novelty. Just like with cryptocoins I was there on the ground floor, I remember when Docker was launching. I remember trying to understand what it was, and I remember someone comparing it to MSI for Linux.

Unlike cryptocoins I didn't miss out on being a millionaire with containers. :(

I just avoided containers until 2019! So to me it was first containers, and then kubernetes.

That way I was already sold on the whole container image concept and namespaces in your OS, I had used FreeBSD jails in the early 2000s.

So when I understood k8s I realized it's literally just a container orchestrator. It might seem complicated but that's all to do with being able to run containers on a range of nodes instead of just one. And of course having an API to tie it all together.

Whether you project needs that or not is something you should definitely explore in depth before you set out and try to use it. Personally I prefer container hosts over k8s for most startup projects. I look forward to Talos' new podman package and being able to deploy a minimal podman container host with Talos, no k8s necessary.


"Whether you project needs that or not is something you should definitely explore in depth before you set out and try to use it." Unfortunately, as someone who's written these kinds of orchestrators a few times already (going back to jails on FreeBSD 4.8 and Xen domUs on NetBSD 2.0) and has a nose for where they're a good idea, I find this is overwhelmingly not the prevailing wisdom. I desperately wish it were.

Somewhere, the industry went all-in on Kubernetes, and I have zero desire to touch most of the resulting software stacks. We sold folks hard on complexity being necessary, and Kubernetes makes it very tempting to go ham with it.


What's the break-even number of machines you must manage before Kubernetes starts to make sense?


I'd say 3 or more distinct technology stacks. You can run 200-machine clusters comfortably with lighter tooling if they're all Ruby or all JVM or what have you (and I have), but once your deployment process requires remembering the foibles of 4 different Python build tools it becomes easier to shove it all in Kubernetes.


Kubernetes starts to make sense when you don't know the number of machines you need (and therefore desire auto-scaling based on system load).


Wny can't we use Meson? Yaml is basically used as a bizarre build system anyways.


This is a good start. It misses an important thing about Kubernetes that is often missed: extensibility. Each and every thing within Kubernetes can be swapped out for something else, including the scheduler. There are Custom Resource Definitions that supports these extensions and operators.

For example, there is no built-in autoscaler for nodes, but someone wrote one and you can add one in there. It uses a constraint solver for determining whether to expand or shrink node groups. If you want to use something else, you can find something or write it and install it.

Another example, you don't have to use kube-proxy. There are other ways to manage inter-node networking.

To address some of the questions:

> Why does it not matter if the state is unstable? If I'm operating a cluster that can't settle, I'd like to know immediately!

I'd point to the Cynefine framework to help make sense of this. Kubernetes is a tool that helps manage things in the complex domain, rather than the complicated domain. Unlike complicated systems, complex systems and complex adaptive systems may never reach a defined settled state.

> Basically, Horizontal Pod Autoscaler but with sensors which are not just "CPU"

That's already available. In addition to scaling on built-in metrics such as cpu and mem, there are ways to create custom metrics, including queue depth. You can do this because Kubernetes is extensible

> Why are the storage and networking implementations "out of tree" (CNI / CSI)?

It used to be in-tree, until the number of third party cloud and in-house storage and networking providers became unwieldy. This goes along with that fundemental characteristic of the Kubernetes design -- extensibility. AWS owns the Elastic Block Storage CSI, and GCP owns its CSI for its storage device. CNI allowed for the various service meshes, including exciting new ones such as the one based on eBFP.

The Cynefine framework again, helps sheds some light on this: the best way to respond to things in the complex domain is to try a lot of different approaches. Even if one approach doesn't look like it works now, some future state may make that previously impractical approach to work well.

> Given the above question, why is there explicit support for Cloud providers?

The current versions of Kubernetes pushes those implementations to CNI and operators. So for example, in order to make use of AWS ELB/ALB for the Ingress object, you have to additionally install the AWS-specific driver. If you are using AWS's EKS service, this is managed as an EKS addon. Under the hood, these drivers are, guess what, pods managed by replicasets managed by deployments that listens to the Kubernetes API servers for changes to the Ingress resource.

Not everyone uses it. On one of the Kubernetes sites I worked on, we used Traefik inside Kubernetes and its custom resource definition, IngressRoute. Everytime you create an Ingress, the AWS driver will create a completely new Load Balancer, which of course, drives up cost for very little gain.


Thanks for the detailed answer

> extensibility

This is something that irks me right now, but at my current knowledge level (~0) it didn't feel right to even mention:

If Kubernetes pushes everything out (CSI, CNI, Cloud provider integration, LoadBalancer, ...), doesn't it become "just a control loop runner"?

I'm sure there's value in that, but I can imagine that most people running clusters would make different choices for various options, landing you with _every cluster being unique_, which feels wrong somehow.


That's right, and it's also why you see Kubernetes distributions popping up.

That way, someone has already done all the configuration of various plugins and components for you. For instance, the major clouds that all let you easily start with their Kubernetes services and that they integrate well with the logging and monitoring systems, IAM, server/node provisioning, etc.

Or ones that are not tied to any particular cloud provider, and perhaps focused on security (full disclosure: I work for Elastisys, who makes exactly that) or ease of use.

I'm sure we will see more and more such efforts in this space, exactly because cobbling together something yourself from scratch (basically just a control loop runner, as you said) is neither very appealing when you think about ongoing maintenance, nor very cost-effective for businesses to spend engineering time on.


That is correct. Every cluster is unique. What we have instead are a set of standard design patterns that are broadly (but not universally) applicable.

Even the core components can be swapped out so that Kubernetes can be adapted. For example, there are several teams that wrote kubelet replacements. The built-in scheduler for kubernetes is only responsible for pod placement by contacting the kubelet on a node and changing its desired state. The kubelet is responsible for actual placement.

That meant people have created kubelets that could run wasm workloads as pods. This was how native Windows workloads were implemented. I remember correctly, there are also kublets that could run unikernels, microvms, etc. The design pattern of label selectors remains the same, so that same concept can be used to place those specialized workloads.

I also learned something from an interview I once had ... their platform team came to the conclusion that every site is unique because every company will have its own specific needs and unique combination of technology. They (deliberately) don't use Kubernetes. They even crafted their interview process for that. It's a timed session to attempt to do something deceptively simple, that turned out to be an ops nightmare. There are no internet resources describing how to do that. No one is expected to finish it within the time. Instead, they are looking at how someone work through the problem as well as their emotional regulation while under time pressure.

Finally, I'd like to draw attention to a fallacy that has plauged modernity, and it is very applicable to computing design. This is from the field of political science, from a book written by James C. Scott called "Seeing Like a State". There is a blog post about the key concept -- legibility -- https://www.ribbonfarm.com/2010/07/26/a-big-little-idea-call...

The pursuit of legibility has lead us to imposing a overly simplistic view onto something complex instead of really understanding the complexity. In the case of governments, this imposition uses authoritarian powers, and results of fragile solutions. In the case of computing infrastructure, we create abstractions and are surprised when it leaks.


Why is he mad at Kubernetes about Helm? Yes, Helm is a mess. Yes, lots of people use Helm to deploy things to Kubernetes, but critiques of Helm are not valid critiques of Kubernetes.


As long as there is no popular replacement for Helm, it will stay tightly coupled with Kubernetes. Not in a technical way, but in the way that you're always going to use it together with k8s. For now Helm is the only sane way to package a complex k8s deployment in a reusable way. Kustomize is great too, but quite limited in many ways.


Nice article ;)

> The number of rules may be large, which may be problematic for certain traffic forwarding implementations (iptables linearly evaluates every rule)

Kube proxy also supports ipvs out of the box, and some CNI (like Cilium) can also replace kube proxy and rely on eBPF.

> When a Pod is moved to another node, the traffic will be forwarded twice until the old DNS entry expires Not sure to understand this one. On a standard setup what happens is:

- Pod A is running on a node, receiving traffic from a service

- The pod is stopped by kubelet (that send a SIGTERM to it).

- The pod should gracefully shutdown. During the shutdown phase, only _existing_ connections are forwarded to the stopping pod, new ones will be already forwarded elsewhere.

- If the pod stops before the terminationGracePeriodSeconds duration (default 30s), everything is fine. Else, the pod is killed by kubelet. So it's developers that should make sure pods handle signals correctly.

"Misbehaving clients (eg: ones that do not re-resolve DNS before reconnecting) will continue to work" => the services IP is stable so clients don't need to re-resolve.

> Why does it not matter if the state is unstable? If I'm operating a cluster that can't settle, I'd like to know immediately!

Kubernetes exposes a lot of metrics, on the control plane components or kubelet, usually using the Prometheus format. Look for example at the metrics exposed by kube state metrics: https://github.com/kubernetes/kube-state-metrics/tree/main/d...

With controllers metrics + kube state metrics about most Kubernetes resources, you can easily build alerts when a resource fails to reconcile.

> Basically, Horizontal Pod Autoscaler but with sensors which are not just "CPU"

Take a look at KEDA, it's exactly this: https://keda.sh/ It "extends" the autoscaler capabilities. If you're running Prometheus you can for example scale on any metric that is stored in Prometheus (and so exposed by your application/infrastructure components: queue depth, latency, request rate...).

Kubernetes was built to be extended like this. Same for your question "Why are the storage and networking implementations "out of tree" (CNI / CSI)?", to my experience support is very good today on various cloud providers or on premise infra components. Look at Karpenter for example, it's IMO a revolution in the Kubernetes node management world.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: