Much as it burns me to admit this, for this usecase, jenkins is king. <60 nodes and its perfect.
At previous job, we had migrated from a nasty cron orchestration system to jenkins. It did a number of things including building software, batch generating thumbnails and moving data about on around 30 nodes, of which about 25 were fungible.
Jenkins job builder meant that everything was defined in yaml, stored in git and was repeatable. A sane user environment meant that we could execute as user and inherit their environment. It has sensible retry logic, and lots of hooks for all your hooking needs. pipelines are useful for chaining jobs together.
We _could_ have written them as normal jobs to be run somewhere in the 36k node farm, but that was more hassle than its worth. Sure its fun, but having to contend with sharing a box that's doing a fluid sim or similar, so we'd have to carve off a section anyway.
However kuberenetes to _just_ run cron is a massive waste. It smacks of shiny new tool syndrome. seriously jenkins is a single day deployment. transplanting the cron jobs is again less than a day (assuming your slaves have got a decent environment.)
So, with the greatest of respect, talking about building a business case is pretty moot when you are effectively wasting what appears to be > two man months on what should be a week long migration. Think gaffer tape, not carbon fibre bonded to aluminium.
If however, the rest of the platform lives on kuberenetes, then I could see the logic, having all your stuff running on one platform is very appealing, especially if you have invested time in translating comprehensive monitoring into business relevant alerts.
Hi! Post author here! I agree that it's really important to be careful of "shiny new tool" syndrome -- one of my primary goals in writing this post was to show that operating Kubernetes in production is complicated and to encourage people to think carefully before introducing a Kubernetes cluster into their infrastructure.
As you say -- I think by itself "we want to run some cron jobs" isn't a good enough reason by itself to use Kubernetes (though it might be a good enough reason if you’re using a managed Kubernetes cluster where someone else handles the cluster operations). A goal for this project was to prove to ourselves that we actually could run production code in Kubernetes, to learn about how much work operating Kubernetes actually is, and to lay the groundwork for moving more things to Kubernetes in the future.
In my mind, a huge advantage of Kubernetes is that Kubernetes' code is very readable and they're great at accepting contributions. In the past when we've run into performance problems with Jenkins (we also use jenkins-job-builder to manage our 1k node Jenkins cluster), they've been extremely difficult to debug and it's hard to get visibility into what's going on inside Jenkins. I find Kubernetes’ code a lot easier to read, it's fairly easy to monitor the internals, and the core components have pprof included by default if you want to get profiling information out. Being able to easily fix bugs in Kubernetes and get the patches merged upstream has been a big deal for us.
> A goal for this project was to prove to ourselves that we actually could run production code in Kubernetes, to learn about how much work operating Kubernetes actually is, and to lay the groundwork for moving more things to Kubernetes in the future.
Why wasn't the final sentence "and to re-evaluate if moving forward was even a good idea?"
Because I get nervous every time someone is relying on their patches to be included upstream. Or they need to dive in to the internals of something repeatedly. That screams "not production ready" to me.
After reading the post, Kubernetes did not sound at all like a slam dunk in terms of a solution, let alone a foundation for more mission critical infrastructure. The Jenkins solution offered by the parent sounds more reasonable, even with the objections you list.
Edit: Take my comments with a grain of salt, but from internet armchair vantage point it does sound like Kubernetes was chosen first, and rationalized second. (Though I very much appreciated the thoroughness with which you went about learning the technology)
Hello! I work at Stripe and helped with some aspects of the Kubernetes cron stuff -- maybe these answers can be helpful.
> Why wasn't the final sentence "and to re-evaluate if
> moving forward was even a good idea?"
I think that's sort of implied -- complex technical projects have a risk of unexpected roadblocks, and it's important that "stop and roll back" always be on the list of options. Never burn your ships.
We invested a (proportionally) large amount of engineering effort to ensure we had the ability to move the whole shebang back to Chronos ~immediately. As noted in the article, we exercised this rollback feature several times when particular cronjobs deviated from expected behavior when run in Kubernetes.
> Because I get nervous every time someone is relying on
> their patches to be included upstream. Or they need to
> dive in to the internals of something repeatedly. That
> screams "not production ready" to me.
This is the same basic model as disto-specific patches to the Linux kernel.
Every engineering organization reaches the point where they want more features than are available in an existing platform. The most practical solutions for this are to launch a new platform ("Not Invented Here"), or contribute code upstream. The first option can provide better short-term outcomes, but is usually inferior on multi-year timescales.
Consider that with a mature build infrastructure, internal builds are actually the latest stable release plus cherry-picked patches. This provides the best of all worlds -- an upstream foundation, with bug fixes on our schedule, and an eventually-consistent contribution to the community.
Julia is a visibility pro. When things scale, you need to be able to look inside the thing. If that's tough :grimmacing: for probably hundreds of developers. What a waste! /irony
I disagree that Jenkins is king for this. Jenkins is a single point of failure, is isn't a highly available distributed scheduler. It is a single master with slaves. While it is easy to configure Jenkins jobs with code (Job Builder, Job DSL, Jenkinsfiles), it is a pain to manage Jenkins itself with code. Plugins, authentication, all the non-job configuration, that is usually done via the GUI.
Saying Jenkins can be configured in a day, to the degree that Stripe configured Kubernetes (with Puppet), is disingenuous. It would take more than a day to do the configuration management of the slaves, getting the right dependancies for all the jobs.
How to you isolate job executions in Jenkins? In Kubernetes each job inherently isolated in containers. In Jenkins you have a bunch of choices. Do you only run one executer per slave? OK, but then you have a bunch of wasted capacity some of the time, and not enough capacity other times. You could dynamically provision EC2 instances to scale capacity, but then you need a setup to bake your slave AMIs, and you have potentially added ~3 minutes to jobs for EC2 provisioning. You can run the jobs in Docker containers on the slaves, that will probably get you better bin packing, but it doesn't have resource management in the way Kubernetes does, so you could easily overload a slave (leading to failure) while other slaves are underutilized.
Doing Jenkins right is not easy, there are solutions to all the problems, but isn't just fire it up and it works.
Stripe was running Chronos before, which is a Mesos scheduler. So they have experience with distributed cluster schedulers. They were probably comfortable with the idea of Kubernetes.
They mention this as a first step to using Kubernetes for other things. So they probably wanted to used Kubernetes for other things, and this seemed like a low risk way to get experience with it. Just like GitHub started using Kubernetes for their internal review-lab to get comfortable with it before moving to riskier things (https://githubengineering.com/kubernetes-at-github/).
> it is a pain to manage Jenkins itself with code. Plugins, authentication, all the non-job configuration, that is usually done via the GUI.
This is not true, all the configuration is scriptable via groovy scripts. We run bunch of groovy startup scripts that configure everything post launch. There is an effort to support this better[1] by jenkins team.
> How to you isolate job executions in Jenkins? In Kubernetes each job inherently isolated in containers.
We run one docker container/build on docker swarm. Each build gets its own isolated/clean environment. There is no EC2 provisioning ect. We already own and maintain docker swarm setup we just run jenkins/jenkins agents on it. I assume if you are using kubernetes it would be similar setup.
> Jenkins is a single point of failure, is isn't a highly available distributed scheduler.
I agree with this to an extent. If you are running jenkins on scheduler it can be rescheduled but you inflight jobs are dead.
> > it is a pain to manage Jenkins itself with code
> This is not true, all the configuration is scriptable via groovy scripts. [...] There is an effort to support this better[1] by jenkins team
The link you gave confirms it by saying managing Jenkins code "require you know Jenkins internals, and are confident in writing groovy scripts". Neither GUI's (like the one shown in your link) nor procedural languages (like Apache Groovy, still procedural even though its collection API is crippled for Jenkins pipelines) are very good for configuring software. Nor is an unreadable declarative language (like XML).
A readable declarative language (like YAML, as shown in your link) is the solution. Languages like Groovy were an over-reaction against unreadable XML in the Java ecosystem. The correct solution is to switch from an unreadable to a readable declarative language for configuring software.
> Languages like Groovy were an over-reaction against unreadable XML in the Java ecosystem. The correct solution is to switch from an unreadable to a readable declarative language for configuring software.
But to tackle your first point, K8s might be distributed, its not inherently reliable. Yeah sure people run it in production, but there are a myriad of bugs that you bump into. I've lost clusters due to tiny issues that ran rampant. Something that I've not had in other cluster or grid engine systems.
if we are talking AWS, then having the jenkins master in an auto scaling group with decent monitoring sorts out most of your uptime issues,
The reason I say it'd take a day to configure jenkins is because the jobs have already been setup in cronos. It should literally be a copy-pasta job. All the hard work of figuring out which jobs are box killers, which can share, which are a bit sticky has been done already, all thats changing is the execution system.
What level of isolation are you after, and for what purpose? if jobs can't live on the same box, then thats almost certainly bad job design. (yes there are exceptions, but unbounded memory or CPU usage is just nasty.) There maybe need for regulatory isolation, but containers are not currently recognised as isolated for that purpose.
The author made clear multiple times that they were using cron jobs as a test bed for Kubernetes, and they chose to “overengineer” because they’re looking to use Kubernetes for more and more of their needs over time. You’re kind of arguing against a straw man.
I think it’s actually a great example of how Stripe thinks about technology choices.
They’re interested in choosing fewer tools that are better built and can grow to solve more needs. And they’re evaluating tools not just by “time to complete X random project”, but by other longer-term heuristics like maintenance levels. And the best way to do that is to start using the tool for a single need, investing more time in learning/research than is required for the need itself—ensuring that it really is a solid, foundational solution—with the understanding that you’re choosing technology for the long run. Then continue to expand your use of the tool over time, reaping benefits on your initial time investment.
I read the article, I understand completely and I've heard that argument before. Thats why at my company we have three incompatible, half arsed K8s clusters.
At the point where you have to fix upstream bugs, its the point where one says: fuckit, its not stable enough, more trouble than its worth. Lets use gaffer tape and move on. As for maintenance, without company buyin for transplanting the _entire_ stack, its questionable. And if there are only two people, and you have to maintain an entire distributed stack, that smacks of pain.
If the benefits of running k8s outweigh the effort of kicking a few patches upstream. Further, if nobody is kicking patches upstream, where exactly are out open source solutions coming from?
I would counter argue about the times jenkins has bit me in the ass, but actually, most solutions will when you go deep enough.
Jenkins is an utter utter arse, don't get me wrong. I would gladly pay for circleCI for 90% of usecases. We have > 90 jenkins masters here (don't ask) all in various states of rot. all of them are unceasingly tedious.
However, for getting a script to run on a certain bunch of nodes, at a certain time for given conditions, its pretty simple. (Unless you have a fetish for the myriad of unstable jenkins plugins)
K8s however isn't simple for that usecase. If I had to read the code and then push changes _before_ it worked for my usecase I'd have dropped it like a bag of sick.
However I do take your point that if no one pushes upstream then its not very fun at all.
I've also previously used jenkins for cron to pretty good effect (I like to call it "jcron"). The ability to define jobs in yaml and have it be driven from your scm is really awesome.
However, k8s does more than just scheduling where pods run. It also ensures that they run with the correct security and availability constraints. When you add in things like affinity (don't run this job on the same machine as that job, or, only run jobs for this tenant on nodes assigned to that tenant), storage management (connect this job to this volume), networking (only let this pod talk to this service and the monitoring layer, don't let anyone connect to the pods running the job), and much, much more.
Yeah, you can do that with jenkins, or like, just cron. I know, because I did it for 18 years before I had ever heard of Kubernetes.
But, just like I can reach for Django or Rails or whatever it is that Java programmers use these days to build my web application, I can lean on Kubernetes to build my infrastructure.
I estimate that leveraging GKE has saved me in the range of $400k in direct employee costs, not to mention time-to-market advantages. As we grow, I expect that number to go higher.
> I can lean on Kubernetes to build my infrastructure. ... I estimate that leveraging GKE has saved me [$BigMoney]
I'm very sympathetic to the view that jenkins, or something comparable, is viable and cost effective for a lot of shops if you're looking exclusively at direct project costs.
As you've pointed out, though, as a building block of Enterprise software the ability to scale out in, and across, multiple clouds consistently is an economic and development boon so powerful I don't think one should really be looking at k8s as just a microservice/deployment platform: it's a common environment-ignorant application standard. Picking and choosing per service whether you should be hosting in GKE, AWS, or on-premise, applying federated clusters, recreating whole production environments for dev... It's a gamechanger.
It's totally possible to fire up a new Jenkins solution in EC2, but as of a few weeks ago Kubernetes is click-and-go in all three major cloud providers. It totally reshapes how we're looking at development projects with suppliers, testing, etc, as we can create fictionalized shared versions of our production environment for development, integration, and testing. As an emerging industry wide standard we can demand and expect Kubernetes knowledge from third parties in a way a home-brewed Jenkins setup could never match.
Though it's resource awareness is lacking, which is where k8s shines. Honestly I find combining Jenkins and K8s a relatively pleasant experience. The jenkins kubernetes-plugin has gaps and issues, but with time it will mature. There's no reason you can't combine them to get the best of both worlds.
My current company keeps trying to cook up elaborate systems to keep certain deployments from happening while others are going on and I couldn’t recall ever having to solve this previously which is odd because of course this has been a problem before.
Yeah I was using my CI system to handle the CD constraints and it was so straightforward it hardly registered as work. I was setting up one build agent with a custom property and all the builds that couldn’t run simultaneously would all require an agent with that property. So they just queued in chronological order of arrival. Done. Next problem.
Depends on the nature of the cronjobs you're scheduling. If your cronjobs cannot run in parallel on the same node (or, more likely, you cannot trust that they can safely run in parallel on the same node, because somebody else wrote the job and didn't need your review or approval before deploying to the scheduler), then you need to restrict each Jenkins node to a single executor, and you cannot run more cronjobs in parallel than you have Jenkins nodes, or else those cronjobs will be delayed. Because Kubernetes enforces the use of containers, multiple jobs can be run on each Kubernetes node with no issues (by design).
Remember - if there's a one in a million chance of a collision, it'll happen by next Tuesday.
You provide a scalable infrastructure underneath your jenkins install while not dealing with the issue of node/agent allocation. Plus, you get kubernetes if for your not-so-simple crons.
Been using Jenkins a bunch here and cronjobs are the only thing it does really nicely. We're thinking of switching to CircleCI for builds though (which has been a pain because no self-hosting), and I'm not sure Jenkins makes sense to keep as only a cronjobber.
Has anyone used Airflow for cronjobs? is it a good idea or a terrible one?
I would argue that, while Stripe is going with a scratch build, this could be motived by AWS's lack of a good managed Kube offering, which is changing in the next few months.
With a managed Kube offering, setting up Kube is much much easier than this jenkins setup you are suggesting. And, there's no overhead charge. Why would anyone go through the hassle of manually provisioning machines like you suggest when AWS/GCP will do it for you?
Its overkill in the same way using DynamoDB for something that only experiences a handful of writes every day is overkill; who cares? The scale is there if you need it, but it doesn't cost anything to not use it.
Setting up a K8S cluster isn't that hard actually.
From my experience, the hard part kickin when dealing with stateful service which needs to associated with volume.
Even with a managed cluster, you still have to solve that problem. Either you pre-provision disk or use dynamic volume.
Next is when upgrading K8S version. with a stateless service, it's a walk in a prt to upgrade. With data volume it's more tricky to upgrade because you want to control the process of replacing node and want to ensure the data volume get mounted and migrated to new node properly.
Thing get harder especially with stuff like Kafka/ZooKeeper when pods get remove and the re-balancing happen.
In other words, managed Kuber actually offer not much. You still have to be carefully planning and it isn't magically solve all problem for you.
That's true, but I'm not sure if using Jenkins would avoid these problems you outline. And that's really the crux of what the OP is suggesting; that Jenkins or something smaller than Kube would have been a better choice.
That's a fair point. I agree that Jenkins will not solve these problem and in fact they come with their own problem anyway. I was argued on the sole point of setting up K8S.
Do you think GKE will support multi clouds setups or hybrid scenarios at some point? For cost reasons we have to put some big servers off the cloud ...
I've often used Jenkins for this use case, and really appreciate how it scales to teams too. While it works well, there are lots of pitfalls in it too, logs filling up disks, lots of configs to tweak. I think you've just gotten past those issues so it's stable for your use case.
> If we could successfully operate Kubernetes, we could build on top of Kubernetes in the future (for example, we’re currently working on a Kubernetes-based system to train machine learning models.)
I wonder if it's feasible or worthwhile for someone to try to extract the task and batch processing code from Jenkins into a separate project. Perhaps the analytics too.
With a little work you could expand that out to make a travis equivalent using the same code base.
Also remember this is Stripe, and they like to advertise through Engineering blogs (and they do that quite well to be honest).
I'm getting cynical here, but I'm sometimes wondering if they didn't specifically chose a cool shiny tool, so that they can speak about it (and advertise through blogging)
At previous job, we had migrated from a nasty cron orchestration system to jenkins. It did a number of things including building software, batch generating thumbnails and moving data about on around 30 nodes, of which about 25 were fungible.
Jenkins job builder meant that everything was defined in yaml, stored in git and was repeatable. A sane user environment meant that we could execute as user and inherit their environment. It has sensible retry logic, and lots of hooks for all your hooking needs. pipelines are useful for chaining jobs together.
We _could_ have written them as normal jobs to be run somewhere in the 36k node farm, but that was more hassle than its worth. Sure its fun, but having to contend with sharing a box that's doing a fluid sim or similar, so we'd have to carve off a section anyway.
However kuberenetes to _just_ run cron is a massive waste. It smacks of shiny new tool syndrome. seriously jenkins is a single day deployment. transplanting the cron jobs is again less than a day (assuming your slaves have got a decent environment.)
So, with the greatest of respect, talking about building a business case is pretty moot when you are effectively wasting what appears to be > two man months on what should be a week long migration. Think gaffer tape, not carbon fibre bonded to aluminium.
If however, the rest of the platform lives on kuberenetes, then I could see the logic, having all your stuff running on one platform is very appealing, especially if you have invested time in translating comprehensive monitoring into business relevant alerts.