What I am missing (often, in these type of articles as well as in actual production environments) is the fact that if you develop (infrastructure) code, you also need to test your (infrastructure) code. Which means you need actual infrastructure to test on.
In my case, this means network equipment, storage equipment and actual physical servers.
If you're in a cloud, this means you need a seperate account on a seperate creditcard and start from there to build up the infra that Dev and Ops can start deploying their infra-as-code on.
And this test-infrastructure is not the test environments other teams run their tests on.
If that is not available, automating your infrastructure can be dangerous at best. Since you cannot properly test your code. And your code will rot.
I found that Kubernetes + minikube (or a variant of that) is a fairly straightforward way to handle this. Teams / developers can easily set up a local testing environment, product owners can QA that way, etcetera.
This of course depends on your level of lock-in with various cloud environments.
IaC tools often handle more than kubernetes, but agreed that k8s is a fantastic way to get reproducible behavior which is absolutely imperative for testing.
This is kinda why I love Google Cloud and don't see myself moving to another cloud provider until they match GKE. I want all developers to throw everything into GKE, and Operations manages only the VPC's, firewalls etc. Developers get complete ownership over compute (and networking within the cluster) while broader network management can still be managed by an operations team.
Yup - that works pretty well. And gives developers some insight into what is required to get stuff working.
It does assume no hardware or complex networking needs to be handled.
And there is the point of observability. When there is a proper testing ground for developers that is as-production, it enables developers to dig into and mess with logging, tracing, debugging of all sorts.
This adds value by providing developers insight into what a reliability engineer (or whatever they call sysadmins these days) needs to provide whatever service it is that the developers' code is part of.
Why would you need a separate credit card? It’s easy enough to set up separate accounts under an Organization with shared billing and with rules that work across accounts.
Because I want to limit the impact of maxing out a creditcard to one environment.
And I want engineers to be able to futz about with all cloud services available, without having to worry about any negative impact on production.
And finally: What happens when $cloud_provider makes changes to the accounts interface and you want to mess around with those new features, without hitting production?
Give your future-self a break, and make sure you can futz around on any and every layer.
Another common practise is using seperate domainnames. Don't use 'dev.news.ycombinator.com'. Instead, use 'news.ycombinator.dev'. This frees you up for messing around with the API of the DNS provider. And when switching DNS provider, test whatever automation you have in place for this.
Because I want to limit the impact of maxing out a creditcard to one environment.
Just because you maxed out the credit card doesn’t mean that you don’t still owe the money if you go over. That’s what billing alerts are for.
And I want engineers to be able to futz about with all cloud services available, without having to worry about any negative impact on production.
And finally: What happens when $cloud_provider makes changes to the accounts interface and you want to mess around with those new features, without hitting production?
Give your future-self a break, and make sure you can futz around on any and every layer.
That’s what the separate accounts are for but you don’t need a separate card and you still should be using an organization.
This frees you up for messing around with the API of the DNS provider. And when switching DNS provider, test whatever automation you have in place for this.
Why isn’t your DNS provider AWS with Route 53 where the separate domains would be in separate accounts with separate permissions and separate access keys/roles per account?
Not large by absolute standards sure, but large enough to cause issues.
I’m sure there’s some kind of solution that involves re-architecting the ES cluster and indices and re-architecting the data flows and stuff. But if our options are go through all that, or seriously slim down our architecture and costs by just running Sonic + our data warehouse, I’m definitely going to give it a go. After all, worst comes to worst we can go down the re-architecting ES route if Sonic doesn’t work out.
I’d be curious what your expectations and constraints are, but from my experience of running clusters in the double digit TB-Size my ballpark figure for that amount of data would be 2 medium size data nodes and a small tiebreaker. Alternatively, if you can live with the reduced resilience and availability, even a single node might just do. Depends on the expectated churn though, ES really does not like document updates.
That does not sound like a good idea. You can't even maintain a quorum of 2 replicas with n=3 on a cluster like that. Losing one data node would be disastrous.
That’s really not how ES replica works. The quorum is formed on the master eligible nodes (hence a tie-breaker) and is only required to elect a master. The elected master designates a primary shard and as many replica as you configure. However, replica shards are replica and may lag. There’s no read quorum or reconciliation or anything happening. If a primary fails, an (in-sync, depending on the version of ES) replica is auto-promoted. The master keeps track of in-sync replica and you can request that writes are on a number of replica before a write returns, but still, no true quorum.
You can absolutely run 2 data/master-eligible nodes plus a single master-eligible tie-breaker node as a safe configuration. The only constraint is that you should have an uneven number of master-eligibile node to avoid a split brain. You also need to understand what the resilience guarantees are for any given number of replica (roughly: each replica allows for the loss of a single random node) and how many replica you can allocate on a given cluster (at most one per data node). That would allow you to run a 2-datanode cluster in a configuration that survives the loss of one node.
I was not saying that's how they work. Most prod clusters are configured for high availability and multiple replicas. They have at least 3 nodes and 2 replicas configured for the index. Sure you can run this configuration, but do you really want all your traffic to hit this one instance when the other one goes down?
> Most prod clusters are configured for high availability and multiple replicas
I've been doing ES consulting and running clusters since 0.14 and I see very few clusters that run more than a single replica. Most 3 node clusters I see are run with three nodes because you can then have three identical configurations at the cost of throwing more hardware at the problem.
> but do you really want all your traffic to hit this one instance when the other one goes down
Whether that's a problem really really depends on whether your cluster is write-heavy or read heavy. Basically all ELK clusters are write heavy and in that case, loosing one of two nodes would also cut writes in half (due to the write amplification that replica cause). Other clusters just have replica for resilience and can survive the read load with half of the nodes available. Whether that is the case for your cluster(s) depends - that's why I explicitly asked what constraints the OP had.
I’ve run quite a few cluster on such a configuration or alternatively 3 data/master-eligible nodes. It’s a safe configuration unless you manage to overload the elected master. But if you’re fighting that issue, you’ll have to go beyond 4 nodes and have a triplet of dedicated master-eligible nodes plus whatever data nodes you need.
I pretty much specifically avoid 4 node clusters. You’d have to either designate 3 of the four nodes as master-eligibile with a quorum of 2 or have all of them master-eligible with a quorum of 3. Both options allow for failure of a single node before the cluster becomes unavailable. Any other configuration would either fail immediately on a node loss (quorum 4) or be unsafe (quorum of 2, allows for split-brain)
I’d much rather opt for 4 data/master eligible nodes plus a dedicated master eligible node with a quorum of 3.
You also need to pick the number of replica suitably: each replica allows for the loss of a single random(!) data node while retaining all your data. Note that if losses are not random but you want to safeguard against loss of a rack or an availability zone or such, configurations are possible that distribute primary and replica suitably (“keep a full copy on either side”)
Those can last for decades, however. A profitable family business is less likely to get bought out and fire everyone. No one's job is safe in an unprofitable business, family or not.
What I am missing (often, in these type of articles as well as in actual production environments) is the fact that if you develop (infrastructure) code, you also need to test your (infrastructure) code. Which means you need actual infrastructure to test on.
In my case, this means network equipment, storage equipment and actual physical servers.
If you're in a cloud, this means you need a seperate account on a seperate creditcard and start from there to build up the infra that Dev and Ops can start deploying their infra-as-code on.
And this test-infrastructure is not the test environments other teams run their tests on.
If that is not available, automating your infrastructure can be dangerous at best. Since you cannot properly test your code. And your code will rot.