CircleCI Down

oxff · on April 6, 2022

> Update - We are investigating multiple possible causes, including database changes and code changes.

Sounds like they haven't got the first clue about what is causing it.

brightball · on April 6, 2022

I'm not all that surprised. A friend saw a phishing email that was imitating them because they lacked a DMARC record. Sent them explicit instructions on how to fix it by adding a DMARC policy and all they did was create a p=none record that doesn't prevent direct imitation. That's definitely the first step, but eventually you need to turn it up to p=quarantine for it to do you any good and it's been a while (several weeks). Shouldn't have needed a random user to point it out in the first place.

I just don't have a tremendous amount of confidence that they take their infrastructure seriously at this point.

tedmiston · on April 6, 2022

To be fair, DMARC quarantining is actually a pain in the ass and will likely break things for people outside of engineering or IT. In a growing or big company, there are always more and more legitimate emails from third-party senders added all the time.

I agree that reviewing is the first step, but not everyone needs to take further steps. And I highly doubt CircleCI is unique here. I think it's a massive leap to conclude "lack of confidence in taking their infrastructure seriously" from not knowing the reason why they haven't flipped the switch from none to reject or quarantine.

Technically sophisticated users know that email spoofing is already rampant and to watch for signs of it in their email client. I'm not saying it's not a good idea, but that flipping the switch is not that simple and comes with significant downsides in a company with many services and users.

IMO I think going to the next level with DMARC is usually more of a prioritization or cost-benefit analysis type decision than a competence once.

brightball · on April 6, 2022

Everyone absolutely needs to take the next step. Without it, you're inviting direct phishing against your user base.

For an core devops tool, that's not okay.

tedmiston · on April 8, 2022

I don't disagree with you about the value of its security benefits from a technical perspective. But if you tested this against the top 100 websites to see how many have actually implemented it... well, I'd be curious to see the results.

vageli · on April 8, 2022

This may satisfy your curiosity: https://dmarc.org/stats/alexa-top-sites/dmarc/

ShakataGaNai · on April 6, 2022

So they did the thing they were recommended but didn't take some further steps, on this one issue. Clearly that means they are totally incompetent? Even though the people dealing with DMARC issues are probably IT & Marketing, not the DevOps & Engineering people who are running the product.

brightball · on April 6, 2022

A p=none record is barely different from not having a record at all...and yes at this point a tech company without an enforced record is a major red flag. It's been a decade since the standard went public, it's required at the federal level already and in many EU countries it's being mandated for businesses in general.

Most 3rd party senders today already insist that you setup DKIM as part of your setup process and if that happens, you're going to pass a DMARC check. It's hard to setup for older companies with thousands of servers in their own data centers that are each individually sending email. Cloud native companies sending their email through a few 3rd parties like Sendgrid/Postmark or a newsletter tool are EASY to setup.

I'm mentioning this on a post about their infrastructure being down for 6 hours because yes, it's related. Email delivery for the primary domain is absolutely an IT, Engineering, Operations and Security problem, not a marketing problem. It goes directly to the application especially when one of the main facets of the application is to send emails about your repos and login credentials.

Blame shifting it to the marketing department does not hold up.

When multiple people are commenting on this post about just how frequently their outages are happening it shows a problem in the overall infrastructure mindset for it to continue. Maybe they know exactly what the problem is and somebody higher up is keeping them from fixing it in order to prioritize other things.

Either way, for company that's supposed to be providing a core devops function to have outages that frequently as well as making it dead simple to spoof email that looks like it's coming straight from them...it's not a good look.

XCSme · on April 6, 2022

Just linking the short guide I wrote (mostly for myself) to help with email auth stuff: https://www.uxwizz.com/blog/stop-others-use-your-domain-emai...

ismayilzadan · on April 6, 2022

It has been going on for almost 6 hours now. It does feel like they haven't got a clue.

onion2k · on April 6, 2022

I don't think you can say that. They might know what happened, but it could still be hard to recover from.

Catastrophes happen. If you deploy something that has a destructive migration you can't easily roll back without reverting to a backup and there's a problem that you've not seen in QA then you're in for a bad time. This is compounded if you also discover your backup process hasn't worked properly for a while. If that happens you're facing some serious downtime, and the dilemma of either trying to fix the problem, or trying to rollback to the last working backup.

There's a good reason why grumpy old devs like me insist on writing docs, having playbooks, testing everything including non-code stuff, and we still fear major deploys. I have scars from exactly those sorts of disasters.

Hopefully the devs at Circle get past this with as little stress as possible, and they learn from what went wrong.

ismayilzadan · on April 6, 2022

You are totally right, catastrophes happen and I also wish that they get past this with as little stress as possible. The whole reason of my assumption was the lack of description in their updates for the incident that is going on for 6 hours. Maybe little more detail would give me a hint that everything is under control, but I didn't feel that when I read their updates.

plumefar · on April 6, 2022

When facing such large scale issues, communicating properly is very hard: Several teams might be investigating several possible root causes in parallel, and you might change your mind over time as to what is the most probable root cause.

So you might end up communicating something ("we think it comes from X, we're fixing it that way"), just to find yourself changing you mind a few minutes later.

Changing your message is usually not well perceived, even though that's actually normal during an investigation.

I would not like to be in charge of the communication. Finding the balance between saying too much or too little is tricky.

a1445c8b · on April 6, 2022

It’s probably just because they had the choice of either focusing all their energy on fixing the problem asap or setting aside some of it to write a more detailed description that’s also fit for public consumption. Given the severity, they probably chose the former since whatever descriptive, reassuring description they put out there isn’t going to be actionable anyway.

z00b · on April 6, 2022

Hi folks. As the CircleCI CTO, I appreciate your patience here and all the feedback. It's true that we are focused on getting customers moving again over sharing more detailed information, but will aim to do better in providing a bit more in our updates. status.circleci.com provides real time updates for both how we're tackling outages and more detailed incident reports. We will post more information there about this incident once we are on the other side and have comprehensive detail.

nerdjon · on April 6, 2022

I still fail to see the heavily opinionated appeal of CircleCI over running a dockerized Jenkins instance (and agents) in AWS. (Or GitHub Actions or any other managed CI environment)

We get all the customization I want and it scales just fine.

But I am also still annoyed that when CircleCI announced their templates, they did not offer you the ability to have private templates (or something along those lines, it left a bad taste in my mouth and we moved to Jenkins a couple months later)

tauntz · on April 6, 2022

There are _a lot_ of companies/teams that don't need/want to manage the complexity that comes with setting up and configuring generic CI environments. They want to get going as fast as possible and never have to think about what's going on behind the scenes - that's especially true in some specific areas. For example mobile apps. 90% of people aren't too happy about managing all the complexity that comes with having to juggle Xcode versions or Android emulators etc in a scalable way.

..at least that's my experience and why I founded an extremely opinionated CI service for mobile apps years ago :)

rhodysurf · on April 6, 2022

This is exactly why we use it. We legit don’t have time to manage more in house services and need CI. Eventually we probably move to GitHub actions but that isn’t free for orgs.

nerdjon · on April 6, 2022

I will admit that needing to build mobile apps in CI is one particular challenge I have never needed to deal with.

But I can see that specific use case being a good one for an opinionated CI environment.

I largely work with Docker containers and Serverless. 2 areas that I don't see any advantages going outside of Jenkins (and see many downsides)

tedmiston · on April 6, 2022

I used CircleCI a long time ago, just before and after 2.0, and thought it was fine.

These days GitHub Actions is awesome and full featured and does everything I want though without getting in the way. I doubt I'd use anything besides GHA unless I wanted to decouple from GitHub as a dependency.

Even then, I wouldn't be surprised if someone else hasn't already come up with an open source or local runner for the GHA yaml files as a stopgap.

watermelon0 · on April 6, 2022

Two downsides of GHA that I could find:

- you are limited to 2 core / 7 GB instance, whereas CircleCI offers up to 16 cores / 64 GB (for example, building software usually scales proportionately to machine size, and this can in theory be up to 8x faster on CircleCI)

- no support for ARM instances

tedmiston · on April 8, 2022

The details you listed sound accurate for the built-in GitHub-hosted runners.

GHA also supports bringing your own self-hosted runners [1][2] where you install their agent, so you could, e.g., use an ARM instance on AWS with tons of cores. It looks like CircleCI offers this as well.

[1]: https://docs.github.com/en/actions/hosting-your-own-runners/...

[2]: https://github.com/actions/runner

Nextgrid · on April 6, 2022

I wonder how much it's going to take before people realize that maybe a single server somewhere in the office running Jenkins isn't that bad of an idea after all. Unless you're Google, "scale" will inherently not be a problem, and risks of operator error can be reduced by scheduling maintenance at times where an accidental outage won't impact your business.

justin_oaks · on April 6, 2022

I agree with the sentiment that people should evaluate whether or not they need an external service to run their builds.

That said, there are a number of reasons to not use the Jenkins server in the office:

1) Someone on staff needs to maintain it.

2) A single hardware failure can cause significant downtime.

3) Your office internet service may have limited bandwidth and be a bottleneck for your build or artifact deployment

4) Having your server on-site may be considered a security risk.

I'm not saying that a server in the office is a bad idea, I'm just saying that each business needs to consider the advantages and disadvantages. I'm sure there are those who could get by just fine with a server in the office.

CSSer · on April 6, 2022

Not to change the subject but point one is such a sticky point that a lot of the non-technicals or semi-technicals get overly hung up on in my experience. It's a very valid concern, and personally in this case I would say even more so because the potential pitfalls of maintaining physical hardware are no joke. However, I keep running into situations where that point is touted out for even compromising, intermediary alternatives like setting up your own build pipeline on a cloud provider or even using a cloud provider instead of a niche SaaS solution at all.

Forgive me if this is just "one of those problems" everyone has to deal with. I'm admittedly still pretty young and naive, and I've found myself in a mentorship/leadership adjacent role at a fast growing company where my relatively small team has a vastly deeper well of technical knowledge and aptitude than the rest of the organization. So I do wonder sometimes if I'm wrong about it, but my intuition, and it seems that of GP's too it seems, drives me to spring to this kind of solution (that we should be doing whatever it is ourselves because it's mission critical for what we do) often and it really gets to me that it's not so readily heard.

It's somewhat ironic because a big part of the point of HN in general is building and promoting these kind of progressive services, but the longer I'm in the industry and the more advanced my knowledge and responsibilities get, the more I keep finding that it's really hard to communicate the gaps in between the marketing claims these SaaS tools promote to higher-ups who don't see the incurred technical debt and wasted time working around their shortcomings. I've found this to be especially true when the issues are ones that one only bumps up against after rigorous use or problems that decision makers try to address by encouraging the addition of yet another third party tool to the mix.

justin_oaks · on April 6, 2022

You make valid points.

I cited problems with hosting something on-site, but didn't list the problems of integrating with a service too.

When you integrate with a service you need a member of staff to maintain that integration too. Of course we hope that the integration is easier than maintaining a server but that is not always the case.

You also take on the risk of the SaaS being flaky, like we see with CircleCI.

The SaaS can always raise their prices or go out of business, requiring your company to switch off of that service.

Maybe the SaaS is a security risk. What happens to your company data and builds if/when they get breached?

There's no silver bullet, there's just a set of trade-offs. You get to decide which is best for your use-case.

CSSer · on April 6, 2022

> When you integrate with a service you need a member of staff to maintain that integration too.

I really like the way you worded this. I think in hindsight it's obvious but you conceptualized it in a way that I've found hard to communicate to others so far. I see now that this is probably because I've been so zeroed in on my own idiosyncratic technical issues with a given service when applicable rather than expressing it through the criteria the people I report to must consider when evaluating solutions brought to them, so thank you for that!

egwor · on April 6, 2022

And also consider the cost of running it there too (heat/energy/noise/space costs)

justin_oaks · on April 6, 2022

Good point. Those are very real problems too, especially in small office environments.

You don't want that server in some spot where someone can accidentally kick it and suddenly the server goes offline. You also don't want it in an unventilated closet where it'll overheat.

As for noise... in this day and age of open office spaces and chatty coworkers, you'll be lucky if the server noise is your biggest problem. Unless everyone is working from home, and now the problem is physically accessing the server if it goes down or needs other physical maintenance.

softfalcon · on April 6, 2022

Folks are likely going to say they can't do this "at scale". However, I've been in both the "small shop with Jenkins running on a Dell blade" and also in the "mega-corp with lots of devs everywhere using all kinds of CI pipelines". I think there is merit in what you're saying, but I suggest an addendum.

In both cases, have seen the "CI is down" issues crop up.

There seems to be a middle ground that I have yet to fully do, but have seen folks do and heard it works well:

Make it so folks can deploy their CI pipeline anywhere. In my case, I would create Docker containers that can run the CI pipeline, pipe the output/result to anywhere you want and have it be able to run anywhere.

The goal is that you could deploy it on a VPS somewhere, or on a Dell blade in your closet, or even run it on a dev's machine. The idea is that you want to reliably be able to run the CI steps anywhere without jank while following the full process the same as if it was running on your dedicated job runner solution. Redundancy, if you will.

I think the idea isn't necessarily "Jenkins in your company closet is better" and more of "how easily can you setup your CI pipeline to get your work done?"

ocimbote · on April 6, 2022

To operate properly at the scale of hundreds of engineers pushing changes, you absolutely need a cluster of machines and a team to operate it.

It is significantly cheaper and more efficient to pay a third party.

XCSme · on April 6, 2022

I personally really dislike remote building. It's so much faster to do everything locally, even more if you can do it on your own machine (machine which is likely to be very powerful, close to the state-of-the art performance, as the tech geeks we are).

xyst · on April 6, 2022

in some cases that single server is bob's 2016 MBP because management is too cheap to buy a dedicated server

justin_oaks · on April 6, 2022

Extra points if management is too cheap to buy a dedicated server, but is willing to pay for 3rd party services.

aftbit · on April 6, 2022

CircleCI used to be an absolutely awesome way to set up CI. The default images just worked for the vast majority of cases. Sure, it always took a bit of fiddling with YAML, but IMO was way ahead of Jenkins et al when it came out.

Then they reworked their YAML format to enable pipelines (and make everything way more complicated, and break all our existing flows). Then they reworked their images to make them more efficient or something, but now there is almost never an image that has everything I need, so I have to build my own image in docker (with yet another CircleCI job). Now that Github Actions exists, we have been slowly migrating everything off CircleCI. Too bad they lost the simplicity that brought us there in the first place.

proxysna · on April 6, 2022

My monthly Circle CI downtime, yay.

hnlmorg · on April 6, 2022

I wish it was only monthly. We have a Slack alert whenever there is reported downtime and it goes off at least once a week. Though granted it's seldom as business impacting as this current outage.

The problem CircleCI faces is that most hosted VCS now support CI/CD tools, as do most enterprise clouds. These will all have better integration with most peoples systems because they'll already be using the VCS or the public cloud (and if you're not using the public cloud you'd likely favor Jenkins / Concourse / etc over a cloud CI solution). So CircleCI's relevance is constantly being eaten at. The last thing they need it to damage their own reputation with these constant outages. I really hope CircleCI can turn around as it's great to see some competition but at this point in time I'm not feeling to optimistic about their long term future.

kldx · on April 6, 2022

In my experience, GitHub actions is equally flaky

frays · on April 6, 2022

This was the final straw. I'm switching over to Buildkite now.

carimura · on April 6, 2022

I'm switching to X and will be happy until either,

a) X gets large enough to matter and also has scale issues.

b) X gets acquired by Y and sends a "what a great journey we're so excited that X will have Y's resources" email and then inevitably becomes just another forgotten tool under Y

/s sorta

x86_64Ubuntu · on April 6, 2022

>...inevitably becomes just another forgotten tool under Y

Not before Y cuts all development to X. It doesn't always happen, but if you see Embarcadero buying a piece of your tech stack, find an alternative, immediately.

kawsper · on April 6, 2022

We've been with Buildkite since 2016, it's been a great experience, I really like the tool, and their team is amazing!

bpicolo · on April 6, 2022

Jenkins is the most flexible system I've worked with, but Buildkite has been great and just gets the job done reliably. Big fan.

mr337 · on April 6, 2022

Welcome to greener pastures :)

Been on BK for about 6mo now and still loving it.

keithpitt · on April 8, 2022

Oh hai!

thinkindie · on April 6, 2022

to all those saying "why you don't spin a GitLab CI instance" - we are a small team, we want to focus on shipping code that adds value to our customers, not maintaining something that has been largely commoditised.

PrimeDirective · on April 6, 2022

Lately, it's been down almost weekly. Not a fan of these types of services myself, but we do use it at work.

folkrav · on April 6, 2022

> Not a fan of these types of services myself

What do you mean by "these types of services"?

capableweb · on April 6, 2022

CircleCI is a CI/CD solution (ala SaaS) that you don't host yourself. Many (myself included) prefer to host mission-critical services ourselves to avoid untimely downtime.

rorymalcolm · on April 6, 2022

Hosting any significantly scaled CI/CD system is very difficult work and since every team in your organisation usually touches it you often have a large blast radius. I've seen downtimes take a long time to recover from even when self hosted, compliance is also hard. It's a good candidate for SaaS IMO.

tapoxi · on April 6, 2022

I've been the owner of our ~90 user GitLab deploy and its been mostly painless over the past two years, we have it installed on an autoscaling GKE cluster subscribed to the 'stable' channel. I helm upgrade it monthly.

hotpotamus · on April 6, 2022

You have a strange way of saying "self-hosted" ;) But it sounds like you've found a happy medium. This is a subject I may need to look into the in relatively near future.

capableweb · on April 6, 2022

As mentioned in a sibling comment: Yes, self-hosting comes with it's own problems, there are no silver-bullets in this industry.

However, I'd still argue that choosing when downtime can happen is important when you're pushing out larger changes to larger organizations. You don't want to be in the middle of a borked migration when your CI/CD service craps out and you can't rollback/push more updates or even restore backups as it was all automated via your CI/CD service.

folkrav · on April 6, 2022

I know what CircleCI is. Just wanted to understand what part of it you were referring to, which seems to be SaaS in general. I'm honestly not really convinced about self-hosting really avoiding untimely downtime, but whatever works for you and your team.

E.g. I've worked in a business who self-host their Gitlab instance, there was a non-negligible amount of work for backups/upgrades on top of troubleshooting performance issues once in a while, amongst other things in the same vein.

capableweb · on April 6, 2022

The part about "untimely downtime" is about that CircleCI decides themselves when to push updates (which is the most common reason services has downtime), instead of you deciding when to upgrade/push updates. If you have a big migration/change coming up, you'd put pause on upgrading the CI/CD service as you don't want to muck with it while pushing out other organization-wide changes.

Granted, self-hosting comes with it's own share of problems too, no solution is a silver-bullet without any issues, but being able to "freeze" things to a stable mode helps to stabilize other processes.

macinjosh · on April 6, 2022

Every service will have down time. It is about the locus of control and agency. When I self host I can decide when to apply updates and do maintenance. When something does go wrong I can dive in, fix the problem, and move on with my day.

Most people don't want any responsibility though so they do everything they can to push work off to a SaaS or cloud provider. That way when the SHTF they can browse reddit and point their boss to the status page.

folkrav · on April 6, 2022

> Most people don't want any responsibility though so they do everything they can to push work off to a SaaS or cloud provider. That way when the SHTF they can browse reddit and point their boss to the status page.

Phew, those are a lot of assumptions. Many businesses choose cloud services cause they take the tradeoff between that cost or having someone at hand to maintain the infrastructure, depending on its complexity. I've worked with employers using many cloud providers for years now, the amount of times I would _legitimately_ be prevented from working at all by outages is extremely low.

The agency part, I do get. But come on.

stepanhruda · on April 6, 2022

They offer to self-host by the way.

shdh · on April 6, 2022

Hands down the worst CI/CD platform out there.

GitLab CI #1 in my opinion.