Large-scale refactorings are actually pleasant. My team owns a framework and set...

nradov · on Dec 10, 2017

I attended a talk by one of the Google Guava (Java collections library) authors and he told us how they didn't have to worry about maintaining backward compatibility at all. When they made a breaking change they could check out all of the impacted Java code across Google, refactor it, verify that the tests still passed, and then commit everything in one shot. It's easy to understand the productivity advantages.

matttproud · on Dec 10, 2017

One challenge is latency in the generation of codebase and identifier and callgraph search index (Cf., Code Search and Kythe). We can perform global tests across the entire monorepo, but that takes time. What happens if someone introduced new usage of old API immediately before our atomic refactoring, and what about pathological tests or flakes? This still necessitates doing some cleanups multi-stage: (1.) mark old API as deprecated (optionally announce), (2.) replace and delete legacy usages, and (3.) deletion of final trailing usage sometime soon thereafter once the codebase has been reindexed.

Some languages and ecosystems are more tolerant of this problem than others. That said, incremental cleanup still has advantage with bisecting regressions.

As I said, it is not perfect but broadbased change quickly is relatively easy.

In my time maintaining open source, I never had these luxuries, which is why I said the monorepo is infinitely easier. Another consequence: if global cleanups are easy, perhaps that reduces the barrier to experimentation. Perfect is no longer the enemy of the good and the good enough. For me, I felt in open source where I had zero control over dependent code and its callgraph, the reverse was true: hesitance to publish something for fear of cost.

user5994461 · on Dec 10, 2017

Facebook gave a conference not long ago about then doing the same thing.

Interestingly, they only do that for java code. Java has good analysis and refactoring tools.

danabramov · on Dec 10, 2017

>Interestingly, they only do that for java code.

Not sure what you mean. I work at Facebook, and can confirm we keep all code in a monorepo (or, rather, one of two big monorepos) rather than just Java code.

This lets us easily do React API changes: we can deprecate an API internally, and update all JS code that references the old APIs in a single commit.

user5994461 · on Dec 11, 2017

I meant the refactoring. I've only ever seen it work on Java.

Other languages are much harder to process.

huherto · on Dec 11, 2017

> Interestingly, they only do that for java code. Java has good analysis and refactoring tools

I can definitely see this.

Aissen · on Dec 10, 2017

You can do that with a many-repo, provided you have the right tooling. In fact, I'd argue Google's advantage is in the tooling they built around the repo, not the monorepo itself. E.g how fast you can find all your dependents in the whole repo.

boulos · on Dec 10, 2017

Not really, as there isn't the ability to atomically commit your changes. With 70000+ full-time employees code is getting checked in all the time. Atomicity is extremely valuable.

kadenshep · on Dec 10, 2017

It's like the people commenting about this forget what distributed means. You can have multiple repos, but you can still have a gate way/"source of truth" repo. You can run tests and whatever else on it just like you do in a "monorepo." The power behind Google's/Facebook's choice isn't and will never be the mono repo. It's specifically the tooling they built around their choice.

yoz-y · on Dec 10, 2017

You can. You have many repos and one top level repository with submodules.

Edit: And although this is a multi step process, it still allows you to de-couple modules and work on them separately.

sah2ed · on Dec 11, 2017

The 70,000+ number you cited includes engineers and non-engineers but surely, the actual number of engineers that need commit access will be much lower.

nickm12 · on Dec 10, 2017

Infinitely worse? That's a bit hyperbolic.

InclinedPlane · on Dec 10, 2017

I agree, though I wouldn't say this is a fundamental aspect to distributed systems, mostly just a consequence of git being built with terrible merge and merge conflict tooling.

Groxx · on Dec 10, 2017

What changes can you do that you couldn't have done before, in a multirepo world? And I do mean could not have done - clearly the monorepo enjoys a few hundreds of thousands (millions?) of hours of effort that the multirepo did not.

i.e. what stops your current tools from `for each repo, run...`, or how is monorepo fundamentally more capable than building automated library management / releases / etc with the same level of tooling?

londons_explore · on Dec 10, 2017

With a single commit you can change an API, and all it's users, and run all the tests for all the dependant projects, etc, and you're done. All in a days work, and no emails/communication necessary.

In a multi-repo world, people are probably linking against old revisions of your library, and against certain tags/branches etc. There is probably no overarching code search to find all users of the API. You're gonna have to grep the code and hope to find all uses. You might miss some repos/branches. Everyone has their own continuous integration/testing procedures, so you can't easily migrate their code for them. You're gonna have to support both API's for probably months until you have persuaded every other user to upgrade to the latest 'release' of your code which supports the new API before finally turning off the old API. The work involved in the migration is spread amongst all the project owners, which is probably much less efficient.

As others have said, it's the fully integrated version consistent codesearch with clickable xrefs across gigabytes of source code, cross repo code review, cross-repo testing, etc. which really makes a monorepo work well.

Groxx · on Dec 11, 2017

(edit: shortened dramatically. apologies, earlier wasn't all that useful.)

With the exception of cross-repo code review (I hadn't thought of that one - would be useful for multi-repo too, but I've honestly never seen a multi-repo tool for this, thanks!), this is all just the benefits of standardization, plus a massive injection of tooling enabled by the standards.

Standardization of projects brings huge benefits when it's done right, absolutely agreed. But that's entirely orthogonal to mono vs multi.

joshuamorton · on Dec 11, 2017

Not really. The point is that you can't have the same level of standardization in a multi repo set up.

Groxx · on Dec 11, 2017

    for repo in ls repos
    do
      thing
    done

?

joshuamorton · on Dec 11, 2017

Diamond dependency like issues crop up.

Imagine I have Repos A,B,C. A is a base repo. B and C depend on A, and C also depends on B. If I modify some API in A, and also update all the callsites in B and C, I also have to bump the version of A depended on by B and C, and also bump the version of B that C depends on, otherwise I'll get version mismatch/api compatibility breakages.

To make this work that means that nothing can depend on latest, everything has to have frozen dependencies, and you either need to manually, or via some system, globally track all of the dependencies across repos, and atomically update all of them on every breaking change.

In other words, you reinvent blaze/bazel at the repo level instead of the target level, and you have to add an additional tool that makes sure you're dependencies can never get mismatched.

The monorepo sidesteps this issue by saying "everything must always build against latest".

Groxx · on Dec 11, 2017

"everything must always build against latest" is perfectly enforceable on multirepo too, it's just that nobody does it.

joshuamorton · on Dec 12, 2017

>"everything must always build against latest" is perfectly enforceable on multirepo too, it's just that nobody does it

No, you cannot. That's my entire point. Here's a minimal example:

Repo one contains one file, provider.py:

    def five():
        return 5

Repo two contains one file, consumer.py.

    import provider  # assume path magic makes this work
    def test_five_is_produced():
        assert provider.five() == 5

    if __name__ == '__main__':
        test_five_is_produced()

I also have an external build script that copies provider and consumer, from origin/master/HEAD into the same directory, and runs `python consumer.py`.

Now I want to change `five` to actually be `number`, such that `number(n) == n`, ie. I really want a more generic impl. What sequence of changes can I commit such that tests will always pass, at any point in time?

There is no way to atomically update both provider and consumer. There will be some period of time, perhaps only milliseconds, but some period of time, at which point I can run my build script and it will pick up incompatible versions of the two files.

This is a reductive example, but the function `five` in this case takes the role of a more complex API of some kind.

Groxx · on Dec 12, 2017

or you give your CI the ability to read transaction markers in your git repo. e.g. add a tag that says "must have [repo] at [sha]+". dependency management basically. you can even do this after the commits are created, so you can allow cycles and not just diamonds.

but yes, cross-project commits are dramatically easier in a monorepo, I entirely agree with that - they essentially come "for free".

joshuamorton · on Dec 12, 2017

Didn't you just reinvent versioning and frozen dependencies? What you described is not always building at latest, it's building at latest except when there are issues at which point you don't build at latest and instead build at a known good version.

Consequences of this are, for example, that you cannot run all affected tests at every commit.

Groxx · on Dec 12, 2017

sure. I honestly don't see why that's a problem though, especially since "at every commit" can have clear markers for if it's expected to be buildable or not.

My point here is that you're describing a known problem with known solutions, and saying it's impossible. I'm saying it requires work, as does all this in a monorepo.

edit: to be technical: yes, you're correct, it can't always build at latest at every instant. Agreed. I don't see why that's necessary though. Simplifying, sure; necessary? No.

joshuamorton · on Dec 12, 2017

>sure. I honestly don't see why that's a problem though, especially since "at every commit" can have clear markers for if it's expected to be buildable or not.

The value from this is the ability to always know exactly which thing caused which problem. If you know things are broken now, you can bisect from the last known good state, and find the change that introduced a breakage. With multi-repo, you can't do that, since it's not always a single change that introduces a breakage, but a combination.

Ensuring that everything always builds at latest allows you to do a bunch of really cool magical bisection tricks. If you don't have that, you can't bisect to find breakages or regressions, because your "bisection" is

    1. now 2 dimensional instead of 1
    2. may/will have many false positives

That puts you in a really rough spot when there's a breakage and you don't have the institutional knowledge to know what broke it.

Groxx · on Dec 12, 2017

No, you're back to "we can't build HEAD in a multirepo", which is fixable with CI rules. If you can, you can bisect exactly the same (well, with a fairly simple addition to bisect by time. `git bisect` is pretty simple, shouldn't be hard to recreate).

In any case, unless you have atomic deploys across all services, this is generally untrue. Bisecting commit history won't give you that any more in a monorepo than in a multirepo.

joshuamorton · on Dec 12, 2017

To your first point, I'mma need you to explain how you bisect across a poset, because that's what you just claimed you could do.

To your second point, nothing I've said has anything to do with deployment. We're still entirely in the realm of continuous integration.

Roritharr · on Dec 10, 2017

How many people work on the repository tooling at Google?

I'm asking because i wouldn't know how to setup a mono repository at my 50 people Startup even if we deemed this to be necessary.

rifung · on Dec 10, 2017

> I'm asking because i wouldn't know how to setup a mono repository at my 50 people Startup even if we deemed this to be necessary.

Sorry if this is a really dumb question. If you only have 50 people I'm assuming your codebase isn't that big, so why can't you just make a repo, make a folder for each of your existing repos, and put the code for those existing repos into the new repo?

I imagine there's a way to do it so that your history remains intact as well.

aurelianito · on Dec 10, 2017

Yes, there is. Move the entire content of each repo to a directory and then force-merge them all in a single repo. I did this a few years ago with 4 small mercurials repositories that belonged together.

fla · on Dec 10, 2017

You can do that easly with a subtree

izacus · on Dec 10, 2017

For a 50 people startup, a Git repository will be usually enough. At my previous company we managed to do the monorepo approach easily with similar amount of people and GitHub.