I've worked with both a distributed repo model and a monorepo model and vastly prefer the distributed approach (given the right tooling). The trade-offs are complementary and no doubt with proper discipline you can try to maximize the benefits, while minimizing the downside. But here's what I don't like about working in a large monorepo:
1) Difficult to track changes to the code I'm interested in. Every day there are hundreds of changes in the repo and almost all of them have nothing to do with what I'm working on.
2) all sorts of operations take longer (pulling, grepping source, etc.) to support code I couldn't care less about.
3) Frequently have to update the world at once. Unless the repo can store multiple versions of the same module, then all the consumers have to be updated at once, even if it's inconvenient. Sometimes migrations are better done gradually.
4) Encourages sloppy dependency management. There are frequently unclear boundaries between software layers.
I'm sure people will say "if you're having those problems, you're doing it wrong" but the same thing could be said to people who find the distributed model problematic.
The trick is that Google have their own VCS, build tooling, automated refactoring tools, etc etc, specifically designed to deal with their monorepo. Nobody else has that - we're stuck with git and a complex landscape of tools for managing code in ad-hoc ways. As a result, with the tools we have, many repos is better than a monorepo - but perhaps if we had those tools, for some cases, a monorepo might be better than many repos.
Note that even where Google are forced to use git (e.g. Android, Chrome) they use a many-repo approach.
Maybe we could look at the problem from the other side. Create tools to manage multi-repos like if they were a single mono-repo. A docke-compose for git.
Perforce Helix might be even better - it even has a DVCS model based on creating a "local server" that can fetch/push from a shared server asynchronously from use of that local server, and a hybrid model that allows for only parts of the repository to be hosted on your personal server, and other parts to follow the more traditional Subversion-like model. Things like exclusive locks on files that can't really be "merged" are also supported (for example, all your assets).
The only downside is that it's not open-source, and as a result has a much smaller community. It's free for up to 5 users, then "email us" for any more. But if a very flexible VCS model is something you need, it's the same as anything else you need to pay for.
Google used to use Perforce until they hit a certain scale, so it's likely it'll work for you until you hit that scale and can build your own tools too.
Well, it seems to fit the requirements better than git. Obviously, subversion is not used much. I would like to hear some experience reports what is the problem with it.
- it requires a certain discipline: we need branching in our workflow and this is handled mostly by convention in a subversion repository. We have "branches" that were created by less careful colleagues by copying subdirectories of trunk to the branches folder.
- all the tooling developers fled to work on making git bearable. It seems that there is good money in sugarcoating got and none in making good tools for Subversion (awareness of branches in Jenkins, decent code review...). We have a budget, but that does not compensate for the lead that git has in that regard.
Other than that, subversion fits our needs. It just works.
Subversion is not used much anymore - just in case you entered the industry after this.
Subversion has been used in basically each and every open source project as a replacement for the previously most used CVS.
Subversion was better tjan cvs, but still bad in many aspects, slow synchronization and bad branching and merging support come to my mind.
Because of these shortcomings and because of the idea of decentralized versioning coming up, many systems like git, mercurial, and others came up then, and git seems to be the most successful of these by now
Must have: Tooling that can interact on a file or sub directory level. Git cannot do that.
Should have: Access control to view and change file on a subdirectory basis. Everyone can see the repo so you can't permissions users per repo anymore. It's optional but these companies have that.
Recommended: Global search tools, global refactoring tools, global linting that can identify file types automatically and apply sane rules, unit test checks and on commit checks available out of the box for everything and that run remotely quickly, etc...
It's regular tooling that every development company should have, but only big companies with mono repos have it.
It's not that the tooling is needed to deal with the mono repo, it's that the tools are great and you want them. But they can't be implemented in a multi repo setup.
Think of it. How could you have a global search tool in a multi repo setup? Most likely, you can't even identify what repo exists inside the company.
Makes me realize. If I ever go back to another tech company, the shit tooling is gonna make me cry.
IIRC, Bitbucket Enterprise has pretty decent global search. GitHub Enterprise doesn't seem to have much of any cross-repo tooling, which is one of my least favorite things about it.
Global refactoring seems a lot less necessary if you have clean separation among your processes. Maybe this is me coming from a more microservices perspective, but I'm inclined to say that needing to do a refactor that cuts across several different functional areas is a sign that things are becoming hopelessly snarled together.
Google have dedicated (no more there) language, platform, library, etc. teams that can push really huge refactoring changelists - for example if they've noticed that code had plenty of: "if (someString == null || someString.empty())" - they would replace it with something simpler.
Or if they've found some bad pattern, would pull it too. I do remember when certain java hash map was replaced, and they replaced it across. It broke some tests (that were relying on specific orders, and that was wrong) - and people quickly jumped and fixed them.
This level of coordination is great. And it's nost just, let's do it today - things are prepared in advance, days, weeks, months and years if it had to. With careful rollout plans, getting everyone aware, helping anyone to get to their goal, etc.
It's also easy to establish code style guides, and remove the bikeshedding of tabs/spaces, camel braces or not, swtich/case statement styles, etc. Once a tool has been written to reformat (either IDE, or other means), and another to check style, some semantics - then people like it or not soon get on that style and keep going. There are more important things to discuss than it.
The idea of global refactoring is mostly that you can decide to modify a private API, and in the process actually update all the consumers of that API, because they all live in the same repo as the component they're consuming. (This is also the argument of the BSD "base system" philosophy, vs. the Linux "distro" philosophy: with a base-system, you can do a kernel update that requires changes to system utilities, and update the relevant system utilities in the very same commit.)
Code search in bitbucket server is dismal. All punctuation characters are removed. This includes colons, full stops, braces and underscores. This makes it close to useless for searching source code.
Regarding global refactorings think new language features or library versions.
Support for punctuation in search is something we knew wasn't ideal when we first added code search. As with all software, there were some technical constraints that made it hard to do.
We plan to have support for full stops and underscores in a future version and are exploring how to best handle more longer term. Our focus, based on feedback, is on "joining" punctuation character to better allow searching for tokens. Support for a full range of characters threatens to blow out index sizes, but if we get more feedback on specific use cases we're always happy to consider them.
Being a self-hosted product we have to make tradeoffs for the thousands of people operating (scaling, upgrading, configuring, troubleshooting...) instances. In short, we try to keep the system architecture fairly simple using available technology and keeping the broad skillsets of admins in mind.
It was a somewhat difficult call to add ElasticSearch for it's broad search capability, but being used for other purposes helped justify it. Adding Hound or similar services that were considered would have added more to administrative complexity and wouldn't provide for a broader range of search needs.
We continue to iterate on search, making it better over time.
A fair point, but I will just say that Hound is _astonishingly_ low maintenance. I set it up at my current employer like two years ago and have logged into that VM maybe twice in the entire time. It just hums along and answers thousands of requests a week with zero fuss.
> Must have: Tooling that can interact on a file or sub directory level. Git cannot do that.
I mean, when you get big, sure. But until you're big, git is fine. Working at fb, I don't use some crazy invocation to replace `hg log -- ./subdir`, I just do `hg log -- ./subdir`. Sparse checkouts are useful, but their necessity is based on your scale - the bigger you are, the more you need them. Most companies aren't big enough to need them.
> Should have: Access control to view and change file on a subdirectory basis. Everyone can see the repo so you can't permissions users per repo anymore. It's optional but these companies have that.
Depends on your culture (and regulatory requirements). I prefer companies where anyone can modify anyone's code.
> Recommended: Global search tools, global refactoring tools, global linting that can identify file types automatically and apply sane rules, unit test checks and on commit checks available out of the box for everything and that run remotely quickly, etc...
I'd bump this up to `should have`. The power of a monorepo is being able to modify a lib that is used by everyone in the company, and have all of the dependencies recursively tested. Global search is required, but until you're big, ripgrep will probably be fine (and after that you just dump it into elasticsearch).
> Depends on your culture (and regulatory requirements). I prefer companies where anyone can modify anyone's code.
This is still true at Google, except for some very sensitive things. However, every directory is covered by an OWNERS file (specific or parent) that governs who needs to sign off on changes. If I’m an owner, I just need any one other engineer to review the code. If I’m not, I specifically need someone that owns the code. IMHO, this is extremely permissive and the bare minimum any engineering organization should have. No hot-rodding code in alone without giving someone the chance to veto.
>ripgrep, ElasticSearch
Having something understand syntax when indexing makes these tools feel blunt. SourceGraph is making a good run at this problem.
Elasticsearch is too dumb. You need to use a parser and build a syntax tree to have a good representation of the code base. That's what facebook and google do on their java code.
Agree that any small to medium company could have a mono repo without special tooling. Yet they don't.
There are companies that care about development and there is the rest of the world.
Might I suggest using a tool designed for searching source code rather than dumping into elastic. Bitbucket, sourcegraph, github search or my own searchcodeserver.com
Unless designed to search source code most search tools will be lacking.
I had a bad time at Google and was glad to leave, but wow did I ever miss that culture of commitment to dev process improvement and investment in tooling. The next startup I joined was kind of a shocking letdown. It became clear pretty early on that nobody else there had ever seen anything like the systems at Google, couldn't imagine why they might be worth investing in, and therefore the level of engineering chaos we wasted so much time struggling with was going to be permanent.
The startup I'm working for now is roughly half ex-googlers, so it is a different story. Of course we can't afford Google level infrastructure, but there is at least a strong cultural value around internal tooling, and a belief that issues with repetitive or error-prone tasks are problems with systems, not the people trying to use them.
Worked at google for 2-3 years, mainly java, under google3: my thoughts: Having things under single repo, and with a system like blaze (bazel), I can quickly link to other systems, or be prevented/warned that it's not good idea (system may be going deprecated, or just fresh new, and you need visibility permission (can be ignored locally)).
Build systems, release systems, integration tests, etc. - everything works easier - as you refer to things just by global path like names.
Blaze helps a lot - one language for linking protobufs, java, c++, python, etc., etc., etc.
Lately docs are going in it too, with renderers.
Best features I've seen: code search, let's you jump by clicking on all references. Let's you "debug" directly things running in servers. Let's you link specific versions, check history, changes, diffs.
GITHUB is very far away from this, for nothing else - but naturally by not even be possible to know how things are linked. Even if github.com/someone/somelibrary is used by github.com/someone-else/sometool, GITHUB would not know how things are connected - is it CMake, Makefiles, .sln, .vcxproj. It maybe able to guess, but that would be lies at the end... Not the case at google - you can browse things better than your IDE - as you can't even produce this information for your IDE (a process that goes every few others updates it, and uses huge Map Reduce to do that).
Then local client spaces - I can just create a dir, open a space there, and virtually everything is visible from it (whole monolithic depot) + my changes. There are also couple of other ways to do it (git-like include), but I haven't explored those.
What's missing? I dunno... I guess the whole overwhelming things that such a beast exist, and it's already tamed by thousands of SREs, SWEs, Managers, and just most awesome folks.
I certainly miss the feeling of it all, back to good ole p4, but the awesome company that I'm in also realized that single depot is the way to go (with perforce that is). We also do have git, but our main business is game development, so huge .tiff, model files, etc. files require it.
Also ReviewBoard and now swarm (p4 web interface and review system) is so far nice. Not as advanced as what google had internally for review (no, it's not gerrit, I still can't get around this thing), but at going there.
Another last point - monolithically incremental change list number would always be easier than random SHAxxx without order - you can build whole systems of feature toggles, experiments, build verifications, around it, like:
This feature is present if built with CL > 12345 or having cherrypicks from 12340 and CL 12300 - you may come up with ways to do this too with SHA - but imagine what your confiuration would look like. It's also easier to explain to non-eng people - just a version number.
What special tooling is required to deal with a monorepo that is not required for multi repo?
From my time at Google the first thing that came to mind was citc. But I couldn't remember if citc was publicly known, so I did an Internet search for "google citc". The first search result was this article.
"CitC supports code browsing and normal Unix tools with no need to clone or sync state locally."
Unless something drastic changed in the last year, I really doubt it. There is the fb frontend, the backend, the offline batch processing repo, and the instagram frontend repo. I think the phone apps have their own repos too? It was a giant mess, especially when you had to make changes that spanned repos, like introducing a new backend API and then depending on it, or changing logging formats.
> Note that even where Google are forced to use git (e.g. Android, Chrome) they use a many-repo approach.
Google uses many-repo approach for Android and Chrome because you cannot fit everything in a single git repo (well you can, but it will be a pain in the ass to work on that repo). Git is just not designed for huge repos. Google is also working on tools to make the many-repo of Android or Chrome work like a monorepo.
Software A version 1 consumes format F1 and produces format G1 data, and software B version 1 consumes format G1 and produces H1.
To upgrade format G2 we must change both software A and B.
First, software B version 2 must accept both G1 and G2. To do this we may need to build software A version 2 and try them in a sandbox environment to gain confidence that ∀F1 we produce the correct G2. If F1 is complete, we may be able to do this exhaustively, but if F1 is sufficiently diverse, monte carlo simulation might be used.
Then, if there's a 1:1 relationship between A/B we can upgrade pairs.
If there's a N:M relationship, we need to upgrade all of the instances of software version B1 to B2 (at least within a shard). If you're running in a non-stop environment, this might have it's own challenges. Only then, can we begin the upgrade from A1 to A2.
Now:
Something, somewhere needs to record what and where we are in this journey. It is relatively straightforward how to do this with a monorepo, but it is very unclear how to do it with a distributed repository:
Almost everyone I know punts and uses some other golden record (like a continuous integration server, or a ticketing system, or an admin/staging system), and like it or not: that's your monorepo.
You can also design software A to produce both G1 and G2 side by side, deploy it, and then develop new software B against G2, submitting bug reports to project A when there’s a problem detected in G2.
If you’re doing the multirepo strategy it’s best imho to make the projects truly independent, as if they were developed by different companies. That way every project only needs to think about its own dependencies and consumers, and how to do migrations, without needing to have the big picture mapped out.
> You can also design software A to produce both G1 and G2 side by side
This can be impractical if G is a database table that is very large.
> it’s best imho to make the projects truly independent, as if they were developed by different companies.
One of our systems might cost £300k, so completely desynchronising them so that code paths can build both G1 and G2 simultaneously (allowing B to develop separately) means "simply" doubling the costs. That might put our team at a disadvantage against someone who figures out another way.
> This can be impractical if G is a database table that is very large.
If this is true, you have no choice, and must run things side by side while you convert to G2. Or shut everything down to make the migration atomic, which is increasingly not an option.
Maybe using a ticketing system [or just call it a project management system] is the right abstraction level.
If A and B has nothing to do with each other - other than for some circumstantial reason they consume data from each other, then why would we care if A or B starts to support a new output format?
If we want to do a format change for some reason, maybe it'll allow better security/traceability, then sure, make a project and track the tasks (like make A able to produce/consume new format, make B able to produce/consume new format, deploy A2 and B2 to test environment, promote to prod), but I don't see why would you track that on the source code versioning level.
A and B has separate tests to ascertain that they can deal with the new format, and then you do the integration testing, that might catch problems, that then should be covered by unit tests in A or B. (Or in a fuzzer for said format.)
> Maybe using a ticketing system [or just call it a project management system] is the right abstraction level.
> If A and B has nothing to do with each other - other than for some circumstantial reason they consume data from each other, then why would we care if A or B starts to support a new output format?
First, even if the coordination between A and B is recorded in the ticketing system, the coordination between F and G is probably not.
> I don't see why would you track that on the source code versioning level.
Pretend F and G are tables in a database (or other data storage system) if that makes it easier.
Where is the schema stored? Who records the migration path?
Many people like to record migrations in a version control system, but it is tricky to link those migrations to the (otherwise) independent A and B.
If these are file formats where exists the code to consume and produce them? Or network formats? The problem remains the same -- do we break this up into additional libraries?
That there's a very real ordering between release of A and B that isn't properly encoded, we're relying on process diligence (as opposed to tooling) to be correct.
If tables, then if they are in the same DB, they should be in the same project.
If they are independent tables, then I don't care, show me the API between the projects.
If these are file/network/serialization/wire/in-memory/binary/codec formats, then there are conformance checkers (passive and active, like fuzzers). Those are separate projects, but they can be used like tools during testing and development.
Rely on tooling to make sure that the stated goal of the project is reached. (It now supports F or G or X,Y,Z formats. It supports output-format G by processing input-format F. If that's a project requirement, test it in that project.)
You can use a top level repo for the integration tests. But it's no need to make it one flat repo.
> If tables, then if they are in the same DB, they should be in the same project.
Lock-stepping two otherwise unrelated applications because they both share support for a data structure is silly at best, and often impractical, especially if development for only one of the projects is "in-house". Consider the possibility that "A" is a commercial product produced by another company.
Anyway, it's my experience most software upgrades don't involve a schema change, so it's worth optimising for the common case, and supporting the difficult case.
> but it is very unclear how to do it with a distributed repository
Versioning through branching and tagging, while having some drawbacks - at least the fact that you have to DO an operation and that this is not automatic - seem to solve this problem, and are not, in my eyes, a form of monorepo. You globally get more flexibility at the cost of a bit more repo management work.
If the problem is retrieving the right version automatically, externals or submodules should be able to solve this problem. If A and B have no clear dependency direction, a top level repo might help.
This is the way I generally do it: A repository that represents my system/environment that has submodules for A1, A2, B1, and B2, and scripts for updating the environment.
I will get a furry of downvotes for saying it but the cause of your first three problems is git, not the mono repo approach.
Git only allows to check/commit/view to the entire repo at once. Then, some git operations are superlinear with the number of files or revisions, they are slow on large repo to the point of being unusable.
It's mandatory to have operation on a per file or subdirectory level in a mono repo approach. Companies that have mono repos all built tooling to support it. CVS/SVN used to do that out of the box but everyone hate them now.
> CSV/SVN used to do that out of the box but everyone hate them now.
It's "CVS" and it lacks a concept for a repository-wide version (except, maybe, a timestamp). A repository-wide version is –I guess– the single best reason to have monorepo in the first place.
Also hard to deal with. Lots of operations leave submodules in unclean / out-of-date states.
In general / light use, yeah, they're great. Unfortunately, they have a very large number of edge cases where they essentially require either a) everyone to be experts in the edge cases, or b) tons of new tooling (because existing tools won't take these steps for you).
That's fair. I think the important thing is to be able to have multiple versions of portions of your repo and then be able to version that. A branch of branches, so to speak.
> Every day there are hundreds of changes in the repo and almost all of them have nothing to do with what I'm working on.
Wouldn't the right tooling be able to show you changes to the slice of code you're interested in? I remember SVN would allow you to checkout just a single subdirectory, for example.
> all sorts of operations take longer (pulling, grepping source, etc.) to support code I couldn't care less about.
Makes sense about the pulling, but again, wouldn't grepping be configurable to only search where you need to?
> Frequently have to update the world at once. Unless the repo can store multiple versions of the same module, then all the consumers have to be updated at once, even if it's inconvenient. Sometimes migrations are better done gradually.
I'm not sure I understood you here, can you expand on that? Do you mean all the devs have to update their module? Why not use tagged/branched version of libs instead of working off trunk?
What kind of tooling do you use for the distributed approach?
Yeah, it's all about the tooling. The distributed system I used was when I worked at Amazon. Each "package" there has its own git repo, which can have multiple branches and the dependencies of each branch are versioned along with the branch. I wonder if at the end of the day it really matters whether something is a "monorepo" or not if the tooling provides the necessary abstractions to version things the way you need.
Using different branches for different modules becomes hard in a monorepo when branching is a global operation. You can only have a single branch checked out in your working copy then.
The only solution I can think of is to create a copy within the monorepo to create de facto branches without regular VCS support. This would be kind of terrible.
I am a firm believer of using processes and tools that make it hard to do stuff you should not be doing in the first place. Dependencies should be solved ala how FAKE and Paket is doing it, not in a mono repo. It's the same story for many projects in a solution vs few. With few projects you avoid wrestling with cyclic dependencies between projects, on the other hand, that is just a tell tale sign that the overall structure is starting to deteriorate.
I still do not see the appeal in mono repos since you're heavily dependent on discipline to not introduce spaghetti dependencies, where you fix one bug, but introduce 4 new in unrelated parts of the code. Now you solve that with an if statement, and thus we introduced a great deal of technical debt.
I’ve never worked in a monorepo, so may be wrong, but this point presumes a dependency on “latest” at all times. I’d assume the components in the mono repo still release versions to the various package management systems (maven, pypi, npm, etc), allowing dependencies to be more stable.
Has that not been the common experience by those that have worked in them? I see a lot of merits of having to update everything at once (less code rot, hopefully) but it does seem to have drawbacks (many have commented on these as well).
If the library/component is wide spread, it either can be developed in a branch, and specific version be tagged, or always developed "trunk" mode with feature toggles (may not be always possible, but one can adjust), e.g. - certain features are disabled, and need to be enabled after others come across, or after some specific time, etc.
While at google, we used that kind of development for the project I was in. Someone would push source code changes for new features, but prefferable behind a flag (normally a command-line flag, driven to a configuration, like the one ksonnet has). The confiuration file would say - enable this flag, only the binary was compiled with this CL version, and/or these cherrypicks, or some other rule.
This also allows a feature to be quickly disabled by SRE, SWE, or other personnel if it's found to be not working well.
Both approaches can be made to work. For me, the overriding concern is simplicity and ease of configuration management, so I prefer something that on the surface looks like a monorepo. Somewhat paradoxically, in my attempt to solve the issues that you mention, I ended up scripting my commits and checkouts so I could place the repository in a set of git repositories -- so I have a distributed set of repositories under the hood that look like a monorepo to the people using it! Neat, huh?
(argh, I wasn't able to post this due to HN's over-eager 'submitting too fast' filter; at the time, I'd submitted all of two comments & one story within an hour — and then I had to restart my browser, which lost the text of this comment)
> 1) Difficult to track changes to the code I'm interested in.
What's wrong with 'git log $PATH'?
> 2) all sorts of operations take longer (pulling, grepping source, etc.) to support code I couldn't care less about.
A different format could help here, as can different tools (e.g. ripgrep or ag instead of grep). The time spent on those operations has to be balanced with the time spent updating your code to deal with someone else's incompatible library changes, again, when the other person is on vacation and you have no idea what the new philosophy of his library is. And you don't have any choice about updating, because another one of your dependencies that you really must update has already been updated to rely on his changes.
> 3) Frequently have to update the world at once.
IMHO that's a feature, not a bug. The person or team responsible for breaking the world is responsible for fixing it, rather than getting to break the world, then pop off down to Barton-on-Sea for an extended holiday while everyone else in the company gets to update his code to use an entirely different idiom.
> 4) Encourages sloppy dependency management.
My experience has been that multiple repos tend to encourage sloppy dependency management, while a monorepo tends to encourage deliberative, collaborative, professional dependency management. That's just my own experience, and of course different organisations will differ.
> I'm sure people will say "if you're having those problems, you're doing it wrong" but the same thing could be said to people who find the distributed model problematic.
My own experience has been that multirepos tend to be like dynamic typing and monorepos tend to be like static typing: multirepos can in theory be done right, but in practice they never are, while monorepos work, but at the cost of people having to colour within the lines. Which makes sense for any particular organisation may actually be a function of its maturity: if a place is trying to move fast and break things, maybe multiple repos make sense; if it's trying to deliver quality software, maybe a single repo makes sense.
3 and 4 are pretty fundamental though, especially 3 - if you don't want to force everybody to keep up with head, you probably don't want to use a monorepo.
My team owns a framework and set of libraries that are widely used within the Google monorepo. We confidently forward-update user code and prune deprecated APIs with relative ease — with benefits of doing it staged or all-at-once atomically.
I attended a talk by one of the Google Guava (Java collections library) authors and he told us how they didn't have to worry about maintaining backward compatibility at all. When they made a breaking change they could check out all of the impacted Java code across Google, refactor it, verify that the tests still passed, and then commit everything in one shot. It's easy to understand the productivity advantages.
One challenge is latency in the generation of codebase and identifier and callgraph search index (Cf., Code Search and Kythe). We can perform global tests across the entire monorepo, but that takes time. What happens if someone introduced new usage of old API immediately before our atomic refactoring, and what about pathological tests or flakes? This still necessitates doing some cleanups multi-stage: (1.) mark old API as deprecated (optionally announce), (2.) replace and delete legacy usages, and (3.) deletion of final trailing usage sometime soon thereafter once the codebase has been reindexed.
Some languages and ecosystems are more tolerant of this problem than others. That said, incremental cleanup still has advantage with bisecting regressions.
As I said, it is not perfect but broadbased change quickly is relatively easy.
In my time maintaining open source, I never had these luxuries, which is why I said the monorepo is infinitely easier. Another consequence: if global cleanups are easy, perhaps that reduces the barrier to experimentation. Perfect is no longer the enemy of the good and the good enough. For me, I felt in open source where I had zero control over dependent code and its callgraph, the reverse was true: hesitance to publish something for fear of cost.
Not sure what you mean. I work at Facebook, and can confirm we keep all code in a monorepo (or, rather, one of two big monorepos) rather than just Java code.
This lets us easily do React API changes: we can deprecate an API internally, and update all JS code that references the old APIs in a single commit.
You can do that with a many-repo, provided you have the right tooling. In fact, I'd argue Google's advantage is in the tooling they built around the repo, not the monorepo itself. E.g how fast you can find all your dependents in the whole repo.
Not really, as there isn't the ability to atomically commit your changes. With 70000+ full-time employees code is getting checked in all the time. Atomicity is extremely valuable.
It's like the people commenting about this forget what distributed means. You can have multiple repos, but you can still have a gate way/"source of truth" repo. You can run tests and whatever else on it just like you do in a "monorepo." The power behind Google's/Facebook's choice isn't and will never be the mono repo. It's specifically the tooling they built around their choice.
The 70,000+ number you cited includes engineers and non-engineers but surely, the actual number of engineers that need commit access will be much lower.
I agree, though I wouldn't say this is a fundamental aspect to distributed systems, mostly just a consequence of git being built with terrible merge and merge conflict tooling.
What changes can you do that you couldn't have done before, in a multirepo world? And I do mean could not have done - clearly the monorepo enjoys a few hundreds of thousands (millions?) of hours of effort that the multirepo did not.
i.e. what stops your current tools from `for each repo, run...`, or how is monorepo fundamentally more capable than building automated library management / releases / etc with the same level of tooling?
With a single commit you can change an API, and all it's users, and run all the tests for all the dependant projects, etc, and you're done. All in a days work, and no emails/communication necessary.
In a multi-repo world, people are probably linking against old revisions of your library, and against certain tags/branches etc. There is probably no overarching code search to find all users of the API. You're gonna have to grep the code and hope to find all uses. You might miss some repos/branches. Everyone has their own continuous integration/testing procedures, so you can't easily migrate their code for them. You're gonna have to support both API's for probably months until you have persuaded every other user to upgrade to the latest 'release' of your code which supports the new API before finally turning off the old API. The work involved in the migration is spread amongst all the project owners, which is probably much less efficient.
As others have said, it's the fully integrated version consistent codesearch with clickable xrefs across gigabytes of source code, cross repo code review, cross-repo testing, etc. which really makes a monorepo work well.
(edit: shortened dramatically. apologies, earlier wasn't all that useful.)
With the exception of cross-repo code review (I hadn't thought of that one - would be useful for multi-repo too, but I've honestly never seen a multi-repo tool for this, thanks!), this is all just the benefits of standardization, plus a massive injection of tooling enabled by the standards.
Standardization of projects brings huge benefits when it's done right, absolutely agreed. But that's entirely orthogonal to mono vs multi.
Imagine I have Repos A,B,C. A is a base repo. B and C depend on A, and C also depends on B. If I modify some API in A, and also update all the callsites in B and C, I also have to bump the version of A depended on by B and C, and also bump the version of B that C depends on, otherwise I'll get version mismatch/api compatibility breakages.
To make this work that means that nothing can depend on latest, everything has to have frozen dependencies, and you either need to manually, or via some system, globally track all of the dependencies across repos, and atomically update all of them on every breaking change.
In other words, you reinvent blaze/bazel at the repo level instead of the target level, and you have to add an additional tool that makes sure you're dependencies can never get mismatched.
The monorepo sidesteps this issue by saying "everything must always build against latest".
>"everything must always build against latest" is perfectly enforceable on multirepo too, it's just that nobody does it
No, you cannot. That's my entire point. Here's a minimal example:
Repo one contains one file, provider.py:
def five():
return 5
Repo two contains one file, consumer.py.
import provider # assume path magic makes this work
def test_five_is_produced():
assert provider.five() == 5
if __name__ == '__main__':
test_five_is_produced()
I also have an external build script that copies provider and consumer, from origin/master/HEAD into the same directory, and runs `python consumer.py`.
Now I want to change `five` to actually be `number`, such that `number(n) == n`, ie. I really want a more generic impl. What sequence of changes can I commit such that tests will always pass, at any point in time?
There is no way to atomically update both provider and consumer. There will be some period of time, perhaps only milliseconds, but some period of time, at which point I can run my build script and it will pick up incompatible versions of the two files.
This is a reductive example, but the function `five` in this case takes the role of a more complex API of some kind.
or you give your CI the ability to read transaction markers in your git repo. e.g. add a tag that says "must have [repo] at [sha]+". dependency management basically. you can even do this after the commits are created, so you can allow cycles and not just diamonds.
but yes, cross-project commits are dramatically easier in a monorepo, I entirely agree with that - they essentially come "for free".
Didn't you just reinvent versioning and frozen dependencies? What you described is not always building at latest, it's building at latest except when there are issues at which point you don't build at latest and instead build at a known good version.
Consequences of this are, for example, that you cannot run all affected tests at every commit.
sure. I honestly don't see why that's a problem though, especially since "at every commit" can have clear markers for if it's expected to be buildable or not.
My point here is that you're describing a known problem with known solutions, and saying it's impossible. I'm saying it requires work, as does all this in a monorepo.
edit: to be technical: yes, you're correct, it can't always build at latest at every instant. Agreed. I don't see why that's necessary though. Simplifying, sure; necessary? No.
>sure. I honestly don't see why that's a problem though, especially since "at every commit" can have clear markers for if it's expected to be buildable or not.
The value from this is the ability to always know exactly which thing caused which problem. If you know things are broken now, you can bisect from the last known good state, and find the change that introduced a breakage. With multi-repo, you can't do that, since it's not always a single change that introduces a breakage, but a combination.
Ensuring that everything always builds at latest allows you to do a bunch of really cool magical bisection tricks. If you don't have that, you can't bisect to find breakages or regressions, because your "bisection" is
1. now 2 dimensional instead of 1
2. may/will have many false positives
That puts you in a really rough spot when there's a breakage and you don't have the institutional knowledge to know what broke it.
No, you're back to "we can't build HEAD in a multirepo", which is fixable with CI rules. If you can, you can bisect exactly the same (well, with a fairly simple addition to bisect by time. `git bisect` is pretty simple, shouldn't be hard to recreate).
In any case, unless you have atomic deploys across all services, this is generally untrue. Bisecting commit history won't give you that any more in a monorepo than in a multirepo.
> I'm asking because i wouldn't know how to setup a mono repository at my 50 people Startup even if we deemed this to be necessary.
Sorry if this is a really dumb question. If you only have 50 people I'm assuming your codebase isn't that big, so why can't you just make a repo, make a folder for each of your existing repos, and put the code for those existing repos into the new repo?
I imagine there's a way to do it so that your history remains intact as well.
Yes, there is. Move the entire content of each repo to a directory and then force-merge them all in a single repo. I did this a few years ago with 4 small mercurials repositories that belonged together.
For a 50 people startup, a Git repository will be usually enough. At my previous company we managed to do the monorepo approach easily with similar amount of people and GitHub.
Google's mono-repo is interesting though, in that you can check out individual directories without having to check out the entire repo. It's very different from checking out a bajillion-line git repo.
It's important to stress that Google uses Perforce and not git (at least for that monorepo, they use git/gerrit for Android).
A monorepo this size would simply not scale on git, at least not without huge amounts of hacks (and to be fair, Google built an entire infrastructure on top of Perforce to make their monorepo work).
Google doesn't use perforce anymore. It's been replaced with Piper, you can read about it in articles from about 2015 or so. Perforce didn't scale enough. I guess it's not clear to what extent Piper is a layer of infrastructure on top of perforce or actually a complete rewrite? I was never super sure. The articles appear to imply way more than a layer on top...
You are exactly right that git doesn't scale though, go see the posts on git that Facebook's engineers made while trying, only to be met with replies to the extent of "you're holding it wrong, go away, no massive monorepo here", at which point they made it work with mercurial instead. Good read though, lot of good technical details. Can't find the link at the moment though :(, but it was from somewhere around 2012-13 ish.
There's nothing wrong with saying "you're holding it wrong" if they're holding it in a way clearly contrary to the solution design. I don't fit in a toddler's car seat and if I tried, it's clearly my fault and not the seat engineer's. I doubt they'd want to accept my changes that would make it work worse for toddlers either.
Sure, if you don't care about people actually using your stuff you can ignore their requests. But Facebook and Google are now working on Mercurial rather than git, and Mercurial actually cares about ease of use (whereas git seems to revel in its obtuseness) and the Mercurial folks are looking at rewriting it, or parts of it in Rust to improve performance, which has always been the major issue.
If all those things continue I think the only reason to use git over hg would be github. How long until they decide to support Mercurial too and people abandon git?
> Sure, if you don't care about people actually using your stuff you can ignore their requests.
Yes. End of story. People will abandon things that don't support them for things that do and those that want to continue using something that fits their application will do so. Nothing to see here; we get it, you don't like git -- don't use it if it doesn't fit your needs. However, don't expect those who do like it to go out of their way in a way they don't want to please you. Just because there is a community developed around something and that something is open source does not mean they are required to accept whatever patches come their way -- often the best projects know what to keep out as much as what to let in. In this case, the git community has decided it doesn't want to do those things; more power to them.
>> Sure, if you don't care about people actually using your stuff you can ignore their requests.
I think you nailed the problem with Git here: it was created by one guy to support his pet project and as long as it works well for him all the other feature requests are low priority.
That's still not even close to Google's repository:
"The Google codebase includes approximately one billion files and has a history of approximately 35 million commits spanning Google's entire 18-year existence. The repository contains 86TBa of data, including approximately two billion lines of code in nine million unique source files."
As another user mentioned, many git actions scale linearly in the number of changes, not in the size of the repository. Try recreating the scaled repo, but say, in commits of 1000 lines each (ie. 200K commits), and see how long things take.
Did your experiment also do 40,000 changes per day (35 million commits, of varying sizes throughout the repo), and then see how that affects git performance? My (admittedly crappy) understanding of git is that it also scales on the commits, not just the raw file number/size count.
Google no longer uses perforce either. I believe it also stopped scaling. They now use Piper, which has a perforce like interface, but is not the same thing.
And there are other non perforce like Piper interfaces.
Perforce is really common in a few domains because it handles 1TB+ repo sizes cleanly, has simple replication, locking of binary files and a good UI client for non-programmers.
Was pretty much used exclusively back when I was in gamedev, not sure if that's still the case.
git is designed from the ground up to be 100% distributed. This is useful for small and/or open source projects. It's 100% portable. You can fork and merge between different repos maintained by complete strangers.
Now, imagine you're a huge corporation. Your code consists of millions of files that have been edited millions of times. It's never going to be released to the public. It's never going to be forked, much less by a stranger. You're going to have only one main branch and main build ever, except for maintenance branches. The complete history of everything that has ever happened on that repo is would take up many gigabytes, and developers are probably only ever going to need to look at and/or build locally 0.01% of that code themselves.
If you were going to design a version control system from scratch for the latter scenario and you had never heard of git or any other existing VCS, how would you design it? Would you come up with something like git? Probably not. People would just have local copies of the minimum of what they needed to get their work done, anything else would call some server on the VPN they were always on. And you would probably come up with some whole specialized server architecture with databases and such that wasn't that similar to a corresponding client architecture that it would also need.
Having worked with all three, nothing but Stockholm syndrome would keep anyone to switch from cvs. Likevise the switch to git for open source happend (In my opinion) in large part because Github offers a far better experience than SourceForge witch was dominant at the time.
There was actually a brief period where Google Code was ascendant, but then GitHub was demonstrably investing more in collaboration.
I think one aspect of Git that is really important is forking, and having your own local commits. Merging commits and patches in svn were awful. You wouldn't ever allow someone random to join your svn repo, but if they can reasonably provide a patch, you could take it. Git makes that massively easier.
For me the main feature was distributed nature. SVN is OK on a gigabit corporate LAN with dedicated people to manage & maintain the servers + network. Anything less than that, and it becomes slow and unreliable.
The dependencies aren't really "downloaded" at all. When you build something, the artifacts are cached locally, but the files you are editing generally speaking not actually stored on your machine. They're accessed on demand via FUSE.
This used to be done manually via "gcheckout" but that's long since been replaced. Users now don't do anything but create quick throwaway clients that have the entire repo in view.
Until very recently there was a versioning system for core libraries so those wouldn't typically be at HEAD (minimizing global breakage). Even that has been eliminated now and it's truly just the presubmit checks and code review process that keeps things sane.
That's pretty common feature in most non-DCVS. It was nice having Perforce on the last game I worked on. The art directory was ~500GB and not fun to pull down even with a P4 proxy.
I work in a large company and I have used a central repository for six years and a distributed for six years. I think a central repository is better. The benefits are:
1) Transparency. I can see what everybody else is doing and if somebody has an interesting project I can find it quickly. You can also learn a lot from looking at other peoples changes.
2) Faster. To check out the source code for the project I now work on takes an hour in the distributed system, while it only took 5 minutes in the centralized system.
3) Always backed up. All code that is checked into the central repository is backed up. It has happened twice that employees have left and code was lost because they only checked it in locally.
Many have only used CVS or SVN, which are horrible. I rather use Git or Mercurial, but Perforce is really good.
> 1) Transparency. I can see what everybody else is doing and if somebody has an interesting project I can find it quickly. You can also learn a lot from looking at other peoples changes.
This doesn't require a single central repository, just that all repositories live in a common location.
> 2) Faster. To check out the source code for the project I now work on takes an hour in the distributed system, while it only took 5 minutes in the centralized system.
What distributed repository management system do you use, and what centralized system did you use?
> 3) Always backed up. All code that is checked into the central repository is backed up. It has happened twice that employees have left and code was lost because they only checked it in locally.
As with point 1, this doesn't require a single central repository, just that all repositories live in a common location.
This doesn't require a single central repository, just that all repositories live in a common location.
Even better, if every project includes a DOAP file (or something similar) and/or you publish commit messages using ActivityStrea.ms or something, you could easily have an interface that shows project activity around the organization, regardless of how many repositories and/or servers you use. Of course it's probably easier if all the repositories live in a common location...
I use git-svn to use a central repository. Let me list the advantages
1) Faster
There is no comparison. But let me count the ways
a) checking out stuff
It is faster than just downloading a directory using SVN.
b) just trying something out (ie. branch)
Creating a branch, making a few changes takes me seconds, and does not require me to change paths like it does for the svn victims I work with. Throwing it back out again takes seconds, and all operations are reversible for when I fuck up (which is often).
c) merging
Git's merging. Oh my God. In half the cases I just have to check stuff over, if that.
d) submitting
We use code review. Unlike most of the subversion folks I can easily have 5 co-dependant changes in flight (5 changes, each depending on the previous one) without going insane, and I have gone up to 13, not counting experimental branches. I observe around me that it takes a good developer to manage 2 with subversion. 5 is considered insane, I bet if I showed them the 13 were in flight at the same time they'd have me taken away as a danger to humanity.
2) always backed up
Subversion doesn't back up until you commit and people don't commit anywhere near quickly enough ... The way people lose code around here 99.9% of the time is by accidentally overwriting their in-flight code contributions (the remaining 0.1% involves laptop upgrades and overenthusiastic developers. Even then cp -rp will just copy my environment and just work, and yet the same is absolutely not true for the subversion guys).
Now with Git, I commit every spelling fix I make, every semicolon I have forgotten, on occassion separately, other times with "--amend". And only then make my share of stupid mistakes, after committing, something that's technically not impossible on subversion but not practical, mostly because of code review ("just commit it" on subversion takes ~5 minutes in the very fast case (that requires a colleague dropping everything that very second, AND can't involve any actual code changes, as that trips a CI run that takes 3 minutes assuming zero contention), and 20-30 minutes is a more typical time (measured from "hey, I'd like to commit this", to actually in the repository). Committing on git takes me the time to type "<esc>! git commit % -m 'spellingfix'". The subversion commit time means that developers often go for weeks without committing. Weeks, as in plural.
I get that a git commit isn't the same thing as a subversion commit. But it does allow me to use the functionality of source control, and that's exactly what I'm looking for in a source control system. Subversion commit doesn't allow me to use source control without paying a large cost for it, that's what I'm getting at.
So I have backups guarding against the 99.9% problem (and an auto-backup script that does hourly incremental backups for the 0.1% case). The subversion guys are probably better covered for the 0.1% problem. Good for them !
3) actual version control
Git's branches, rebase, merge, etc mean I can actually work on different things within short time periods in the same codebase.
The fact that other developers are using subversion means I can have my own git hooks that I use for various automated stuff. Some fixing code layout, some warning me about style mistakes, bugs, ... (you'd be surprised how much your reputation benefits from these). Some updating parts of the codebase when I modify other parts, ... you have to be careful as these are part of the reason subversion is so slow (esp. the insistence on CI, I hear a CI run at big G, which is required before even code review can happen, takes upwards of an hour on many projects with some taking 8-9 hours)
Not really. I work at google. I work on a leaf, so my CI takes < a minute. I also can send out multiple chained changes, in a tree, to multiple reviewers, and have them reviewed independently.
Certainly, CI takes a long time for certain changes, but those are changes that affect everything. You'd have the same problem in a multi-repo approach if you updated a repo that everything else depended on. At some point, you have to run all of the tests on that change.
Cool. I've wondered about Google's CI a lot, but there are a lot of horror stories online. Most people are complaining about it taking an hour for simple changes (something called "tap", I wonder what that stands for).
Chained code review changes, I refuse to believe that in Google version control (which is perforce according to Linus' git talk at Google) chained changes are easy. Branching in perforce is literally worse than SVN, it's a bit more like the old CVS model, and they've sort-of tried to get the SVN copy-directory model forced into the design afterwards. Also the tool support (merges ...) is bad compared to subversion and stone-age compared to Git's tools.
The one reason I keep hearing for using perforce is that perforce allows the administrator to "lock off" parts of the repository to certain users.
I've done branches and merges in Git, Subversion and CVS (and I've had someone talk me through one in Perforce, but I don't really know). Google's branch/merge experience is very likely to be somewhere between SVN and CVS, and those can accurately be referred to as "disaster" and "crime against human dignity". It's certainly not impossible, but it's very hard and you can't expect me to believe (normal developer) people can reasonably do that in Perforce.
Also: what would happen if you send out 20 chained commits, 10 of which are spelling corrections, 5 of which are trivial, compile-fixing bugs (forgot semicolon, "]" that should have been ")", etc ...), 2 of which are small changes to single expressions and 3 of which introduce a new function and some tests. Perforce, like subversion and cvs doesn't have any way of tracking stuff unless you commit it and you can almost never commit without CI and code review, so would you track changes like that, or would you just leave them in your client untracked until you're ready for a code review ?
>Cool. I've wondered about Google's CI a lot, but there are a lot of horror stories online. Most people are complaining about it taking an hour for simple changes (something called "tap", I wonder what that stands for).
Well, like I said, its possible to do modify things that have a lot of dependencies, at which point you run a lot of tests, but that would be truish anyway. Consider the hypothetical situation where you're changing you're modifying the `malloc` implementation in your /company/core/malloc.c`. Everything depends on this, because everything uses malloc. If you have a monorepo, you make this change, and run (basically) every unit and integration test, and it takes a while.
Alternatively, if `core` is its own repo, you run the core unittests, and then later when you bump the version of `core` that everything else depends on, you run those tests too, but now if there's a rarely encountered issue that only certain tests exercise, you notice that immediately when you run all the monorepo tests, and can be sure that the malloc change is the breakage. If you don't do that, then you notice breakages when you update `core`, or maybe you don't notice it, because its only one test failing per package, and it could just be flakyness. So noticing it is harder, and identifying the issue once you've decided there is one is harder, and now you need to rollback instead of just not releasing.
>Chained code review changes, I refuse to believe that in Google version control (which is perforce according to Linus' git talk at Google) chained changes are easy. Branching in perforce is literally worse than SVN, it's a bit more like the old CVS model, and they've sort-of tried to get the SVN copy-directory model forced into the design afterwards. Also the tool support (merges ...) is bad compared to subversion and stone-age compared to Git's tools.
Google no longer uses perforce, we use Piper (note that this is a google develped tool called Piper, not the Perforce frontend called Piper, yes this is confusing, afaik, Google's Piper came first). Piper is inspired by perforce, but is not at all the same thing. (See Citc in the article). The exact workflow I use isn't piblic (yet), but suffice to say that while Piper is perforce inspired, Perforce is not the only interface to Piper. This article even mentions a git style frontend for Piper.
>Google's branch/merge experience is very likely to be somewhere between SVN and CVS, and those can accurately be referred to as "disaster" and "crime against human dignity". It's certainly not impossible, but it's very hard and you can't expect me to believe (normal developer) people can reasonably do that in Perforce.
Suffice to say you're totally mistaken here.
>Also: what would happen if you send out 20 chained commits, 10 of which are spelling corrections, 5 of which are trivial, compile-fixing bugs (forgot semicolon, "]" that should have been ")", etc ...), 2 of which are small changes to single expressions and 3 of which introduce a new function and some tests. Perforce, like subversion and cvs doesn't have any way of tracking stuff unless you commit it and you can almost never commit without CI and code review, so would you track changes like that, or would you just leave them in your client untracked until you're ready for a code review ?
So, Piper doesn't have a concept of "untracked". Well it does, in the sense that you have to stage files to a given change, but CitC snapshots every change in a workspace. Essentially, since CitC provides a FUSE filesystem, every write is tracked independently as a delta, and it's possible to return to any previous snapshot at any time. One way to think of this concept is that every "CL" is vaguely analogous to a squashed pull request, and every save is vaguely analogous to an anonymous commit.
This means that in extreme cases, you can do something like "oh man I was working on a feature 2 months ago, but stopped working on it and didn't really need it, but now I do", and instead of starting from scratch, you can, with a few incantations, jump to you're now deleted client and recover files at a specific timestamp (for example: you could jump to the time that you ran a successful build or test).
>Also: what would happen if you send out 20 chained commits, 10 of which are spelling corrections, 5 of which are trivial, compile-fixing bugs (forgot semicolon, "]" that should have been ")", etc ...), 2 of which are small changes to single expressions and 3 of which introduce a new function and some tests.
I'd logically group them so that each resulting commit-set was a successfully building, and isolated, feature. Then, each of those would become its own CL and be sent for independent review.
I think you are confusing central/distributed with monorepo/multiple repos. Also distributed VCS doesn't imply that you don't have a central master somewhere.
perforce has such a janky ui though. whenever i try to do anything significant with my company's codebase, the whole application locks up for hours. i guess i need to learn how to use the cli.
This might not only be GUI vs cli, it can just be down to the granularity of your client mapping - if the p4 server thinks it needs to lock across large regions of depots it can go into the weeds.
I always try to have the absolute minimum in my client specs, but sometimes you do need to operate over the world.
The perforce docs are generally well written, worth looking at them.
The third problem can essentially be solved by doing all your production builds by checking out code from some central repository. If you follow that rule, then you guarantee you'll have the source code for every binary in production.
That way, you can still have a distributed repository (Git, Mercurial, etc.) if you want. Even if some code exists only in some developer's local repository, it's presumably not that big of a deal since that code can never have made it to production.
So as someone who previously worked for Google and now works for Facebook, it's interesting to see the differences.
When people talk about Google's monolithic repo they're talking about Google3. This excludes ChromeOS, Chrome and Android, which ard all Git repos that have their own toolchains. Google3 here consists of several parts:
- The source code itself, which is essentially Perforce. This includes code in C++, Java, Python, Javascript, Objective-C, Go and a handful of other minor languages.
- SrcFS. This allows you to check out only part of the repo and depend on the rest via read-only links to what you need from the rest.
- Blaze. Much like Bazel. This is the system that defines how to build various artifacts. All dependencies are explicit, meaning you can create a true dependency graph for any piece of code. This is super-important because of...
- Forge. Caching of built artifacts. The hit-rate on this is very good and it consumes a huge amount of resources given the number of artifacts produced. Forge turns build times for some binaries from hours (even days) into minutes or even seconds.
- ObjFS. SrcFS is for source files. ObjFS is for built artifacts.
This all leads to what is usually a pretty good workflow like the ability to check out directories if you want to modify them and just use the read only version if you don't. You can still step through the read only code with a debugger however.
Now Facebook I have less experience with (<6 months) but broadly there are four repos: www, fbobjc, fbandroid and fbcode (C++, Java, Thrift services, etc). At one point these were Git but for various reasons ended up being migrated to Mercurial some years ago.
The FB case (IMHO) highlights just how useful it can be to have one repo. Google uses protobufs for platform independence. FB uses GraphQL at a client level and Thrift at the service level.
So one pain point is that, for example, you can modify a GraphQL endpoint in one repo but its used by clients in others (ie mobile clients). There are lots of warnings about making backward-incompatible changes, some of them excessively pessimistic because deterministically showing something will break some mobile build in another repo is hard.
Google3 has less of these problems because the code is in the same repo. On top of that, Google has spent a vast amount of effort making it so the same build and caching systems can handle C++ server code as well as Objective-C iOS app code. Basically if you're working on Google3 you basically compile very little to nothing locally.
Engineers on Android, Chrome and ChromeOS however compile a lot of things locally and thus get far beefier workstations.
At FB the mobile build system doesn't seem to be as advanced in that there is a far higher proportion of local building.
IIRC the Git people seemed to reject the idea of large code bases. Or, rather, their solution was to use Git submodules. There was (and maybe is?) parts of the Git codebase that didn't scale because they were O(n). Apologies if I'm misspeaking here but I peripherally followed these discussions on HN and elsewhere years ago as someone from the outside looking in so I'm no authority on this.
The problem of course is that Git submodules don't give you the benefits of a single repo and I've honestly not heard anyone say anything good about Git submodules.
Just to stress, the above is just my personal experience and I hope it's taken as intended: general observations rather than complaints and definitely not arguing that one is objectively better than the other. There are simply tradeoffs.
Also, there are definite issues with Google3, like the dependency graph getting so large that even reading it in and figuring out what to build is a significant performance cost and optimization issue.
I have two main concerns when I see monorepos being used.
First, like in other areas, I see companies that want to "google
scale" and blindly copy the idea of monorepos but without the requisite tooling teams or cloud computing background / infrastructure that makes this possible.
Second, I worry about the coupling between unrelated products. While I admit part of this probably comes from my more libertarian world view but I have seen something as basic as a server upgrade schedule that is tailored for one product severely hurt the development of another product, to the point of almost halting development for months. I can't imagine needing a new feature or a big fix from a dependency but to be stuck because the whole company isn't ready to upgrade.
I've read of at least one less serious case of this from google with JUnit
> In 2007, Google tried to upgrade their JUnit from 3.8.x to 4.x and struggled as there was a subtle backward incompatibility in a small percentage of their usages of it. The change-set became very large, and struggled to keep up with the rate developers were adding tests.
> I worry about the coupling between unrelated products.
I even worry about coupling among related products.
I could see monorepos working out well for a company that just does SaaS, and is able to get away with nice things like maintaining a single running version of the app, and continuous delivery.
Having mostly worked in companies that do shrinkwrap software or that allow different teams or clients to manage their own upgrade schedule, though, monorepo seems to me like a recipe for a codebase that is horribly resistant to change. Not just in the "big bang upgrades like JUnit4 are awful" ways described above, but also in a, "We never clean up old stuff, because most of the time when we try it breaks a bunch of other teams' code and we just nope out of that whole hassle, so barely-supported code sort of collects continuously, like dead underbrush in a forest that's never allowed to burn, until eventually it all explodes in a horrible conflagration," sort of way.
Seeing the list of things that Google keeps in a monorepo, vs things that Google keeps in Git repos, it seems like they might be thinking similarly. They've really only got a precious few products that typically run on non-Google-owned hardware, and apparently the major ones live outside the monorepo.
The dynamics actually played out very differently at Google. Because it was a monorepo with automated testing, if you didn't want other teams to break you when they change the dependencies, then you had better have a robust test suite.
Breaking changes would then lead to a discussion with your team, rather than your fruitlessly trying to binary search to find the commit that broke you.
Over time, the culture at Google became that all teams need to write tests at the unit, functional, and (usually) integration level.
>They've really only got a precious few products that typically run on non-Google-owned hardware, and apparently the major ones live outside the monorepo.
This depends on what you mean. Most/all consumer android applications don't run on google-owned hardware, but are in the monorepo.
That said, you're right that the whole "keep things up to date" thing is important. That's where tools like rosie and even bots come in.
> Seeing the list of things that Google keeps in a monorepo, vs things that Google keeps in Git repos, it seems like they might be thinking similarly. They've really only got a precious few products that typically run on non-Google-owned hardware, and apparently the major ones live outside the monorepo.
I always thought it was more that the things which take open source contributions are hosted in Git while the internal things would be hosted in Google3.
First, I agree with you about companies worrying prematurely or unnecessarily about “google scale”. You saw this a lot in the NoSQL hype days. You can go pretty darn far with even a single MySQL instance.
Second, source level dependencies bs binary level dependencies has s a choice and a commitment.
Even at large companies release schedules can really hinder you.
I didn’t hear about the JUnit issue but I can believe it. With code bases this large you have to get really good at static analysis (dynamic languages are your enemy here), tooling for refactoring and just general hygiene of the code baee.
For the first concern you have things totally flipped. A monorepo actually seems best suited for a small company, with a small codebase and a small set of services.
If anything, the stage of company where it makes most sense to have many small repos is when you have a large company with multiple unrelated products, services, teams, etc.
You're right that monorepos are just fine for small company. The problem is that at some point they don't scale without code review discipline and sophisticated tooling (such as Google's); thus there is a bottleneck where it becomes harder and harder to scale it until you get your tools right.
Yes, but that point is probably when you have hundreds of engineers, at which point you can afford to have (and probably will already have) a few engineers working on internal tooling.
>>> First, like in other areas, I see companies that want to "google scale" and blindly copy the idea of monorepos but without the requisite tooling teams or cloud computing background / infrastructure that makes this possible.
Mono repo will work fine for most small and medium companies without issue, even on top of git.
The need for special tooling and performance issues will only pops up when you have millions upon millions of lines of code.
> broadly there are four repos: www, fbobjc, fbandroid and fbcode
This used to be true, but today these are all in fact the same hg repo (www as a possible exception, I'm unsure). The "sparse checkout" machinery disguises it, but for engineers working cross platform (e.g. React Native) it's routine to make commits that span platforms.
I'd add that most technical companies today, unless at the scale of Google/Facebook do not necessarily (and most likely don't) have very sophisticated tooling in place. You can imagine Jenkins is always in place and codes are splitted into "self-contained" repos; I don't know if Google/Facebook uses Jenkins, but I know Netflix certainly does.
I haven't heard much about Microsoft or Amazon, though I do know from a friend working at Apple their toolings are not always consistent from team to team. I would appreciate if we have someone from these other big tech companies to discuss their development workflow.
As a SRE/DevOps, I love working on internal toolings because I get to feel like creating my own programming language - I can be creative but focus on solving problems in my domains.
Big, non-tech companies use completely crap tools in a lot of cases, compared to even 50 person startups. Google's a clear outlier in terms of tool quality, even for a tech giant, but I've seen some great in-house or newer tools used by a lot of startups, too. There are often huge differences in tool quality vs. task for companies of the same size/stage, too.
I respect that opinion, but it's different from my experience. My experience is that non-tech companies just don't care (i.e., we can do what we want), and that startups spend way too much time worrying about the tools they will need when they get as big as google. Instead of, you know, getting as big as google.
These are only secrets in the most tendentious sense. Half the people at Facebook worked at Google before anyway, and there are maybe a half dozen companies that could plausibly get the benefits of going Google-scale for source control. And they already have all their internal systems: if they lack all the same capabilities as Google, it's not because Google's systems are secret but because of other challenges.
Yeah, the mono repo / distributed division is almost a red herring.. A key component to the division though is fault and responsibility in making something as relatively unimportant as the source control and build systems work well.
Google has put a lot of money and effort to make their system nice. Working in a mono repo without that much effort is very frustrating, doubly so because there's nothing individual teams can do about it. It's especially worse if you can't even make team specific branches on the mono repo to try and isolate yourself from the steady stream of breaking changes elsewhere.
However if you're a team lucky enough to get out and do most things on your own git repo, then you're now the only ones responsible for making that better or worse. Fortunately there's a ton of open source to learn from and use, so taking control of your own team's destiny to get to a point better than before doesn't have to mean much work.
"
Google has put a lot of money and effort to make their system nice. Working in a mono repo without that much effort is very frustrating, doubly so because there's nothing individual teams can do about it."
Sure there is. They can architect code in a way that it doesn't break heavily when other people do things.
IE abstract things reasonably. They can test things well.
and they can complain when other teams aren't doing either and it's making them less effective.
A refreshing comment. I feel like people often overlook the importance of execution. Though, basing one's execution off shaky ideas is not really the best either.
Because it's not really secret? Numerous articles have been written about this stuff for years and years and also it's not hard to get [G|X]ooglers to talk about this stuff candidly in casual convo.
I think that what is missing when using multiple git repos is the ability to make a code change that spans multiple projects. We're open to adding that to GitLab
This will help if you don't want to run a monorepo, but many in the industry consider a monorepo suitable for their organisation. The biggest blocker with Gitlab in my opinion is not being able to only run a job if some specific folder was modified (https://gitlab.com/gitlab-org/gitlab-ce/issues/19232).
In places I've worked in before that needed to join concurrent changes across repositories, we would have the build system always build from a super-repo that had the other repositories tracked as submodules. Co-dependent changes were pulled in via a single commit in the super-repo.
This was kind of cumbersome to maintain TBH, and the fact that changes to different repos can be dependent on one another seems to strongly suggest that the code should be together in the same repo. Personally, I opt for mono-repos until I'm forced to change for whatever reason.
I think the monorepo is better but it's really hard to pull off without Google-like tooling and engineering practices. Git + meticulous dependency tracking and strict versioning conventions is probably the better move for most companies.
+1, I really dislike the "Let's do it because Google does it" mentality in software engineering. Google operates at a scale and encounters problems that the average company would probably never encounter.
How would they do CI/CD if it's all in one big repo?
I suppose you could do it if you had a very strict rule where absolutely everything that could affect a "unit" was inside its own directory (but never and nothing higher up than that "project root").
So you could check which sub-directory is affected within a commit and so on.
This is one part of a _very_ big answer, but Bazel (internally Blaze) lets you do reverse dependency querying [0] with the query language, i.e. "Bazel, give me the list of targets that depend on this target that I've just modified"
$ bazel query 'rdeps(//foo/my:target, //...)'
Of course, this query in the monorepo will take a long time or not work, because the target universe of "//..." is far too large. This is where other systems come in.
I'm unsure about the confidentiality of those systems, so I'm erring on the side of not expanding.
However by deriving from first principles, yes, there is no reason to re-query the transitive closure of unchanged targets' reverse deps, so caching can happen here.
Blaze/bazel solves this. All build targets (buildable units) explicitly define their direct dependencies. If you modify something, you can then run all tests for all units that transitively depend on your change.
For surface level changes this is often quite small. For changes to core libraries, well, you run a lot of tests.
Oh it works really well. One benefit is all tests affected by your target is run as you presubmit. The benefit is everything is at head so things like library bugs, security bugs are all handled naturally as part of the new releases. This usually happens twice a week for most server binaries.
Depends. Mobile tests that run on emulators are the worst. Unit tests finish relatively fast, integration tests that bring up server tend to be slow (30-40mins best case almost for these for projects im working on). The cost of this gets amortized two ways: you can run immediate unit tests manually on command line as the fastest signal. Then when you send for code review, presubmit runs. During code reviews you may choose to run them as you go. Eventually when you submit they run again.
If there has not been any changes to your commit/cl and there is an already passing submit, it will just skip and submit.
Oh wow. We have a test tenancy that's carried throughout production, so you make requests against real backends (read/write data in the test namespace, sometimes read-only production data). There's a proxy in front doing rate limiting, endpoint whitelisting, audit logging, emergency lockdown, etc. I never thought of deploying a whole separate environment just for integration testing.
Still, seems you could keep a handful of integration test environments always running? Time spent waiting your turn for one of them could well be less than time spent spinning up a whole bunch of servers.
There is an effort to make everything hermetic. Namespacing is hard, but not always possible, and touching production servers (and potentially crashing them) could cause significant revenue damage.
I don't think all tests should be hermetic - the effort to make such things happen usually do not overweigh the time it takes to do them, but hey - that's what we are doing.
In a single integration test? That'd be pretty absurd.
At least in our project, each integration test has a certain amount of overhead. Some backends are fakes (when I request X, you provide Y), some are actually booted up with the test, e.g. persistence.
Multiply this across N integration tests, have lots of demand for the same CPUs, and you're up to 30-40 minutes of integration test time.
Though, that said, some integration tests can be crazy long if they have a lot of "waitFor" style conditions. "Do this, then wait for something to happen in backend Z. Once that's done, do this, and this, and this..."
(Not at Google, but Twitter has its own monorepo) Generally submit queue takes 10-20 minutes, longer if you're changing a core library.
I find the larger factor is what kind of test has to run. Feature tests can take a while on CI if you're spinning up lots of embedded services (dependent services, MySQL, storage layers, etc).
To me, Piper is a monolithic version control system which is geared towards good engineering practices.
As far as I know there are only two such systems in use today and the other one is very dated and older than a lot of things out there.
When people say they have worked in a monolithic repo, then they typically mean a 1 repo under one of the open source version control systems, but none of these actually do or support what is needed when working with a monolithic repo AND modern/good engineering practices.
For that specialised VCS is required and there is very few examples of that, none of which is in open source software.
Git probably could be made to do this kind of stuff, but it would require some extensions to the DAG as well as extend on it's already verbose command line set. But I think it is doable.
The question is who can do it? Most are probably under some strict NDAs.
I think that's an effect, not a cause. It's annoying to open source google3 code because it means untangling dependencies. Meanwhile Android has lots of SOC-specific and other development happening behind the wall. So the open source question seems orthogonal.
Google today has separate repos (android, chrome, chromeos, google3), each with its own build system: Gradle, gyp/ninja, Portage, Blaze. There's hysterical raisins, but I wonder if Google considers it to be a good thing these projects are so different, or a wart they would prefer to fix?
What nested brings to the table is the semantics of a mono repo with the advantages of a multi repo. The whole thing walks in lockstep, if you have a 3 week old version of the kernel and you add in the testing component (subrepo) then you get the 3 week old version of the testing component, all lined up with the same heads in the same tip commit.
I get it that git won but at least steal the ideas.
Edit: BTW, bk has a bk fast-import that usually works (it doesn't like octopus merges but other than that....)
The lazy me thinks maybe 1 kitchen sink repo is better.
Here's my problem:
I can be working on multiple projects at the same time. Each project has multiple modules (core, api, www, admin, android, etc -- I use microservices on Google App Engine). Sometimes, some modules have "feature" branches. Oh, did I tell you I work on both Desktop and Laptop?
The problem is syncing. Before traveling, I need to make sure the projects/modules I'll be working on the go all have latest commit from Desktop.
My question:
Is there some "dashboard/overview" for all Git projects? So I can quickly tell, "Ok, all projects are at latest commit, and oh I'm working on feature branch for project X and Y."
There is a dashboard called github? Or what are you looking for?
I think if you want to make sure that all your changes are in, you need to do what most programmers do and learn to finish a programming session with a commit and push, just like you finish a sentence with a dot. Once you are used to it, the chance of forgetting are really low.
This is something I’ve been idly thinking about too.
I agree with regards to committing regularly, but sometimes life happens and one can forget.
I’ve been considering writing a python script that checks my local repos for uncommitted/unpushed changes and - now I think about it - perhaps also runs when I start a new terminal session just for good measure.
Once, I noticed that a minor piece of how Google's Python system tests could be very slightly cleaner and more consistent. It would be a tiny change, it seemed very safe, and it was easily accomplishable with a short sed script, but it'd also be a backward-incompatible change across the projects of hundreds of teams and thousands of build targets. I was able to make that change with only a few commits and without needing to bother most of those teams.
These sorts of small, general, large scale cleanup commits are quite common at Google, and they're encouraged. They help keep the codebase healthy. There are special groups that review them so that all of the individual teams affected don't have to bother, and there are tools to manage the additional testing and approval requirements for such a change.
At my previous company, making such a change would have been a major undertaking. I never would have considered a refactor of that scale without a critical need. They had thousands of packages, each of which had its own repository and an incredibly complex web of build and runtime dependencies. It was a nightmare, and fiddling to find a working sets of versions of internal dependencies took up way, way too much of my time each day.
For me, refactors were the largest "aha" moment. On large scale projects you can move a lot faster if you don't have to maintain backwards compatible API's. We use Facebooks version of a mono-repo (BUCK [1]) for iOS dev. It's really easy to change an API, see all of the upstream breakages, write tests, fix upstream and submit a diff (pull request).
With a fragmented large code base you're in a world of hurt because you're dealing with versioning. There is no guarantee of when every other dependency will migrate to the latest code path.
But again, if you're in a 1-10 person team working on some trivial codebase, a monorepo might not be helpful. If you have 500 engineers working on a single codebase, tradeoffs change.
1. Huge changes that affect the API's of multiple sub-packages.
2. Having no friction to change anything makes you far more productive and ambitious.
3. Scripting at a org-level means you can automate things more easily and more in depth.
We run an entirely node stack so Lerna enables this in the first place. Given that, I'd never move to more than one repo if possible. It's almost all downside: more overhead/fragmentation, less control, more wasted time/mental overhead moving between things, API friction that reduces ambitious change.
Only downside of monorepo is Github not supporting them well. If you want to release some sub-packages as OSS, or want to use GH to track issues you're stuck using one big repo to handle everything. I'd bet Github fixes this within the next year or so though.
The clear advantage of a mono repo over a mono-purpose repo is technical dept. The goal is that you never build up technical debt and immediately patch all references in an atomic change.
Example:
Let's say you introduce a breaking change in lib A that is used in libs B and C. First problem is visibility, that A does not necessarily see that it is used in C and D. Second, you the build should break immediately and not until someone build C/D.
Tech debt isn't just breaking changes though, and mono repo does nothing to curb all the other types of tech debt:
* Accumulation of FIXMEs
* Partial refactors cut short after change in business reqs
* Quick hacks near release time
* "this could be done better if I had time"
* Overdue re-architecture after accumulated changes and additions
* Orphaned code
* Commented out tests
* etc etc
Can someone further explain how the pre commit phase works? I don't get how/why "pre commits" work without feature branches?
How are my changes shared with the reviewer, if I there is no feature branch? Is my local code uploaded to that review tool mentioned in the article? And then what happens if the reviewer requests changes?
I probably did get this completely wrong, so thanks in advance for pushing me in the right direction.
You got it right. There is a separate set of tooling layered on top of the VCS, which maintains a sub-history of each commit. The tool (Gerrit, Phabricator, etc.) tracks this relationship between commits, and whatever is eventually merged into the repo.
This architecture assigns each line of code a nested history: the public commit log, and also the sub-history of each commit, which evolved during code review.
IMO it would be better if the code review changes were manifest in the public commit log (e.g. via feature branches), instead of being tracked separately. The code review layers add duplicative complexity.
Perforce (the origin of Piper) has a concept of changelists. Some changelists are submitted (committed) while others are not. So the review works by uploading your changelist to Perforce, then pointing people to that changelist. It's like an unnamed feature branch that can't easily be rebased off of. Changelists do have a base, and you do a "g4 sync" to essentially rebase off of master. Does that make sense?
When people talk about making a single global change to update all clients when they change an interface, are they updating unsubmitted changelists too?
Unsubmitted changes at Google usually come in one of two flavors, short-lived (abandoned or submitted within a few days) or perpetual. The latter flavor is often for "I think we might want this". It's not uncommon for those to be completely rewritten if they're actually needed. There's usually a preference for submitting useful things (with tests!) and flag gating them to cut down on bitrot.
I have seen exceptions -- I reported a bug in a fiddly bit of epoll-related code and an engineer on my team had a multi-year-old fix -- he hadn't submitted it because he wasn't confident he'd found an actual bug. The final changelist number was more than double the original CL number (unsubmitted changes get re-numbered to fit in sequence when they're submitted -- the original number redirects to the final submitted version in our tooling).
Well, the act of submitting a changelist essentially runs a test suite which requires that the changelist has no merge conflicts with the head and that the relevant tests pass. From that it follows that if someone changes an interface, you'll get either merge conflicts or test failures on your own changelist. Meaning - it's the changelists authors task to sync it up to the current head state so the refactorers won't touch unsubmitted changelists.
It's pretty much the same as GitHub pull requests - the changelists are supposed to be decently short lived and if the master code changes it's up to you to resolve conflicts and get it into a mergeable state again.
every CL (changelist) gets 2 CL numbers, an original (OCL) and committed (CL) number. so CL #s are monotonically increasing, but less than 1/2 are ever actually committed.
when the CL is first uploaded or sent for review the OCL is assigned, the CL number is assigned on commit.
- Monorepo for all the services code in the enterprise. (Java )
- UI code is kept in their own repositories.
- Enterprise Services are exposed using well defined stable rest-like APIs. (json, http, swagger, etc). We only exposed what is needed.
- Within the services monorepo, services can call other services directly using regular java calls.
- Services in the monorepo are refactored all the time. This is the advantage of using a strongly typed language like Java.
- Several instances of the services run are a time, they scale horizontally. Releases are done one instance at a time.
- We had good unit and integration test suites on the services.
Now I am working for a different company. We have hundreds of services deployed. No one knows what is running where. Or what the dependencies are. Once something is released, everybody is afraid to make a change as no one knows how that is affects other systems.
How do they handle Android? It's one thing when you can just go "use the latest" for web service and another when you have N branches. I'm not sure a mono repo is as big of a benefit then.
Android Open Source Project is still in gerrit, where the code is stored as a set of many git repositories. There's a "repo" tool that adds a helpful layer of abstraction to make those git repositories (mostly) look like a single repository.
Doesn’t this create a single point of failure for the development of the entire software ecosystem at google? What happens if something goes disastrously wrong with their repo? Does everyone twiddle their thumbs until it gets fixed?
Also, isn’t this is the opposite of “separation of concerns”... code should be divided into small units of functionality that don’t overlap. Minimise interdependencies. Eliminate unnecessary work. Is this the most efficient and future-proof approach?
Yes, when the system is down, it's really bad. Though it's an extremely reliable system as so many folks depend on it (note there's also lots of layers of tooling, so maybe you just need commits to work versus code review versus your local edits, etc.). In my experience, the worst things are actually when the build/test system is running slowly (thus blocking commits until tests are complete) or the bog standard "ugh, someone decided to force submit even though it said the tests were failing. Roll that back, please".
As for separation of concerns, there's actually a lot of scoping / visibility stuff precisely to avoid letting people depend on things they shouldn't. I think you now have to explicitly open up your visibility to let random other projects depend on you. A monorepo doesn't require that you share, it just permits it.
Interesting. Every system is going to be down times, so I don't worry about that too much. I'd be more worried about a design flaw or bug that doesn't get exposed until you're painted into a corner and it's hard to escape.
Good point about the scoping. I suppose it really depends on how developers use it - do they default to sharing or not?
Personally, I’d have to agree that at a minimum a mono repo per distinct product role (server side in one mono repo, desktop software in one mono repo) is immensely valuable.
I heard updating pager duty [on-call] commits used to be a mess --perhaps it was apocryphal, but if not, why not use a different system for the pager duty updates?
"The two are not mutually exclusive. I utilize what I call ensemble repositories that, for us, submodule the individual repositories" - http://disq.us/p/1hj9nmu
I'm a bit disappointed that they don't back their publication with any numbers that compare the two approaches. I don't see the scientific value of what would also be a good click-bait blog post.
I'm working on a project where we have two projects: a desktop app and a web backend. They each live in their own repo. However, this approach is proving tedious for us now, as every new feature often means two branches (one in each repo) which also means two separate pull requests when the feature is ready to be merged. Has anyone encountered this kind of issue, and is the solution to simply merge the two repos into one monorepo?
I develop an application that has a SPA frontend and an API backend. They can love inside separate repos but I prefer to keep them together because if I change the API signature, I'll also make the same changes in the frontend and they will deploy together at once.
> Google has shown the monolithic model of source code management can scale to a repository of one billion files, 35 million commits, and tens of thousands of developers.
> Benefits include unified versioning, extensive code sharing, simplified dependency management, atomic changes, large-scale refactoring, collaboration across teams, flexible code ownership, and code visibility.
> Drawbacks include having to create and scale tools for development and execution and maintain code health, as well as potential for codebase complexity (such as unnecessary dependencies)
At this point I normally assume that if Google does something a certain way, or uses a particular proprietary technology, then it likely should NOT be used.
I work for a place that tried to model itself culturally after Google and Facebook and has had a lot of engineers moving back and forth. If Google is anything like us, then it creates wrong incentives to invent in-house stuff. See, in their expense scheme the salaries of engineers are not a huge deal. It's cheap to have people do things. There is also an incentive for an engineer to try and "leave a mark". There are also a bias to hiring new and unexperienced people, who fail to learn the existing tech and replace it with something different, simpler and less functional (by the time it matures, it becomes just as complex, however).
I live in Java world, so I am seeing Google reinvent (poorly) every bit of Java tech: DI (Dagger), build tools (Gradle), commons libraries (Guava) and the list goes on and on. Well, apparently, they also reinvented internally Git, Jenkins and the rest of the tools. The rest of us should probably NOT do it this way.
Gradle was not a Google invention. Dagger was invented by Square. Google invented Guice (Bob Lee now at Square) which was way better than Spring and J2EE. Guava is far better designed than Apache Commons.
Dagger2 was invented by Google for good reason because it works purely as an annotation processor without runtime bytecode classloader magic. That means it is easier to debug, faster to startup on mobile, and can be crosscompiled with j2objc and GWT/j2cl.
Much of the time in-house stuff is created to deal with scalability or maintain ability problems.
Amazon has a repo per package model, but has a meta-versioning system (called version sets) that tracks revisions of packages built together and a way for declaring package dependencies. So in some sense you're building your own monorepo out of a bunch of packages (usually by branching off a parent version set).
It fulfills a lot of the same goals of having a monorepo in terms of scaling a large organization, while having somewhat different pros and cons.
We don't, and it's hell. I may make a change to X, but some other developer doesn't know it, and then one day when he makes a change to X and tries to pull those changes into Y, now he has to change the bits of Y which are affected by my changes as well as his.
I would love to have a monorepo, although I'd advocate for branch-based rather than trunk-based development. Yes, merges can be their own kind of hell, but they impose that cost on the one making breaking changes, rather than everyone else.
Because of course they do. Just like we all do. Everyone has a file they backup just because they might need to reference it. Index.txt is my cross.
In terms of sheer code. I hope they have some Jedi ai to cross reference it. If they don't they will. In fact we help everyday.
No point in burning the printing presses. We have to learn to get along.
Enjoy living, learning will come naturally as you pursue your interests. Those interests will change, growing, cultivating other fields of knowledge, grappling for your attention.
Culture you enjoy as only a human can, and cultivate it. Distribute and share it. As humans we are denying our own renaissance with a ministry of silly walks. Sometimes I think high school UN clubs are better at running the world. Why not. It doesn't seem to matter anymore.
1) Difficult to track changes to the code I'm interested in. Every day there are hundreds of changes in the repo and almost all of them have nothing to do with what I'm working on.
2) all sorts of operations take longer (pulling, grepping source, etc.) to support code I couldn't care less about.
3) Frequently have to update the world at once. Unless the repo can store multiple versions of the same module, then all the consumers have to be updated at once, even if it's inconvenient. Sometimes migrations are better done gradually.
4) Encourages sloppy dependency management. There are frequently unclear boundaries between software layers.
I'm sure people will say "if you're having those problems, you're doing it wrong" but the same thing could be said to people who find the distributed model problematic.