Maybe we could look at the problem from the other side. Create tools to manage multi-repos like if they were a single mono-repo. A docke-compose for git.
Perforce Helix might be even better - it even has a DVCS model based on creating a "local server" that can fetch/push from a shared server asynchronously from use of that local server, and a hybrid model that allows for only parts of the repository to be hosted on your personal server, and other parts to follow the more traditional Subversion-like model. Things like exclusive locks on files that can't really be "merged" are also supported (for example, all your assets).
The only downside is that it's not open-source, and as a result has a much smaller community. It's free for up to 5 users, then "email us" for any more. But if a very flexible VCS model is something you need, it's the same as anything else you need to pay for.
Google used to use Perforce until they hit a certain scale, so it's likely it'll work for you until you hit that scale and can build your own tools too.
Well, it seems to fit the requirements better than git. Obviously, subversion is not used much. I would like to hear some experience reports what is the problem with it.
- it requires a certain discipline: we need branching in our workflow and this is handled mostly by convention in a subversion repository. We have "branches" that were created by less careful colleagues by copying subdirectories of trunk to the branches folder.
- all the tooling developers fled to work on making git bearable. It seems that there is good money in sugarcoating got and none in making good tools for Subversion (awareness of branches in Jenkins, decent code review...). We have a budget, but that does not compensate for the lead that git has in that regard.
Other than that, subversion fits our needs. It just works.
Subversion is not used much anymore - just in case you entered the industry after this.
Subversion has been used in basically each and every open source project as a replacement for the previously most used CVS.
Subversion was better tjan cvs, but still bad in many aspects, slow synchronization and bad branching and merging support come to my mind.
Because of these shortcomings and because of the idea of decentralized versioning coming up, many systems like git, mercurial, and others came up then, and git seems to be the most successful of these by now
Must have: Tooling that can interact on a file or sub directory level. Git cannot do that.
Should have: Access control to view and change file on a subdirectory basis. Everyone can see the repo so you can't permissions users per repo anymore. It's optional but these companies have that.
Recommended: Global search tools, global refactoring tools, global linting that can identify file types automatically and apply sane rules, unit test checks and on commit checks available out of the box for everything and that run remotely quickly, etc...
It's regular tooling that every development company should have, but only big companies with mono repos have it.
It's not that the tooling is needed to deal with the mono repo, it's that the tools are great and you want them. But they can't be implemented in a multi repo setup.
Think of it. How could you have a global search tool in a multi repo setup? Most likely, you can't even identify what repo exists inside the company.
Makes me realize. If I ever go back to another tech company, the shit tooling is gonna make me cry.
IIRC, Bitbucket Enterprise has pretty decent global search. GitHub Enterprise doesn't seem to have much of any cross-repo tooling, which is one of my least favorite things about it.
Global refactoring seems a lot less necessary if you have clean separation among your processes. Maybe this is me coming from a more microservices perspective, but I'm inclined to say that needing to do a refactor that cuts across several different functional areas is a sign that things are becoming hopelessly snarled together.
Google have dedicated (no more there) language, platform, library, etc. teams that can push really huge refactoring changelists - for example if they've noticed that code had plenty of: "if (someString == null || someString.empty())" - they would replace it with something simpler.
Or if they've found some bad pattern, would pull it too. I do remember when certain java hash map was replaced, and they replaced it across. It broke some tests (that were relying on specific orders, and that was wrong) - and people quickly jumped and fixed them.
This level of coordination is great. And it's nost just, let's do it today - things are prepared in advance, days, weeks, months and years if it had to. With careful rollout plans, getting everyone aware, helping anyone to get to their goal, etc.
It's also easy to establish code style guides, and remove the bikeshedding of tabs/spaces, camel braces or not, swtich/case statement styles, etc. Once a tool has been written to reformat (either IDE, or other means), and another to check style, some semantics - then people like it or not soon get on that style and keep going. There are more important things to discuss than it.
The idea of global refactoring is mostly that you can decide to modify a private API, and in the process actually update all the consumers of that API, because they all live in the same repo as the component they're consuming. (This is also the argument of the BSD "base system" philosophy, vs. the Linux "distro" philosophy: with a base-system, you can do a kernel update that requires changes to system utilities, and update the relevant system utilities in the very same commit.)
Code search in bitbucket server is dismal. All punctuation characters are removed. This includes colons, full stops, braces and underscores. This makes it close to useless for searching source code.
Regarding global refactorings think new language features or library versions.
Support for punctuation in search is something we knew wasn't ideal when we first added code search. As with all software, there were some technical constraints that made it hard to do.
We plan to have support for full stops and underscores in a future version and are exploring how to best handle more longer term. Our focus, based on feedback, is on "joining" punctuation character to better allow searching for tokens. Support for a full range of characters threatens to blow out index sizes, but if we get more feedback on specific use cases we're always happy to consider them.
Being a self-hosted product we have to make tradeoffs for the thousands of people operating (scaling, upgrading, configuring, troubleshooting...) instances. In short, we try to keep the system architecture fairly simple using available technology and keeping the broad skillsets of admins in mind.
It was a somewhat difficult call to add ElasticSearch for it's broad search capability, but being used for other purposes helped justify it. Adding Hound or similar services that were considered would have added more to administrative complexity and wouldn't provide for a broader range of search needs.
We continue to iterate on search, making it better over time.
A fair point, but I will just say that Hound is _astonishingly_ low maintenance. I set it up at my current employer like two years ago and have logged into that VM maybe twice in the entire time. It just hums along and answers thousands of requests a week with zero fuss.
> Must have: Tooling that can interact on a file or sub directory level. Git cannot do that.
I mean, when you get big, sure. But until you're big, git is fine. Working at fb, I don't use some crazy invocation to replace `hg log -- ./subdir`, I just do `hg log -- ./subdir`. Sparse checkouts are useful, but their necessity is based on your scale - the bigger you are, the more you need them. Most companies aren't big enough to need them.
> Should have: Access control to view and change file on a subdirectory basis. Everyone can see the repo so you can't permissions users per repo anymore. It's optional but these companies have that.
Depends on your culture (and regulatory requirements). I prefer companies where anyone can modify anyone's code.
> Recommended: Global search tools, global refactoring tools, global linting that can identify file types automatically and apply sane rules, unit test checks and on commit checks available out of the box for everything and that run remotely quickly, etc...
I'd bump this up to `should have`. The power of a monorepo is being able to modify a lib that is used by everyone in the company, and have all of the dependencies recursively tested. Global search is required, but until you're big, ripgrep will probably be fine (and after that you just dump it into elasticsearch).
> Depends on your culture (and regulatory requirements). I prefer companies where anyone can modify anyone's code.
This is still true at Google, except for some very sensitive things. However, every directory is covered by an OWNERS file (specific or parent) that governs who needs to sign off on changes. If I’m an owner, I just need any one other engineer to review the code. If I’m not, I specifically need someone that owns the code. IMHO, this is extremely permissive and the bare minimum any engineering organization should have. No hot-rodding code in alone without giving someone the chance to veto.
>ripgrep, ElasticSearch
Having something understand syntax when indexing makes these tools feel blunt. SourceGraph is making a good run at this problem.
Elasticsearch is too dumb. You need to use a parser and build a syntax tree to have a good representation of the code base. That's what facebook and google do on their java code.
Agree that any small to medium company could have a mono repo without special tooling. Yet they don't.
There are companies that care about development and there is the rest of the world.
Might I suggest using a tool designed for searching source code rather than dumping into elastic. Bitbucket, sourcegraph, github search or my own searchcodeserver.com
Unless designed to search source code most search tools will be lacking.
I had a bad time at Google and was glad to leave, but wow did I ever miss that culture of commitment to dev process improvement and investment in tooling. The next startup I joined was kind of a shocking letdown. It became clear pretty early on that nobody else there had ever seen anything like the systems at Google, couldn't imagine why they might be worth investing in, and therefore the level of engineering chaos we wasted so much time struggling with was going to be permanent.
The startup I'm working for now is roughly half ex-googlers, so it is a different story. Of course we can't afford Google level infrastructure, but there is at least a strong cultural value around internal tooling, and a belief that issues with repetitive or error-prone tasks are problems with systems, not the people trying to use them.
Worked at google for 2-3 years, mainly java, under google3: my thoughts: Having things under single repo, and with a system like blaze (bazel), I can quickly link to other systems, or be prevented/warned that it's not good idea (system may be going deprecated, or just fresh new, and you need visibility permission (can be ignored locally)).
Build systems, release systems, integration tests, etc. - everything works easier - as you refer to things just by global path like names.
Blaze helps a lot - one language for linking protobufs, java, c++, python, etc., etc., etc.
Lately docs are going in it too, with renderers.
Best features I've seen: code search, let's you jump by clicking on all references. Let's you "debug" directly things running in servers. Let's you link specific versions, check history, changes, diffs.
GITHUB is very far away from this, for nothing else - but naturally by not even be possible to know how things are linked. Even if github.com/someone/somelibrary is used by github.com/someone-else/sometool, GITHUB would not know how things are connected - is it CMake, Makefiles, .sln, .vcxproj. It maybe able to guess, but that would be lies at the end... Not the case at google - you can browse things better than your IDE - as you can't even produce this information for your IDE (a process that goes every few others updates it, and uses huge Map Reduce to do that).
Then local client spaces - I can just create a dir, open a space there, and virtually everything is visible from it (whole monolithic depot) + my changes. There are also couple of other ways to do it (git-like include), but I haven't explored those.
What's missing? I dunno... I guess the whole overwhelming things that such a beast exist, and it's already tamed by thousands of SREs, SWEs, Managers, and just most awesome folks.
I certainly miss the feeling of it all, back to good ole p4, but the awesome company that I'm in also realized that single depot is the way to go (with perforce that is). We also do have git, but our main business is game development, so huge .tiff, model files, etc. files require it.
Also ReviewBoard and now swarm (p4 web interface and review system) is so far nice. Not as advanced as what google had internally for review (no, it's not gerrit, I still can't get around this thing), but at going there.
Another last point - monolithically incremental change list number would always be easier than random SHAxxx without order - you can build whole systems of feature toggles, experiments, build verifications, around it, like:
This feature is present if built with CL > 12345 or having cherrypicks from 12340 and CL 12300 - you may come up with ways to do this too with SHA - but imagine what your confiuration would look like. It's also easier to explain to non-eng people - just a version number.
What special tooling is required to deal with a monorepo that is not required for multi repo?
From my time at Google the first thing that came to mind was citc. But I couldn't remember if citc was publicly known, so I did an Internet search for "google citc". The first search result was this article.
"CitC supports code browsing and normal Unix tools with no need to clone or sync state locally."
Unless something drastic changed in the last year, I really doubt it. There is the fb frontend, the backend, the offline batch processing repo, and the instagram frontend repo. I think the phone apps have their own repos too? It was a giant mess, especially when you had to make changes that spanned repos, like introducing a new backend API and then depending on it, or changing logging formats.
Everywhere I've seen mono repo, mono repo was better than multi repo.
They all built special tooling and have dedicated teams to support it.