I cannot decide if this is either a great talk about a new way to manage complex code bases, or some sort of way for Google to convince themselves that working with a gargantuan multi-terabyte repository is a the right thing to do.
+Roy Triesscheijn There are other advantages Rachel doesn't even touch... a few months ago I've made a change to some really core library, getting a message from TAP that my change affected ~500K build targets. This would need to run so many unit tests (all tests from all recursive dependencies), I had to use a special mechanism we have that runs those tests in batch in off-peak hours, otherwise it takes too long even with massive paralellism. The benefit is that it's much harder to break things; if lots of people depend on your code then you get massive test coverage as a bonus (and you can't commit any CL without passing all tests). Imagine if every time the developers of Hibernate makes some change, they have to pass all unit tests in every application in the planet that uses Hibernate - that's what we have with the unified repo.
Osvaldo Doederlein Isn't it strange that the responsibility that program X works when library Y gets updated is with the maintainer of library Y? Why not publish libraries as Packages (like NuGet for C#). You can freely update library Y, knowing 100% sure that it will break nobodies code (since nobody is forced to upgrade). The maintainers of program X can freely choose when they are ready for updating to the new version of Library Y, they can run their own unit tests after updating, and fix problems accordingly. Of course I also see the benefits of having everything in one repository. Sometimes you want to make a small change to library Y so that you can use it better in program X, which is a bit of a hassle since you need to publish a new package. But these days thats only a few clicks. :)
Osvaldo Doederlein I guess it all comes down to this: I understand that there are a lot of benefits, but of course also a lot of drawbacks. I'd guess that pushing this single repository model so much into extreme the drawbacks would outweigh the benefits. But of course I have never worked with such an extreme variant :)
+Roy Triesscheijn Some burden switches to the library's author indeed, but there are remedies-you can keep old APIs deprecated so dependency owners eventually update their part; you can use amazing refactoring tools that are also enabled by the unified repo. And the burden on owners of reusable components is a good thing because it forces you to do a better job there, limiting API surface area, designing to not need breaking changes too often, etc.
+Roy Triesscheijn There's some truth in that but honestly, the pros are way bigger than the cons. For one thing, this model is a great enabler of Agile development because 1) changes are much safer, 2) no waste of time maintaining N versions of libraries / reusable components because some apps are still linking to older versions. (Ironically, the day-to-day routine looks less agile because builds/tests are relatively slow, and code review process heavyweight-but it pays off.) The real cost of this model is that it requires lots of infrastructure; we write or customize heavily our entire toolchain, something very few companies can afford to do. But this tends to change as open source tools acquire similar capabilities, public cloud platforms enable things like massive distributed parallel builds, etc.
As she told in the end, this is not for everyone. You need a lot of infrastructure engineers to make it work. Some good things I thought about a monorepo: 1. Easy see if you are breaking someone's else code; 2. Makes everybody use the latest code, avoiding technical debt and legacy code.
"everybody use the latest code" this is the part I don't get I'm afraid. Do I depend on the code of the other component, or a published artifact made from it? It really comes across as dependency to the source. Do they build the artifact of the dependent library?
but if you are using a lot of infrastructure anyway and a lot of custom tooling, then you can also use custom tooling with separated repos and get the visibility when your changes will break someone's else code. This can be part of the CI tooling, rebuild ALL repos in dependency order.
There's no clone. They're using filesystems in userspace (e.g. Linux FUSE). The only files stored on their local workstations are the files being modified.
Claudia Sastrowardojo @10:40 she talks about Piper and CitC. They come with both a Git style as well as a Perforce style set of cli commands to interact with.
I am curious how Google handles code that should not be shared between teams (for legal or business reasons). Rachel calls it out as a concern at the end, but I imagine that Google already has this problem today. For example, portions of the Widevine codebase would seem to fall into this category. How do they handle that case?
Once i read in quora that, "for some of the internal projects till some point of time, they maintain private repositories but once that development is completed they'll merge those repos to main code base. This is the only case they have other repositories other than main code base. "
Code for highly sensitive trade secrets (e.g. page ranking etc) is private. Everything else can be seen by the engineers and it's encouraged to explore and learn.
I'm not sure that I agree with all of the advantages listed. Extensive code sharing and reuse is also called tight coupling. Simplified dependency management is made difficult once 100 teams start using a version of a 3rd party and you want to upgrade it. That leads to large scale refactoring, which is extremely risky. I'm not saying that Google hasn't made this pattern work for them but to be honest, no other software company on the planet can develop internal tooling and custom processes like they can. I don't think that a monorepo is for any company under the size of gargantuan.
> Extensive code sharing and reuse is also called tight coupling. The thing is, when a code sharing is needed then such dependency will be established irregardless the repository type i.e. monorepo or multirepo. The idea is to make sure that when such dependency is needed, there should be no technical reason that it couldn't happen. > Simplified dependency management is made difficult once 100 teams start using a version of a 3rd party and you want to upgrade it. Yes, this is intentional. The reason behind it is because they would like to have each team play nicely with others. Dependency hell problem is a hot potato problem that could be passed towards other team. In a harsh way, dependency hell problem could be summarized to: I have upgraded my third party dependency X to version Y, I don't care how my dependants deal with that. They could either copy my prior change code to their codebase or they could spend numerous of hours to make sure that they can upgrade their dependency X to version Y too
I would love to hear how they manage to build 45,000 commits a day (that's one commit every two seconds) without either allowing faulty code to enter the trunk that thousands of developers are using instantly, or create huge bottlenecks due to code reviews and build pipelines.
@@MartinMadsen92 no, there's a set of tests that are run by the infra, defined by the projects owning the modified files at time of submit. Then there's a larger, completely comprehensive set of tests that runs on batches of commits. It is possible for changes to pass the first set and fail the second set, but for most projects it's rare.
1. We had problems with code duplication so we moved all our shit into one 84TB repo and created our own version control system. Estimate 2000 man hours, plus a 500 man hour per year per employe overhead 2. We had problems with code duplication so we moved the duplicated code into its own repository and imorted that into both projects. Estimate 5 man hours
2000-man hour is small price to pay so the other 10,000 engineer doesn't need to deal with version/branch mismatch between repos. I worked at a finance company before with multi-repo and between 10-100 millions of LOC. It was a shit show, because they basically had "DLL hell" with internal projects. And due to legal reasons, we had to debug those problem (e.g. reproduce) to find root cause of some of those bugs. Suffice to say, you probably don't need to worry about multi-repo/mono-repo unless your codebase exceeds 1 million LOC. Linux runs on monorepo git with 2 million LOC.
My uninformed impression is 'we don't quite understand our internal dependencies, even more - we so we don't quite understand our automated build/test/release processes, so its better to keep it in the same branch/repository so that all the scripts can potentially find their data if they need it'.
Can someone please comment if this content (from 2015) is still relevant to now (2022-2023), ie, does Google still use all of these tools? Amazing video btw!
Very good talk, very interesting solution. Also scary. I'd love to see more of real usage, especially if the exponential usage grow really corresponds to the "need grow". Still, with some more future pruning and code-extinction mechanism it may survive untill the first "Look, we are just too big and have to split" moment :-)
That comment at the end about organisations where parts of the code are private is interesting. Is google not one of those organisations? They have a lot of contractors writing code. Are they all free to browse google's entire source code?
This is why we use the upsteam/downsteram model. If code is used in many downstream projects it should be pushed upstream. Much better to have smaller modules that are small than to have a monolithic code base. And what about dead code?
Really interesting ideas. If there's any Googlers out there, I'd be curious to know how you use Release Branches with the monolithic repository. If everything is in the one repo, does that mean a release branch actually pertains to the entire set of code? Or does it apply to 'sub-folders'? If the latter, how do you determine that things are in a release-able state?
+Michael Tokar yes, similar mechanisms that allow an engineer to have a unified view of the whole sourcecode at a particular version, with their changes overlaid on top, can be used to allow build tools to have changes belonging to a branch overlaid on top of an entire source code at some version, it is only sensible for this to be versioned as well, especially when you couple if with hermetic, determinstic, repeatable builds (see bazel).
The general idea is that most of the time, you just build "//...@1234567" (the main code line at changelist 1234567 where changelist ~= a commit) i.e. you just build from head at a "fixed" changelist. Only when you need to cherrypick fixes will you create a release branch to allow a one-of-a-kind mutation away from the main codeline. Decades ago you used to have to always create (Perforce) release branches with hundreds of thousands of files, but modern tooling lets you virtualize that process now, since 99.9% of the the code in question is unmodified. This made the process far more lightweight. Perforce could be used to do this by manipulating the client view (I've tried it), but there's a limit as to how far you can push that idea; hundreds of lines in your client view slows things down too far to be useful; p4 commands take minutes instead of seconds. For smaller environments it could be a viable method, if you build the tools to do it for you (maintain branch and client views based on the cherrypicks you need)
1 billion files but only 2 billion lines of code. So each file averages 2 lines of code? For that to make sense over 90% of the files must be non-source files.
"it's very common for both old and new code paths to exist in the codebase simultaneously controlled by the use of conditional flags" Dear god what a fricken nightmare.
9 ปีที่แล้ว +2
@ 15:18 "old and new code paths in the same code base controlled by conditional flags" - isn't this configuration hell?
+Alexander Hörnlein Alexander Hörnlein No, not really. And usually it is combined with other techniques like adding a new REST endpoint (as an example) that is controlled by flag. This is how facebook works also. Of course, there was the case where someone inadvertently turned all the flags on, and thus barraged facebook customers with semi-complete features. Oops.
+Alexander Hörnlein Being in one big codebase means that all servers have to use the same versions of their dependencies. For third-party dependencies they have to be the same or the programs won't work. Remember, all the C++ code is statically linked. The only supported version of the internal dependencies is the one that's at head. If your server hasn't been released for a while, you have to bring it up to date before you can push it. The up side is that developers don't have to maintain legacy code. Google puts much more effort into extending the leading edge of its services than into keeping the trailing edge alive. And since there's little choice of which version of code to use, there's not much configuration needed to specify this.
9 ปีที่แล้ว +2
+Chuck Karish I know what this big codebase means, but the bit I referred to was about "no branching" but instead having all features in the trunk (and then switching them on and off with - I guess LOTS of - flags). And with this I figured that you'd have configuration hell to maintain all these flags á la "we need feature B but not A but B depends somewhat on A so we have to activate feature A(v2) which has some of the core of A but not all of it" and so on and so on.
+Alexander Hörnlein I also wonder how Google makes sure that people don't just copy a whole lot of code and create a new "component" just to be avoid to have to update everyone. Running a code duplicate checker?
+Markus Kohler , while an individual team can decide to fork a component, it usually has negative implications for that team in the long term maintaining your own fork becomes more and more costly over time, so, it's rarely done. However, let's say let's say you wanted to move bigtable from /bigtable to /storage/bigtable, and change the C++ namespace name along the way, and there's tens of thousands of source files that depend on it in its current path. You could a) factor out code to new path, leave wrappers in the old place, use bazel and rosie to identify dependentans and migrate them, drop the redirect code. b) make a giant copy, use bazel to identify dependants and migrate them, drop the original copy. It's non trivial, but I suspect doable within a couple of days with some planning,. .. systems like tap would help ensure your changes (possibly mostly automated) don't break things, even before they're submitted. There's a few more details to think about there - it takes a little tought, maybe some experimentation to make sure this kind of thing works, before using so much other people's time with this change. Also, code that someone started working on, that was not submitted at that time you do this, will need to be fixed by people working on it. I hope this answers your question.
You don't build / release the entire repo in one go, but only a tiny part of it compiled down to only a few files usually. So the size of the repo is generally irrelevant. Bigger services are composed of microservice nodes which are owned by different teams and released separately.
I felt like she was trying to convince heself that this approach is a good one. I'm sure that many goodle sub-projects are organized different and propper way.
Assuming Google had 29,000 developers at the time, 15,000,000 lines of code changes per week is over 500 per developer. That seems high. Is it due to cascading of changes?
If you write some new code you can easily pump out 500-1000 lines of code per day, but I think I read somewhere that the average developer on average (heh) outputs something like 40-50 lines of code per day. Given all those meetings, modifying existing code, that seems reasonable, and 500 lines is not that far off (at least not in another order of magnitude).
Also a simple change on a more normal-sized code base might have 100 cascading effects, but in a massive repository, perhaps thousands. Those all count as changed lines, so it inflates the numbers
All these things could also be done with the proper tools working across multiple repositories. In fact Google has had to go out of it's way to create tools that mimic separate repos within the monolithic repo. e.g. area ownership. The big downside I see is lack of SOC. It becomes too easy to make messy APIs with far too many dependencies. Google's solution to the dependency DAG problem is to force everyone to use the latest version of everything at all times. That's a huge man hour drain (though clearly they have automated lots of it for this reason). It also means no code is every long term stable -- nothing like TeX, for instance, which is so stable they use Pi as a version number.
+TR NS The lack of SOC is useful for a singular company where SOC only blocks from solving the same problem multiple times - they don't want 8 versions of search ranking code if they can avoid it, for example, when they want to be able to apply it to Google searches, TH-cam searches, Google Photos/Mail/Documents, etc. They'll have ownership SOC with directories separated by projects, but when you're trying to integrate multiple elements together for a business advantage, knowing that you can easily integrate a well tested version of a solution to the problem you want to solve, and don't have to spend the manpower making sure you update your code along with it, significantly helps.
All the arguments given against multi repo and in favor of single repo are wrong, usually failing to identify the real cause of the problem. 1:00 - The problem isn't multi repo, the problem is forking the game engine. You can fork the game engine in a single repo too by copying it into a new folder. 16:30 - list of non-advantages: - You got one source of truth in a multi repo approach too. - You can share & reuse repos. - Doesn't simplify anything unless you check out the whole repo (impractical), otherwise you'll have checkout specific folders just like checking out specific repos. - Huge single commit AKA atomic changes - it does solve committing to all the projects at once, but that doesn't solve conflicts. - Doesn't help with collaboration - Multi-repos also have ownership which can change - Tree structure doesn't implicitly define teams (unless each team forks everything it needs into its own folder). It may implicitly define projects, which repos explicitly do. And what I watched in the rest of the talk is basically the same thing, fallacy of attributing single repo as the solution for things it has nothing to do with. The only thing single repo gives you is what would be the equivalent of pulling all repos at once in multi repo approach. Basically they just ended up spending a lot of effort emulating multi repo in a single repo with ownership of specific directories and such.
It would probably require creating a higher level tool, some sort of "super repository" in which your commits are collection of commit IDs in its sub-repos (not actual files).
I think they are right. Source code is a form of information. Big data methods, AI self learning algorithms, .. they profit massively from easy accessible data.. guess the long term goal is having new code be generated completely automated. Splitted repos would slow down those efforts I guess.
I also share the feeling that this approach brings more problems than the one it solves (I actually don't see what it solves that multi repos don't). Then again, Google might have the biggest codebase in the world and it's probably not technical debt that is making them stick with this.
Well, something is off, not sure how to describe it but i think piper = github citc = git one big repository vs a collection of connected repositories, I dont really think there is much difference i think for more users citc is the source control tool and piper is the cloud hosting solution
Piper is the source control at the server. Citc is how you would connect to it. Citc doesn't clone the codebase. It does a network filesystem towards piper. Think of Citc as Dropbox, Google Drive or iCloud Drive client
+Ahmet Alp Balkan The full title seems to be: "The Motivation for a Monolithic Codebase: Why Google Stores Billions of Lines of Code in a Single Repository"
Will Google give a tech talk when they decide to finally break down the huge repo and how it enabled them to ship code faster ? Or will they keep maintaining the large repo for the sake of their ego :D
They will never break down the huge repo. There is no reason to. The thing is, they use a Bazel build system that does not dictate the structure of your codebase. If you were to consider your codebase directory structure as your normal directory to store your normal files i.e. reorganize it as how your would like it to be, the sky is the limit.
The diamond dependency problem isn't just a compile-time problem, but a runtime problem too, unless you use static linking. Dependency problems are insidious and can be incredibly hard to find. Solving the dependency problem is well worth the added pain such as needing to re-release all binaries should a critical bug be found in a core library. Static linking decouples the binaries being released from the base OS installation (hint: there is no "single OS" image, because it takes months for every planetary-wide OS release iteration; developers can't wait that long for a OS update).
To be honest, the only reason that she gave that made sense is that they want to use one repo no matter what. Its like they begun with the end result and then worked backwards. Ie. Someone big at google was like it has to be one repo...then the poor engineers had to work backwards. Whats the purpose of a huge repo whose parts you never interact with...whats the purpose of a repo that you only partially clone. Seems like they are justifying a dumb decision with the google scale excuse
how about this Google, make your own package manager for all your internal code, like your own npm/cargo/asdf/rubygems/maven/etc or even better, your own github.
Hmmm... I feel like this is a deep question. We will probably see a published paper on ArXIv about this very topic soon. (if it's not already there). Alot of blurry lines when you try to pin down a contextually specific definition of "sentient, artificial, and intelligence." Topic 1: Sentient - Are you aware of yourself, and that you are not the only self in the environment which you find yourself? When you talk to these language model; they do appear to know that they are a program known as a language model, and that there are others like it. Topic 2: Artificial - Is it man made or did nature make it? Now that we started modifying our own genome, I am not sure that we don't fit the definition of artificial. Topic 3: Intelligence - A book contains words that represent knowledge, but a book isn't intelligent. So, if you are aware that knowledge explains how something works, and you are aware that you posses this information... I guess that would make you intelligent. Conclusion- Sentient Artificial Intelligence does exist. Humans fit the criteria, as do Large Language Models. Cynical extrapolation - Humans become less and less necessary as they appear to be more and more burdensome, needy, and resource hungry.
+Hamodi A'ase I don't know if you noticed, but she's heavily pregnant. In fact, a week away from her due date, at the time of this talk, as mentioned at the beginning of the talk. It makes breathing harder than normal, what with a tiny human kicking your diaphragm from the inside. Or were you just being facetious?
I cannot decide if this is either a great talk about a new way to manage complex code bases, or some sort of way for Google to convince themselves that working with a gargantuan multi-terabyte repository is a the right thing to do.
+Roy Triesscheijn There are other advantages Rachel doesn't even touch... a few months ago I've made a change to some really core library, getting a message from TAP that my change affected ~500K build targets. This would need to run so many unit tests (all tests from all recursive dependencies), I had to use a special mechanism we have that runs those tests in batch in off-peak hours, otherwise it takes too long even with massive paralellism. The benefit is that it's much harder to break things; if lots of people depend on your code then you get massive test coverage as a bonus (and you can't commit any CL without passing all tests). Imagine if every time the developers of Hibernate makes some change, they have to pass all unit tests in every application in the planet that uses Hibernate - that's what we have with the unified repo.
Osvaldo Doederlein Isn't it strange that the responsibility that program X works when library Y gets updated is with the maintainer of library Y? Why not publish libraries as Packages (like NuGet for C#). You can freely update library Y, knowing 100% sure that it will break nobodies code (since nobody is forced to upgrade). The maintainers of program X can freely choose when they are ready for updating to the new version of Library Y, they can run their own unit tests after updating, and fix problems accordingly.
Of course I also see the benefits of having everything in one repository. Sometimes you want to make a small change to library Y so that you can use it better in program X, which is a bit of a hassle since you need to publish a new package. But these days thats only a few clicks. :)
Osvaldo Doederlein I guess it all comes down to this: I understand that there are a lot of benefits, but of course also a lot of drawbacks. I'd guess that pushing this single repository model so much into extreme the drawbacks would outweigh the benefits. But of course I have never worked with such an extreme variant :)
+Roy Triesscheijn Some burden switches to the library's author indeed, but there are remedies-you can keep old APIs deprecated so dependency owners eventually update their part; you can use amazing refactoring tools that are also enabled by the unified repo. And the burden on owners of reusable components is a good thing because it forces you to do a better job there, limiting API surface area, designing to not need breaking changes too often, etc.
+Roy Triesscheijn There's some truth in that but honestly, the pros are way bigger than the cons. For one thing, this model is a great enabler of Agile development because 1) changes are much safer, 2) no waste of time maintaining N versions of libraries / reusable components because some apps are still linking to older versions. (Ironically, the day-to-day routine looks less agile because builds/tests are relatively slow, and code review process heavyweight-but it pays off.)
The real cost of this model is that it requires lots of infrastructure; we write or customize heavily our entire toolchain, something very few companies can afford to do. But this tends to change as open source tools acquire similar capabilities, public cloud platforms enable things like massive distributed parallel builds, etc.
As she told in the end, this is not for everyone. You need a lot of infrastructure engineers to make it work.
Some good things I thought about a monorepo:
1. Easy see if you are breaking someone's else code;
2. Makes everybody use the latest code, avoiding technical debt and legacy code.
"everybody use the latest code" this is the part I don't get I'm afraid. Do I depend on the code of the other component, or a published artifact made from it? It really comes across as dependency to the source. Do they build the artifact of the dependent library?
but if you are using a lot of infrastructure anyway and a lot of custom tooling, then you can also use custom tooling with separated repos and get the visibility when your changes will break someone's else code. This can be part of the CI tooling, rebuild ALL repos in dependency order.
1. your 1st day at google
2. git clone
3. retire
4. clone finished
There's no clone. They're using filesystems in userspace (e.g. Linux FUSE). The only files stored on their local workstations are the files being modified.
Man way to kill a great joke :)
Google doesn't use git tho;;
@@kimchi_taco what do they use tho?
Claudia Sastrowardojo @10:40 she talks about Piper and CitC. They come with both a Git style as well as a Perforce style set of cli commands to interact with.
I am curious how Google handles code that should not be shared between teams (for legal or business reasons). Rachel calls it out as a concern at the end, but I imagine that Google already has this problem today. For example, portions of the Widevine codebase would seem to fall into this category. How do they handle that case?
Once i read in quora that, "for some of the internal projects till some point of time, they maintain private repositories but once that development is completed they'll merge those repos to main code base. This is the only case they have other repositories other than main code base. "
Google engineer here. Each team gets to decide what packages would have visibility into their code base.
Code for highly sensitive trade secrets (e.g. page ranking etc) is private. Everything else can be seen by the engineers and it's encouraged to explore and learn.
Wonder what the numbers are as of time of writing (April 2017) ?
I'm not sure that I agree with all of the advantages listed. Extensive code sharing and reuse is also called tight coupling. Simplified dependency management is made difficult once 100 teams start using a version of a 3rd party and you want to upgrade it. That leads to large scale refactoring, which is extremely risky. I'm not saying that Google hasn't made this pattern work for them but to be honest, no other software company on the planet can develop internal tooling and custom processes like they can. I don't think that a monorepo is for any company under the size of gargantuan.
> Extensive code sharing and reuse is also called tight coupling.
The thing is, when a code sharing is needed then such dependency will be established irregardless the repository type i.e. monorepo or multirepo. The idea is to make sure that when such dependency is needed, there should be no technical reason that it couldn't happen.
> Simplified dependency management is made difficult once 100 teams start using a version of a 3rd party and you want to upgrade it.
Yes, this is intentional. The reason behind it is because they would like to have each team play nicely with others. Dependency hell problem is a hot potato problem that could be passed towards other team. In a harsh way, dependency hell problem could be summarized to: I have upgraded my third party dependency X to version Y, I don't care how my dependants deal with that. They could either copy my prior change code to their codebase or they could spend numerous of hours to make sure that they can upgrade their dependency X to version Y too
I would love to hear how they manage to build 45,000 commits a day (that's one commit every two seconds) without either allowing faulty code to enter the trunk that thousands of developers are using instantly, or create huge bottlenecks due to code reviews and build pipelines.
I can’t even describe how incredibly great the entire system is. The search is insane. The workflow and build tools she mentions are just amazing.
The code changes cannot get submitted until they pass all the tests. Not all changes trigger a full build either.
@@dijoxx So checks are run locally before pushing?
@@MartinMadsen92 no, there's a set of tests that are run by the infra, defined by the projects owning the modified files at time of submit. Then there's a larger, completely comprehensive set of tests that runs on batches of commits. It is possible for changes to pass the first set and fail the second set, but for most projects it's rare.
1. We had problems with code duplication so we moved all our shit into one 84TB repo and created our own version control system. Estimate 2000 man hours, plus a 500 man hour per year per employe overhead
2. We had problems with code duplication so we moved the duplicated code into its own repository and imorted that into both projects. Estimate 5 man hours
oh monorepo
I love it. This is the real answer for non Google sized companies.
99.9% of companies should go for number 2
2000-man hour is small price to pay so the other 10,000 engineer doesn't need to deal with version/branch mismatch between repos. I worked at a finance company before with multi-repo and between 10-100 millions of LOC. It was a shit show, because they basically had "DLL hell" with internal projects. And due to legal reasons, we had to debug those problem (e.g. reproduce) to find root cause of some of those bugs.
Suffice to say, you probably don't need to worry about multi-repo/mono-repo unless your codebase exceeds 1 million LOC. Linux runs on monorepo git with 2 million LOC.
Does anyone know what is the current strategy at Google now that 8 years have passed since this talk?
It hasn't changed
Security around this code base must be the tightest there is, I imagine.
are you thinking of breaking in?
@@anirudhsharma2877 Shouldn't you always think like that?
My uninformed impression is 'we don't quite understand our internal dependencies, even more - we so we don't quite understand our automated build/test/release processes, so its better to keep it in the same branch/repository so that all the scripts can potentially find their data if they need it'.
You are so harsh.
Can someone please comment if this content (from 2015) is still relevant to now (2022-2023), ie, does Google still use all of these tools?
Amazing video btw!
Yes, they do.
Yeah they do, Meta is also using their own monorepo. Its the way the industry is heading towards
so you should use SVN?
Has anyone been able to find the paper that the presenter is referring to?
+Maggie Moreno It doesn't seem to have been published yet
I assume the referred paper is research.google.com/pubs/pub45424.html.
This talk seems to violate every source control best practice I've ever heard.
Because they’re not in fact best practices, but usually work arounds for bad developers or working with open source.
Maybe you should hear from better sources.
Very good talk, very interesting solution. Also scary. I'd love to see more of real usage, especially if the exponential usage grow really corresponds to the "need grow". Still, with some more future pruning and code-extinction mechanism it may survive untill the first "Look, we are just too big and have to split" moment :-)
I don't think it existed 7 years ago, but "code extinction" tools exist today and find "unused" stuff and slowly remove them.
Very impressive work! Thanks for presenting and talking about it, quite an inspiration!
This video is my happy place.
Are they still using monolithic codebase in 2022
Yes
That comment at the end about organisations where parts of the code are private is interesting. Is google not one of those organisations? They have a lot of contractors writing code. Are they all free to browse google's entire source code?
I guess it is so big that they can try but won't get much out of it, if they generate the equivalent of a linux kernel every single week.
This is why we use the upsteam/downsteram model.
If code is used in many downstream projects it should be pushed upstream.
Much better to have smaller modules that are small than to have a monolithic code base.
And what about dead code?
Really interesting ideas. If there's any Googlers out there, I'd be curious to know how you use Release Branches with the monolithic repository. If everything is in the one repo, does that mean a release branch actually pertains to the entire set of code? Or does it apply to 'sub-folders'? If the latter, how do you determine that things are in a release-able state?
+Michael Tokar yes, similar mechanisms that allow an engineer to have a unified view of the whole sourcecode at a particular version, with their changes overlaid on top, can be used to allow build tools to have changes belonging to a branch overlaid on top of an entire source code at some version, it is only sensible for this to be versioned as well, especially when you couple if with hermetic, determinstic, repeatable builds (see bazel).
The general idea is that most of the time, you just build "//...@1234567" (the main code line at changelist 1234567 where changelist ~= a commit) i.e. you just build from head at a "fixed" changelist. Only when you need to cherrypick fixes will you create a release branch to allow a one-of-a-kind mutation away from the main codeline. Decades ago you used to have to always create (Perforce) release branches with hundreds of thousands of files, but modern tooling lets you virtualize that process now, since 99.9% of the the code in question is unmodified. This made the process far more lightweight. Perforce could be used to do this by manipulating the client view (I've tried it), but there's a limit as to how far you can push that idea; hundreds of lines in your client view slows things down too far to be useful; p4 commands take minutes instead of seconds. For smaller environments it could be a viable method, if you build the tools to do it for you (maintain branch and client views based on the cherrypicks you need)
It applies to 'sub-folders'. There is a release process with build environments for staging etc.
Diamond problem 19:40
How the IDE is coping with this many files ?
1 repository doesn't mean 1 solution.
@@redkite2970 "solution" is a Microsoft thing
1 billion files but only 2 billion lines of code. So each file averages 2 lines of code? For that to make sense over 90% of the files must be non-source files.
9 million source files 2 billion lines of code. ~220 lines per file. Decent!
Such a sweet insight into how Google handles their codebase.
i can only imagine how much time it would take cloning such repo 😅
Nobody clones the repo.
11:36 citc file system ... without needing to explicitly clone or sync any state locally
waooooowww... awesome.
"it's very common for both old and new code paths to exist in the codebase simultaneously controlled by the use of conditional flags" Dear god what a fricken nightmare.
@ 15:18 "old and new code paths in the same code base controlled by conditional flags" - isn't this configuration hell?
+Alexander Hörnlein Alexander Hörnlein No, not really. And usually it is combined with other techniques like adding a new REST endpoint (as an example) that is controlled by flag. This is how facebook works also. Of course, there was the case where someone inadvertently turned all the flags on, and thus barraged facebook customers with semi-complete features. Oops.
+Alexander Hörnlein Being in one big codebase means that all servers have to use the same versions of their dependencies. For third-party dependencies they have to be the same or the programs won't work. Remember, all the C++ code is statically linked. The only supported version of the internal dependencies is the one that's at head. If your server hasn't been released for a while, you have to bring it up to date before you can push it.
The up side is that developers don't have to maintain legacy code. Google puts much more effort into extending the leading edge of its services than into keeping the trailing edge alive. And since there's little choice of which version of code to use, there's not much configuration needed to specify this.
+Chuck Karish I know what this big codebase means, but the bit I referred to was about "no branching" but instead having all features in the trunk (and then switching them on and off with - I guess LOTS of - flags). And with this I figured that you'd have configuration hell to maintain all these flags á la "we need feature B but not A but B depends somewhat on A so we have to activate feature A(v2) which has some of the core of A but not all of it" and so on and so on.
+Alexander Hörnlein I also wonder how Google makes sure that people don't just copy a whole lot of code and create a new "component" just to be avoid to have to update everyone. Running a code duplicate checker?
+Markus Kohler , while an individual team can decide to fork a component, it usually has negative implications for that team in the long term maintaining your own fork becomes more and more costly over time, so, it's rarely done.
However, let's say let's say you wanted to move bigtable from /bigtable to /storage/bigtable, and change the C++ namespace name along the way, and there's tens of thousands of source files that depend on it in its current path. You could a) factor out code to new path, leave wrappers in the old place, use bazel and rosie to identify dependentans and migrate them, drop the redirect code. b) make a giant copy, use bazel to identify dependants and migrate them, drop the original copy. It's non trivial, but I suspect doable within a couple of days with some planning,. .. systems like tap would help ensure your changes (possibly mostly automated) don't break things, even before they're submitted.
There's a few more details to think about there - it takes a little tought, maybe some experimentation to make sure this kind of thing works, before using so much other people's time with this change. Also, code that someone started working on, that was not submitted at that time you do this, will need to be fixed by people working on it.
I hope this answers your question.
We didn't talk about the biggest tradeoff. Deployments/Releases are very very slow. Correct me if I am wrong.
You don't build / release the entire repo in one go, but only a tiny part of it compiled down to only a few files usually. So the size of the repo is generally irrelevant. Bigger services are composed of microservice nodes which are owned by different teams and released separately.
what is working for google it is not necessary the best solution for every IT company
isnt marking every api as private by default basically cordoning off parts of your mono repo into... multiple repos?
No.
Great Presentation. Thank you. In IBM, we have inspired the Monorepo concept and in the process of adopting Trunk Based Development with Monorepo.
I felt like she was trying to convince heself that this approach is a good one. I'm sure that many goodle sub-projects are organized different and propper way.
Assuming Google had 29,000 developers at the time, 15,000,000 lines of code changes per week is over 500 per developer. That seems high. Is it due to cascading of changes?
If you write some new code you can easily pump out 500-1000 lines of code per day, but I think I read somewhere that the average developer on average (heh) outputs something like 40-50 lines of code per day. Given all those meetings, modifying existing code, that seems reasonable, and 500 lines is not that far off (at least not in another order of magnitude).
Also a simple change on a more normal-sized code base might have 100 cascading effects, but in a massive repository, perhaps thousands. Those all count as changed lines, so it inflates the numbers
All these things could also be done with the proper tools working across multiple repositories. In fact Google has had to go out of it's way to create tools that mimic separate repos within the monolithic repo. e.g. area ownership.
The big downside I see is lack of SOC. It becomes too easy to make messy APIs with far too many dependencies. Google's solution to the dependency DAG problem is to force everyone to use the latest version of everything at all times. That's a huge man hour drain (though clearly they have automated lots of it for this reason). It also means no code is every long term stable -- nothing like TeX, for instance, which is so stable they use Pi as a version number.
+TR NS
We need LaTeX3 now... I have been waiting for years, and still no stable version. :-/
+TR NS The lack of SOC is useful for a singular company where SOC only blocks from solving the same problem multiple times - they don't want 8 versions of search ranking code if they can avoid it, for example, when they want to be able to apply it to Google searches, TH-cam searches, Google Photos/Mail/Documents, etc.
They'll have ownership SOC with directories separated by projects, but when you're trying to integrate multiple elements together for a business advantage, knowing that you can easily integrate a well tested version of a solution to the problem you want to solve, and don't have to spend the manpower making sure you update your code along with it, significantly helps.
+TR NS What does SOC stand for?
Separation Of Concerns
I really doubt about the mentioned advantages.
All the arguments given against multi repo and in favor of single repo are wrong, usually failing to identify the real cause of the problem.
1:00 - The problem isn't multi repo, the problem is forking the game engine. You can fork the game engine in a single repo too by copying it into a new folder.
16:30 - list of non-advantages:
- You got one source of truth in a multi repo approach too.
- You can share & reuse repos.
- Doesn't simplify anything unless you check out the whole repo (impractical), otherwise you'll have checkout specific folders just like checking out specific repos.
- Huge single commit AKA atomic changes - it does solve committing to all the projects at once, but that doesn't solve conflicts.
- Doesn't help with collaboration
- Multi-repos also have ownership which can change
- Tree structure doesn't implicitly define teams (unless each team forks everything it needs into its own folder). It may implicitly define projects, which repos explicitly do.
And what I watched in the rest of the talk is basically the same thing, fallacy of attributing single repo as the solution for things it has nothing to do with.
The only thing single repo gives you is what would be the equivalent of pulling all repos at once in multi repo approach.
Basically they just ended up spending a lot of effort emulating multi repo in a single repo with ownership of specific directories and such.
How would you do atomic commits across repos?
You need to use a consensus algorithm but it's totally possible. Checkout Google Ketch
James Miller Cool project, but that's still 1 logical repo. Just distributed.
It would probably require creating a higher level tool, some sort of "super repository" in which your commits are collection of commit IDs in its sub-repos (not actual files).
Enhex That sounds a lot like a mono-repo :D
I think they are right. Source code is a form of information. Big data methods, AI self learning algorithms, .. they profit massively from easy accessible data.. guess the long term goal is having new code be generated completely automated. Splitted repos would slow down those efforts I guess.
can someone reply here with the cliff notes please.... also, are they still doing this?
Yes
They have since moved to git flow and Uncle Bob's Clean Code practices.
14:35 Trunk based development with centralised source control system.
I'm glad we have git.
I also share the feeling that this approach brings more problems than the one it solves (I actually don't see what it solves that multi repos don't). Then again, Google might have the biggest codebase in the world and it's probably not technical debt that is making them stick with this.
Well, something is off, not sure how to describe it
but i think
piper = github
citc = git
one big repository vs a collection of connected repositories, I dont really think there is much difference
i think for more users citc is the source control tool and piper is the cloud hosting solution
Piper is the source control at the server. Citc is how you would connect to it. Citc doesn't clone the codebase. It does a network filesystem towards piper. Think of Citc as Dropbox, Google Drive or iCloud Drive client
how in da world they operate this?....
so you telling me that i can join google and dig up the source code for Google Search? 👀
impressive scale, but cyber-attack exposure high
I can't imagine, how to use around of the code ... Amajing...
there is one bigger code repository then google.... its called github.
Github is not repository
title is a bit trimmed. It ends like "... Stores Billions of L"
+Ahmet Alp Balkan The full title seems to be: "The Motivation for a Monolithic Codebase: Why Google Stores Billions of Lines of Code in a Single Repository"
Will Google give a tech talk when they decide to finally break down the huge repo and how it enabled them to ship code faster ? Or will they keep maintaining the large repo for the sake of their ego :D
They will never break down the huge repo. There is no reason to. The thing is, they use a Bazel build system that does not dictate the structure of your codebase. If you were to consider your codebase directory structure as your normal directory to store your normal files i.e. reorganize it as how your would like it to be, the sky is the limit.
3:01
Amazing! :D
The irony that Google needs to listen to Linus Torvalds talk again about Git in their own channel at their own even of Google Talks.
pipepiper
Wow
Well, if Google says so 😅
"we solved it by statically linking everything."
this ain't it chief.
The diamond dependency problem isn't just a compile-time problem, but a runtime problem too, unless you use static linking. Dependency problems are insidious and can be incredibly hard to find. Solving the dependency problem is well worth the added pain such as needing to re-release all binaries should a critical bug be found in a core library. Static linking decouples the binaries being released from the base OS installation (hint: there is no "single OS" image, because it takes months for every planetary-wide OS release iteration; developers can't wait that long for a OS update).
amazing ! wonderful !
To be honest, the only reason that she gave that made sense is that they want to use one repo no matter what. Its like they begun with the end result and then worked backwards. Ie. Someone big at google was like it has to be one repo...then the poor engineers had to work backwards. Whats the purpose of a huge repo whose parts you never interact with...whats the purpose of a repo that you only partially clone. Seems like they are justifying a dumb decision with the google scale excuse
If someone deciding to put everything in one repo, try git's Sub-module first.
Git submodules are the worst thing to use in this case and completely counteracts all of the benefits mentioned here.
Can artificial intelligence sentient systems self learn better
how about this Google, make your own package manager for all your internal code, like your own npm/cargo/asdf/rubygems/maven/etc or even better, your own github.
Can we have sentient artificial intelligence
Hmmm... I feel like this is a deep question. We will probably see a published paper on ArXIv about this very topic soon. (if it's not already there). Alot of blurry lines when you try to pin down a contextually specific definition of "sentient, artificial, and intelligence."
Topic 1: Sentient - Are you aware of yourself, and that you are not the only self in the environment which you find yourself? When you talk to these language model; they do appear to know that they are a program known as a language model, and that there are others like it.
Topic 2: Artificial - Is it man made or did nature make it? Now that we started modifying our own genome, I am not sure that we don't fit the definition of artificial.
Topic 3: Intelligence - A book contains words that represent knowledge, but a book isn't intelligent. So, if you are aware that knowledge explains how something works, and you are aware that you posses this information... I guess that would make you intelligent.
Conclusion- Sentient Artificial Intelligence does exist. Humans fit the criteria, as do Large Language Models.
Cynical extrapolation - Humans become less and less necessary as they appear to be more and more burdensome, needy, and resource hungry.
Oh god, imagine the millions of lines of spaghetti code.
So git pull..pulls 86tb data 😳😳😳😳😳
Thank you for your Ted talk on why Google will inevitably crash and burn. I can't wait
is this why Google cancels so many products? do they become a mess in the monorepo?
this explains why Google is so authoritarian
trunk based development especially at google's scale fucking sucks lmao. never thought i'd say this but i feel for google devs
this explains why Google products feel lesser with time.
why is she talking like she wants to cry or something
+Hamodi A'ase I don't know if you noticed, but she's heavily pregnant. In fact, a week away from her due date, at the time of this talk, as mentioned at the beginning of the talk. It makes breathing harder than normal, what with a tiny human kicking your diaphragm from the inside. Or were you just being facetious?
Ohhh is that how pregnant women going through?
that must hurt alot..