<aside> ✏️ 21.06.2020 by Darjan Salaj [ web | GoogleScholar | LinkedIn | GitHub ]
</aside>
The situation: A project has several binaries that need to be tracked and versioned along with the code. The binaries are small enough and rarely change, so versioning them directly in git is appropriate.
That was some months ago, and in the meantime, the project evolved, binaries grew and changed. Now you're left with a bloated repository of multiple gigabytes that takes a long time to clone. This is significantly slowing down the CI pipeline and it's time to fix it.
There are many options to solve this problem, all with unique pros and cons. The simplest solution is dealing only with problematic symptoms like slow CI pipeline. This can be overcome with one of the following:
git clone --depth 1 URL
: Limiting the depth of the clone means that the history is not pulled.git clone --filter=blob:none URL
: The filter option is more powerful and can be used filter only specific files that meet specific criteria. See docs.git sparse-checkout
. See docs.But sometimes these workarounds are not enough and you need to rewrite the repo history. Below I will discuss the two solutions that should cover most of the usecases. The first option is migrating to the git lfs
. The second option is pruning the binaries from the repo history using git filter-repo
and versioning the binaries in a separate versioning system.
git lfs
The main advantage of the migration to git lfs
is that the repo history is left intact and no information is lost. This method is recommended when the binaries are really tightly coupled with the code. This is easily done with the following steps:
conda install -c conda-forge git-lfs
git lfs install
git lfs migrate import --include-ref=master --include="*.bz2"
For more details on how to match your usecase see this guide.
The downside of using git lfs
is that it adds more complexity to the workflow of developers. Decisions need to be made: Which files are to be versioned with git and which with git lfs
? Do you decide based on the file size or the file format/extension? The .gitattributes
file needs to be maintained etc.
You get the gist, there is overhead and nobody likes overhead. But if the binary files are really tightly coupled with the code and need to be version, this is the way to go.
When the binaries are not tightly coupled with the code there is another solution: simply prune away all the binary files from the repo history and version them in a separate, more appropriate system.
The pruning part is simple thanks to the amazing [git filter-repo](<https://github.com/newren/git-filter-repo>)
tool. For example, pruning away all the binary files larger than 10 MB can be done using the guide from GitLab. For more elaborate cases please refer to the official documentation with examples.