Recently while I was attending attending a company wide weekly operations review. One of the teams presenting was talking about how they had reduced their CI pipeline times by doing a shallow clone of their git repository instead of a full clone. Being a simple change, this can have a positive (negative time) impact on your pipeline wait times.
Let’s go through an example, let’s assume we need to clone the Go repository. Generally we’d clone it by running:
git clone https://github.com/golang/go.git
Let’s time this!
|
|
You can see that the entire operation took around 28 seconds.
Now we let’s try the same with a shallow clone.
git clone --depth=1 https://github.com/golang/go.git
And with timing
|
|
That’s around 5 seconds. That’s a whopping 5.8x improvement in speed. This is huge indeed.
So how exactly does this work?
Shallow Clone (AKA –depth=N) Link to heading
In simple terms, each commit in your git repository is a node in a tree datastructure that git maintains. Each commit has child nodes which point to the file structure as per that commit. Every new commit will have similar child nodes, new files added will be added as new leaf nodes to the commit while existing unmodified files will have multiple parent commits in the tree.
So you can imagine, as the commits build up, the tree too builds up. This is the price we pay for having version control. The ability to go back in history comes with added storage cost.
This cost is negligible in local setups as the repository is cloned 1ce while history is accessed often. But in CI pipelines, we don’t need the history. We only care about the current HEAD. As such, it doesn’t make sense to clone the entire repository. This the exact functionality provided by the --depth
feature of git clone.
From the docs, we have
Create a shallow clone with a history truncated to the specified number of commits. Implies
--single-branch
unless--no-single-branch
is given to fetch the histories near the tips of all branches.
So when we give the --depth=1
this means we only get the top 1 commit for that branch. Brilliant.
If you’re interested about the details, I would highly recommend going through this elaborate document about how Git works internally.
Getting back to the company call I was present in, while the presentation was going on, someone suggested in the chat that this might not be the fastest way to do this. They theorized that a combination of git init
and git fetch
might be a fast option.
So what they suggested is:
- Create an empty repository via
git fetch
- Add the upstream URL via
git remote
- Fetch only the required branch via
git fetch --depth=1
- Checkout the commit via
git checkout
Theoretically I don’t see how this can be faster (since we still have to fetch the objects in the tree for that particular commit). I did my fair bit to push back, saying that both are the same and there shouldn’t be any difference, but the argument carried on to the end of the call with neither party agreeing.
Giving the other person the benefit of doubt, tet’s see if it’s practically any faster.
|
|
Now let’s time it!
|
|
Alas! This too took around 5 seconds.
Whats interesting here is that the number of objects which we’re compressing and sending over the wire is the same for both the examples, that is 9194. We can also see the total size transferred over the wire is the same, i.e. 22.54 MiB. The only difference is the time taken for the wire transfer and this can obviously fluctuate as per the load on your connection.
So, both the methods are the same. Then why use multiple commands when git clone --depth=1
suffices.
1 for Karthik and 0 for the person on the chat. Hehe.