Speeding up git clone in a CI pipeline

Recently while I was attending attending a company wide weekly operations review. One of the teams presenting was talking about how they had reduced their CI pipeline times by doing a shallow clone of their git repository instead of a full clone. Being a simple change, this can have a positive (negative time) impact on your pipeline wait times.

Let’s go through an example, let’s assume we need to clone the Go repository. Generally we’d clone it by running:

git clone https://github.com/golang/go.git

Let’s time this!

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
❯ time git clone https://github.com/golang/go.git
Cloning into 'go'...
remote: Enumerating objects: 482349, done.
remote: Counting objects: 100% (1073/1073), done.
remote: Compressing objects: 100% (484/484), done.
remote: Total 482349 (delta 613), reused 964 (delta 583), pack-reused 481276
Receiving objects: 100% (482349/482349), 257.45 MiB | 11.07 MiB/s, done.
Resolving deltas: 100% (383787/383787), done.

________________________________________________________
Executed in   28.28 secs    fish           external
   usr time   28.96 secs  547.00 micros   28.96 secs
   sys time    3.88 secs  221.00 micros    3.88 secs

You can see that the entire operation took around 28 seconds.
Now we let’s try the same with a shallow clone.

git clone --depth=1 https://github.com/golang/go.git

And with timing

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
❯ time git clone --depth=1 https://github.com/golang/go.git
Cloning into 'go'...
remote: Enumerating objects: 10982, done.
remote: Counting objects: 100% (10982/10982), done.
remote: Compressing objects: 100% (9194/9194), done.
remote: Total 10982 (delta 1766), reused 5970 (delta 1338), pack-reused 0
Receiving objects: 100% (10982/10982), 22.54 MiB | 10.99 MiB/s, done.
Resolving deltas: 100% (1766/1766), done.

________________________________________________________
Executed in    4.56 secs    fish           external
   usr time    1.34 secs  565.00 micros    1.34 secs
   sys time    0.35 secs  215.00 micros    0.35 secs

That’s around 5 seconds. That’s a whopping 5.8x improvement in speed. This is huge indeed.

So how exactly does this work?

Shallow Clone (AKA –depth=N)

In simple terms, each commit in your git repository is a node in a tree datastructure that git maintains. Each commit has child nodes which point to the file structure as per that commit. Every new commit will have similar child nodes, new files added will be added as new leaf nodes to the commit while existing unmodified files will have multiple parent commits in the tree.

So you can imagine, as the commits build up, the tree too builds up. This is the price we pay for having version control. The ability to go back in history comes with added storage cost.

This cost is negligible in local setups as the repository is cloned 1ce while history is accessed often. But in CI pipelines, we don’t need the history. We only care about the current HEAD. As such, it doesn’t make sense to clone the entire repository. This the exact functionality provided by the --depth feature of git clone.

From the docs, we have

Create a shallow clone with a history truncated to the specified number of commits. Implies --single-branch unless --no-single-branch is given to fetch the histories near the tips of all branches.

So when we give the --depth=1 this means we only get the top 1 commit for that branch. Brilliant.

If you’re interested about the details, I would highly recommend going through this elaborate document about how Git works internally.


Getting back to the company call I was present in, while the presentation was going on, someone suggested in the chat that this might not be the fastest way to do this. They theorized that a combination of git init and git fetch might be a fast option.

So what they suggested is:

  1. Create an empty repository via git fetch
  2. Add the upstream URL via git remote
  3. Fetch only the required branch via git fetch --depth=1
  4. Checkout the commit via git checkout

Theoretically I don’t see how this can be faster (since we still have to fetch the objects in the tree for that particular commit). I did my fair bit to push back, saying that both are the same and there shouldn’t be any difference, but the argument carried on to the end of the call with neither party agreeing.

Giving the other person the benefit of doubt, tet’s see if it’s practically any faster.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
❯ bat x.fish
───────┬─────────────────────────────────────────────────────────────────
       │ File: x.fish
───────┼─────────────────────────────────────────────────────────────────
   1#!/bin/fish
   23   │ mkdir go
   4   │ cd go
   5   │ git init
   6   │ git remote add origin https://github.com/golang/go.git
   7   │ git fetch --depth=1 origin master
   8   │ git checkout origin/master
───────┴─────────────────────────────────────────────────────────────────

Now let’s time it!

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
❯ time ./x.fish
Initialized empty Git repository in /tmp/go/.git/
remote: Enumerating objects: 10982, done.
remote: Counting objects: 100% (10982/10982), done.
remote: Compressing objects: 100% (9194/9194), done.
remote: Total 10982 (delta 1766), reused 5970 (delta 1338), pack-reused 0
Receiving objects: 100% (10982/10982), 22.54 MiB | 9.27 MiB/s, done.
Resolving deltas: 100% (1766/1766), done.
From https://github.com/golang/go
 * branch            master     -> FETCH_HEAD
 * [new branch]      master     -> origin/master
Note: switching to 'origin/master'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by switching back to a branch.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -c with the switch command. Example:

  git switch -c <new-branch-name>

Or undo this operation with:

  git switch -

Turn off this advice by setting config variable advice.detachedHead to false

HEAD is now at 460900a os/signal: test with a significantly longer fatal timeout

________________________________________________________
Executed in    4.94 secs    fish           external
   usr time    1.44 secs  403.00 micros    1.44 secs
   sys time    0.33 secs  140.00 micros    0.33 secs

Alas! This too took around 5 seconds.

Whats interesting here is that the number of objects which we’re compressing and sending over the wire is the same for both the examples, that is 9194. We can also see the total size transferred over the wire is the same, i.e. 22.54 MiB. The only difference is the time taken for the wire transfer and this can obviously fluctuate as per the load on your connection.

So, both the methods are the same. Then why use multiple commands when git clone --depth=1 suffices.

1 for Karthik and 0 for the person on the chat. Hehe.