In this article, we are going to explore how the git object storage works, and we will recreate the famous
git commit command with a simple Python script. You can find the full code here. Let’s jump right into it!
At the core of Git, there’s a key-value storage, known as the Git Database, located in the
.git/objects folder of your repo. Great. What does that mean? Well, you can pass any blob of data to the Git Database, and it will give you a unique identifier: the hash of the content. You can then get this blob back using the same identifier. Let’s try with an example:
$ git init $ find .git/objects -type f # No output $ echo "example blob" | git hash-object -w --stdin bfd132e6a8ca084d0aa7d6f18cb4852b6ca7c1d3 $ find .git/objects -type f .git/objects/bf/d132e6a8ca084d0aa7d6f18cb4852b6ca7c1d3 $ git cat-file -p bfd132e6a8ca084d0aa7d6f18cb4852b6ca7c1d3 example blob
Phew, let’s unpack this.
git init will simply initialize an empty repository in the current folder, with an empty database. The second command will list all the files in the Git database, located at
.git/objects, and as you can see it is empty for now.
The next command will feed the string
example blob\\n (the newline character is added by
git hash-object which is the low-level command (also called plumbing command) to hash and write down objects to the database. The
-w flag will make it actually write it down instead of just calculating the hash, and
--stdin will tell Git that we piped the content in, using
echo |. It will return the hash of the file: the unique identifier used to retrieve it from the database.
On the next line, you can see that there is now a file in the database! If you look carefully, you can see that the file is actually stored in a subfolder: this is an optimisation trick, where each file will be named after the last 38 hash digits, inside of a folder named with the first two digits.
Finally, we can use the
git cat-file in order to display the blob content. The
-p option will be used in order to make Git guess the type of the file and display it correctly, you will learn more about that in a bit.
Cool, now we know more about the Git database itself, but what about what to write? Let’s make a test repository for that
<aside> 💡 If you want to follow this section along, in order to have a reproducible environment, please run the following commands in your shell. You will need to run them after the first command.
$ git config --local user.name "Example User" $ git config --local user.email "example@localhost" $ git config --local commit.gpgsign false $ export GIT_AUTHOR_DATE="Mon, 09 Jan 2017 00:00:00 +0000" $ export GIT_COMMITTER_DATE="Mon, 09 Jan 2017 00:00:00 +0000"
$ git init $ mkdir src $ echo "This is a README" > README.txt $ echo "This is a License" > LICENSE $ echo "import __hello__" > src/script.py $ git add README.txt LICENSE src $ find .git/objects -type f .git/objects/00/43777bc09925c1e09d9df2bc3919745f526a4d .git/objects/b0/7f0ed953b8a24983dd5048cd2019b595692d74 .git/objects/c9/1772d420edb9d16ce07bc546f108eee6efdde8
As you can see, Git created three objects when we added those files. Using
git hash-object, we can confirm what those objects are:
$ echo "This is a README" | git hash-object -w --stdin b07f0ed953b8a24983dd5048cd2019b595692d74 $ echo "This is a LICENSE" | git hash-object -w --stdin 0043777bc09925c1e09d9df2bc3919745f526a4d $ echo "import __hello__" | git hash-object -w --stdin c91772d420edb9d16ce07bc546f108eee6efdde8
Sweet, each object in our database is one of the file content. Something that I’d like to mention is that the file name isn’t mentioned anywhere. Git objects are anonymous! The name will be stored separately. Let’s now make a commit:
$ git commit -m "Example Commit" [master (root-commit) 7681db6] Example Commit 3 files changed, 3 insertions(+) create mode 100644 LICENSE create mode 100644 README.txt create mode 100644 src/script.py $ find .git/objects -type f .git/objects/00/43777bc09925c1e09d9df2bc3919745f526a4d .git/objects/27/3af442f00f3a89e6efcedd0861e58a4d9e7b78 # New .git/objects/76/81db64512437fa48e5b58954c91cc9736ff200 # New .git/objects/84/cd21c3f997f58238de2e1b25da337d0839de6b # New .git/objects/b0/7f0ed953b8a24983dd5048cd2019b595692d74 .git/objects/c9/1772d420edb9d16ce07bc546f108eee6efdde8
We made one commit, but Git created three objects! Let’s start by inspecting the commit itself,
$ git cat-file -p 7681db64512437fa48e5b58954c91cc9736ff200 tree 84cd21c3f997f58238de2e1b25da337d0839de6b author Example User <example@localhost> 1483920000 +0000 committer Example User <example@localhost> 1483920000 +0000 Example Commit