Chapter 2
Git Guts
Although commonly called a version control tool (and this is certainly the most common use) it is useful to think of Git as an object file system. That might not help much right now, but as we start to learn about Git you will hopefully appreciate why I say this.
2.1 Git Repository
When we use git init, Git will create a repository in a .git directory within the current working directory which become the Git workspace. You edit your files in the workspace just as you normally would and use git commands to manipulate objects and data stored in the .git database (known commonly as the repository).
1mkdir class 2cd class 3ls -a 4git init 5tree -a
Within the .git directory are a number of files and sub-directories that constitute the .git database.
1.git 2├── branches 3├── config 4├── description 5├── HEAD 6├── hooks 7│ ├── applypatch-msg.sample 8│ ├── commit-msg.sample 9│ ├── fsmonitor-watchman.sample 10│ ├── post-update.sample 11│ ├── pre-applypatch.sample 12│ ├── pre-commit.sample 13│ ├── prepare-commit-msg.sample 14│ ├── pre-push.sample 15│ ├── pre-rebase.sample 16│ ├── pre-receive.sample 17│ └── update.sample 18├── info 19│ └── exclude 20├── objects 21│ ├── info 22│ └── pack 23└── refs 24 ├── heads 25 └── tags 26 279 directories, 15 files
A few are of less interest to us at the moment:
- config—holds configuration to be used on this repository (more on git configuration later)
- description—used to provide a description of this repository to the web interface (not something we will look at for a while)
- info—contains the files to be ignored (.gitignore, to be investigated later) for this project’s workspace.
- hooks—these are small scripts that can be triggered on certain actions. We will use these later but for now they can be ignored. These take up a lot of room on our output, we don’t need these, so let’s delete them (I’ll leave the directory, even though it is not required, to remind us that it’s typically there).
1rm .git/hooks/*
The ones we are most interested in this chapter are:
- HEAD—Holds a special reference to the last object stored from the workspace
- objects—this holds the data
- refs—this holds references into the data in objects
We will see several other files and directories created as we use Git and we will discuss these as they occur.
There are several types of object held in .git repositories, the three we will encounter in this chapter are:
- blobs—containing the data we want to store (typically files)
- trees—containing data about sets of blobs (and other trees)
- commits—containing metadata about trees
No need to worry about the details, all will become clear as we progress through this chapter.
2.2 blobs
We can use some low-level Git commands to create blobs directly1. The git hash-object sub-command creates and stores objects. Let’s create an object:
1echo 'version 1' > file1.txt 2git hash-object file1.txt 3tree .git
We created a simple text file and had git hash-object show us it’s hash (a 40 character string, actually the SHA-1 hash of the file’s content) but this object is not stored in the repository yet.
1.git 2├── branches 3├── config 4├── description 5├── HEAD 6├── hooks/ 7├── info 8│ └── exclude 9├── objects 10│ ├── info 11│ └── pack 12└── refs 13 ├── heads 14 └── tags 15 169 directories, 15 files
To have git hash-object store the file we use the -w option.
1git hash-object -w file1.txt 2tree .git
1.git 2├── branches 3├── config 4├── description 5├── HEAD 6├── hooks 7├── info 8│ └── exclude 9├── objects 10│ ├── 83 11│ │ └── baae61804e65cc73a7201a7252750c76066a30 12│ ├── info 13│ └── pack 14└── refs 15 ├── heads 16 └── tags 17 1810 directories, 16 files
The object is stored in the objects directory and the first two characters of the hash are used to create a directory (this is called ‘sharding’ and it is used to reduce the number of files stored in any one directory).
It is important to note that Git has no idea what this blob is, it is just some data. No record is held about the original file name, for that matter Git doesn’t even care that this blob came from a file.
1echo 'not a file' | git hash-object -w --stdin 2tree .git
1.git 2├── branches 3├── config 4├── description 5├── HEAD 6├── hooks 7├── info 8│ └── exclude 9├── objects 10│ ├── 7a 11│ │ └── b4ff63b2ea4c2c3ff89ee972bc42988a4b8472 12│ ├── 83 13│ │ └── baae61804e65cc73a7201a7252750c76066a30 14│ ├── info 15│ └── pack 16└── refs 17 ├── heads 18 └── tags 19 2011 directories, 17 files
Here the data for the blob is fed into Git straight from stdin, no file is involved this is ‘raw data’.
We can recall the blob from our repository using git cat-file (this is a bit misleading and would be better called cat-object because, as we shall see, we can use it to look inside various git objects).
1git cat-file -p 83baae61804e65cc73a7201a7252750c76066a30
1version 1
The -p option ‘pretty prints’ the content of the object to stdout so if we want to create a file from this object we need to redirect it …
1git cat-file -p 83baae61804e65cc73a7201a7252750c76066a30 > new_file.txt 2cat new_file.txt
1version 1
Typing out those long hash identities quickly becomes tiresome. Fortunately Git allows us to specify shorter forms in many instances, specifically we can provide just enough of the start of an object’s hash that is unambiguous.
1git cat-file -p 83ba
In most circumstances 6 to 8 characters is sufficient, here we can use just 4 because our repository has so few entries this is all that is required to unambiguously reference each object. (We cannot go so far as reducing to just 2 as Git considers these too short—two characters will only identify the shard directory, not the object file.)
We can add another version of our file1.txt without any confusion (because Git does not care about the filename at this point).
1echo 'version 2' > file1.txt 2git hash-object -w file1.txt
Git adds the new object as a simple blob.
1tree .git 2git cat-file -p 1f7a
1.git 2├── branches 3├── config 4├── description 5├── HEAD 6├── hooks 7├── info 8│ └── exclude 9├── objects 10│ ├── 1f 11│ │ └── 7a7a472abf3dd9643fd615f6da379c4acb3e3a 12│ ├── 7a 13│ │ └── b4ff63b2ea4c2c3ff89ee972bc42988a4b8472 14│ ├── 83 15│ │ └── baae61804e65cc73a7201a7252750c76066a30 16│ ├── info 17│ └── pack 18└── refs 19 ├── heads 20 └── tags 21 22 12 directories, 18 files
1version 2
So we can store blobs in our repository but this is of limited use as we normally deal with directories containing files and these tend to have human readable names (like file1.txt).
2.3 trees
To get Git to track filenames and directories we have it create a different type of object called a ‘tree’ and to create tree objects we use the ‘index’. The index is a sort of holding area within our repository2 (you will also see the ‘index’ called the ‘cache’ or ‘staging’ area). In the index we collect information about all of the objects we want to store in our repository, then we use a single command to create a tree entry using the entries in the index.
1git update-index --add --cacheinfo 100644 83baae61804e65cc73a7201a7252750c76066a30 file1.txt 2tree .git
1.git 2├── branches 3├── config 4├── description 5├── HEAD 6├── hooks 7├── index 8├── info 9│ └── exclude 10├── objects 11│ ├── 1f 12│ │ └── 7a7a472abf3dd9643fd615f6da379c4acb3e3a 13│ ├── 7a 14│ │ └── b4ff63b2ea4c2c3ff89ee972bc42988a4b8472 15│ ├── 83 16│ │ └── baae61804e65cc73a7201a7252750c76066a30 17│ ├── info 18│ └── pack 19└── refs 20 ├── heads 21 └── tags 22 2312 directories, 19 files
update-index is used to manipulate our repository index. Initially a new repository has no index but after adding an object’s information to the index we see a new file index (line 7 above). The --cacheinfo option specifies the object data to be added. The file’s mode (100644) is stored, then the object hash (83baae61804e65cc73a7201a7252750c76066a30), and finally the filename we want to associated with the object (file1.txt). Note, these are entirely under our control in the update-index command and do not have to correspond with any real file. Even the object identity is not checked by the update-index command (you should always provide a real hash though, otherwise you will get an “invalid object” error when you attempt to write the tree—up next).
Having created our index we can examine its content using git ls-files --stage, the --stage option causes ls-files to display the mode and object hash.
1git ls-files --stage 2git write-tree 3git ls-files --stage
1100644 83baae61804e65cc73a7201a7252750c76066a30 0 file1.txt
1b7e8fac7e3e35d93d39d2fa2260868f025a9efb4
1100644 83baae61804e65cc73a7201a7252750c76066a30 0 file1.txt
The git write-tree operation does not change the index file. The ls-files shows us that the index is the same before and after the write-tree.
1tree .git
1.git 2├── branches 3├── config 4├── description 5├── HEAD 6├── hooks 7├── index 8├── info 9│ └── exclude 10├── objects 11│ ├── 1f 12│ │ └── 7a7a472abf3dd9643fd615f6da379c4acb3e3a 13│ ├── 7a 14│ │ └── b4ff63b2ea4c2c3ff89ee972bc42988a4b8472 15│ ├── 83 16│ │ └── baae61804e65cc73a7201a7252750c76066a30 17│ ├── b7 18│ │ └── e8fac7e3e35d93d39d2fa2260868f025a9efb4 19│ ├── info 20│ └── pack 21└── refs 22 ├── heads 23 └── tags 24 2513 directories, 20 files
After the write-tree a new object has appeared in our repository. The hash for this object (b7e8fac7e3e35d93d39d2fa2260868f025a9efb4) is what was returned from the write-tree command. You can check the type of this object, confirming it is a tree, and then look at its content to see that the --cacheinfo we used above has been captured.
1git cat-file -t b7e8 2git cat-file -p b7e8
1tree
1100644 blob 83baae61804e65cc73a7201a7252750c76066a30 file1.txt
The second field of this tree record blob is telling us that the record refers to an object of type ‘blob’. Why blob and not object? The object directory contains both file content (blob) and tree objects (which we will shortly see as analogous to directories in the workspace). In other words, blobs and trees are both objects. It is therefore fine to use the term ‘object’ when the context makes clear the type of object we are talking about (or we are talking collectively about any type of object). I will continue to use ‘object’ unless it is important to use a more specific type.
We can add multiple objects to our index and these can be a mix of existing repository objects and new files added from our working area.
1echo 'Another file' > another_file.txt 2git update-index --add another_file.txt 3git ls-files --stage
1100644 b0b9fc8f6cc2f8f110306ed7f6d1ce079541b41f 0 another_file.txt 2100644 83baae61804e65cc73a7201a7252750c76066a30 0 file1.txt
Here we are using update-index directly on the file another_file.txt. This will create a new object in the repository holding the content of another_file.txt at the time this update-index is run and then create the entry in the index to relate the filename and the file mode to this object. We cannot use --cacheinfo here because the object does not exist within the repository until we run the update-index. We need the --add option so that update-index will accept new files (files that have no existing index entry) into the index.
Some time back we created a new object containing the text ‘version 2’. This object was assigned the hash 1f7a7a472abf3dd9643fd615f6da379c4acb3e3a when we created it with hash-object -w. We want to add this object to our index.
1git update-index --cacheinfo 100644 1f7a7a472abf3dd9643fd615f6da379c4acb3e3a file1.txt 2git ls-files --stage
1100644 b0b9fc8f6cc2f8f110306ed7f6d1ce079541b41f 0 another_file.txt 2100644 1f7a7a472abf3dd9643fd615f6da379c4acb3e3a 0 file1.txt
Notice that the index is modified so that the file1.txt entry now refers to object 1f7a7a472abf3dd9643fd615f6da379c4acb3e3a.
Why was a new line not created in the index? Note the absence of the --add option. We are modifying the index entry associated with the name file1.txt, not adding a new entry. The index is a mapping between objects in the Git repository and files in the workspace and workspace files must be uniquely identified filename. There can only be a one to one mapping from filename to object in the index (a filename can only refer to one object).
It is fine for the index to have a one to many mapping from object to filename (one object can be referred to by many filenames). This can be illustrated by adding a second index entry referring to the object 1f7a7a472abf3dd9643fd615f6da379c4acb3e3a but using a different filename.
1git update-index --add --cacheinfo 100644 1f7a7a472abf3dd9643fd615f6da379c4acb3e3a filerX.txt 2git ls-files --stage
1100644 b0b9fc8f6cc2f8f110306ed7f6d1ce079541b41f 0 another_file.txt 2100644 1f7a7a472abf3dd9643fd615f6da379c4acb3e3a 0 file1.txt 3100644 1f7a7a472abf3dd9643fd615f6da379c4acb3e3a 0 fileX.txt
What does this represent?
Work through what we have learned so far. The object 1f7a7a472abf3dd9643fd615f6da379c4acb3e3a contains the data ‘version 2’. The index shows the mapping between the data and the files in the workspace. So both file1.txt and fileX.txt in the workspace are to have the same content (that from object 1f7a7a472abf3dd9643fd615f6da379c4acb3e3a).
We don’t really want this double mapping (interesting as it is), so we remove it from the index using the --remove option to the update-index command.
1git update-index --remove fileX.txt 2git ls-files --stage
1100644 b0b9fc8f6cc2f8f110306ed7f6d1ce079541b41f 0 another_file.txt 2100644 1f7a7a472abf3dd9643fd615f6da379c4acb3e3a 0 file1.txt
We now create another tree object.
1git write-tree
So far we have created some basic blob and tree objects, but we have not yet dealt with directories. Or have we?
A directory is essentially a container holding files and other directories. Sounds familiar? The tree object we just created is a list of blobs related to file names. Can we similarly relate a directory name with a tree object and include it in another tree object?
Create a directory and a new file in that directory.
1mkdir dir1 2echo 'version 1' > dir1/file11.txt
We now add this new file to the index.
1git update-index --add dir1/file11.txt
If we now look at our index we find that this has simply added an entry to the index with the path dir1/file11.txt rather than a simple filename. We have discovered that the index maps files by pathname rather than simply their file name. These pathnames are relative to the root of our working area.
1git ls-files -s
1100644 b0b9fc8f6cc2f8f110306ed7f6d1ce079541b41f 0 another_file.txt 2100644 83baae61804e65cc73a7201a7252750c76066a30 0 dir1/file11.txt 3100644 1f7a7a472abf3dd9643fd615f6da379c4acb3e3a 0 file1.txt
2.3.1 Progress review: blobs and trees
Let’s review the situation we now have.
We have some blobs in the .git/objects store holding various data. We have two tree objects in the .git/objects store (b7e8fac7e3e35d93d39d2fa2260868f025a9efb4) that relates 83baae to the name file1.txt and 349fa0b7f3252dbe6989c2e8156803b3265a78e0 that relates 1f7a7a to file1.txt and b0b9fc to another_file.txt). We have a .git/index file containing various mappings between blobs and filenames (which we just listed out above).
We can list all the objects in .git/objects using cat-file with the --batch-all-objects and --batch-check options.
1git cat-file --batch-all-objects --batch-check
11f7a7a472abf3dd9643fd615f6da379c4acb3e3a blob 10 2349fa0b7f3252dbe6989c2e8156803b3265a78e0 tree 81 37ab4ff63b2ea4c2c3ff89ee972bc42988a4b8472 blob 11 483baae61804e65cc73a7201a7252750c76066a30 blob 10 5b0b9fc8f6cc2f8f110306ed7f6d1ce079541b41f blob 13 6b7e8fac7e3e35d93d39d2fa2260868f025a9efb4 tree 37
We can now see what happens when we add sub-directories to our object store. Remember that our index has a new dir1/file11.txt path mapping so we are expecting write-tree to account for this in our repository.
1git write-tree 2git cat-file --batch-all-objects --batch-check
10139f016af84acd889e2f707ef9eca2140e0222e tree 112 21f7a7a472abf3dd9643fd615f6da379c4acb3e3a blob 10 3337f3832b1bce2d8f364e99965c8519a3eb9dc6c tree 38 4349fa0b7f3252dbe6989c2e8156803b3265a78e0 tree 81 57ab4ff63b2ea4c2c3ff89ee972bc42988a4b8472 blob 11 683baae61804e65cc73a7201a7252750c76066a30 blob 10 7b0b9fc8f6cc2f8f110306ed7f6d1ce079541b41f blob 13 8b7e8fac7e3e35d93d39d2fa2260868f025a9efb4 tree 37t update-index --remove fileX.txt
We have added two new tree objects, 337f38 and 0139f0. Inspecting these we can see what has happened.
1git cat-file -p 337f38 2git cat-file -p 0139f0
1100644 blob 83baae61804e65cc73a7201a7252750c76066a30 file11.txt
1100644 blob b0b9fc8f6cc2f8f110306ed7f6d1ce079541b41f another_file.txt 2040000 tree 337f3832b1bce2d8f364e99965c8519a3eb9dc6c dir1 3100644 blob 1f7a7a472abf3dd9643fd615f6da379c4acb3e3a file1.txt
The first (337f38) represents the content of the dir1 directory, in this instance just the mapping of 83baae to the file name file11.txt.
The second (0139f0) represent the content of our root directory. The interesting entry being the tree object referenced on line 2 and mapped to the name dir1.
From this short exercise we can make a few observations.
- The index maps blobs to file paths (not simply file names).
- The index does not map tree objects.
- Tree objects are created as required whenever a write-tree is executed.
- Tree objects are mapped to names by other tree objects.
- Tree objects form a directed graph representing a directory structure.
- The root Tree object has no name (since names are mapped by tree objects and, by definition, the root tree object is not itself a part of a parent tree object).
We have now shown how Git stores data in blobs. Names are mapped to those blobs by tree objects. Tree objects can contain other tree objects and map them to names, allowing us to store directories3.
Now that we can store a basic file structure it is time to consider how Git stores the history of changes to files.
2.4 commits
Tree objects effectively capture and freeze a hierarchical set of files and directories. Put another way, a tree object is a snapshot in time of a set of blob to file path mappings. This is useful to us when we want to capture a history, all we need do is capture tree objects representing the start and end of any operation and then somehow tell Git that the first snapshot precedes the second. We can now look at these two snapshots as a history of the files and directories captured by the tree objects.
We already have two snapshots we can use to start our history.
1git ls-tree -r b7e8fa 2git ls-tree -r 0139f0
I’ve used the -r option to list the tree object recursively. This has no effect on the first tree but the second tree shows blob object 83baae mapped to the file path dir1/file11.txt whereas without the -r option we would see only the tree object 337f38 mapped to directory dir1 (as above).
1100644 blob 83baae61804e65cc73a7201a7252750c76066a30 file1.txt
1100644 blob b0b9fc8f6cc2f8f110306ed7f6d1ce079541b41f another_file.txt 2100644 blob 83baae61804e65cc73a7201a7252750c76066a30 dir1/file11.txt 3100644 blob 1f7a7a472abf3dd9643fd615f6da379c4acb3e3a file1.txt
The first (b7e8fa) tree contains only file1.txt. In the second we have added file another_file.txt, the directory dir1 and within that the file file11.txt (file1.txt has different content to that referred to in b7e8fa, we know this because it is mapped to a different blob (1f7a7a) rather than 83baae).
So Git provides a simple mechanism for showing that tree b7e8fa is historically before tree 0139f0? Yes …and no. Although we know that we added and modified the files between creating tree objects b7e8fa and 0139f0 there is nothing reflecting this history. We could just as easily claim that 0139f0 was created first and then we modified file1.txt and removed dir1 and it’s content, the results would be the same.
To create a history we must first create a new type of object, the commit object. It will not surprise you that these objects are also stored under .git/objects.
Commit objects contain special metadata (that is data about data, in this case data about a tree object). To create a commit object we use the commit-tree command.
1git commit-tree -m "First commit" b7e8fa
As with other commands that create new objects the commit-tree command returns the hash of the new commit object. This is the first time you will notice your commit will have a different hash to my commit object. Pause for a second and consider why this might be.
We can now inspect this object, first confirming its type and then pretty printing it.
1git cat-file -t f871b5 2git cat-file -p f871b5
1commit
1tree b7e8fac7e3e35d93d39d2fa2260868f025a9efb4 2author vagrant <vagrant@debian-10.7-amd64> 1615399633 +0000 3committer vagrant <vagrant@debian-10.7-amd64> 1615399633 +0000 4 5First commit
And here we see another difference between what you see and what I see. What causes this difference? After all we have, so far, started with the same setup and created the same objects in the repository. Compare closely the output of cat-file. At the end of the lines starting author and committer are two numbers, these are timestamps and since you and I created our commit objects at different times we have different timestamps and consequently these commit objects have different hashes.
We can demonstrate this clearly by repeating the commit-tree with no changes.
1git commit-tree -m "First commit" b7e8fa
Git returns a different hash. If we compare the two commit objects (remembering that your commit objects’ hashes will be different to mine!), we see they differ only in the timestamps recorded.
1diff <(git cat-file -p f871b5) <(git cat-file -p e3004b)
We now have two commit objects, but they are not very interesting as they refer to the same tree object (and hence the same ’version 1’ of file1.txt). Let’s create some more interesting commit objects.
We previously created a tree object (0139f0) that captured the files file1.txt (’version 2’), another_file.txt, and dir1/file11.txt. We now what to create a history in which this configuration of files and directories follows from the ’version 1’ file1.txt.
1git commit-tree -m "Second commit" -p f871b5 0139f0
In this commit-tree we added the -p option to indicate that commit object f871b5 is the parent of the commit object we are creating for tree object 0139f9. As before we can examine the new commit object (in my case 0715e7) with the cat-file command.
1git cat-file -p 0715e7
1tree 0139f016af84acd889e2f707ef9eca2140e0222e 2parent f871b58596491e15ee1da91eaf0a4a6c1da3e573 3author vagrant <vagrant@debian-10.7-amd64> 1615399872 +0000 4committer vagrant <vagrant@debian-10.7-amd64> 1615399872 +0000 5 6Second commit
On line 2 we see that this commit has a parent (f871b5).
Now let’s quickly create one more commit. First we create a new version of file1.txt, then create a new tree object, and finally a new commit.
1echo 'version 3' > file1.txt 2git update-index file1.txt 3git write-tree 4git commit-tree -m "Third commit" -p 0715e7 fd97ab
2.4.1 Progress review: blobs, trees, and commits
Let us review the content of our objects store4. We have create three tree objects using the write-tree command. These were:
- Version one of file1.txt on its own.
- Adding another_file.txt alongside version one of dir1/file11.txt and updating file1.txt to version two.
- Update file1.txt to version 3
To create a version chain of these three tree objects we use three commit-tree commands. The first commit object has no parent as it is the first entry, it contains the four pieces of data:
- tree—the hash of the tree object to which this commit refers.
- author—a record of the author’s name and email (the person who write the changes in the tree object), along with the time the commit was authored
- committer—a record of the committer (the user who actually executed the commit-tree)
- A blank line, followed by the text of any comment we want to associate with the commit (in these example, supplied by the -m option to the commit-tree command).
Author versus Committer
Why the two entries ‘author’ and ‘committer’?
The ‘author’ of a change is the individual who edited the files making up the change.
The ‘committer’ is the user who created the commit object.
In private use these two field normally contain the same information. The same user created the commit and makes the changes. However, suppose a user submits a change as a patch file using e-mail? That person is the author of the change but not the person who puts those changes into the Git repository. This is why there is a distinction between the ‘author’ and ‘committer’.
The second commit specifies the first commit object (the hash returned by the first commit-tree). Looking at this commit object you can see one additional piece of data over the initial commit:
- parent—the hash reference to parent commit object.
Finally we created a third commit object referencing the second as its parent.
The entire chain we just created can be displayed using the log command; the hash e27aa being the hash of the last (third) commit object we just created. The --stat option shows summary statistics of each commit and the --patch shows the changes to the files in each commit.
1git log --stat --patch e27aaa
1commit e27aaa8c158e6f261f4c03aaaf173a149ad61d81 2Author: vagrant <vagrant@debian-10.7-amd64> 3Date: Wed Mar 10 18:13:55 2021 +0000 4 5 Third commit 6--- 7 file1.txt | 2 +- 8 1 file changed, 1 insertion(+), 1 deletion(-) 9 10diff --git a/file1.txt b/file1.txt 11index 1f7a7a4..7170a52 100644 12--- a/file1.txt 13+++ b/file1.txt 14@@ -1 +1 @@ 15-version 2 16+version 3 17 18commit 0715e707b906d30c9e395448ddc9e96acd89d5f7 19Author: vagrant <vagrant@debian-10.7-amd64> 20Date: Wed Mar 10 18:11:12 2021 +0000 21 22 Second commit 23--- 24 another_file.txt | 1 + 25 dir1/file11.txt | 1 + 26 file1.txt | 2 +- 27 3 files changed, 3 insertions(+), 1 deletion(-) 28 29diff --git a/another_file.txt b/another_file.txt 30new file mode 100644 31index 0000000..b0b9fc8 32--- /dev/null 33+++ b/another_file.txt 34@@ -0,0 +1 @@ 35+Another file 36diff --git a/dir1/file11.txt b/dir1/file11.txt 37new file mode 100644 38index 0000000..83baae6 39--- /dev/null 40+++ b/dir1/file11.txt 41@@ -0,0 +1 @@ 42+version 1 43diff --git a/file1.txt b/file1.txt 44index 83baae6..1f7a7a4 100644 45--- a/file1.txt 46+++ b/file1.txt 47@@ -1 +1 @@ 48-version 1 49+version 2 50 51commit f871b58596491e15ee1da91eaf0a4a6c1da3e573 52Author: vagrant <vagrant@debian-10.7-amd64> 53Date: Wed Mar 10 18:07:13 2021 +0000 54 55 First commit 56--- 57 file1.txt | 1 + 58 1 file changed, 1 insertion(+) 59 60diff --git a/file1.txt b/file1.txt 61new file mode 100644 62index 0000000..83baae6 63--- /dev/null 64+++ b/file1.txt 65@@ -0,0 +1 @@ 66+version 1
2.5 refs
So far we have:
- Created some hash objects.
- Created some tree objects that associate file pathname and mode with one or more hash objects
- Created some commit objects that associate metadata with tree objects and allows us to relate tree objects in a graph which is typically interpreted as a version graph where each parent is an earlier version (it should be noted that Git itself is completely unaware of this interpretation though).
So far, so good but it is still a bit cumbersome to use. For one thing we have to remember which commit object we last created so that we can use it as the parent for our next commit. We have seen this problem above, not only when using commit-tree but also using the log command where we needed to know the hash of the most recent commit object in our history.
refs (or ‘references’) to the rescue. A ref is a more human readable way to refer a commit object hash.
1tree -a .git/refs
1.git/refs 2├── heads 3└── tags 4 52 directories, 0 files
The refs directory contains two sub-directories:
- heads—contains references to the head, or latest, commit object we want to name.
- tags—contains references to any object we want to give a human readable name.
We can set the head of our master branch (the default branch5 on which Git works, more on branches later) to our latest commit object:
1echo "e27aaa8c158e6f261f4c03aaaf173a149ad61d81" > .git/refs/heads/master 2git log --stat
We have to use the full hash when writing to the .git/refs/heads/master file.
1commit e27aaa8c158e6f261f4c03aaaf173a149ad61d81 (HEAD -> master) 2Author: vagrant <vagrant@debian-10.7-amd64> 3Date: Wed Mar 10 18:13:55 2021 +0000 4 5 Third commit 6 7file1.txt | 2 +- 81 file changed, 1 insertion(+), 1 deletion(-) 9 10commit 0715e707b906d30c9e395448ddc9e96acd89d5f7 11Author: vagrant <vagrant@debian-10.7-amd64> 12Date: Wed Mar 10 18:11:12 2021 +0000 13 14 Second commit 15 16 another_file.txt | 1 + 17 dir1/file11.txt | 1 + 18 file1.txt | 2 +- 19 3 files changed, 3 insertions(+), 1 deletion(-) 20 21commit f871b58596491e15ee1da91eaf0a4a6c1da3e573 22Author: vagrant <vagrant@debian-10.7-amd64> 23Date: Wed Mar 10 18:07:13 2021 +0000 24 25 First commit 26 27 file1.txt | 1 + 28 1 file changed, 1 insertion(+)
Issuing the log command without specifying the exact commit we are interested in causes Git to look up the refs entry of our current branch (actually it looks in .git/HEAD for the ‘latest’ commit and since we have not moved from the default branch this will be master, we look at .git/HEAD again in §2.6.1 .)
1cat .git/HEAD 2cat .git/refs/heads/master
1ref: refs/heads/master
1e27aaa8c158e6f261f4c03aaaf173a149ad61d81
The master file in the .git/refs/heads directory contains the hash of the commit object we want to call the ‘head’ of our master branch.
Editing refs files directly is not ideal (not least because we can’t abbreviate the hash) so Git provides the update-ref command.
1git update-ref refs/heads/master 0715e7 2git log --stat
1commit 0715e707b906d30c9e395448ddc9e96acd89d5f7 2Author: vagrant <vagrant@debian-10.7-amd64> 3Date: Wed Mar 10 18:11:12 2021 +0000 4 5 Second commit 6 7 another_file.txt | 1 + 8 dir1/file11.txt | 1 + 9 file1.txt | 2 +- 10 3 files changed, 3 insertions(+), 1 deletion(-) 11 12commit f871b58596491e15ee1da91eaf0a4a6c1da3e573 13Author: vagrant <vagrant@debian-10.7-amd64> 14Date: Wed Mar 10 18:07:13 2021 +0000 15 16 First commit 17 18 file1.txt | 1 + 19 1 file changed, 1 insertion(+)
After we update the reference to the penultimate commit in our repository (0715e7) we curtail the log at that point and Git will treat that commit as the latest on the master branch.
We can easily restore the head of master using the update-ref command.
1git update-ref refs/heads/master e27aaa 2git log --stat
1commit e27aaa8c158e6f261f4c03aaaf173a149ad61d81 (HEAD -> master) 2Author: vagrant <vagrant@debian-10.7-amd64> 3Date: Wed Mar 10 18:13:55 2021 +0000 4 5 Third commit 6 7file1.txt | 2 +- 81 file changed, 1 insertion(+), 1 deletion(-) 9 10commit 0715e707b906d30c9e395448ddc9e96acd89d5f7 11Author: vagrant <vagrant@debian-10.7-amd64> 12Date: Wed Mar 10 18:11:12 2021 +0000 13 14 Second commit 15 16 another_file.txt | 1 + 17 dir1/file11.txt | 1 + 18 file1.txt | 2 +- 19 3 files changed, 3 insertions(+), 1 deletion(-) 20 21commit f871b58596491e15ee1da91eaf0a4a6c1da3e573 22Author: vagrant <vagrant@debian-10.7-amd64> 23Date: Wed Mar 10 18:07:13 2021 +0000 24 25 First commit 26 27 file1.txt | 1 + 28 1 file changed, 1 insertion(+)
The update-ref command is doing several things for us. Firstly, it allows us to use short hashes (phew!), it checks that the hash we provide is valid too. Secondly you may have notices new directories and files appearing in the repository.
1tree -a
1└── .git 2 ├── branches 3 ├── config 4 ├── description 5 ├── HEAD 6 ├── hooks 7 ├── index 8 ├── info 9 │ └── exclude 10 ├── logs 11 │ ├── HEAD 12 │ └── refs 13 │ └── heads 14 │ └── master
The new logs directory contains two new files; HEAD and refs/heads/master. These contain a record of each time we modify a reference using update-ref. Each log entry records the old hash, the new hash, the user who made the change, and a time stamp for when the change was made. These ‘logs of ref changes’ can be viewed (and manipulated) using the reflog command.
1git reflog
1e27aaa8 (HEAD -> master) HEAD@{0}: 20715e70 HEAD@{1}:
This shows the history of changes we made to the master branch refs/heads/master using update-ref. As with the log command, by default, more recent changes are shown first (reverse chronological order).
The output may look like gibberish but it’s actually simple enough. Let’s break down the first line.
- e27aaa8—This is the new hash we set.
- (HEAD -> master)—This tells us that this entry is about the HEAD reference and this currently refers to the master branch. (Actually, HEAD is an indirect reference that points us to the ‘latest commit on the active branch’, more on this shortly.)
- HEAD@{0}—This is a commit reference, specifically is says that we are referring to the ‘zeroth’ (latest) change relative to the ‘HEAD’ reference.
Okay, we should now be able to read the simpler second line with ease.
- 0715e70—The hash we set when this change was made.
- HEAD@{1}—This line refers to the ‘first’ change to HEAD, counting back from the current value (HEAD@{0}).
2.5.1 Remote refs
There is one more type of ref we need to discuss, the ‘remote refs’. These are read only refs in the sense that one does not manipulate them directly but they are maintained through interaction with remote repositories. As these have such a specialised use I’m going to leave a complete discussion to §5.1, after we have discussed working with multiple repositories in Chapter 5.
2.6 References (branches and tags)
Using the head refs .git/refs/heads we can create named branches. In fact we have done so already, .git/refs/heads/master holds the reference to the latest commit (head) of the master branch.
There is nothing special about the master branch other than convention and that Git treats this as the default branch name in a new repository6.
We can create a new branch very easily, we just create a new .git/refs/heads entry.
1git update-ref refs/heads/test_branch 0715e7 2git log --stat test_branch 3git log --stat
1commit 0715e707b906d30c9e395448ddc9e96acd89d5f7 2Author: vagrant <vagrant@debian-10.7-amd64> 3Date: Wed Mar 10 18:11:12 2021 +0000 4 5 Second commit 6 7 another_file.txt | 1 + 8 dir1/file11.txt | 1 + 9 file1.txt | 2 +- 10 3 files changed, 3 insertions(+), 1 deletion(-) 11 12commit f871b58596491e15ee1da91eaf0a4a6c1da3e573 13Author: vagrant <vagrant@debian-10.7-amd64> 14Date: Wed Mar 10 18:07:13 2021 +0000 15 16 First commit 17 18 file1.txt | 1 + 19 1 file changed, 1 insertion(+)
We create the new branch named test_branch from the second commit by creating the new refs/heads/test_branch. Now when we log that branch we see only the first and second commit, while logging the current default (‘master’) we see all three commits.
1commit e27aaa8c158e6f261f4c03aaaf173a149ad61d81 (HEAD -> master) 2Author: vagrant <vagrant@debian-10.7-amd64> 3Date: Wed Mar 10 18:13:55 2021 +0000 4 5 Third commit 6 7file1.txt | 2 +- 81 file changed, 1 insertion(+), 1 deletion(-) 9 10commit 0715e707b906d30c9e395448ddc9e96acd89d5f7 11Author: vagrant <vagrant@debian-10.7-amd64> 12Date: Wed Mar 10 18:11:12 2021 +0000 13 14 Second commit 15 16 another_file.txt | 1 + 17 dir1/file11.txt | 1 + 18 file1.txt | 2 +- 19 3 files changed, 3 insertions(+), 1 deletion(-) 20 21commit f871b58596491e15ee1da91eaf0a4a6c1da3e573 22Author: vagrant <vagrant@debian-10.7-amd64> 23Date: Wed Mar 10 18:07:13 2021 +0000 24 25 First commit 26 27 file1.txt | 1 + 28 1 file changed, 1 insertion(+)
2.6.1 HEAD
How does Git know that our currently active branch is master? There is a special file in .git called HEAD (we saw this in §2.5) that tells Git where the current default head commit is located. The HEAD therefore indicates which commit object will be the parent to the next commit object created. In this way Git will add the next commit object to the end of the currently active branch.
1cat .git/HEAD 2cat .git/refs/heads/master
1ref: refs/heads/master
1e27aaa8c158e6f261f4c03aaaf173a149ad61d81
Normally, as in this case, HEAD is an indirect reference to one of the refs/heads files, which is in turn a reference to the actual commit hash that we are to use as the current head (the current situation is illustrated in Figure 2.9).
We can change the branch to which HEAD refers (and consequently the branch on which we are working).
1git log --oneline 2echo "ref: refs/heads/test_branch" > .git/HEAD 3git log --oneline
I’ve switched to using the --oneline option on log to keep the output short (I don’t think outputting the entire --stat output each time is adding any value here.).
1e27aaa8 (HEAD -> master) Third commit 20715e70 (test_branch) Second commit 3f871b59 First commit
10715e70 (HEAD -> test_branch) Second commit 2f871b59 First commit
Before the change log outputs the master branch (HEAD -> master), after changing the content of .git/HEAD log outputs the test_branch (HEAD -> test_branch), we have effectively changed the default branch by changing the ref to which HEAD refers.
As with changing refs/heads files, manually editing the HEAD file is not ideal and Git provides the symbolic-ref command to make this safer.
1git symbolic-ref HEAD 2git log --oneline 3git symbolic-ref HEAD refs/heads/master 4git symbolic-ref HEAD 5git log --oneline
1refs/heads/test_branch
10715e70 (HEAD -> test_branch) Second commit 2f871b59 First commit
1refs/heads/master
1e27aaa8 (HEAD -> master) Third commit 20715e70 (test_branch) Second commit 3f871b59 First commit
First we view the current value of the symbolic reference HEAD, then we change that reference; note that we specify the path of the actual ref file (refs/head/master).
2.6.1.1 Detached HEAD
You may occasionally encounter a ‘detached HEAD’ error. This seems to cause much confusion online but is actually a very simple issue.
In some circumstances the HEAD symbolic reference will contain a hash value directly (i.e. not a reference to one of the refs/heads). This can arise for a number of reasons, among which the most common are:
- checkout of a commit directly using its hash
- checkout of a remote (more on these later)
- checkout of a tag (which we look at next).
We examine checkout in detail in §??, in the following it is simply a way to ask Git to ‘get’ a commit object’s content and, more importantly, update the HEAD file.
1git checkout f871b5 2git symbolic-ref HEAD 3cat .git/HEAD
1Note: checking out 'c1bf'. 2 3You are in 'detached HEAD' state. You can look around, make experimental 4changes and commit them, and you can discard any commits you make in this 5state without impacting any branches by performing another checkout. 6 7If you want to create a new branch to retain commits you create, you may 8do so (now or later) by using -b with the checkout command again. Example: 9 10 git checkout -b <new-branch-name> 11 12HEAD is now at f871b59 First commit
Line 3 announces that we are in the ‘detached HEAD’ state.
1fatal: ref HEAD is not a symbolic ref
We cannot look at the ‘symbolic-ref’ because it is no longer there.
1f871b597ef5160ab19556e42e8a5264d092ad2bc
In fact the .git/HEAD file contains only the hash of the commit we checked out.
1git checkout 0715e70 2git symbolic-ref HEAD
1Previous HEAD position was f871b59 First commit 2HEAD is now at 0715e70 Second commit
1fatal: ref HEAD is not a symbolic ref
If we checkout the second commit directly (using its hash) we are simply informed that we updated the hash and attempting to examine the symbolic-ref still results in an error.
1git checkout test_branch 2git symbolic-ref HEAD
1Switched to branch 'test_branch'
1refs/heads/test_branch
Checking out the test_branch (which is the same commit but now referred to by the refs/heads/test_branch) we are ‘switched’ to that branch’s HEAD (the very same commit is being checked out, but the reference is now a symbolic-ref).
2.6.2 tags
It would be very useful to have a method of recalling a commit by name, for example when we release versions of our project it would be good to be able to say “this commit is version 1.0 of my project”. Fortunately Git has tag references for just this purpose.
Tag references come in two types:
- Lightweight—these tags are similar to the symbolic references used above, they are simple records in the .git/refs/tags directory that point to specific commit objects (much as we have just seen .git/refs/heads do). Lightweight tags are typically private temporary names assigned by the user.
- Annotated—these are a new type of object, a tag object, that contains some metadata associated with the tag. An entry is then made in the .git/refs/tags directory referencing this tag object. Annotated tags are used for more public permanent tags, such as release commits.
To create a new lightweight tag we use update-ref.
1git update-ref refs/tags/v1.0 f871b5 2cat .git/refs/tags/v1.0
1f871b58596491e15ee1da91eaf0a4a6c1da3e573
We can now refer to the commit object with hash f871b5 using the tag name v1.0. These lightweight tags are useful to assign ‘human readable’ names to Git objects we may be interested in, but we can create an annotated tag that includes additional information.
1git tag -a v2.0 0715e7 -m "The second version" 2cat .git/refs/tags/v2.0 3git cat-file -t v2.0 4git cat-file -p v2.0
1a7edafabd57c0a3dc7788d021359083ae31d3826
1tag
1object 0715e707b906d30c9e395448ddc9e96acd89d5f7 2type commit 3tag v2.0 4tagger vagrant <vagrant@debian-10.7-amd64> 1616259961 +0000 5 6The second version
Here we use the git tag command with the -a (for annotate) option to tag commit object ac21f9 and add a comment with the -m option7. This creates a new object of type tag. Unlike other commands that we have seen that create objects, the tag command does not return the new object’s hash. This is not a problem as the tag is now a proxy for that tag object’s hash. We can specify either the hash or the new tag to the cat-file to examine the new tag object. Looking inside that tag object we see that it is referencing the object ac21f9 (the one we tagged), this object is a commit object and the tag object is for tag v2.0. The last text block is the comment provided in the -m option.
The fact that the tag tracks the type of the object being tagged should be a clue that we can tag any object we like.
1git tag -a Meta v2.0 -m "Meta tagging dude" 2git cat-file -p Meta
1object a7edafabd57c0a3dc7788d021359083ae31d3826 2type tag 3tag Meta 4tagger vagrant <vagrant@debian-10.7-amd64> 1616260030 +0000 5 6Meta tagging dude
Here we have created a tag (Meta) of a tag (v2.0). What is more Git will do what you might expect, it follow this meta-tag down until an object of the type expected by the command is found.
1git checkout master 2git log --oneline 3git log --oneline Meta
1Switched to branch 'master'
1d17958c (HEAD -> master) Third commit 20715e70 (tag: v2.0, tag: Meta, test_branch) Second commit 3f871b58 (tag: v1.0) First commit
Notice that the tags associate with each commit object are also shown.
10715e70 (tag: v2.0, tag: Meta, test_branch) Second commit 2f871b58 (tag: v1.0) First commit
Switching back to the master branch we can see the log history from the master HEAD contains three commits. Specifying the Meta tag as the revision from which we want to log we see only the two commits from Meta—even though Meta is actually a tag object (50945f) that refers to a tag object (a7edaf) that finally refers to a commit object (0715e7).
1In day-to-day use we will use high-level commands to interact with our repository but in this chapter we’re interested in learning what Git does under the hood.
2This is a lie! In Chapter 3 we will take a closer look at the index and learn why this lie is so often repeated.
3Note that we cannot create an empty tree object. This is the reason Git cannot store empty directories.
4If you want to be one of the cool kids you can point out that this structure (a tree in which each node hashes its children) is a Merkle tree.
5I tend to freely use ‘default branch’ to mean the ‘current default branch’ and ‘the branch Git will use absent any other branches’. This is lazy but I think context makes clear which is implied.
6In mid-2020 Git 2.8.0 provided the ability to change the default branch name (using the configuration setting init.defaultBranch). In October 2020 GitHub started using main rather than master in new repositories, GitLab announced a similar change in March 2021. This in response to sensitivity about the use of ‘master’ in all forms due to its tangential association with slavery. Etymology is not the strong suit of the over-sensitive. Thankfully the use of master is not forbidden so this change can be largely ignored.
7The -a option is implied when -m is specified without -a,-s or -u, see man git-tag(1).