Version Control System -

Anonymous included in Git

2023-08-13 2025-02-28

Data Model

The following pseudo code can explain the git data model clearly.

Firstly, we will talk about the three models in git.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
// a file is a set of data
type blob = array<byte>

// a directory which concludes files and directories
type tree = map<string, tree | blob>

// every commit concludes a parent, metadata and top tree
type commit = struct {
    parent: array<commit>
    author: string
    message: string
    snapshot: tree
}

How should we understand the bolb, tree and commit?

In my opinion, the best way to understand these three data models in git is to follow the tutorial or experiment of the offical website(中文版).

I will introduce this tutorial from a macro perspective.

The experiment process is diveded into three parts.

Introduces the concept of blob

Use the command git-hash-object to understand the object id of blob
Use the process of editing the test.txt to understand the version control preliminarily
1. Create test.txt, then write some content (such as “version 1”) to it and last create a object id for it
2. Edit the content of test.txt (such as the content becomes to “version 2”) and create a new object id for it
3. Delete the original file test.txt, you will find you can use the content id to restore the content of test.txt

Introduce the concept of tree

Add the test.txt (the content is “version 1”) to index (staging area)
Create a object id for current staging area
Create a new file named new.txt
Add the test.txt (the content is “version 2”) and new.txt to index
Create a object id for current staging area
Add the first object of staging area to the second object of staging area

Learn about the concept of commit
- Commit the first tree
- Commit the second tree and specify the parent commit is the first commit
- Commit the third tree and specify the parent commit is the second commit

The commands which are used in experiment are placed in the commends at the end of the text

After the experiment, we will have a deeper understanding of these three data models. A premise is that git need to complete the requirements of version control.

At an abstract level, blob、tree and commit are three data models in git.
At an physical level, Blob、tree and commit are a ordinary binary file that saves the content and other information of file or directory. The uniqueness of them lies in the name, they were named by the hash value of the content and other information, which is calld object id of the file.

In order to get a depper understanding of these statements, we need to pay attention to the actual physical file of these models.

When we create a blob for a file, we will find that a new file shows up in the .git/objects. Although we delete the original file, we can use blob to restore it. In other words, the blob save the content of the file.
Creating a tree concludes two parts, the first one is to put some files or directories into staging area, after this operation, we will find a new file named index shows up in the .git. The second one is to create a tree for staging area, we will find a new file shows up in the .git/objects, its format is same as a blob.
When we create a commit for a tree (a commit can only be created for a tree), we will find a new file shows up in the .git/objects

Let us comb the logical relationship of blob、tree and commit. A blob can save the content of file, but it can’t save the filename and catalog relationship. So git introdces the tree, it can save the filename and catalog relationship. But tree can’t save other information such as commitor and commit time and we can’t understand the meaning of hash value (such who did what), so git introduces the commit, bases on the tree and add other useful information.

Secondly, let us see the concept of object in git.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
// a object in git may be a blob(file), tree(directory) or commit(or snapshot, a state at a certain moment)
type object = blob | tree | commit

// git maintenances a big dictionary, which uses a string as a key to index an object
objects = map<string, object>

// how does git store an object
def store(object):
    id = sha1(object)
    objects[id] = object

// how does git load an object
def load(id):
    return objects[id]

Although commit have some useful information, which can help people understand these commits, but we can only find a commit by its long and meaningless hash value string now. So git introduces the reference, it contains two parts of file. The first one is the .git/HEAD and the second one is the .git/refs/xxx. Simply speaking, reference is just a short and meaningful string and it refer to a commit. People can use this string to find a commit.

The so-called branch is a synonym for a reference. The difficulty is that top level of operation for branch is different from bottom level of operation. If you don’t want to confuse the understand, please remember that a reference is just a pointer, it uses a short string to simplify the long hash value string, although at top level, we always say commit current changes at this branch, a branch (reference) is not a real way.

If you have feeled puzzle about the relationship between different branches, a good way to comb them is that you can ignore the existed branches, what you are facing is only the current working directory. Actually, when we say that create a new branch and git commit at one branch, it means we create a new reference and a commit, then make the reference point to this commit and keep the original reference unchanged.

Thirdly, let us see the concept of reference in git.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
// git uses a long and meaningless string to index an object, which is difficult for human to memorise. 
// So it introduces a new concept, reference. A reference is a simple and custom string, which can substitude the hash string.
references = map<string, string>

def update_reference(name, id):
    references[name] = id

def read_reference(name):
    return references[name]

def load_reference(name_or_id):
    if name_or_id in references:
        return load(references[name_or_id])
    else:
        return load(name_or_id)

To insure current working stage, git introduces a HEAD, which can point to a reference (branch) or a commit. When the HEAD doesn’t point to a reference, it is called ‘detached HEAD’ state. When you use top level command git checkout, you will edit the HEAD, the .git/HEAD file. When you use top level command git branch, you will edit the branch/reference, the .git/refs/heads file.

staging area

暂存区目前比较难理解，尤其是其内容的变化不知道是如何进行的，以及tree该如何理解

Staging area is a middle state between current state and snapshot/commit. If you have created many modifications, and you only want to use some of them to create a snapshot. Now, using current state to create a new snapshot/commit directly is inconvenient,

Every staging area is a tree, it is like a directory, which concludes some files and/or some directories. Tree is also a object in git, so there is a hash string of tree in .git/objects. If you use the command git cat-file -p to get the value of a tree object, you will find its content is some files and/or some directories.

Movement

HEAD movement

Using relative reference

You can use ^ operator to move to a previous state, such as main^ and HEAD^. And you can use multiple operators to move to multiple previous states. To simplify the operation of multiple states movement, you can use ~ operator. For example, HEAD~2 and HEAD^^ are equivalent.

^ + <number> isn’t similar with ~ + <number>, its usage is shown in the following figure.

Branch/Commit movement

Merge branches

The begining state is shown in the following figure:

git merge xxx Assuming current branch is main, you want merge bugFix branch into main, you can excute git merge bugFix, the result is shown in the following figure.

git rebase xxx Assuming current branch is bugFix, you want merge bugFix branch and main and you want keep just one record line of commit, you can excute git rebase main, the result is shown in the following figure.

git rebase xxx means moving current branch to the xxx branch, git rebase yyy xxx means moving xxx branch to the yyy branch

An important and easily overlooked point is that excute an rebase command to a commit doesn’t mean just move this commit to the target commit, it will find the lowest common ancestor between source branch and target branch, move the source commits below the LCA to the target branch (below the latest commit) following the original index.

If the work of using command git rebase to adjust the branch is too complex, you can use the interactive version of rebase. Just add a extra parameter -i (interactive).

In fact, the interactive mode of rebase has many other usages, such as editting history commit messages.

We can execute git rebase -i HEAD~<the number of history commmit we want to view>, and we will see the following content

1
2
pick 7cb5fcd message1
pick 43457b5 message2

Change the pick to reword, store it and exit
Input the new message and store it

Revoke changes

The begining state is shown in the following figure:

git reset xxx

After you excute the command git reset HEAD^, the main back to the previous commit. What need we notice is that the c2 changes don’t disappear, these changes just don’t add to staging area.

git reset 的本质是重置当前分支的 HEAD 位置，根据定位的方式以及文件的改动方式，分为以下几种:

是否保留目标位置之后的修改：

想要完全丢弃当前分支上的所有更改（包括未提交的更改和暂存区的更改），并重置到某个指定的提交时，使用 git reset --hard
希望移动 HEAD 到某个提交，但希望保留修改并将其重新暂存时，使用 git reset --soft

定位的方式：

git reset --hard/--soft <hash value>
git reset --hard/--soft HEAD@{<value>}

此方式需要结合 git reflog (reference log) 获取信息，适用于更加久远的撤回

git revert xxx If we are editing a remote branch which need to be shared with other people, we need to use this command.

After we excute the command git revert HEAD

The difference between the two commands is that the latter command create a new commit, which can be push to the remote, so that other people can see the change.

Move the specify commit

git cherry-pick commit...

In software development, the term cherry-pick is used to describe the action of choosing specific commits from one branch and applying them to current branch. It is derived from the analogy of a tree with branches that have many cherries (commits).

The begining state is shown in the following figure

Now we are in the main branch, after we excute the command git cherry-pick bugFix, the bugFix branch will be applied into current main branch. The result is shown in the following figure.

If you select multiple commits when you use this command, it will pick the commit one by one, its role is similar with git rebose -i

Realistic scenarios

select a specific historical commit to commit

The scenario is that you just want to select the bugFix branch to merge with main and to commit.

You can switch to the main branch, then use the git cherry-pick bugFix to select the bugFix branch. The result is shown in the following figure.

edit the previous commit

Note: I still don’t know the way of datas change, when we use the git rebase -i to edit many commit.

Your task is that edit the newImage commit.

Firstly, excute git rebase -i HEAD~2, adjust the index of rebasing, get a result that is shown in the following figure.

Secondly, edit in the newImage branch, then excute git commit --amend.

Thirdly, excute git rebase -i HEAD~2, adjust the index of rebasing.

At last, switch to main branch and excute git rebase caption to move the main to the caption.

You can also use the git cherry-pick to solve this problem

Remote warehouse

git fetch

download commit records which don’t exist in local from remote
update the remote reference

git fetch is equivalent to download data from remote, it doesn’t edit the local date such as local reference

Parameters for the git-fetch

参数 git fetch origin foo: download the data from remote foo branch
:参数 download the remote source branch to the local destination branch, for example

If the source is null, the command will create a new local destination branch

git pull

Because git fetch just download the remote data, we need edit the local data manually. So git pull will complete two works of download remote data and merge the remote and local reference.

The local and remote structure is shown in the following figure.

After we excute the command git fetch, the remote reference go ahead one step.

Then we excute the command git merge o/main.

If we excute git pull, we can get the same result. All in all, git pull = git fetch + git merge

git push

git pull --rebase = git fetch + git rebase

Before push the local commit, maybe we need to handle some special situations. c3 bases on c1, but remote branch has gone to the c2, so direct git push operation fails.

We need git fetch to gain the remote warehouse data, then merge the data, at last push the data to the remote.

And we can also use the git merge to substitute the git rebase

Parameters for the git-push

git push <remote> <place> After setting up the mapping, you can use origin as the remote directly. Place means the source of data for commit to the remote warehouse.
git push origin : When we set the mapping relationship between source and destination at origin, we can use the command git push origin source directly, but if we don’t set it, we need give this information for git.

If the source is null, the command will delete the remote destination branch

Remote tracking

The remote tracking means that you can specify a branch to track a remote branch. Think about it, why when we commit an main branch, it will commit to the main branch in remote warehouse. The reason is that the local main branch maps to the original main branch. You can use git branch -u original/main side to set the side branch to track the remote main branch, after it, you can git push directly at side branch. The result of changing the map relationship is shown in the following figure.

.gitignore

需要注意的一点是

.gitignore 只能忽略那些原来没有被 track 的文件，如果某些文件已经被纳入了版本管理中，则修改 .gitignore 是无效的 ¹

There are two situations we need to think about:

If the file is not be tracked, we can add it into the .gitignore directly and we needn’t to do other things.
If the file is commited to the repository, we need to delete it first from the git. We can use git rm --cached or git rm -f command and then push the changes to the repository.

The difference between git rm --cached and git rm -f
git rm --cached 仅从 Git 的暂存区中删除文件，而不会删除工作目录中的实际文件，执行这个命令后，文件将被标记为未追踪状态，并且不再包含在下一次提交中，通常用于停止 Git 跟踪已经添加到版本控制中的文件，但希望保留在工作目录中的情况
git rm -f 从 Git 的暂存区中强制删除文件(-f)，并且同时从工作目录中删除实际的文件，执行这个命令后，文件将被完全从版本控制中移除，并且不再包含在下一次提交中，通常用于彻底删除文件，包括从版本控制中移除并从工作目录中清除

Next, we will give an example to illustrate the usage of these two commands

As we all know, there is a .DS_Store file in macOS，which is called macOS Desktop Services Store. If we don’t use a .gitignore, we will push it to the repository. Now, assuming we have already submitted this file to the repository, how can we do to delete it from the repository?

We can find all files named .DS_Store and the use the command git rm --cached .DS_Store or git rm -f .DS_Sotre (according to your read requirement) to delete them. It is suitable for situations with a small number of files.

If we have too many files to delete, we can combine other command line tools and git. For example, we can use find . -name .DS_Store -print0 | xargs -0 git rm -f --ignore-unmatch to delete all the .DS_Store files in current path. ²

global config of .gitignore

We can use git config --global core.excludesfile <.gitignore global config file path> to specify the global .gitignore file. ³ ⁴ ⁵

之所以添加多个参考，是因为目前对于 .gitignore 中的写法规则不是很明确，几个参考文章中给出的均不相同

Comments

The commands which are used in experiment

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
—————————————————————————————————
control the ordinary file

// create hash string of object from standord input and file
git hash-object -w --stdin
git hash-object -w test.txt

// get the value of object by hash string:
git cat-file -p d670460b4b4aece5915caf5c68d12f560a9fe3e4

// judge the type of object:
git cat-file -t 1f7a7a472abf3dd9643fd615f6da379c4acb3e3a

—————————————————————————————————
control the staging area

// index is a equivalent concept of staging area
// select a file and put it into the staging area
git update-index --add test.txt
//what is the meaning of parameter cacheinfo？
// Now, in the working area, you just have one file named test.txt, but this file has twe versions because of different file content, such as version 1 and version 2. 
// The file is version 2 now, but you only want to use version 1 to update index, so you can use the parameter cacheinfo to specify to use version 1. 
git update-index --add --cacheinfo 100644 83baae61804e65cc73a7201a7252750c76066a30 test.txt

// create an object id for staging area which concludes selected files and directories
git write-tree

—————————————————————————————————
// control the commit object

echo 'first commit' | git commit-tree d8329f
echo 'second commit' | git commit-tree 0155eb -p fdf4fc3

—————————————————————————————————
// control the reference

// create a new reference, the object must be a commit
git update-ref refs/heads/main <object id>

// determine which branch now
git symbolic-ref HEAD
// edit the HEAD
git symbolic-ref HEAD refs/haeds/other

常用命令记录

git config 配置

查看所有配置: git config --list
查看系统级别配置: git config --system --list
查看全局级别配置: git config --global --list
查看仓库级别配置: git config --local --list
查看特定配置项: git config user.email

分支管理

分支设置方案参考 Git分支管理策略

删除分支：git branch -d <branch-name>
创建远程分支：git push --set-upstream origin <branch-name>
更为精准的分支变基：git rebase --onto <target-branch> <old-branch> <need-to-move-branch>

⁶

目标是只将 client branch rebase 到 master 分支，如果直接执行 git rebase master 则会将 server 的节点一并 rebase 到 master，添加 --onto 可以更为精准的控制

保留原分支节点的分支合并: git merge --no-ff add_<branch-name>

在不使用 --no-ff 选项，即采用 “fast-forward merge” 时，被合并节点将和 merge 节点合并为同一个节点，采用 --no-ff 则可以保留被合并节点

⁷

git rebase 时如果需要查看历史提交的结构，可以使用 git log --graph --decorate(使用 --oneline 会使得输出信息更加精简, --all 会显示所有分支)

Version Control System