在一段时间间隔内迭代Git提交历史记录时Git rev-list变慢

1rhkuytd 于 2022-12-02 发布在 Git

关注(0)|答案(3)|浏览(117)

我编写了一个工具，使用git rev-list -n1 --before=X以固定的时间间隔迭代Git提交历史，这样我就可以看到每年、每月等的最后一次修订。
问题是rev-list在每次调用时都会启动一个新的版本检查，这需要更长的时间。

$ time git rev-list -n1 --before="Jan 1 2016" HEAD
a889331d759453fa7f424330f75ae4e2b9e02db4

real    0m1.395s
user    0m1.367s
sys 0m0.024s

$ time git rev-list -n1 --before="Jan 1 2015" HEAD
5b5e76218fbdbb71a01d5480f289ead624232876

real    0m2.349s
user    0m2.306s
sys 0m0.036s

$ time git rev-list -n1 --before="Jan 1 2005" HEAD

real    0m5.556s
user    0m5.435s
sys 0m0.105s

如果我想在N个递减日期的循环中调用rev-list，该循环运行N次遍历，每次迭代花费更长的时间。文档讨论了位图和对象遍历策略来加快历史，但我很难理解它们。我尝试了git repack -ab，然后是git rev-list --use-bitmap-index，但结果没有改善。
我唯一的要求是，给定HEAD的任何位置，我都可以准确地确定在给定日期--before之前出现的第一个修订版本，如果需要，可以沿着祖先的路径。
对于此使用情形，提高rev-list速度的最佳方法是什么？

Git

来源：https://stackoverflow.com/questions/44403419/slow-git-rev-list-when-iterating-git-commit-history-over-time-intervals

3条答案

按热度按时间

uqcuzwp81#

如果你的时间复杂度是O（N^2）的话，那么你的时间复杂度是O（N^2）。
生成一个包含提交id和日期的列表，然后去掉不需要的内容，并从选定的sha中生成真实的的日志消息。

git log --first-parent --pretty=%H\ %cd --date=short \
| awk '$2$3 != last { last=$2$3; print $1}' 'FS=[- ]' \
| git log --no-walk --stdin

这在linux repo的冷缓存中花费了15秒，用spinny-things硬盘，列出了147个提交，重新运行不到一秒。
编辑：在--date-order中为--first-parent添加子缓存以考虑所有路径需要25.1秒的冷缓存时间，7.9秒的热缓存时间，以列出782个提交。

赞(0）回复(0）举报 2022-12-02

eyh26e7m2#

在一般情况下，除了这两个选项外，别无选择：

支付步行的全部费用，或
存储的信息要比只请求一个节点时git rev-list发出的信息多得多

因为图遍历有效地线性化（通过优先级队列，以日期作为优先级）一个任意的浓密树（这个“树”是通过切断重连接而形成的，通过简单地不重新访问DAG中的任何已经访问过的节点）。例如，假设我们有一个提交DAG看起来很像这样。假设所有的方向弧都指向左（直接向左，或向上和向左，或向下和向左）。

A-....--o
      /         \
...--o----...----o   <-- HEAD
      \         /
       B--...--o

其中有两个或三个点的任何地方都有任意数量的节点。您的选择之一可能会选择节点A作为满足--before条件的第一个访问的节点。节点B可能早于A，因此可能是从HEAD开始的较早的--before选择的节点（因为节点B可以从HEAD到达）。但是节点B不能从节点A到达，所以通过简单地从节点A开始，将永远不会找到节点B。
如果您以某种方式让git rev-list转储出它当前的优先级队列内容（加上关于它迄今为止访问过的所有节点的信息--尽管这最终并不需要得到相同的 * 结果 *;这将仅仅是用于潜在地加速某些节点修整的选项），您可以从这些点重新开始其图行走操作，以便搜索B，从而避免重新访问上述大网格-y区域中的许多节点。
如果您严格限制图遍历，例如使用--first-parent，您可以简单地从最近发现的提交重新开始遍历，因为您知道队列深度始终为1，因此下一个要访问的节点是您先前发现的节点的第一个父节点（您可以将其保留为git rev-list）。

赞(0）回复(0）举报 2022-12-02

0x6upsns3#

你可以再试一次同样的命令，添加新的--filter=tree:0选项，这样应该可以加快创建所需提交列表的速度。
这是因为在Git 2.20（Q4 2018）中，“rev-list --filter“特性学会了通过“tree:0“过滤器排除所有树。
请参见commit 8b10a20（2018年10月18日）、commit d9e6d09（2018年10月12日）、commit bc5975d、commit cc0b05a、commit 696aa73、commit 99c9aa9、commit 7c0fe33（2018年10月5日）、commit f1d02da（2018年8月15日）和commit 9202489、commit f447a49（2018年8月13日）。
（由Junio C Hamano -- gitster --合并至commit 77d5037，2018年10月30日）

`list-objects`：支持跳过树遍历

tree:0过滤器不需要遍历它过滤掉的树，所以优化list-objects和list-objects-filter以完全跳过遍历树。

列表对象过滤器：实现筛选器树：0

教列表对象“tree:0“过滤器，它允许过滤掉所有的树和blob对象（除非用户明确指定了其他对象）。这个补丁的目的是允许较小的部分克隆。
这个过滤器的名称tree:0并没有明确地指定它也过滤掉所有的blob，但是这应该不会引起太多的混乱，因为如果没有引用blob的树，blob根本就没有用。
我也考虑过将only:commits作为一个名称，但这是不准确的，因为它暗示省略了带注解的标记，但实际上包含了它们。
名称“tree:0“允许稍后基于深度进行过滤，即“tree:1“将过滤掉除根树和斑点之外的所有内容。
为了避免0和大写O之间的混淆，文档的措辞有些迂回，这也暗示了该功能的未来改进。
请注意，Git 2.22（Q2 2019）将修复一个回归，其中“is this object available to us?“检查已知对象，如空树（即使空树在磁盘上没有对象，也应该得到“是”），已被更正。
参见Jeff King ( peff )的commit f06ab02（2019年3月4日）。
（2019年3月20日由commit 83b13e2合并于commit 83b13e2）

`rev-list`：允许在存在性检查中使用缓存对象

这修复了7c0fe33（修订列表：正确处理丢失的树对象，2018-10-05），其中rev-list现在将报告空树，当它在磁盘上物理上不存在时。
在提交之前，我们依赖list-objects.c中的遍历代码遍历树，因为它使用parse_tree()，我们将执行一个普通的对象查找，包括在“缓存”对象集中查找（这是我们神奇的内部空树的作用）。
在提交之后，我们告诉list-objects.c不要死在任何缺失的树上，然后我们自己用has_object_file()检查它们，但是那个函数使用OBJECT_INFO_SKIP_CACHED，这意味着我们不会使用内部的空树。
对于大多数操作，Git会像对待其他对象一样，尝试写出空树对象。而push或fetch中的pack-objects会发送空树（即使它在发送端是虚拟的）。
然而，在某些情况下，这一点可能很重要。我在野外发现的一个例子是：
1.通过删除所有文件，而不使用索引，提交的根树会变成空的。在这个例子中，这是使用libgit 2的树构建器API完成的，但正如所包含的测试所示，使用普通git使用hash-object也可以很容易地完成。
最终的repo工作正常，因为我们避免了遍历我们自己的可达提交来进行连通性检查。
1.用--reference指向（1）中的仓库进行克隆可能会引发这个问题，因为我们告诉另一端我们已经有了那个提交（因此是空树），但是在连接性检查期间却忽略了它（我们抱怨它丢失了）。
可以说，如果我们引用了空树对象，那么步骤（1）中的工作流在写入空树对象时应该更加小心。但是这个工作流在7c0fe33之前确实工作过，所以让我们恢复它。
Git 2.22中修复的另一个回归：“错误的”对象出现在期望不同类型的对象的地方，而不是盲目地假设对象之间的连接是正确的。
参见Taylor Blau ( ttaylorr )的commit b49e74e、commit 23c2044、commit 0616617（2019年4月10日）和commit 5c07647（2019年4月5日）。
请参见commit 97dd512、commit ee4dfee、commit 8348766（2019年4月10日），作者为Jeff King ( peff )。
（由Junio C Hamano -- gitster --合并于commit ea2dab1，2019年5月8日）

修订列表：不使用--missing时，让遍历停止

Commit 7c0fe33（rev-list：2018-10-05，Git v2.20.0-rc 0）教导git-rev-list使用遍历机制来忽略缺失的树，这样rev-list就可以自己处理它们。
但是，它只能通过oid_object_info_extended()检查对象是否存在来完成此操作。
这可能会遗漏rev-list先前侦测到的数个错误类别：

类型不匹配（例如，我们期望的是树，但得到的是blob）
无法读取对象数据（例如，由于磁盘上的位损坏）

This is especially important because we use " rev-list --objects " as our connectivity check to admit new objects to the repository, and it will now miss these cases (though the bitrot one is less important here, because we'd typically have just hashed and stored the object).
Git 2.23 (Q3 2019) fixes another bug: The filter_data used in the list-objects-filter (which manages a lazily sparse clone repository) did not use the dynamic array API correctly---' nr ' is supposed to point at one past the last element of the array in use.
See commit 7140600 (31 May 2019) by Matthew DeVore ( matvore ) .
(Merged by Junio C Hamano -- gitster -- in commit 34032c4 , 21 Jun 2019)

`list-objects-filter` : correct usage of `ALLOC_GROW`

In the sparse filter data, array_frame array is used in a way such that nr is the index of the last element.

Fix this so that nr is actually the number of elements in the array.

The filter_sparse_free function also has an unaddressed TODO to free the memory associated with the sparse filter data.
Address that TODO and fix the memory leak.
With Git 2.24 (Q4 2019), The name of the blob object that stores the filter specification for sparse cloning/fetching was interpreted in a wrong place in the code, causing Git to abort: this has been fixed.
See commit a4cafc7 , commit 4c96a77 (15 Sep 2019) by Jeff King ( peff ) .
See commit cf34337 , commit 72de589 (15 Sep 2019) by Jon Simons ( simonsj ) .
(Merged by Junio C Hamano -- gitster -- in commit ad8f036 , 07 Oct 2019)
Git 2.24 (Q4 2019) will make the object traversal faster!
It has been optimized not to load tree objects when we are only interested in commit history.
See commit 72ed80c (12 Sep 2019) by Jeff King ( peff ) .
(Merged by Junio C Hamano -- gitster -- in commit bbfe5f2 , 07 Oct 2019)

`list-objects` : don't queue root trees unless `revs->tree_objects` is set

When traverse_commit_list() processes each commit, it queues the commit's root tree in the pending array.
Then, after all commits are processed, it calls traverse_trees_and_blobs() to walk over the pending list, calling process_tree() on each.
But if revs->tree_objects is not set, process_tree() just exists immediately!
We can save ourselves some work by not even bothering to queue these trees in the first place.
There are a few subtle points to make:

we also detect commits with a NULL tree pointer here.

But this isn't an interesting check for broken commits, since the lookup_tree() we'd have done during commit parsing doesn't actually check that we have the tree on disk.
So we're not losing any robustness.

besides queueing, we also set the NOT_USER_GIVEN flag on the tree object.

This is used by the traverse_commit_list_filtered() variant.
But if we're not exploring trees, then we won't actually care about this flag, which is used only inside process_tree() code-paths.

queueing trees eventually leads to us queueing blobs, too.

But we don't need to check revs->blob_objects here.
Even in the current code, we still wouldn't find those blobs, because we'd never open up the tree objects to list their contents.

the user-visible impact to the caller is minimal.

The pending trees are all cleared by the time the function returns anyway, by traverse_trees_and_blobs() .
We do call a show_commit() callback, which technically could be looking at revs->pending during the callback.
But it seems like a rather unlikely thing to do (if you want the tree of the current commit, then accessing the tree struct member is a lot simpler).
So this should be safe to do.
Let's look at the benefits:

[before]
Benchmark #1: git -C linux rev-list HEAD >/dev/null
Time (mean ± σ):      7.651 s ±  0.021 s    [User: 7.399 s, System: 0.252 s]
Range (min … max):    7.607 s …  7.683 s    10 runs

[after]
Benchmark #1: git -C linux rev-list HEAD >/dev/null
Time (mean ± σ):      7.593 s ±  0.023 s    [User: 7.329 s, System: 0.264 s]
Range (min … max):    7.565 s …  7.634 s    10 runs

Not too impressive, but then we're really just avoiding sticking a pointer into a growable array.
But still, I'll take a free 0.75% speedup.
Let's try it after running " git commit-graph write ":

[before]
Benchmark #1: git -C linux rev-list HEAD >/dev/null
Time (mean ± σ):      1.458 s ±  0.011 s    [User: 1.199 s, System: 0.259 s]
Range (min … max):    1.447 s …  1.481 s    10 runs

[after]
Benchmark #1: git -C linux rev-list HEAD >/dev/null
Time (mean ± σ):      1.126 s ±  0.023 s    [User: 896.5 ms, System: 229.0 ms]
Range (min … max):    1.106 s …  1.181 s    10 runs

Now that's more like it. We saved over 22% of the total time.
Part of that is because the runtime is shorter overall, but the absolute improvement is also much larger.
What's going on?
When we fill in a commit struct using the commit graph, we don't bother to set the tree pointer, and instead lazy-load it when somebody calls get_commit_tree() .
So we're not only skipping the pointer write to the pending queue, but we're skipping the lazy-load of the tree entirely.
Note that with Git 2.34 (Q4 2021), the method traverse_trees_and_blobs has been renamed as traverse_non_commits .
See commit b3e36df (12 Aug 2021) by Teng Long ( dyrone ) .
(Merged by Junio C Hamano -- gitster -- in commit 0d4f46b , 30 Aug 2021)

`list-objects.c` : rename `"traverse_trees_and_blobs"` to `"traverse_non_commits"`

Signed-off-by: Teng Long
Function traverse_trees_and_blobs not only works on trees and blobs, but also on tags, the function name is somewhat misleading.
This commit rename it to traverse_non_commits .
Git 2.39 (Q4 2022) include parse_object() hardening when checking for the existence of a suspected blob object.
See commit 8db2dad , commit 04fb962 (17 Nov 2022) by Jeff King ( peff ) .
See commit 40286ca (21 Nov 2022) by Ævar Arnfjörð Bjarmason ( avar ) .
(Merged by Junio C Hamano -- gitster -- in commit ba88f8c , 28 Nov 2022)

`parse_object()` : check on-disk type of suspected blob

Signed-off-by: Jeff King
Signed-off-by: Taylor Blau
In parse_object() , we try to handle blobs by streaming rather than loading them entirely into memory.
The most common case here will be that we haven't seen the object yet and check oid_object_info() , which tells us we have a blob.

但我们在另一个案例中触发了这段代码：当我们有一个OBJ_BLOB类型的内存中对象结构时（并且没有设置“parsed”标志，因为否则我们会提前从函数返回）。
这表明代码的其他部分怀疑我们有blob（例如，它被树或标记提及），但我们还没有查看磁盘上的副本。
我对条件语句进行了一些修改，将其改为：

if ((suspected_blob && oid_object_info() == OBJ_BLOB)
    (no_clue && oid_object_info() == OBJ_BLOB)

我们有更简单的：

if ((suspected_blob || no_clue) && oid_object_info() == OBJ_BLOB)

这段比较短，但也反映了我们真正想要说的，即“我们是否排除了这是一个斑点;如果没有，请在磁盘上检查”。
因此，这修复了0616617中的一个延迟expect_failure情况（“t：引入对意外对象类型的测试”，2019-04-09，Git v2.22.0-rc 0--merge列于batch #8中）。
在提交之前，我们会悄悄地检查sha1并将blob标记为“已解析”。
现在，我们正确地抱怨了不匹配（“hash mismatch“）。

赞(0）回复(0）举报 2022-12-02

我来回答

在一段时间间隔内迭代Git提交历史记录时Git rev-list变慢

3条答案

`list-objects`：支持跳过树遍历

列表对象过滤器：实现筛选器树：0

`rev-list`：允许在存在性检查中使用缓存对象

修订列表：不使用--missing时，让遍历停止

`list-objects-filter` : correct usage of `ALLOC_GROW`

`list-objects` : don't queue root trees unless `revs->tree_objects` is set

`list-objects.c` : rename `"traverse_trees_and_blobs"` to `"traverse_non_commits"`

`parse_object()` : check on-disk type of suspected blob

相关问题

热门标签

最新问答

在一段时间间隔内迭代Git提交历史记录时Git rev-list变慢

3条答案

list-objects：支持跳过树遍历

列表对象过滤器：实现筛选器树：0

rev-list：允许在存在性检查中使用缓存对象

修订列表：不使用--missing时，让遍历停止

list-objects-filter : correct usage of ALLOC_GROW

list-objects : don't queue root trees unless revs->tree_objects is set

list-objects.c : rename "traverse_trees_and_blobs" to "traverse_non_commits"

parse_object() : check on-disk type of suspected blob

相关问题

热门标签

最新问答

`list-objects`：支持跳过树遍历

`rev-list`：允许在存在性检查中使用缓存对象

`list-objects-filter` : correct usage of `ALLOC_GROW`

`list-objects` : don't queue root trees unless `revs->tree_objects` is set

`list-objects.c` : rename `"traverse_trees_and_blobs"` to `"traverse_non_commits"`

`parse_object()` : check on-disk type of suspected blob