Commit._diffs_copied()
is used to determine if a removed blob was actually moved or renamed, possibly with some changes. However, it is called every time a commit is viewed and hits every file removed from a commit, and it is slow enough to be a problem.
Some ideas for optimizing it:
SequenceMatcher.real_quick_ratio()
to get the upper-bound on the ratio to exclude obvious non-matches quickly, probably followed up with quick_ratio()
and/or ratio()
to confirm a matchDIFF_SIMILARITY_THRESHOLD
and break after a single match instead of continuing to test all files (though this could give false matches, so maybe not do this one)Finally, we should almost certainly move this computation to compute_diffs()
instead of doing it every time the commit's diffs are used.
Also, currently, children of removed (or the removed side of moved/renamed) trees are not included in the diff to avoid hitting this performance issue too often, which causes the added portion of moved/renamed trees to look like brand new files. Once the performance of _diffs_copied()
is more reasonable and/or pre-computed, the removed trees short-circuit in compute_diffs()
needs to be removed.
We should really consider getting diff info from the SCM directly instead of trying to do it ourselves. (see 'indexless' tickets)
This was fixed (by using SCM directly instead of doing it ourselves) in [#7925]
Related
Tickets:
#7925