Commit._diffs_copied() is used to determine if a removed blob was actually moved or renamed, possibly with some changes. However, it is called every time a commit is viewed and hits every file removed from a commit, and it is slow enough to be a problem.
Some ideas for optimizing it:
SequenceMatcher.real_quick_ratio()to get the upper-bound on the ratio to exclude obvious non-matches quickly, probably followed up with
ratio()to confirm a match
DIFF_SIMILARITY_THRESHOLDand break after a single match instead of continuing to test all files (though this could give false matches, so maybe not do this one)
Finally, we should almost certainly move this computation to
compute_diffs() instead of doing it every time the commit's diffs are used.
Also, currently, children of removed (or the removed side of moved/renamed) trees are not included in the diff to avoid hitting this performance issue too often, which causes the added portion of moved/renamed trees to look like brand new files. Once the performance of
_diffs_copied() is more reasonable and/or pre-computed, the removed trees short-circuit in
compute_diffs() needs to be removed.
Log in to post a comment.