Apache Allura™ / Tickets / #5733 Improve performance of Commit._diffs

#5733 Improve performance of Commit._diffs_copied

Milestone: v1.3.2

Status: closed

Owner: nobody

Labels: performance (126) scm (10)

Component: General

Reviewer: nobody

Updated: 2015-08-11

Created: 2013-02-01

Creator: Cory Johns

Private: No

Commit._diffs_copied() is used to determine if a removed blob was actually moved or renamed, possibly with some changes. However, it is called every time a commit is viewed and hits every file removed from a commit, and it is slow enough to be a problem.

Some ideas for optimizing it:

Short-circuit identical blob comparisons by comparing the blob hash first, as is done w/ trees
Use SequenceMatcher.real_quick_ratio() to get the upper-bound on the ratio to exclude obvious non-matches quickly, probably followed up with quick_ratio() and/or ratio() to confirm a match
Raise the DIFF_SIMILARITY_THRESHOLD and break after a single match instead of continuing to test all files (though this could give false matches, so maybe not do this one)
Exclude binary or particularly large blobs

Finally, we should almost certainly move this computation to compute_diffs() instead of doing it every time the commit's diffs are used.

Also, currently, children of removed (or the removed side of moved/renamed) trees are not included in the diff to avoid hitting this performance issue too often, which causes the added portion of moved/renamed trees to look like brand new files. Once the performance of _diffs_copied() is more reasonable and/or pre-computed, the removed trees short-circuit in compute_diffs() needs to be removed.

Cory Johns - 2013-02-01

labels: --> performance, scm
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Dave Brondsema - 2014-04-01

We should really consider getting diff info from the SCM directly instead of trying to do it ourselves. (see 'indexless' tickets)

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Dave Brondsema - 2015-08-11

status: open --> closed
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Dave Brondsema - 2015-08-11

This was fixed (by using SCM directly instead of doing it ourselves) in [#7925]

Related

Tickets: ~~#7925~~

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Dave Brondsema - 2015-12-08

Milestone: unreleased --> v1.3.2
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Apache Allura™

Forge software for hosting software projects

Milestone

Searches

Help

#5733 Improve performance of Commit._diffs_copied

Related

Discussion

Related