In a git repo with a large amount of binary files, our diff processing can be very inefficient. We should test if a file is binary and exclude it from the diff processing section.
Tickets: #5733
Tickets: #7537
Tickets: #7814
Tickets: #7918
Tickets: #7949
Tickets: #7963
QA: hs/7925
Binary files should no longer make XHR requests for diff processing.
I think we need to skip binaries sooner, not just on the display side. Server-side time plus the background task to "refresh" the repo takes forever still. Like inside the
_diffs_copied
method.Also we can do better text detection, using the existing
has_html_view
method. It checks several things to determine if its text. Might want to rename the method, or alias it, though.Good notes. Based on your feedback I refactored paged_diffs to now rely on the SCM system.
QA at:
hs/7925
&
hs/7925 on forgehg
Other notes:
Git has a few other interesting options for tweaking performance. For example -- we could use a diff processing threshold when searching for copies.
We also could further improve the visual indicators when displaying copies vs renames etc (but that may be better in another ticket).
We could specify the max number for the -C option. We could also make this configurable via the ini.
git diff-tree:
-l<num>
The -M and -C options require O(n^2) processing time where n is the number of potential rename/copy targets. This option prevents rename/copy detection from running if the number of rename/copy targets exceeds the specified number.
The results here are great. Including the repo refresh backend logic. But it is several changes and some quite big changes, and so naturally there's a good handful of tweaks needed to polish it up:
general
has_html_view
method's new functionality for fast binary detection?{'new': u'README.txt', 'old': u'README', 'diff': '', 'ratio': 1}
in the diff section and also saysCan't load diff
''
in many places?hg & svn
[:]
slice would be better on thefor
loop than theif
line right?hg
git
Flan
dir shows up as having changes. Nothing shown foroptions.txt
orbin/
ormods/
but they did have changes. You can see this with ?limit=1000. And if you use the default limit, the pages at the end are all blank.--find-copies-harder
__init__.py
file was copied to another, but really it's just a new file. And another file that is new but has a lot of test boilerplate so git thinks its a 56% similar copy. Thus I think we should drop--find-copies-harder
And this ticket will also resolve [#7918] too. (Although I think the
[:]
loop issue needs causing a minor bug still)Related
Tickets:
#7918These are great notes.
I was on the fence about --find-copies-harder. I ended up using it because my testing showed slightly better results when detecting copies, but I did not consider (or test for) false positives.
Fixes on
db/7925
on allura and forgehg repos. Followup ticket [#7949] for a few items.Related
Tickets:
#7949The changes you made looked really good and over all much cleaner.
Nice work!