Apache Allura™ / Tickets / #7925 Speed up diff processing with binary files

Dave Brondsema - 2015-07-13

labels: --> sf-2, sf-current
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Heith Seewald - 2015-07-16

QA: hs/7925

Binary files should no longer make XHR requests for diff processing.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Heith Seewald - 2015-07-16

status: in-progress --> review
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Dave Brondsema - 2015-07-21

labels: sf-2, sf-current --> sf-2, sf-current, performance

status: review --> in-progress

Reviewer: Dave Brondsema
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Dave Brondsema - 2015-07-21

I think we need to skip binaries sooner, not just on the display side. Server-side time plus the background task to "refresh" the repo takes forever still. Like inside the _diffs_copied method.

Also we can do better text detection, using the existing has_html_view method. It checks several things to determine if its text. Might want to rename the method, or alias it, though.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Heith Seewald - 2015-07-27

status: in-progress --> review
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Heith Seewald - 2015-07-27

Good notes. Based on your feedback I refactored paged_diffs to now rely on the SCM system.

QA at:

hs/7925
&
hs/7925 on forgehg

Other notes:
Git has a few other interesting options for tweaking performance. For example -- we could use a diff processing threshold when searching for copies.

We also could further improve the visual indicators when displaying copies vs renames etc (but that may be better in another ticket).

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Heith Seewald - 2015-07-28
  
  We could specify the max number for the -C option. We could also make this configurable via the ini.
  
  git diff-tree:
  
  -l<num>
  
  The -M and -C options require O(n^2) processing time where n is the number of potential rename/copy targets. This option prevents rename/copy detection from running if the number of rename/copy targets exceeds the specified number.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Dave Brondsema - 2015-07-30

status: review --> in-progress
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Dave Brondsema - 2015-07-30

The results here are great. Including the repo refresh backend logic. But it is several changes and some quite big changes, and so naturally there's a good handful of tweaks needed to polish it up:

general

Now the commit view doesn't show binary diffs, good. But the table listing all the files has binary files linked up still and the links don't go anywhere.

Can you add a test for the has_html_view method's new functionality for fast binary detection?

"refresh" logic is fast now too, yay!

I guess this should be a separate ticket, but it'd be nice to sort by filename across all change types, instead of showing adds, then removes, etc. Maybe same ticket as displaying copies vs renames better.

Down in the diff list, it says "File was copied or renamed." We should be able to say exactly which now.

A rename shows up as {'new': u'README.txt', 'old': u'README', 'diff': '', 'ratio': 1} in the diff section and also says Can't load diff

Is it ok that we set diff to '' in many places?

hg & svn

The [:] slice would be better on the for loop than the if line right?

hg

cleanup: move imports to top of file

git

Testing with walrustech repo, in the 2nd commit, only the Flan dir shows up as having changes. Nothing shown for options.txt or bin/ or mods/ but they did have changes. You can see this with ?limit=1000. And if you use the default limit, the pages at the end are all blank.

I think we don't want to use --find-copies-harder

Performance wise on a big repo my timing measurement is 0m0.035s without it and 0m0.135s with it. Noticable but not huge

A bigger impact is the semantics of it. It can make an incorrect association of files being "copied" if the contents are common contents. A very good example of common contents is no content, an empty file. I've found a diff that says one __init__.py file was copied to another, but really it's just a new file. And another file that is new but has a lot of test boilerplate so git thinks its a 56% similar copy. Thus I think we should drop --find-copies-harder

After doing a straight copy or rename in git and committing it, I get:

File '/home/dbrondsema/dbrondsema-1019/forge/ForgeGit/forgegit/model/git_repo.py', line 682 in paged_diffs for i in xrange(0, result['total'] + 1, 2)] IndexError: list index out of range
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Dave Brondsema - 2015-07-30

And this ticket will also resolve [#7918] too. (Although I think the [:] loop issue needs causing a minor bug still)

Related

Tickets: ~~#7918~~

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Heith Seewald - 2015-07-31

These are great notes.

I was on the fence about --find-copies-harder. I ended up using it because my testing showed slightly better results when detecting copies, but I did not consider (or test for) false positives.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Dave Brondsema - 2015-08-05

Fixes on db/7925 on allura and forgehg repos. Followup ticket [#7949] for a few items.

Related

Tickets: ~~#7949~~

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Dave Brondsema - 2015-08-05

status: in-progress --> review

Reviewer: Dave Brondsema --> Heith Seewald
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Heith Seewald - 2015-08-10

labels: sf-2, sf-current, performance --> sf-current, performance, sf-4

status: review --> closed
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Heith Seewald - 2015-08-10

The changes you made looked really good and over all much cleaner.

Nice work!

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Dave Brondsema - 2015-08-10

labels: sf-current, performance, sf-4 --> performance, sf-4
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Dave Brondsema - 2015-12-08

Milestone: unreleased --> v1.3.2
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Apache Allura™

Forge software for hosting software projects

Milestone

Searches

Help

#7925 Speed up diff processing with binary files

Related

Discussion

general

hg & svn

hg

git

Related

Related