#7925 Speed up diff processing with binary files

v1.3.2
closed
General
Heith Seewald
2015-08-10
2015-07-13
No

In a git repo with a large amount of binary files, our diff processing can be very inefficient. We should test if a file is binary and exclude it from the diff processing section.

Related

Tickets: #5733
Tickets: #7537
Tickets: #7814
Tickets: #7918
Tickets: #7949
Tickets: #7963

Discussion

  • Dave Brondsema

    Dave Brondsema - 2015-07-13
    • labels: --> sf-2, sf-current
     
  • Heith Seewald

    Heith Seewald - 2015-07-16

    QA: hs/7925

    Binary files should no longer make XHR requests for diff processing.

     
  • Heith Seewald

    Heith Seewald - 2015-07-16
    • status: in-progress --> review
     
  • Dave Brondsema

    Dave Brondsema - 2015-07-21
    • labels: sf-2, sf-current --> sf-2, sf-current, performance
    • status: review --> in-progress
    • Reviewer: Dave Brondsema
     
  • Dave Brondsema

    Dave Brondsema - 2015-07-21

    I think we need to skip binaries sooner, not just on the display side. Server-side time plus the background task to "refresh" the repo takes forever still. Like inside the _diffs_copied method.

    Also we can do better text detection, using the existing has_html_view method. It checks several things to determine if its text. Might want to rename the method, or alias it, though.

     
  • Heith Seewald

    Heith Seewald - 2015-07-27
    • status: in-progress --> review
     
  • Heith Seewald

    Heith Seewald - 2015-07-27

    Good notes. Based on your feedback I refactored paged_diffs to now rely on the SCM system.

    QA at:

    hs/7925
    &
    hs/7925 on forgehg

    Other notes:
    Git has a few other interesting options for tweaking performance. For example -- we could use a diff processing threshold when searching for copies.

    We also could further improve the visual indicators when displaying copies vs renames etc (but that may be better in another ticket).

     
    • Heith Seewald

      Heith Seewald - 2015-07-28

      We could specify the max number for the -C option. We could also make this configurable via the ini.

      git diff-tree:

      -l<num>

      The -M and -C options require O(n^2) processing time where n is the number of potential rename/copy targets. This option prevents rename/copy detection from running if the number of rename/copy targets exceeds the specified number.

       
  • Dave Brondsema

    Dave Brondsema - 2015-07-30
    • status: review --> in-progress
     
  • Dave Brondsema

    Dave Brondsema - 2015-07-30

    The results here are great. Including the repo refresh backend logic. But it is several changes and some quite big changes, and so naturally there's a good handful of tweaks needed to polish it up:

    general

    • Now the commit view doesn't show binary diffs, good. But the table listing all the files has binary files linked up still and the links don't go anywhere.
    • Can you add a test for the has_html_view method's new functionality for fast binary detection?
    • "refresh" logic is fast now too, yay!
    • I guess this should be a separate ticket, but it'd be nice to sort by filename across all change types, instead of showing adds, then removes, etc. Maybe same ticket as displaying copies vs renames better.
    • Down in the diff list, it says "File was copied or renamed." We should be able to say exactly which now.
    • A rename shows up as {'new': u'README.txt', 'old': u'README', 'diff': '', 'ratio': 1} in the diff section and also says Can't load diff
      • Is it ok that we set diff to '' in many places?

    hg & svn

    • The [:] slice would be better on the for loop than the if line right?

    hg

    • cleanup: move imports to top of file

    git

    • Testing with walrustech repo, in the 2nd commit, only the Flan dir shows up as having changes. Nothing shown for options.txt or bin/ or mods/ but they did have changes. You can see this with ?limit=1000. And if you use the default limit, the pages at the end are all blank.
    • I think we don't want to use --find-copies-harder
      • Performance wise on a big repo my timing measurement is 0m0.035s without it and 0m0.135s with it. Noticable but not huge
      • A bigger impact is the semantics of it. It can make an incorrect association of files being "copied" if the contents are common contents. A very good example of common contents is no content, an empty file. I've found a diff that says one __init__.py file was copied to another, but really it's just a new file. And another file that is new but has a lot of test boilerplate so git thinks its a 56% similar copy. Thus I think we should drop --find-copies-harder
    • After doing a straight copy or rename in git and committing it, I get:
    File '/home/dbrondsema/dbrondsema-1019/forge/ForgeGit/forgegit/model/git_repo.py', line 682 in paged_diffs
      for i in xrange(0, result['total'] + 1, 2)]
    IndexError: list index out of range
    
     
  • Dave Brondsema

    Dave Brondsema - 2015-07-30

    And this ticket will also resolve [#7918] too. (Although I think the [:] loop issue needs causing a minor bug still)

     

    Related

    Tickets: #7918

  • Heith Seewald

    Heith Seewald - 2015-07-31

    These are great notes.

    I was on the fence about --find-copies-harder. I ended up using it because my testing showed slightly better results when detecting copies, but I did not consider (or test for) false positives.

     
  • Dave Brondsema

    Dave Brondsema - 2015-08-05

    Fixes on db/7925 on allura and forgehg repos. Followup ticket [#7949] for a few items.

     

    Related

    Tickets: #7949

  • Dave Brondsema

    Dave Brondsema - 2015-08-05
    • status: in-progress --> review
    • Reviewer: Dave Brondsema --> Heith Seewald
     
  • Heith Seewald

    Heith Seewald - 2015-08-10
    • labels: sf-2, sf-current, performance --> sf-current, performance, sf-4
    • status: review --> closed
     
  • Heith Seewald

    Heith Seewald - 2015-08-10

    The changes you made looked really good and over all much cleaner.

    Nice work!

     
  • Dave Brondsema

    Dave Brondsema - 2015-08-10
    • labels: sf-current, performance, sf-4 --> performance, sf-4
     
  • Dave Brondsema

    Dave Brondsema - 2015-12-08
    • Milestone: unreleased --> v1.3.2
     

Log in to post a comment.