Apache Allura™ / Tickets / #7873 Git branch & tag speedups -- NEEDS INI

Dave Brondsema - 2015-04-30

labels: performance --> performance, sf-current
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Dave Brondsema - 2015-04-30

labels: performance, sf-current --> performance, sf-current, sf-4
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Heith Seewald - 2015-05-04

status: open --> in-progress

assigned_to: Heith Seewald
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Heith Seewald - 2015-05-05

status: in-progress --> review

Last edit: Heith Seewald 2015-05-05
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Heith Seewald - 2015-05-05

Review at: hs/7873

TTFB was significantly reduced by removing the logic to locate the 'default_branch_name'.

Time complexity for displaying the 'master' branch went from O(n) to O(1).

It turns out that the issue was unique to the branches processing -- tags and ForgeHG are unaffected.

Last edit: Heith Seewald 2015-05-05

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Igor Bondarenko - 2015-05-07

Reviewer: Igor Bondarenko
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Igor Bondarenko - 2015-05-08

status: review --> in-progress
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Found one small issue. If default branch is in first 10 branches returned by self.repo.get_branches() you'll see two buttons for it in the sidebar (this isn't true when you have only one branch)

And a test failure:

======================================================================
FAIL: forgegit.tests.functional.test_controllers.TestFork.test_merge_request_list_view
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/var/local/env-allura/lib/python2.7/site-packages/nose-1.3.4-py2.7.egg/nose/case.py", line 197, in runTest
    self.test(*self.arg)
  File "/home/ibondarenko/ibondarenko-1154/forge/ForgeGit/forgegit/tests/functional/test_controllers.py", line 585, in test_merge_request_list_view
    assert_equal(r.html.findAll('span')[8].getText(), 'less than 1 minute ago')
AssertionError: u'foo' != 'less than 1 minute ago'
-------------------- >> begin captured stdout << ---------------------
Running setup_app() from allura.websetup
http://localhost/p/test2/code/do_request_merge
 ==> 302 ==> http://localhost/p/test/src-git/merge-requests/1/

Also, I don't sure there are any improvements on performance. I've imported allura repo itself for testing. I'm looking at stats.log and don't see any improvement in sidebar timings.

On master (timings fluctuate around 4200 in most cases):

            "sidebar": 5145, 
            "sidebar": 4145, 
            "sidebar": 4211,

On a branch (sometimes even slightly worse, seems like it randomly spikes. 4000 then almost ~6700):

            "sidebar": 6514, 
            "sidebar": 3916, 
            "sidebar": 6730,

(Hitting /p/allura/allura/ci/master/tree/ page)

Could you describe how did you measure it? Maybe I'm missing something?

Heith Seewald - 2015-05-08

These are great notes.

My performance improvements were based on the "time to first byte" from the chrome dev tools. I attached two files of the before and after based on my tests (using ~200 branches).

TTFB, while not the best metric for gauging specific functions, it does give an indication of the perception of latency.

I'm going to re-run the tests and check out the stats.log.

How many branches were your performance tests run with?

new.png

old.png

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Igor Bondarenko - 2015-05-08
  
  I tested with 1224 branches and 588 tags.
  
  I've looked into TTFB, for me it's ~7.6 sec and stable across page reloads on master, and on the branch it spiking just like stats: ~7sec, then ~13sec, then ~8.
  
  It's strange that I don't see even tiny performance improvement, in your case it's almost 5 times faster. Could you try it with more branches and maybe some tags too?
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Heith Seewald - 2015-06-05

status: in-progress --> review
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Heith Seewald - 2015-06-05

You were right Igor, when adding thousands of branches/tags, things became tricky.

The number of branches and tags were high enough that their initial loading was killing the performance. Due to the nature of git and symbolic references, it was necessary to loop through each tag/branch like we had been doing... But by removing the validation for each branch, the number of calls can be significantly reduced.

[Object(name=b.name, object_id=b.commit.hexsha) for b in self._git.branches]

I attached two images of the cprofile breakdown of this loop

Notice the cumtime is significantly higher when checking validation. Given the probability of a symbolic ref existing that doesn't point to a commit, this seems like a more than fair trade-off. Even if it does occur, it would have no major impact.

What do you think?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Igor Bondarenko - 2015-06-08
  
  I can't see the attachments. Did you forget to add those? :)
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Heith Seewald - 2015-06-08
    
    Haha, looks like it, Here you go.
    
    with_validation.png
    
    without_validation.png
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Igor Bondarenko - 2015-06-09

status: review --> in-progress
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Igor Bondarenko - 2015-06-09

Ok, I see some timing improvements. TTFB mean: 9.7s -> 8.3s. It is not very much, but a lot of time takes other stuff. E.g. if I change RepositoryApp.sidebar_menu to always return empty list, TTFB is still pretty high (6-7s on a branch and 7-9 on a master).

That said, b.commit.hexsha can raise an error if ref is not valid, so we should handle it gracefully for the (hopefully rare) cases like this.

ValueError: Failed to parse reference information from u'refs/heads/a-fake'

or

ValueError: SHA 31189d43d7a3650482c5aa7acda2f105b937acf1 could not be resolved, git returned: '31189d43d7a3650482c5aa7acda2f105b937acf1 missing'

You can reproduce it like this:

echo 'bad hash' > /git/p/t/tee/test/code.git/refs/heads/a-fake

and then refresh the page.

Last edit: Igor Bondarenko 2015-06-09
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Heith Seewald - 2015-06-09

Yeah, I see what you mean. My thought was that if they had a mismatched sha, then there are probably bigger data integrity problems to deal with. But you're right -- we should be handle it to be safe.

I tried catching the BadObject exception in a few places -- but quickly realized that the existing location is the best place for validation (inside the branch/tags list).

So here's another angle:

What if we used git fsck to check for invalid refs on repo change (via hooks maybe)? A single check on when something changes would be much more performant than checking on every page view... the only question would be how to handle the invalid refs.

What are your thoughts on that?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Igor Bondarenko - 2015-06-10
  
  I thought we can expand list comprehension in the branches/tags to be a full loop and wrap Object construction in the try..except block and that it will be cheaper than calling is_valid on every branch/tag, since exception handling logic will run only for repos with integrity problems, which is fairly rare. I'm not sure, though.
  
  We can also cache results of branches/tags and update cache on repo change with proper is_valid checks, but we're moving from keeping repo metadata in mongo to direct access, so don't sure if it worth it.
  
  git fsck approach looks like overkill to me, we'll need to parse it's output to determine what's wrong and also save the results somewhere to skip invalid branches. Correct me if I'm mistaken.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Heith Seewald - 2015-06-10

So the Object construction in the branches/tags context is actually just a ming Object (aka a dict with object-like attr access). And the 'is_valid' is just doing a simple try/except itself -- so I don't think expanding the list comprehension for our own try/except be any cheaper unless...

We use git fsck as the try/except and only check is_valid for the very rare cases that it finds an issue. I attached a profile of running self._git.git.fsck(full=True) and it's very cheap even with 1000s of branches.

@LazyProperty def branches(self): try: self._git.git.fsck(full=True) except git.GitCommandError as gce: log.debug(u"Found inconsistency in {}: {}".format(self._repo.name, gce.stderr)) return [Object(name=b.name, object_id=b.commit.hexsha) for b in self._git.branches if b.is_valid()] return [Object(name=b.name, object_id=b.commit.hexsha) for b in self._git.branches]

This looks like it's a pretty good trade-off between handling invalid refs and performance. It would have the added benefit of informing us of issues instead of just silently returning valid results.

I pushed an optimized version of this idea if you want to take a look. hs/7873

Screen Shot 2015-06-10 at 9.56.42 AM.png
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Dave Brondsema - 2015-06-10

The fsck operation checks all objects in a repo, so I think it's going to scale with the size of the repo overall, not just the number of branches. It seems overkill to me too, with the potential for being slow on big repos. (I also wonder how fsck could be faster than the is_valid checks since they'd both be going through and checking each ref)

I think a try/catch in the loop would be worth trying. I'd say change it from a list comprehension to a regular loop so you can put the try/catch around just the ref.commit.hexsha bit that triggers the exception).

Igor's suggestion of caching the results is also interesting. As he says, we're trying to do direct repo access rather than caching in mongo, but it would be much faster. I'm ok with using mongo storage if we treat it as a smart cache, not as a definitive datastore that must be populated by the "repo refresh" logic before repo webpages are usable. For example: the branches property would know how to get the info from the git repo and would cache the result in mongo and use that cache. Cache invalidation would happen whenever new changes are pushed. That'd be in the "repo refresh" functions, but would be fast and simple.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Heith Seewald - 2015-06-10
  
  Checkout these simple timings:
  
  _start = time() a = self._get_refs(self._git.branches) print("Time using fsck : {}".format(time() - _start)) _start = time() b = [Object(name=b.name, object_id=b.commit.hexsha) for b in self._git.branches if b.is_valid()] print("Validation Time (the old way): {}".format(time() - _start)) print(a == b) ## Results # Time using fsck : 1.61762309074 # Validation Time (the old way): 2.93375897408 # True
  
  Of the time it took to get the list with self._get_refs(self._git.branches),
  fsck only accounted for: 0.121250049591 -- In that time, it checked 256 directories and 3498 objects (the result of which is cached via lazyprop so tags can make use of it too).
  
  The only time the fsck method will be slower is the case where there is an actual error -- then it's the sum of both fsck and is_valid times. But that is supposedly an edge case anyways, right?
  
  If you guys think it'll cause issues down the road, I have no problem going another route, but it seems to scale better than our existing implementation.
  
  Also, I really like the idea of using mongo as a 'smart cache' and invalidating via repo_refresh!
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Dave Brondsema - 2015-06-10
    
    No is_valid/fsck would be even faster too right?
    
    I am still skeptical about fsck. Can you time it on a large repo? The Allura repo might be a good start but there are much much bigger ones out there too.
    
    In principle I don't like doing extra work to handle a corner case when try/except could too. But if it is really fast all the time it could be ok.
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Dave Brondsema - 2015-06-10
    
    I am still skeptical about fsck. Can you time it on a large repo? The
    Allura repo might be a good start but there are much much bigger ones out
    there too.
    
    In principle I don't like doing extra work to handle a corner case when
    try/except could too. But if it is fast all the time it could be ok.
    
    --
    Dave Brondsema
    Principal Software Engineer - sourceforge.net
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Heith Seewald - 2015-06-15

status: in-progress --> review
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Heith Seewald - 2015-06-15

Implemented a cache system that is cleared on repo_refresh. Much faster overall. Thanks for the tips :)

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Apache Allura™

Forge software for hosting software projects

Milestone

Searches

Help

#7873 Git branch & tag speedups -- NEEDS INI

Related

Discussion