#7873 Git branch & tag speedups -- NEEDS INI

v1.3.1
closed
General
2015-06-29
2015-04-17
No

I saw that git pages on forge-allura.apache.org were slow, so I looked at stats.log and saw that the sidebar was the slowest part. I did some additional digging and found 2 specific areas for improvement:

  • in git_main.py change default_branch_name to a @LazyProperty since it is called many times inside a loop in RepositoryApp.sidebar_menu
  • Since sidebar_menu only requests a certain number of branches, pass that "limit" all the way through to git_repo.py's branches method so that is_valid() is only called a minimum number of times needed.
    • make sure the default branch logic to put it at the top of the list still works (e.g. always put it at the top first)

In addition to those changes, generalize and apply the same approach to the tags. And also check ForgeHg to see if mercurial can benefit the same way.

Related

Tickets: #7961

Discussion

1 2 > >> (Page 1 of 2)
  • Dave Brondsema

    Dave Brondsema - 2015-04-30
    • labels: performance --> performance, sf-current
     
  • Dave Brondsema

    Dave Brondsema - 2015-04-30
    • labels: performance, sf-current --> performance, sf-current, sf-4
     
  • Heith Seewald - 2015-05-04
    • status: open --> in-progress
    • assigned_to: Heith Seewald
     
  • Heith Seewald - 2015-05-05
    • status: in-progress --> review
     

    Last edit: Heith Seewald 2015-05-05
  • Heith Seewald - 2015-05-05

    Review at: hs/7873

    TTFB was significantly reduced by removing the logic to locate the 'default_branch_name'.

    Time complexity for displaying the 'master' branch went from O(n) to O(1).

    It turns out that the issue was unique to the branches processing -- tags and ForgeHG are unaffected.

     

    Last edit: Heith Seewald 2015-05-05
  • Igor Bondarenko - 2015-05-07
    • Reviewer: Igor Bondarenko
     
  • Igor Bondarenko - 2015-05-08
    • status: review --> in-progress
     
  • Igor Bondarenko - 2015-05-08

    Found one small issue. If default branch is in first 10 branches returned by self.repo.get_branches() you'll see two buttons for it in the sidebar (this isn't true when you have only one branch)


    And a test failure:

    ======================================================================
    FAIL: forgegit.tests.functional.test_controllers.TestFork.test_merge_request_list_view
    ----------------------------------------------------------------------
    Traceback (most recent call last):
      File "/var/local/env-allura/lib/python2.7/site-packages/nose-1.3.4-py2.7.egg/nose/case.py", line 197, in runTest
        self.test(*self.arg)
      File "/home/ibondarenko/ibondarenko-1154/forge/ForgeGit/forgegit/tests/functional/test_controllers.py", line 585, in test_merge_request_list_view
        assert_equal(r.html.findAll('span')[8].getText(), 'less than 1 minute ago')
    AssertionError: u'foo' != 'less than 1 minute ago'
    -------------------- >> begin captured stdout << ---------------------
    Running setup_app() from allura.websetup
    http://localhost/p/test2/code/do_request_merge
     ==> 302 ==> http://localhost/p/test/src-git/merge-requests/1/
    

    Also, I don't sure there are any improvements on performance. I've imported allura repo itself for testing. I'm looking at stats.log and don't see any improvement in sidebar timings.

    On master (timings fluctuate around 4200 in most cases):

                "sidebar": 5145, 
                "sidebar": 4145, 
                "sidebar": 4211, 
    

    On a branch (sometimes even slightly worse, seems like it randomly spikes. 4000 then almost ~6700):

                "sidebar": 6514, 
                "sidebar": 3916, 
                "sidebar": 6730, 
    

    (Hitting /p/allura/allura/ci/master/tree/ page)

    Could you describe how did you measure it? Maybe I'm missing something?

     
  • Heith Seewald - 2015-05-08

    These are great notes.

    My performance improvements were based on the "time to first byte" from the chrome dev tools. I attached two files of the before and after based on my tests (using ~200 branches).

    TTFB, while not the best metric for gauging specific functions, it does give an indication of the perception of latency.

    I'm going to re-run the tests and check out the stats.log.

    How many branches were your performance tests run with?

     
    • Igor Bondarenko - 2015-05-08

      I tested with 1224 branches and 588 tags.

      I've looked into TTFB, for me it's ~7.6 sec and stable across page reloads on master, and on the branch it spiking just like stats: ~7sec, then ~13sec, then ~8.

      It's strange that I don't see even tiny performance improvement, in your case it's almost 5 times faster. Could you try it with more branches and maybe some tags too?

       
  • Heith Seewald - 2015-06-05
    • status: in-progress --> review
     
  • Heith Seewald - 2015-06-05

    You were right Igor, when adding thousands of branches/tags, things became tricky.

    The number of branches and tags were high enough that their initial loading was killing the performance. Due to the nature of git and symbolic references, it was necessary to loop through each tag/branch like we had been doing... But by removing the validation for each branch, the number of calls can be significantly reduced.

    [Object(name=b.name, object_id=b.commit.hexsha) for b in self._git.branches]
    

    I attached two images of the cprofile breakdown of this loop

    Notice the cumtime is significantly higher when checking validation. Given the probability of a symbolic ref existing that doesn't point to a commit, this seems like a more than fair trade-off. Even if it does occur, it would have no major impact.

    What do you think?

     
    • Igor Bondarenko - 2015-06-08

      I can't see the attachments. Did you forget to add those? :)

       
      • Heith Seewald - 2015-06-08

        Haha, looks like it, Here you go.

         
  • Igor Bondarenko - 2015-06-09
    • status: review --> in-progress
     
  • Igor Bondarenko - 2015-06-09

    Ok, I see some timing improvements. TTFB mean: 9.7s -> 8.3s. It is not very much, but a lot of time takes other stuff. E.g. if I change RepositoryApp.sidebar_menu to always return empty list, TTFB is still pretty high (6-7s on a branch and 7-9 on a master).

    That said, b.commit.hexsha can raise an error if ref is not valid, so we should handle it gracefully for the (hopefully rare) cases like this.

    ValueError: Failed to parse reference information from u'refs/heads/a-fake'
    

    or

    ValueError: SHA 31189d43d7a3650482c5aa7acda2f105b937acf1 could not be resolved, git returned: '31189d43d7a3650482c5aa7acda2f105b937acf1 missing'
    

     
    You can reproduce it like this:

    echo 'bad hash' > /git/p/t/tee/test/code.git/refs/heads/a-fake
    

    and then refresh the page.

     

    Last edit: Igor Bondarenko 2015-06-09
  • Heith Seewald - 2015-06-09

    Yeah, I see what you mean. My thought was that if they had a mismatched sha, then there are probably bigger data integrity problems to deal with. But you're right -- we should be handle it to be safe.

    I tried catching the BadObject exception in a few places -- but quickly realized that the existing location is the best place for validation (inside the branch/tags list).

    So here's another angle:

    What if we used git fsck to check for invalid refs on repo change (via hooks maybe)? A single check on when something changes would be much more performant than checking on every page view... the only question would be how to handle the invalid refs.

    What are your thoughts on that?

     
    • Igor Bondarenko - 2015-06-10

      I thought we can expand list comprehension in the branches/tags to be a full loop and wrap Object construction in the try..except block and that it will be cheaper than calling is_valid on every branch/tag, since exception handling logic will run only for repos with integrity problems, which is fairly rare. I'm not sure, though.

      We can also cache results of branches/tags and update cache on repo change with proper is_valid checks, but we're moving from keeping repo metadata in mongo to direct access, so don't sure if it worth it.

      git fsck approach looks like overkill to me, we'll need to parse it's output to determine what's wrong and also save the results somewhere to skip invalid branches. Correct me if I'm mistaken.

       
  • Heith Seewald - 2015-06-10

    So the Object construction in the branches/tags context is actually just a ming Object (aka a dict with object-like attr access). And the 'is_valid' is just doing a simple try/except itself -- so I don't think expanding the list comprehension for our own try/except be any cheaper unless...

    We use git fsck as the try/except and only check is_valid for the very rare cases that it finds an issue. I attached a profile of running self._git.git.fsck(full=True) and it's very cheap even with 1000s of branches.

        @LazyProperty
        def branches(self):
            try:
                self._git.git.fsck(full=True)
            except git.GitCommandError as gce:
                log.debug(u"Found inconsistency in {}: {}".format(self._repo.name, gce.stderr))
                return [Object(name=b.name, object_id=b.commit.hexsha) for b in self._git.branches if b.is_valid()]
            return [Object(name=b.name, object_id=b.commit.hexsha) for b in self._git.branches]
    

    This looks like it's a pretty good trade-off between handling invalid refs and performance. It would have the added benefit of informing us of issues instead of just silently returning valid results.

    I pushed an optimized version of this idea if you want to take a look. hs/7873

     
  • Dave Brondsema

    Dave Brondsema - 2015-06-10

    The fsck operation checks all objects in a repo, so I think it's going to scale with the size of the repo overall, not just the number of branches. It seems overkill to me too, with the potential for being slow on big repos. (I also wonder how fsck could be faster than the is_valid checks since they'd both be going through and checking each ref)

    I think a try/catch in the loop would be worth trying. I'd say change it from a list comprehension to a regular loop so you can put the try/catch around just the ref.commit.hexsha bit that triggers the exception).

    Igor's suggestion of caching the results is also interesting. As he says, we're trying to do direct repo access rather than caching in mongo, but it would be much faster. I'm ok with using mongo storage if we treat it as a smart cache, not as a definitive datastore that must be populated by the "repo refresh" logic before repo webpages are usable. For example: the branches property would know how to get the info from the git repo and would cache the result in mongo and use that cache. Cache invalidation would happen whenever new changes are pushed. That'd be in the "repo refresh" functions, but would be fast and simple.

     
    • Heith Seewald - 2015-06-10

      Checkout these simple timings:

              _start = time()
              a = self._get_refs(self._git.branches)
              print("Time using fsck : {}".format(time() - _start))
      
              _start = time()
              b = [Object(name=b.name, object_id=b.commit.hexsha) for b in self._git.branches if b.is_valid()]
              print("Validation Time (the old way): {}".format(time() - _start))
      
              print(a == b)
      ## Results
      # Time using fsck : 1.61762309074
      # Validation Time (the old way): 2.93375897408
      # True
      

      Of the time it took to get the list with self._get_refs(self._git.branches),
      fsck only accounted for: 0.121250049591 -- In that time, it checked 256 directories and 3498 objects (the result of which is cached via lazyprop so tags can make use of it too).

      The only time the fsck method will be slower is the case where there is an actual error -- then it's the sum of both fsck and is_valid times. But that is supposedly an edge case anyways, right?

      If you guys think it'll cause issues down the road, I have no problem going another route, but it seems to scale better than our existing implementation.

      Also, I really like the idea of using mongo as a 'smart cache' and invalidating via repo_refresh!

       
      • Dave Brondsema

        Dave Brondsema - 2015-06-10

        No is_valid/fsck would be even faster too right?

        I am still skeptical about fsck. Can you time it on a large repo? The Allura repo might be a good start but there are much much bigger ones out there too.

        In principle I don't like doing extra work to handle a corner case when try/except could too. But if it is really fast all the time it could be ok.

         
      • Dave Brondsema

        Dave Brondsema - 2015-06-10

        I am still skeptical about fsck. Can you time it on a large repo? The
        Allura repo might be a good start but there are much much bigger ones out
        there too.

        In principle I don't like doing extra work to handle a corner case when
        try/except could too. But if it is fast all the time it could be ok.

        --
        Dave Brondsema
        Principal Software Engineer - sourceforge.net

         
  • Heith Seewald - 2015-06-15
    • status: in-progress --> review
     
  • Heith Seewald - 2015-06-15

    Implemented a cache system that is cleared on repo_refresh. Much faster overall. Thanks for the tips :)

     
1 2 > >> (Page 1 of 2)

Log in to post a comment.