#7828 Analyze & document usage of repo collections


We've done some work in the past for our SCM repos to be "indexless", that is use the git/hg/svn repo directly instead of indexing in mongo and using that. (Storing in mongo can take up a lot of space and also adds delay for the indexing process to run)

Analyze where each of the following collections (models) is used. Perhaps cross-reference by page or function (e.g. browse repo, view commit, etc; also git/svn/hg). Then we can plan which pages' functionality needs to be updated to be able to remove them.

Collections (with relative size factors based on sf.net data):

  • repo_trees (4x)
  • repo_tree (2x)
  • repo_last_commit (2x)
  • repo_diffinfo (1x)
  • repo_ci (very tiny)
  • repo_commitrun (very tiny)


Tickets: #7837
Tickets: #8168
Wiki: Goals


  • Dave Brondsema

    Dave Brondsema - 2015-02-09
    • labels: indexless, sf-current --> indexless, sf-current, sf-2
  • Igor Bondarenko

    Igor Bondarenko - 2015-02-10
    • Owner: Anonymous --> Igor Bondarenko
    • Labels: indexless, sf-current, sf-2 --> 42cc, sf-current, sf-2, indexless
    • Status: open --> in-progress
  • Igor Bondarenko

    Igor Bondarenko - 2015-02-17
    • status: in-progress --> review
  • Igor Bondarenko

    Igor Bondarenko - 2015-02-17

    Basically, we populate these collection whenever repo update happens (via hook and UI) and use it to display info to users, except for log view, which talks directly to SCM.

    Query usage pretty much boils down to these controllers:

    • MergeRequestController:
      • index - uses Commit in template 'jinja:allura:templates/repo/merge_request.html' via req.commits
      • do_request_merge_edit - uses Commit
    • BranchBrowser:
      • log - uses Commit
    • CommitBrowser:
      • __init__ - uses Commit
      • index:
        • uses Tree via self._commit.tree
        • DiffInfoDoc via self._commit.paged_diffs
      • basic - DiffInfoDoc (in commit_basic.html via commit.diffs)
    • TreeBrowser:
      • uses Tree (passed to __init__, in lookup uses Tree.__getitem__)
      • LastCommit in Allura/allura/templates/widgets/repo/tree_widget.html (via tree.ls())
    • FileBrowser - uses Tree & Commit


    Git, Hg, SVN (all of has a version of this):

    • BranchBrowser - uses Commit via repo.latest



    • Macro for including file from a repo Allura/allura/lib/macro.py:include_file uses Commit.get_path, which use Commit.get_tree, which uses Tree
    • Allura/allura/model/stats.py uses DiffInfoDoc via commit.diffs, but I don't sure where stats.py is used

    The above include only places that query one or more of those collections to display something useful to user. We should be able to get rid of these collection by rewriting those controllers/templates above (I might have missed something so we should estimate it a bit higher).

    More about overall usage below:

    TreesDoc, repo_trees
    • ForgeSVN/forgesvn/model/svn.py:compute_tree_new (upserts)
      • Commit.get_tree()
        • Commit.tree
          • almost everywhere
        • Commit.get_path()
          • Diff between two revisions of the same file (FileBrowser.diff)
          • Macro for including file from a repo Allura/allura/lib/macro.py:include_file
          • Commit.has_path()
            • nowhere
      • Tree.getitem()
        • TreeBrowser controller uses it, maybe also used elsewhere, it's hard to tell
    • Allura/allura/model/repo_refresh.py:refresh_commit_trees (creates)
      • refresh_repo
        • Repository.refresh
        • git hook
        • basically whenever repo refresh happens we create TreesDoc for new commits
    • Allura/allura/model/repo_refresh.py:compute_diffs (queries)
      • refresh_repo
      • script task Allura/allura/scripts/refresh_last_commits.py
      • uses info from TreesDoc to compute diffs between current and parent commits
    • Allura/allura/model/repo_refresh.py:compute_lcds (gets from ModelCache)
      • pretty much same places as compute_diffs, computes last commit for each tree (presumably to show it on repo browse pages)


    Tree, TreeDoc, repo_tree
    • Allura/allura/model/repo_refresh.py:trees (query)
      • seems like not used
    • Allura/allura/model/repo_refresh.py:compute_diffs:_update_cache (query)
      • well, used only inside, to help calculate diffs
    • GitImplementation.refresh_tree_info (create)
    • HgImplementation.refresh_tree_info (create)
    • CommitBrowser.__init__ via Commit.tree, which always creates new TreeDoc via repo.compute_tree_new
    • SVNImplementation.compute_tree_new (query & upsert)
    • Tree.__getitem__
    • Allura/allura/model/repo_refresh.py:_pull_tree & _update_tree_cache` - helpers, so don't really care


    LastCommit, LastCommitDoc, repo_last_commit
    • compute_lcds - produces LastCommit, which is used in:
      • SVNImplementation.compute_tree_new (updates)
      • in tree_widget.html (via tree.ls())


    DiffInfoDoc, repo_diffinfo
    • compute_diffs - produces DiffInfoDoc, which is used in:
      • Commit.paged_diffs (queries)
        • Displays diffs for commit. CommitBrowser.index
        • Commit.diffs
          • Allura/allura/model/stats.py - ?
          • Allura/allura/templates/repo/commit_basic.html (CommitBrowser.basic)
      • Commit.added_paths (queries)
        • LastCommit._prev_commit_id (only as optimization to exit early if prev commit don't exist`
          • LastCommit._build
      • Allura/allura/scripts/refresh_last_commits.py, Allura/allura/scripts/refreshrepo.py - only deleting
      • SVNImplementation.refresh_commit_info (creates)

    Last edit: Igor Bondarenko 2015-02-17
  • Dave Brondsema

    Dave Brondsema - 2015-02-18
    • status: review --> closed
    • Reviewer: Dave Brondsema
  • Dave Brondsema

    Dave Brondsema - 2015-02-18


    I also did some measurements of a few SVN imports. Optimizing for SVN imports is of particular desire for me and how we're using Allura. An import & refresh will fully populate repo_ci and repo_diffinfo collections. Other collections (repo_trees, repo_tree, repo_last_commit) will populate as the repo is browsed. The sizes of the collections, after browsing around a fair bit (wget spidering), are the same order of magnitude between repo_ci, repo_diffinfo, repo_tree, and repo_last_commit. Good targets to address first would be repo_ci andrepo_diffinfo since they are created at immediately during "refresh". Next could be repo_tree. Not sure about tackling repo_last_commit since the logic is very complex and caching is necessary since it is expensive calculations.

  • Dave Brondsema

    Dave Brondsema - 2015-02-23
    • labels: 42cc, sf-current, sf-2, indexless --> 42cc, sf-2, indexless
  • Dave Brondsema

    Dave Brondsema - 2015-08-10
    • Milestone: unreleased --> v1.3.1

Log in to post a comment.