We've done some work in the past for our SCM repos to be "indexless", that is use the git/hg/svn repo directly instead of indexing in mongo and using that. (Storing in mongo can take up a lot of space and also adds delay for the indexing process to run)
Analyze where each of the following collections (models) is used. Perhaps cross-reference by page or function (e.g. browse repo, view commit, etc; also git/svn/hg). Then we can plan which pages' functionality needs to be updated to be able to remove them.
Collections (with relative size factors based on sf.net data):
Basically, we populate these collection whenever repo update happens (via hook and UI) and use it to display info to users, except for log view, which talks directly to SCM.
Query usage pretty much boils down to these controllers:
MergeRequestController:index- usesCommitin template 'jinja:allura:templates/repo/merge_request.html' viareq.commitsdo_request_merge_edit- usesCommitBranchBrowser:log- usesCommitCommitBrowser:__init__- usesCommitindex:Treeviaself._commit.treeDiffInfoDocviaself._commit.paged_diffsbasic-DiffInfoDoc(incommit_basic.htmlviacommit.diffs)TreeBrowser:Tree(passed to__init__, in lookup usesTree.__getitem__)LastCommitinAllura/allura/templates/widgets/repo/tree_widget.html(viatree.ls())FileBrowser- usesTree&Commit.
Git, Hg, SVN (all of has a version of this):
BranchBrowser- usesCommitviarepo.latest.
Also:
Allura/allura/lib/macro.py:include_fileusesCommit.get_path, which useCommit.get_tree, which usesTreeAllura/allura/model/stats.pyusesDiffInfoDocviacommit.diffs, but I don't sure wherestats.pyis usedThe above include only places that query one or more of those collections to display something useful to user. We should be able to get rid of these collection by rewriting those controllers/templates above (I might have missed something so we should estimate it a bit higher).
More about overall usage below:
TreesDoc, repo_treesForgeSVN/forgesvn/model/svn.py:compute_tree_new(upserts)Commit.get_tree()Commit.treeCommit.get_path()FileBrowser.diff)Allura/allura/lib/macro.py:include_fileCommit.has_path()Tree.getitem()TreeBrowsercontroller uses it, maybe also used elsewhere, it's hard to tellAllura/allura/model/repo_refresh.py:refresh_commit_trees(creates)refresh_repoRepository.refreshTreesDocfor new commitsAllura/allura/model/repo_refresh.py:compute_diffs(queries)refresh_repoAllura/allura/scripts/refresh_last_commits.pyTreesDocto compute diffs between current and parent commitsAllura/allura/model/repo_refresh.py:compute_lcds(gets fromModelCache)compute_diffs, computes last commit for each tree (presumably to show it on repo browse pages).
Tree, TreeDoc, repo_treeAllura/allura/model/repo_refresh.py:trees(query)Allura/allura/model/repo_refresh.py:compute_diffs:_update_cache(query)GitImplementation.refresh_tree_info(create)HgImplementation.refresh_tree_info(create)CommitBrowser.__init__viaCommit.tree, which always creates newTreeDocviarepo.compute_tree_newSVNImplementation.compute_tree_new(query & upsert)Tree.__getitem__Allura/allura/model/repo_refresh.py:_pull_tree & _update_tree_cache` - helpers, so don't really care.
LastCommit,LastCommitDoc,repo_last_commitcompute_lcds- producesLastCommit, which is used in:SVNImplementation.compute_tree_new(updates)tree_widget.html(viatree.ls()).
DiffInfoDoc,repo_diffinfocompute_diffs- producesDiffInfoDoc, which is used in:Commit.paged_diffs(queries)CommitBrowser.indexCommit.diffsAllura/allura/model/stats.py- ?Allura/allura/templates/repo/commit_basic.html(CommitBrowser.basic)Commit.added_paths(queries)LastCommit._prev_commit_id(only as optimization to exit early if prev commit don't exist`LastCommit._buildAllura/allura/scripts/refresh_last_commits.py,Allura/allura/scripts/refreshrepo.py- only deletingSVNImplementation.refresh_commit_info(creates)Last edit: Igor Bondarenko 2015-02-17
Thanks.
I also did some measurements of a few SVN imports. Optimizing for SVN imports is of particular desire for me and how we're using Allura. An import & refresh will fully populate
repo_ciandrepo_diffinfocollections. Other collections (repo_trees,repo_tree,repo_last_commit) will populate as the repo is browsed. The sizes of the collections, after browsing around a fair bit (wget spidering), are the same order of magnitude betweenrepo_ci,repo_diffinfo,repo_tree, andrepo_last_commit. Good targets to address first would berepo_ciandrepo_diffinfosince they are created at immediately during "refresh". Next could berepo_tree. Not sure about tacklingrepo_last_commitsince the logic is very complex and caching is necessary since it is expensive calculations.