We've done some work in the past for our SCM repos to be "indexless", that is use the git/hg/svn repo directly instead of indexing in mongo and using that. (Storing in mongo can take up a lot of space and also adds delay for the indexing process to run)
Analyze where each of the following collections (models) is used. Perhaps cross-reference by page or function (e.g. browse repo, view commit, etc; also git/svn/hg). Then we can plan which pages' functionality needs to be updated to be able to remove them.
Collections (with relative size factors based on sf.net data):
Basically, we populate these collection whenever repo update happens (via hook and UI) and use it to display info to users, except for log view, which talks directly to SCM.
Query usage pretty much boils down to these controllers:
MergeRequestController
:index
- usesCommit
in template 'jinja:allura:templates/repo/merge_request.html' viareq.commits
do_request_merge_edit
- usesCommit
BranchBrowser
:log
- usesCommit
CommitBrowser
:__init__
- usesCommit
index
:Tree
viaself._commit.tree
DiffInfoDoc
viaself._commit.paged_diffs
basic
-DiffInfoDoc
(incommit_basic.html
viacommit.diffs
)TreeBrowser
:Tree
(passed to__init__
, in lookup usesTree.__getitem__
)LastCommit
inAllura/allura/templates/widgets/repo/tree_widget.html
(viatree.ls()
)FileBrowser
- usesTree
&Commit
.
Git, Hg, SVN (all of has a version of this):
BranchBrowser
- usesCommit
viarepo.latest
.
Also:
Allura/allura/lib/macro.py:include_file
usesCommit.get_path
, which useCommit.get_tree
, which usesTree
Allura/allura/model/stats.py
usesDiffInfoDoc
viacommit.diffs
, but I don't sure wherestats.py
is usedThe above include only places that query one or more of those collections to display something useful to user. We should be able to get rid of these collection by rewriting those controllers/templates above (I might have missed something so we should estimate it a bit higher).
More about overall usage below:
TreesDoc, repo_trees
ForgeSVN/forgesvn/model/svn.py:compute_tree_new
(upserts)Commit.get_tree()
Commit.tree
Commit.get_path()
FileBrowser.diff
)Allura/allura/lib/macro.py:include_file
Commit.has_path()
Tree.getitem()
TreeBrowser
controller uses it, maybe also used elsewhere, it's hard to tellAllura/allura/model/repo_refresh.py:refresh_commit_trees
(creates)refresh_repo
Repository.refresh
TreesDoc
for new commitsAllura/allura/model/repo_refresh.py:compute_diffs
(queries)refresh_repo
Allura/allura/scripts/refresh_last_commits.py
TreesDoc
to compute diffs between current and parent commitsAllura/allura/model/repo_refresh.py:compute_lcds
(gets fromModelCache
)compute_diffs
, computes last commit for each tree (presumably to show it on repo browse pages).
Tree, TreeDoc, repo_tree
Allura/allura/model/repo_refresh.py:trees
(query)Allura/allura/model/repo_refresh.py:compute_diffs:_update_cache
(query)GitImplementation.refresh_tree_info
(create)HgImplementation.refresh_tree_info
(create)CommitBrowser.__init__
viaCommit.tree
, which always creates newTreeDoc
viarepo.compute_tree_new
SVNImplementation.compute_tree_new
(query & upsert)Tree.__getitem__
Allura/allura/model/repo_refresh.py
:_pull_tree & _update_tree_cache` - helpers, so don't really care.
LastCommit
,LastCommitDoc
,repo_last_commit
compute_lcds
- producesLastCommit
, which is used in:SVNImplementation.compute_tree_new
(updates)tree_widget.html
(viatree.ls()
).
DiffInfoDoc
,repo_diffinfo
compute_diffs
- producesDiffInfoDoc
, which is used in:Commit.paged_diffs
(queries)CommitBrowser.index
Commit.diffs
Allura/allura/model/stats.py
- ?Allura/allura/templates/repo/commit_basic.html
(CommitBrowser.basic
)Commit.added_paths
(queries)LastCommit._prev_commit_id
(only as optimization to exit early if prev commit don't exist`LastCommit._build
Allura/allura/scripts/refresh_last_commits.py
,Allura/allura/scripts/refreshrepo.py
- only deletingSVNImplementation.refresh_commit_info
(creates)Last edit: Igor Bondarenko 2015-02-17
Thanks.
I also did some measurements of a few SVN imports. Optimizing for SVN imports is of particular desire for me and how we're using Allura. An import & refresh will fully populate
repo_ci
andrepo_diffinfo
collections. Other collections (repo_trees
,repo_tree
,repo_last_commit
) will populate as the repo is browsed. The sizes of the collections, after browsing around a fair bit (wget spidering), are the same order of magnitude betweenrepo_ci
,repo_diffinfo
,repo_tree
, andrepo_last_commit
. Good targets to address first would berepo_ci
andrepo_diffinfo
since they are created at immediately during "refresh". Next could berepo_tree
. Not sure about tacklingrepo_last_commit
since the logic is very complex and caching is necessary since it is expensive calculations.