Apache Allura™ / Tickets / #7828 Analyze & document usage of repo collections

#7828 Analyze & document usage of repo collections

Milestone: v1.3.1

Status: closed

Owner: Igor Bondarenko

Labels: 42cc (432) sf-2 (994) indexless (10)

Component: General

Reviewer: Dave Brondsema

Updated: 2015-03-02

Created: 2015-02-09

Creator: Dave Brondsema

Private: No

We've done some work in the past for our SCM repos to be "indexless", that is use the git/hg/svn repo directly instead of indexing in mongo and using that. (Storing in mongo can take up a lot of space and also adds delay for the indexing process to run)

Analyze where each of the following collections (models) is used. Perhaps cross-reference by page or function (e.g. browse repo, view commit, etc; also git/svn/hg). Then we can plan which pages' functionality needs to be updated to be able to remove them.

Collections (with relative size factors based on sf.net data):

repo_trees (4x)
repo_tree (2x)
repo_last_commit (2x)
repo_diffinfo (1x)
repo_ci (very tiny)
repo_commitrun (very tiny)

Dave Brondsema - 2015-02-09

labels: indexless, sf-current --> indexless, sf-current, sf-2
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Igor Bondarenko - 2015-02-10

Owner: Anonymous --> Igor Bondarenko

Labels: indexless, sf-current, sf-2 --> 42cc, sf-current, sf-2, indexless

Status: open --> in-progress
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Igor Bondarenko - 2015-02-17

status: in-progress --> review
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Igor Bondarenko - 2015-02-17

Basically, we populate these collection whenever repo update happens (via hook and UI) and use it to display info to users, except for log view, which talks directly to SCM.

Query usage pretty much boils down to these controllers:

MergeRequestController:

index - uses Commit in template 'jinja:allura:templates/repo/merge_request.html' via req.commits

do_request_merge_edit - uses Commit

BranchBrowser:

log - uses Commit

CommitBrowser:

__init__ - uses Commit

index:

uses Tree via self._commit.tree

DiffInfoDoc via self._commit.paged_diffs

basic - DiffInfoDoc (in commit_basic.html via commit.diffs)

TreeBrowser:

uses Tree (passed to __init__, in lookup uses Tree.__getitem__)

LastCommit in Allura/allura/templates/widgets/repo/tree_widget.html (via tree.ls())

FileBrowser - uses Tree & Commit

.

Git, Hg, SVN (all of has a version of this):

BranchBrowser - uses Commit via repo.latest

.

Also:

Macro for including file from a repo Allura/allura/lib/macro.py:include_file uses Commit.get_path, which use Commit.get_tree, which uses Tree

Allura/allura/model/stats.py uses DiffInfoDoc via commit.diffs, but I don't sure where stats.py is used

The above include only places that query one or more of those collections to display something useful to user. We should be able to get rid of these collection by rewriting those controllers/templates above (I might have missed something so we should estimate it a bit higher).

More about overall usage below:

TreesDoc, repo_trees

ForgeSVN/forgesvn/model/svn.py:compute_tree_new (upserts)

Commit.get_tree()

Commit.tree

almost everywhere

Commit.get_path()

Diff between two revisions of the same file (FileBrowser.diff)

Macro for including file from a repo Allura/allura/lib/macro.py:include_file

Commit.has_path()

nowhere

Tree.getitem()

TreeBrowser controller uses it, maybe also used elsewhere, it's hard to tell

Allura/allura/model/repo_refresh.py:refresh_commit_trees (creates)

refresh_repo

Repository.refresh

git hook

basically whenever repo refresh happens we create TreesDoc for new commits

Allura/allura/model/repo_refresh.py:compute_diffs (queries)

refresh_repo

script task Allura/allura/scripts/refresh_last_commits.py

uses info from TreesDoc to compute diffs between current and parent commits

Allura/allura/model/repo_refresh.py:compute_lcds (gets from ModelCache)

pretty much same places as compute_diffs, computes last commit for each tree (presumably to show it on repo browse pages)

.

Tree, TreeDoc, repo_tree

Allura/allura/model/repo_refresh.py:trees (query)

seems like not used

Allura/allura/model/repo_refresh.py:compute_diffs:_update_cache (query)

well, used only inside, to help calculate diffs

GitImplementation.refresh_tree_info (create)

HgImplementation.refresh_tree_info (create)

CommitBrowser.__init__ via Commit.tree, which always creates new TreeDoc via repo.compute_tree_new

SVNImplementation.compute_tree_new (query & upsert)

Tree.__getitem__

Allura/allura/model/repo_refresh.py:_pull_tree & _update_tree_cache` - helpers, so don't really care

.

LastCommit, LastCommitDoc, repo_last_commit

compute_lcds - produces LastCommit, which is used in:

SVNImplementation.compute_tree_new (updates)

in tree_widget.html (via tree.ls())

.

DiffInfoDoc, repo_diffinfo

compute_diffs - produces DiffInfoDoc, which is used in:

Commit.paged_diffs (queries)

Displays diffs for commit. CommitBrowser.index

Commit.diffs

Allura/allura/model/stats.py - ?

Allura/allura/templates/repo/commit_basic.html (CommitBrowser.basic)

Commit.added_paths (queries)

LastCommit._prev_commit_id (only as optimization to exit early if prev commit don't exist`

LastCommit._build

Allura/allura/scripts/refresh_last_commits.py, Allura/allura/scripts/refreshrepo.py - only deleting

SVNImplementation.refresh_commit_info (creates)

Last edit: Igor Bondarenko 2015-02-17
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Dave Brondsema - 2015-02-18

status: review --> closed

Reviewer: Dave Brondsema
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Dave Brondsema - 2015-02-18

Thanks.

I also did some measurements of a few SVN imports. Optimizing for SVN imports is of particular desire for me and how we're using Allura. An import & refresh will fully populate repo_ci and repo_diffinfo collections. Other collections (repo_trees, repo_tree, repo_last_commit) will populate as the repo is browsed. The sizes of the collections, after browsing around a fair bit (wget spidering), are the same order of magnitude between repo_ci, repo_diffinfo, repo_tree, and repo_last_commit. Good targets to address first would be repo_ci andrepo_diffinfo since they are created at immediately during "refresh". Next could be repo_tree. Not sure about tackling repo_last_commit since the logic is very complex and caching is necessary since it is expensive calculations.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Dave Brondsema - 2015-02-23

labels: 42cc, sf-current, sf-2, indexless --> 42cc, sf-2, indexless
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Dave Brondsema - 2015-08-10

Milestone: unreleased --> v1.3.1
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Apache Allura™

Forge software for hosting software projects

Milestone

Searches

Help

#7828 Analyze & document usage of repo collections

Related

Discussion

`TreesDoc, repo_trees`

`Tree, TreeDoc, repo_tree`

`LastCommit`, `LastCommitDoc`, `repo_last_commit`

`DiffInfoDoc`, `repo_diffinfo`

Apache Allura™

Forge software for hosting software projects

Milestone

Searches

Help

#7828 Analyze & document usage of repo collections

Related

Discussion

TreesDoc, repo_trees

Tree, TreeDoc, repo_tree

LastCommit, LastCommitDoc, repo_last_commit

DiffInfoDoc, repo_diffinfo

`TreesDoc, repo_trees`

`Tree, TreeDoc, repo_tree`

`LastCommit`, `LastCommitDoc`, `repo_last_commit`

`DiffInfoDoc`, `repo_diffinfo`