SCM tools in Allura

The interface API and most of the controller and view structure of code repository type apps is defined in the base classes in the Allura package, which consists of the classes in the following packages:

  • allura.lib.repository for the base application and implementation classes
  • allura.controllers.repository for the base controllers
  • allura.model.repository for the repo metadata models
  • allura.model.repo_refresh for the repo metadata refresh logic

Application and Implementation

The Application structure for SCM apps follows the normal pattern for Allura applications, though they should inherit from allura.lib.repository.RepositoryApp instead of allura.app.Application. The apps are then responsible for implementing subclasses of allura.lib.repository.Repository and allura.lib.repository.RepositoryImplementation.

The Repository subclass is responsible for implementing tool-specific logic using the metadata models and proxying the rest of its logic to the RepositoryImplementation subclass, which talks directly to the underlying SCM tool, such as GitPython, Mercurial, or pysvn.

Historically, more was done in the Repository subclass using the metadata models, but we are trying to move away from those toward making the SCM apps thin wrappers around the underlying SCM tool (see Indexless, below).

Controller / View Dispatch

All of the SCM apps use the base controllers in allura.controllers.repository with only minimal customization through subclassing (primarily to override the template for displaying instructions for setting up a newly created repository), so the dispatch for all SCM apps follows the same pattern.

The root controller for SCM apps is allura.controllers.repository.RepoRootController. This controller has views for repo-level actions, such as forking and merging, and the SCM app attaches a refs and a commits (ci) controller to dispatch symbolic references and explicit commit IDs, respectively. (This should be refactored to be done in the RepoRootController so that the dispatch can be followed more easily.) (Also, ForgeSVN actually eschews this class and uses BranchBrowser directly as its root, in order to tweak the URL slightly, but it monkeypatches the relevant views over, so the dispatch ends up working more or less the same.)

The refs controller, allura.controllers.repository.RefsController, handles symbolic references, and in particular handles custom escape handling to detect what is part of the ref name vs part of the remainder of the URL to dispatch. This is then handed off to the BranchBrowserClass which is a pointer to the implementation within the specific SCM app which handles the empty repo instructions or hands it back to the generic commits controller.

The commits controller, allura.controllers.repository.CommitsController, originally only handled explicit commit IDs, but was modified to allow for persistent symbolic refs in the URL (“/p/allura/git/ci/master/”, e.g.), it was changed to have the same escape parsing logic as the refs controller. Regardless, it just parses out the reference / ID from the URL and hands off to the commit browser.

The commit browser, allura.controllers.repository.CommitBrowser, holds the views related to a specific commit, such as viewing the commit details (message and changes), a log of the commit history starting with the commit, or creating a snapshot of the code as of the commit. It also has a “tree” endpoint for browsing the file system tree as of the commit.

The tree browser, allura.controllers.repository.TreeBrowser, holds the view for viewing the file system contents at a specific path for a given commit. The only complication here is that, instead of parsing out the entire tree path at once, it recursively dispatches to itself to build up the path a piece at a time. Tree browsing also depends on the Last Commit Logic to create the data needed to display the last commit that touched each file or directory within the given directory.

Last Commit Logic

Determining which commit was the last to touch a given set of files or directories can be complicated, depending on the specific SCM tool. Git and Mercurial require manually walking up the commit history to discover this information, while SVN can return it all from a single info2 command (though the SVN call will be significantly slower than any individual call to Git or Mercurial). Because this can sometimes be costly to generate, it is cached via the allura.model.repository.LastCommit model. This will generate the data on demand by calling the underlying SCM tool, if necessary, but the data is currently pre-generated during the post-push refresh logic for Git and Mercurial.

The overall logic for generating this data for Git and Mercurial is as follows:

  1. All items modified in the current commit get their info from the current commit
  2. The info for the remaining items is found:
  • If there is a LastCommit record for the parent directory, all of the remaining items get their info from the previous LastCommit
  • Otherwise, the list of remaining items is sent to the SCM implementation, which repeatedly asks the SCM for the last commit to touch any of the items for which we are missing information, removing items from the list as commits are reported as having modified them
  1. Once all of the items have information, or if a processing timeout is reached, the gathered information is saved in the LastCommit model and returned

Indexless

Currently, there are model classes which encapsulate SCM metadata (such as commits, file system structure, etc) in a generic (agnostic to the underlying tool implementation) way. However, this means that we’re duplicating in mongo a lot of data that is already tracked by the underlying SCM tool, and this data must also be indexed for new repos and after every subsequent push before the commits or files are browsable via the web interface.

To minimize this duplication of data and reduce or eliminate the delay between commits being pushed and them being visible, we are trying to move toward a lightweight API layer that requests the data from the underlying SCM tool directly, with intelligent caching at the points and in the format that makes the most sense to make rendering the SCM pages as fast as possible.