Apache Allura™

Forge software for hosting software projects

Brought to you by: alexluberg, brondsem, ccruz, deshani, and 11 others

#6133 User stats scaling problems & matplotlib/numpy dependency

Milestone: v1.0.0

Status: closed

Owner: Dave Brondsema

Labels: performance (126) sf-1 (616)

Component: General

Reviewer: Cory Johns

Updated: 2015-08-20

Created: 2013-04-18

Creator: Dave Brondsema

Private: No

With over 129000 userstat records in the system, viewing a single user stats runs for 15 minutes and then fails. The stat record is https://sourceforge.net/p/allura/pastebin/51701870b9363c6030ce62f7/ which is big (lists of posts and tickets, for example) but not necessarily too big.

I don't know what the scaling problem is necessarily, but the one document could be tested on its own. If that's not the problem, perhaps it is an issue with comparing to all the other users' stats.

Discussion

Dave Brondsema - 2013-04-18

Also, from the mongo server log, some enormous queries: https://sourceforge.net/p/allura/pastebin/51701a1ec4d1041a7a12f793/

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Stefano - 2013-04-18

The main problem is related to the comparison of each single user with the remaining ones of the forge. I think we should remove this part, which is represented by the last section of the userstats profile, so that in the meanwhile we can work out how to improve performances. If you want, you can try to remove it before running the forge, in order to verify this hypothesis. Let us know if you want us to proceed in this direction.
For what concerns all the remaining data, all the queries only involve a single userstats document, which includes a few counters and a list of events registered during the last 30 days. Maybe this could be improved, but it should not create a similar issue. Anyway, we are already thinking about how to reduce its dimension.
Obviously, we are available for providing you support to solve this problem. If you have other relevant data, feel free to provide us with it.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Dave Brondsema - 2013-04-18

Removing the comparison makes sense to me. If you want to proceed with that, go ahead. It'll probably be at least several days before I'd have a chance to.

Our production release process does require that we have the changes committed and merged, so we can't do a quick change in production to test it first.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Dave Brondsema - 2013-05-05

summary: User stats scaling problems --> User stats scaling problems & matplotlib/numpy dependency

status: open --> in-progress

assigned_to: Dave Brondsema

Milestone: forge-backlog --> forge-may-17
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Dave Brondsema - 2013-05-05

Most of the matplotlib/numpy usage is from the global ranking, so a now's a good time to remove those dependencies anyway (issues with them included: large size of packages, long compile time, g++ required, difficult to install on some systems, had to manually install numpy before pip install -r requirements.txt for some reason)

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Dave Brondsema - 2013-05-05

Changes in db/6133

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Dave Brondsema - 2013-05-05

status: in-progress --> code-review
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Cory Johns - 2013-05-06

QA: Cory Johns
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Cory Johns - 2013-05-06

status: code-review --> closed
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Cory Johns - 2013-05-06

Some things to consider if we want to revisit adding these features back in:

The Mongo aggregation framework could be leveraged to remove the need for blanket self.query.find() queries

Additionally or alternatively, all stats should probably be computed ahead of time as opposed to on-demand as these troublesome queries were being done

Simple bar graphs / histograms or percentage bars can be implemented using tricks like rendering in CSS by setting scaled width or height values of elements with a fill-colored background, to avoid requiring heavy math and graphing libraries (e.g., http://www.xul.fr/en/css/bar-chart.php — since we can render the CSS directly from python, we don't even need javascript).
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Dave Brondsema - 2013-05-17

Size: --> 1
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Log in to post a comment.