#6133 User stats scaling problems & matplotlib/numpy dependency

v1.0.0
closed
General
Cory Johns
2015-08-20
2013-04-18
No

With over 129000 userstat records in the system, viewing a single user stats runs for 15 minutes and then fails. The stat record is https://sourceforge.net/p/allura/pastebin/51701870b9363c6030ce62f7/ which is big (lists of posts and tickets, for example) but not necessarily too big.

I don't know what the scaling problem is necessarily, but the one document could be tested on its own. If that's not the problem, perhaps it is an issue with comparing to all the other users' stats.

Discussion

  • Stefano - 2013-04-18

    The main problem is related to the comparison of each single user with the remaining ones of the forge. I think we should remove this part, which is represented by the last section of the userstats profile, so that in the meanwhile we can work out how to improve performances. If you want, you can try to remove it before running the forge, in order to verify this hypothesis. Let us know if you want us to proceed in this direction.
    For what concerns all the remaining data, all the queries only involve a single userstats document, which includes a few counters and a list of events registered during the last 30 days. Maybe this could be improved, but it should not create a similar issue. Anyway, we are already thinking about how to reduce its dimension.
    Obviously, we are available for providing you support to solve this problem. If you have other relevant data, feel free to provide us with it.

     
  • Dave Brondsema

    Dave Brondsema - 2013-04-18

    Removing the comparison makes sense to me. If you want to proceed with that, go ahead. It'll probably be at least several days before I'd have a chance to.

    Our production release process does require that we have the changes committed and merged, so we can't do a quick change in production to test it first.

     
  • Dave Brondsema

    Dave Brondsema - 2013-05-05
    • summary: User stats scaling problems --> User stats scaling problems & matplotlib/numpy dependency
    • status: open --> in-progress
    • assigned_to: Dave Brondsema
    • Milestone: forge-backlog --> forge-may-17
     
  • Dave Brondsema

    Dave Brondsema - 2013-05-05

    Most of the matplotlib/numpy usage is from the global ranking, so a now's a good time to remove those dependencies anyway (issues with them included: large size of packages, long compile time, g++ required, difficult to install on some systems, had to manually install numpy before pip install -r requirements.txt for some reason)

     
  • Dave Brondsema

    Dave Brondsema - 2013-05-05

    Changes in db/6133

     
  • Dave Brondsema

    Dave Brondsema - 2013-05-05
    • status: in-progress --> code-review
     
  • Cory Johns - 2013-05-06
    • QA: Cory Johns
     
  • Cory Johns - 2013-05-06
    • status: code-review --> closed
     
  • Cory Johns - 2013-05-06

    Some things to consider if we want to revisit adding these features back in:

    • The Mongo aggregation framework could be leveraged to remove the need for blanket self.query.find() queries
    • Additionally or alternatively, all stats should probably be computed ahead of time as opposed to on-demand as these troublesome queries were being done
    • Simple bar graphs / histograms or percentage bars can be implemented using tricks like rendering in CSS by setting scaled width or height values of elements with a fill-colored background, to avoid requiring heavy math and graphing libraries (e.g., http://www.xul.fr/en/css/bar-chart.php — since we can render the CSS directly from python, we don't even need javascript).
     
  • Dave Brondsema

    Dave Brondsema - 2013-05-17
    • Size: --> 1
     

Log in to post a comment.