Apache Allura™ / Tickets / #1861 Figure out how to deal with big tracker imports

#1861 Figure out how to deal with big tracker imports

Milestone: v1.0.0

Status: closed

Owner: Paul Sokolovsky

Labels: import (116) sf-2 (994)

Component: Tracker

Reviewer: nobody

Updated: 2015-08-20

Created: 2011-03-31

Creator: Paul Sokolovsky

Private: No

When more or less big (<100 tickets) tracker import is run against sandbox (with nginx frontend) it ends up with:

$ time python allura_import_test.py data/sf100.json -u https://sf-psokolovsky-3014.sb.sf.net -a tckd18490839ab2c834b16b -s 7c43a517b310ab7477f251922282456fa184c6440f2769c76fecfece04b1a58125063030b166d4e8
Importing 100 tickets
Traceback (most recent call last):
  File "allura_import_test.py", line 93, in <module>
    res = cli.call(url, doc=doc_txt, options=json.dumps(import_options))
  File "allura_import_test.py", line 64, in call
    raise e
urllib2.HTTPError: HTTP Error 504: Gateway Time-out

real    0m31.727s
user    0m0.300s
sys 0m0.048s
</module>

I.e. nginx or some other intermediate component times out in 30s. So, the current approach of providing import data in ForgePlucker format as one big JSON and then executing import synchronously (to return status to the client) is not viable and needs replacement/augmentation. Following choices can be proposed:

Instead of submitting JSON with multiple tickets, provide API call to import ticket one by one (this was one of solutions originally proposed). We still can use ForgePlucker as interchange format, it will be just parsed by import script, instead of within Allura.
Go into other direction, and make import feature be UI functionality, where user uploads JSON to a web page, and then uses it to track progress/get final status.
API-esque compromise between 1 & 2: there's one call to post import job (import id is returned), it is then processed as async task, and there's another API call to check status of import id.
Regardless of approach chosen, there should be also optimization on how imported tickets are created - they should just go to Mongo, and as little as possible post- and side-processing performed. The only thing which should be done on them is SOLR indexing, and even for that, it probably makes sense to just queue up complete tracker re-index after import is complete (unless choice 1 is used).

Choice 1 is by far the easiest to implement - it would just require what #1767 already queued up, just need to split chunks to be of size 1.

Paul Sokolovsky - 2011-03-31

Well, and solutions 2 & 3 do have drawback that there's still max size limit for import doc (request body).

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Dave Brondsema - 2011-04-01

Any guesses how much overhead #1 would have? It's a separate request/response for each ticket instead of just one. But it sidesteps the issue you mentioned with 2 & 3 (file size limit)

I also like 1 because we can have just one API, not two. We already have an oauth API for POST to /mytickets/new. I'm not sure if its syntactically similar to ForgePlucker, but I don't think that is necessarily important. We'd just have to add a field or two that can only be set for migration (e.g. 'reported_by').

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Dave Brondsema - 2011-04-01

Any guesses how much overhead #1 would have? It's a separate request/response for each ticket instead of just one. But it sidesteps the issue you mentioned with 2 & 3 (file size limit)

I also like 1 because we can have just one API, not two. We already have an oauth API for POST to /mytickets/new. I'm not sure if its syntactically similar to ForgePlucker, but I don't think that is necessarily important. We'd just have to add a field or two that can only be set for migration (e.g. 'reported_by').

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Paul Sokolovsky - 2011-04-01

Discussed this with Dave, and we agreed that choice gives best solution to the issue, and implementation-wise, existing infrastructure for ForgePlucker format should used, just docs submitted should contain single ticket per call. The this essentially reduces to [#1767] with chunk size of 1. And my tests show big improvement in import time - 1K tickets are now imported in ~14mins locally. SOLR indexing still takes about an hour, but now background indexing tasks are properly queued/scheduled, and there's no bottleneck for import API calls.

So, closing this ticket, changes will go against [#1767].

labels: --> import

status: open --> closed

size: --> 2

milestone: limbo --> apr-7

Related

Tickets: ~~#1767~~
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Apache Allura™

Forge software for hosting software projects

Milestone

Searches

Help

#1861 Figure out how to deal with big tracker imports

Related

Discussion

Related