#6192 Research & fix large-scale issues with running ReindexCommand

v1.0.0
closed
nobody
42cc (432)
General
2015-08-20
2013-05-01
No

For upcoming solr schema changes, we will reindex everything in Allura, via the ReindexCommand. This will take quite a while. [#6191] will help, but the ReindexCommand code itself will need some improvements to handle this process. Evaluate these concerns, do further research & testing at small scales where needed, and make code changes.

  • the only options to limit what is selected is --project and --neighborhood. We'll want to do some medium sized batches, so allowing --project to be a regex will give us more flexibility in the batches we run. (I think we do this on some other script or command too).
  • the ref_ids list could get extremely large even for a single project, and too big to save in mongo (when add_artifacts.post records the task). Consider using the BatchIndexer class from forge-classic/sfx/lib/migrate.py Or at least it's _post method technique for splitting tasks into smaller sizes. You can move BatchIndexer into Allura if it makes sense to use it
  • artifact references & shortlinks rebuilding is part of ReindexCommand too. In theory you can run it with --solr to skip that other stuff, and that would make it go faster. However, I recall things not working completely right when doing that. Research and test this risk.
  • the delete option will be extraneous when we run this on a brand new empty solr instance. Have an option to skip solr delete.
  • anything else you can think of, for running this on a huge scale?

Related

Tickets: #6191
Tickets: #6192

Discussion

  • Igor Bondarenko - 2013-05-02

    Created following tickets for now:

    • #335: [#6192] ReindexCommand: add new options (1cp) (For --project and --skip-solr-delete options)
    • #336: [#6192] ReindexCommand: better ref_ids handling (2cp)
    • #337: [#6192] ReindexCommand: research artifact references and shortlinks rebuilding (3cp)
     

    Related

    Tickets: #6192

  • Igor Bondarenko - 2013-05-02
    • status: open --> in-progress
     
  • Igor Bondarenko - 2013-05-07
    • status: in-progress --> code-review
     
  • Igor Bondarenko - 2013-05-07

    Closed #335. je/42cc_6192

     
  • Igor Bondarenko - 2013-05-07
    • status: code-review --> in-progress
     
  • Igor Bondarenko - 2013-05-08

    Closed #336. je/42cc_6192

    I don't see benefits of using entire BatchIndexer, 'cause it just extends ArtifactSessionExtension and it split tasks in chunks when some artifacts were added or changed (so, needed to be re-indexed). But during execution of the re-index command we never change artifacts, and, in fact, we never flush artifact session, so this code never get called.

    However, using BatchIndexer._post technique for splitting tasks is good for this case, 'cause we're creating this tasks manually and can split it right away.

    Also, I've tested --solr option a little bit and seems like it works fine. Do you recall what kind of problems this was causing (invalid 'related artifacts' links, missing index, etc)? It would be helpful to know in what direction to dig in #337. Thanks.


    JFYI: comment in BatchIndexer.flush lies:

    # cls.to_delete - contains solr index ids which can easily be over
    #                 100 bytes. Here we allow for 160 bytes avg, plus
    #                 room for other document overhead.
    # cls.to_add - contains BSON ObjectIds, which are 12 bytes each, so 
    #              we can easily put 1m in a doc with room left over.
    

    cls.to_add also contains index ids as cls.to_delete, not ObjectIds. So, chunks for cls.to_add should be the same size as for cls.to_delete. Bigger chunks still works, 'cause _post splits them anyway, but it's a little bit misleading. Maybe you'll want a low-priority ticket to fix this comment and corresponding code. Just letting you know :)

     
  • Dave Brondsema

    Dave Brondsema - 2013-05-08

    I don't remember exactly what errors I had with --solr. If it works well for you now, we can leave it at that and do further work if errors come up in the future.

     
  • Igor Bondarenko - 2013-05-09

    Ok, I'll going to do a little bit more of testing for this in #337 (maybe an hour or so) to make sure.

     
  • Igor Bondarenko - 2013-05-10
    • status: in-progress --> code-review
     
  • Igor Bondarenko - 2013-05-10

    Closed #337.

    I've tested a little bit more and it seems like --solr works fine.

    You can review now what's in je/42cc_6192

     
  • Dave Brondsema

    Dave Brondsema - 2013-05-16
    • QA: Dave Brondsema
     
  • Dave Brondsema

    Dave Brondsema - 2013-05-16
    • Milestone: forge-backlog --> forge-may-17
     
  • Dave Brondsema

    Dave Brondsema - 2013-05-16
    • status: code-review --> closed
     

Log in to post a comment.