taskd should have a way to determine if a large number of tasks have failed. I am thinking this would be most useful to count by task type, across all taskd instances. (Does each taskd instance query mongo for that occasionally? Or a separate script on cron?)
When a large % of errors have occurred, it'll depend on the type of task and the deployment situation to determine what should happen. So needs to be flexible. Some default behaviors that would be useful: email somebody, or stop processing more events of that type.
For upgrades, we'd want to stop the processing of upgrades if there too many failures.
Log in to post a comment.