If a taskd instance is killed (not gracefully) then the monq_task document for it will be left in the 'busy' state incorrectly. Also, it occasionally happens that a taskd process gets completely stuck and doesn't finish its current task. It can stay in this state for many days, which is a waste of the process.
We should have a taskd cleanup command that queries the monq_task docs for all 'busy' processes assigned to the current hostname. For each one, do something like this (may need some tweaking):
- if the pid doesn't match one running on this host (e.g. run something like
pgrep -f '/paster taskd'), then set the task doc's state to error. Put an explanation in the result field.
- for all the processes that are running, send a USR1 signal to the pid (this will log the current task to allura.log) and then watch that log (may have to wait a few seconds for it to appear) and see if the task id matches. Some tasks move quickly, so we need to make sure we don't assume a miss when the task moved quickly. If it is really not found, then also consider the 'busy' task to be killed previously so set its state to error and put an explanation in the result field.
- if the process does not log anything at all to allura.log for its current task, we consider it stuck. Have a commandline option to kill stuck tasks and update their state. Otherwise just report on finding a stuck proc & task
- you can change taskd to write the USR1 status output to an additional file besides allura.log if that is easier to watch for. The allura.log file can be very busy.
- print to stdout all the actions done, including proc & task details