A signal to taskd (e.g. SIGHUP for graceful restart) will frequently cause an error if pymongo is currently doing something: http://pastie.org/private/mdiglqtzkcydtyfusja
AutoReconnect
Raised when a connection to the database is lost and an attempt to auto-reconnect will be made.
In order to auto-reconnect you must handle this exception, recognizing that the operation which caused it has not necessarily succeeded. Future operations will attempt to open a new connection to the database (and will continue to raise this exception until the first successful connection is made).
Could we implement an auto-retry option in ming? Probably with a setting that would only be enabled for taskd, initially.
So far most of the exceptions we're seeing are on idle workers in their monq_task get() method which calls ming/pymongo's find_and_modify
Here's a different trace that we should handle too. They're not all cursor.next() calls in pymongo that raise the error. This is during an insert. We'll have to figure out a way to wrap all the necessary ming methods.
find_and_modify example
From the documentation here and here, and the discussion here, it seems that calling
signal.signal()
is overriding the default behavior of transparently restarting system calls and instead setting it to interrupt.I've added the calls to
signal.siginterrupt()
inallura:cj/4947
and, while it is impossible to test properly, I have verified that the graceful handlers still work and was not able to reproduce theAutoReconnect
error after a few minutes of trying, while I was getting them occasionally before the change. At the very least, it doesn't seem to hurt the handlers and since it is at the system level, it seems like it should avoid the thorny issues with mongo and attempting to retry non-idempotent operations.It might warrant further discussion, however, so leaving this ticket open for the moment.