Trytond-worker silently unoperative after child proccesses are killed

Dan · March 1, 2022, 6:19pm

Whenever its child process die for a reason other than a python exception, e.g. due to a SIGKILL being received from the OS, I’d expect the trytond-worker daemon to stop with a non-zero exit code.

Right now this is not so because the underlying multiprocessing interface used is Pool. Its returned AsyncResults objects are waited to be ready in a loop with a minor sleep, but async results from a dead child will never be ready, so the program gets stuck in an infinite waiting loop.

This stops any other tasks from being processed with no alert about any dead process, and the situation can go unnoticed until users realise new purchases and sales are not being processed or emails are not being sent.

Dan · March 1, 2022, 6:35pm

Given that the task list is already being handled manually with a list wrapper class named TaskList

class TaskList(list):
    def filter(self):
        for t in list(self):
            if t.ready():
                self.remove(t)
        return self

# ...

while len(tasks.filter()) >= processes:
  time.sleep(0.1)

so there is no advantatge out of using a multiprocessing.Pool, I think it would make sense to move to the bare Process objects. Those objects have an .is_alive() method that would let the processing queue to move on, either if childs finished their task or died abruptly.

Alternatively, the concurrent.futures.ProcessPoolExecutor has a similar API that would let retain all the initializer magic around database connections, save quite a bunch of process forking but still interrogate the returned Future objects about their running state.

Also, future attempts to submit more tasks would raise a BrokenExecutor exception, after which the program could either try to restore with a new executor pool or maybe let it be unhandled so the main process effectively stops so that system administrators can be alerted by their favorite tools.

ced · March 2, 2022, 10:54am

This has been spotted on Worker pool stop working after all processes crash (#11104) · Issues · Tryton / Tryton · GitLab

The advantage is to have a pool of worker ready, this avoid to have to initialize the pool for each task.

Maybe. It is annoying that Python has multiple options to manage a pool of processors.

Dan · March 2, 2022, 11:58am

Thanks, I often fail to find issues in that platform.

Absolutely. I was thinking about the manual effort of keeping N tasks running concurrently, which is only one of the many advantatges of pools.

Indeed. I guess multiprocessing must be retained for its widespread usage, but the concurrent module feels like the right way to do things nowadays. My first attempt to replace the process pool in the worker module is quite straightforward and I am gonna port it to a nasty deployment that is recurringly running out of memory.

ced · March 2, 2022, 2:13pm

Please make a proposal on the issue following: Tryton - How to Develop

Dan · March 10, 2022, 4:52pm

I hope I can work on it soon.

On an additional note, deploying this caused an immediate drop of the basal CPU usage of the worker service.

My guess is this is due to having replaced the time.sleep(0.1) instruction in the wait loop with

concurrent.futures.wait(tasks, return_when=FIRST_COMPLETED)

but I haven’t profiled it. (See wait)

Nonetheless, this could be even more beneficial than expected.