Supervisord stopped working on 80% of droplets at the same time

Posted on May 17, 2020

I have supervisord (latest version, 4.2.0) on my ubuntu 18.04 droplets in different regions.

Today I saw that exactly at the same time almost all of droplets has dropped in CPU usage from 20-30% to almost zero. It turned out that supervisord stopped working.

In supervisor logs I can see that someone send SIGTERM:

2020-05-16 12:33:59,831 WARN received SIGTERM indicating exit request

The only relevant answer I googled is https://stackoverflow.com/questions/28440543/supervisor-gets-a-sigterm-for-some-reason-quits-and-stops-all-its-processes

However, I’ve checked out that the date of unattended upgrade is different, though minutes are the same, which is suspicious:

Start-Date: 2020-05-15  06:33:13
Commandline: /usr/bin/unattended-upgrade
Upgrade: libjson-c3:amd64 (0.12.1-1.3, 0.12.1-1.3ubuntu0.1)
End-Date: 2020-05-15  06:33:13

Notice that both hours and days are different.

Is it possible to somehow figure out why supervisord stopped working, and how can I prevent this to happen in future? Since it’s crucial for me, I need it up and running for 100% of time :(

This textbox defaults to using Markdown to format your answer.

You can type !ref in this text area to quickly search our full set of tutorials, documentation & marketplace offerings and insert the link!

These answers are provided by our Community. If you find them useful, show some love by clicking the heart. If you run into issues leave a comment, or add your own answer to help others.

KFSys

May 17, 2020

Hi @Akcium,

It seems Supervisord was killed by the server. It received a Sigkill signal which is basically when the server kills processes when they are out of memory. I’ll recommend checking if this is true by ‘grep’-ing in /var/log/messages for oom or kill. Here is an example

grep -i oom /var/log/messages

grep -i kill /var/log/messages

Most probably in the logs, the time will match with what you saw in your supervisord log. Now, you know the server killed them however you’ll need to find out why. If you see the oom signal, it means the server was out of memory. You can confirm this by using the sar command like so: