Question

Supervisord stopped working on 80% of droplets at the same time

Posted May 17, 2020 662 views
Linux BasicsUbuntu 18.04

I have supervisord (latest version, 4.2.0) on my ubuntu 18.04 droplets in different regions.

Today I saw that exactly at the same time almost all of droplets has dropped in CPU usage from 20-30% to almost zero. It turned out that supervisord stopped working.

In supervisor logs I can see that someone send SIGTERM:

2020-05-16 12:33:59,831 WARN received SIGTERM indicating exit request

The only relevant answer I googled is https://stackoverflow.com/questions/28440543/supervisor-gets-a-sigterm-for-some-reason-quits-and-stops-all-its-processes

However, I’ve checked out that the date of unattended upgrade is different, though minutes are the same, which is suspicious:

Start-Date: 2020-05-15  06:33:13
Commandline: /usr/bin/unattended-upgrade
Upgrade: libjson-c3:amd64 (0.12.1-1.3, 0.12.1-1.3ubuntu0.1)
End-Date: 2020-05-15  06:33:13

Notice that both hours and days are different.

Is it possible to somehow figure out why supervisord stopped working, and how can I prevent this to happen in future? Since it’s crucial for me, I need it up and running for 100% of time :(

1 comment

These answers are provided by our Community. If you find them useful, show some love by clicking the heart. If you run into issues leave a comment, or add your own answer to help others.

×
1 answer

Hi @Akcium,

It seems Supervisord was killed by the server. It received a Sigkill signal which is basically when the server kills processes when they are out of memory. I’ll recommend checking if this is true by ‘grep’-ing in /var/log/messages for oom or kill. Here is an example

grep -i oom /var/log/messages
grep -i kill /var/log/messages

Most probably in the logs, the time will match with what you saw in your supervisord log. Now, you know the server killed them however you’ll need to find out why. If you see the oom signal, it means the server was out of memory. You can confirm this by using the sar command like so:

sar -r 

It will show you for a certain period of time what was your memory usage.

That would be a good way to start troubleshooting.

Regards,
KDSys

  • Thanks for reply!

    1. I don’t have /var/log/messages, I googled that /var/log/syslog contains the same info
    2. In /var/log/syslog there is no oom, but there are some sigterms
    3. However it’s strange but the dates are all different. And /var/log/syslog contains only few records for the past day:
    root@node-frankfurt:~# grep -i kill /var/log/syslog
    May 17 06:56:18 node-frankfurt systemd[21690]: Received SIGRTMIN+24 from PID 22053 (kill).
    May 17 06:56:18 node-frankfurt systemd[1]: user@0.service: Killing process 22053 (kill) with signal SIGKILL.
    May 17 06:57:11 node-frankfurt systemd[22126]: Received SIGRTMIN+24 from PID 22566 (kill).
    May 17 06:57:11 node-frankfurt systemd[1]: user@0.service: Killing process 22566 (kill) with signal SIGKILL.
    May 17 07:41:27 node-frankfurt systemd[14056]: Received SIGRTMIN+24 from PID 15382 (kill).
    May 17 07:41:27 node-frankfurt systemd[1]: user@0.service: Killing process 15382 (kill) with signal SIGKILL.
    May 17 07:50:34 node-frankfurt systemd[19830]: Received SIGRTMIN+24 from PID 20487 (kill).
    May 17 07:50:34 node-frankfurt systemd[1]: user@0.service: Killing process 20487 (kill) with signal SIGKILL.
    May 17 07:55:39 node-frankfurt systemd[20490]: Received SIGRTMIN+24 from PID 23389 (kill).
    May 17 07:55:39 node-frankfurt systemd[1]: user@0.service: Killing process 23389 (kill) with signal SIGKILL.
    May 17 08:11:10 node-frankfurt systemd[23392]: Received SIGRTMIN+24 from PID 32316 (kill).
    May 17 08:11:10 node-frankfurt systemd[1]: user@0.service: Killing process 32316 (kill) with signal SIGKILL.
    May 17 08:13:08 node-frankfurt systemd[32539]: Received SIGRTMIN+24 from PID 1076 (kill).
    May 17 08:13:08 node-frankfurt systemd[1]: user@0.service: Killing process 1076 (kill) with signal SIGKILL.
    May 17 08:44:38 node-frankfurt systemd[18340]: Received SIGRTMIN+24 from PID 19028 (kill).
    May 17 08:44:38 node-frankfurt systemd[1]: user@0.service: Killing process 19028 (kill) with signal SIGKILL.
    May 17 09:42:53 node-frankfurt systemd[2998]: Received SIGRTMIN+24 from PID 23470 (kill).
    May 17 09:42:53 node-frankfurt systemd[1]: user@0.service: Killing process 23470 (kill) with signal SIGKILL.
    May 17 09:47:42 node-frankfurt systemd[26397]: Received SIGRTMIN+24 from PID 26889 (kill).
    May 17 09:47:42 node-frankfurt systemd[1]: user@0.service: Killing process 26889 (kill) with signal SIGKILL.
    

    While supervisor log shows:

    149136:2020-05-16 12:33:59,831 WARN received SIGTERM indicating exit request
    

    I thought that maybe my droplets restarted for some reason, but..

    root@node-frankfurt:~# last reboot
    reboot   system boot  4.15.0-99-generi Sat May 16 13:15   still running
    reboot   system boot  4.15.0-99-generi Tue May 12 08:28 - 12:34 (4+04:05)
    reboot   system boot  4.15.0-99-generi Tue May 12 07:15 - 08:26  (01:11)
    reboot   system boot  4.15.0-99-generi Tue May 12 06:40 - 07:08  (00:27)
    
    wtmp begins Sat May  2 08:46:50 2020
    

    I don’t have sar, but I’m going to install and use it.

    Anyways, is there any solution to… supervise supervisor :) I mean, if it’s down, what’s the fastest and best way to run him again? I thought about cron scheduling, checking if it’s up and running, though I don’t yet know how to do this

    • Hi @Akcium,

      Hmm,well your idea of a cron running to see if everything is running is quite good. To be honest I’m not sure if there is anything to supervise supervisord.

      The cron might need to check the status of the service and act based on the output it gets. Maybe something in the mists of

      #!/bin/bash
      
      service supervisord status
      
      if [[ $? -eq 0 ]];then
          echo 'working'
      else
          echo 'Not Working'
          service supervisor start
      fi
      

      And you can setup the cron job to run every five minutes using the

      crontab -e
      

      command.

      Regards,
      KDSys

      • Oh thank you so much! I also did a script which sends information to the database every 5 min, so that I know if everything working correctly

        But now supervisor randomly use 100% of CPU. I’ve checked previous issues and in one of supervisor versions they fixed this, but it still occurs :(

        I know I’m a bit off topic, but could you draft a script which checks if the CPU usage is high (say 95%) then restart supervisor? I know how to setup cron but I don’t know bash scripting at all. But anyway will google that too

Submit an Answer