jonathan
By:
jonathan

Droplet remains alive but "falls off" network. OR... How to reboot automatically when condition detected?

April 18, 2016 814 views
Networking Monitoring Ubuntu

Quickest version of question:
What's the simplest way to detect a network error (say, 8 failed pings in a row) and then trigger a reboot?

A bit more detail

I don't seem to be able to resolve a networking issue (more on that below) and my only choice seems to be a complete rebuild.

Ubuntu 16.04 (Xenial Xerus) is due out on 21st April - does anyone know how quickly DO get the new images ready to deploy as a droplet? It's a fairly complex build, I want to move to php7 at the same time, and I don't have time to check everything this week.

In depth - diagnostic level!

Meantime, I have to keep this droplet up, and this problem has been plaguing my Ubuntu 15.10 4.2.0-34-generic i686 VPS for a couple of months now.

There will be a small (2-3Mb/s) spike in outgoing network traffic, and then the VPS will cease to respond to any connections on any port, with the exception of DigitalOcean's proprietary Hypervisor-level console "which is like plugging a keyboard in". When I login like that, all services seem to be up. But the ONLY way to get the system back on like is a shutdown -r now.

Support say they've investigated and can find no reason for this.

All I know is that downloading any large file triggers this condition, and that at the same time, the time the weekly backups take changed from 32 mins to about 15 mins, despite the droplet remaining the same size.

I've checked that there's no fail2ban weirdness going on. I've tried running without iptables firewall, I've looked for clues in the nginx webserver logs, and there's nothing I can find. Also, DigitalOcean's control panel keeps logging throughout the outage, and there is no CPU load spike either (so I doubt it's DDOS).

Anyone got any ideas?

I've got an NFS connection to another machine, and this is what syslog and kern.log are showing at the times the machine last dropped off the network:

syslog

Mar 20 11:09:01 tns2000 CRON[8805]: (root) CMD ( [ -x /usr/lib/php5/sessionclean ] && /usr/lib/php5/sessionclean)
Mar 20 11:17:01 tns2000 CRON[8852]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly)
Mar 20 11:29:00 tns2000 kernel: [79745.504204] nfs: server 46.101.85.XX not responding, timed out
Mar 20 11:32:11 tns2000 kernel: [79935.988165] nfs: server 46.101.85.XX not responding, timed out
Mar 20 11:32:11 tns2000 kernel: [79935.988236] nfs: server 46.101.85.XX not responding, timed out
Mar 20 11:32:33 tns2000 systemd-timesyncd[369]: Timed out waiting for reply from 91.189.94.4:123 (ntp.ubuntu.com).
Mar 20 11:32:43 tns2000 systemd-timesyncd[369]: Timed out waiting for reply from 91.189.89.199:123 (ntp.ubuntu.com).
Mar 20 11:32:53 tns2000 systemd-timesyncd[369]: Timed out waiting for reply from [2001:67c:1560:8003::c7]:123 (ntp.ubuntu.com).
Mar 20 11:35:40 tns2000 kernel: [80145.500142] nfs: server 46.101.85.XX not responding, timed out

kern.log

Mar 20 11:29:00 tns2000 kernel: [79745.504204] nfs: server 46.101.85.XX not responding, timed out
Mar 20 11:32:11 tns2000 kernel: [79935.988165] nfs: server 46.101.85.XX not responding, timed out
Mar 20 11:32:11 tns2000 kernel: [79935.988236] nfs: server 46.101.85.XX not responding, timed out
Mar 20 11:35:40 tns2000 kernel: [80145.500142] nfs: server 46.101.85.XX not responding, timed out

1 Answer

First, I wanted to let you know that we will have images for 16.04 out very quickly after the release. We try to quickly get these to our users after the official launch.

As far as automatic reboots, or failover, you may want to look at https://www.digitalocean.com/community/tutorials/how-to-create-a-high-availability-setup-with-heartbeat-and-floating-ips-on-ubuntu-14-04 or https://www.digitalocean.com/community/tutorials/how-to-set-up-highly-available-web-servers-with-keepalived-and-floating-ips-on-ubuntu-14-04 in order to either restart or move traffic to another site while your Droplet is facing issues.

Heartbeat is an open source program that provides cluster infrastructure capabilities—cluster membership and messaging—to client servers, which is a critical component in a high availability (HA) server infrastructure. Heartbeat is typically used in conjunction with a cluster resource manager (CRM), such as Pacemaker, to achieve a complete HA setup. However, in this tutorial, we will demonstrate how to create a 2-node HA server setup by simply using Heartbeat and a DigitalOcean Floating IP.
Have another answer? Share your knowledge.