Resizing Digital Ocean Droplets that use CoreOS break the CoreOs Cluster.

April 22, 2015 2.3k views
Docker DigitalOcean CoreOS

Resizing Digital Ocean Droplets that use CoreOS break the CoreOs Cluster.

How to replicate :

Spin up 2 CoreOS and link them together to form a cluster through Cloud-Config.
In the Digital Ocean Dashboard, power-off both droplets and resize.
Power-on both droplets.

ssh into one of the droplets and run fleetctl list-machines.
You should get

2015/04/22 21:05:50 INFO client.go:291: Failed getting response from http://127.0.0.1:4001/: dial tcp 127.0.0.1:4001: connection refused
2015/04/22 21:05:50 ERROR client.go:213: Unable to get result for {Get /_coreos.com/fleet/machines}, retrying in 100ms
2015/04/22 21:05:50 INFO client.go:291: Failed getting response from http://127.0.0.1:4001/: dial tcp 127.0.0.1:4001: connection refused
2015/04/22 21:05:50 ERROR client.go:213: Unable to get result for {Get /_coreos.com/fleet/machines}, retrying in 200ms
2015/04/22 21:05:50 INFO client.go:291: Failed getting response from http://127.0.0.1:4001/: dial tcp 127.0.0.1:4001: connection refused
2015/04/22 21:05:50 ERROR client.go:213: Unable to get result for {Get /_coreos.com/fleet/machines}, retrying in 400ms
2015/04/22 21:05:51 INFO client.go:291: Failed getting response from http://127.0.0.1:4001/: dial tcp 127.0.0.1:4001: connection refused
2015/04/22 21:05:51 ERROR client.go:213: Unable to get result for {Get /_coreos.com/fleet/machines}, retrying in 800ms
2015/04/22 21:05:51 INFO client.go:291: Failed getting response from http://127.0.0.1:4001/: dial tcp 127.0.0.1:4001: connection refused
2015/04/22 21:05:51 ERROR client.go:213: Unable to get result for {Get /_coreos.com/fleet/machines}, retrying in 1s

Running
journalctl -u etcd

will show

Apr 22 14:38:02 test etcd[578]: [etcd] Apr 22 14:38:02.471 INFO      | f507c71154cc47b1804558c7298d0313: state changed from 'leader' to 'follower'.
Apr 22 14:38:02 test etcd[578]: [etcd] Apr 22 14:38:02.471 INFO      | f507c71154cc47b1804558c7298d0313: term #7 started.
Apr 22 14:38:02 test etcd[578]: [etcd] Apr 22 14:38:02.471 INFO      | f507c71154cc47b1804558c7298d0313: leader changed from 'f507c71154cc47b1804558c7298d0313' to ''.
Apr 22 14:38:11 test etcd[578]: [etcd] Apr 22 14:38:11.257 INFO      | f507c71154cc47b1804558c7298d0313: state changed from 'follower' to 'candidate'.
Apr 22 14:38:11 test etcd[578]: [etcd] Apr 22 14:38:11.258 INFO      | f507c71154cc47b1804558c7298d0313: leader changed from 'fa61f58c81fd4e7abe9ac0b6585fafef' to ''.
Apr 22 14:38:11 test etcd[578]: [etcd] Apr 22 14:38:11.546 INFO      | f507c71154cc47b1804558c7298d0313: state changed from 'candidate' to 'follower'.
Apr 22 14:38:11 test etcd[578]: [etcd] Apr 22 14:38:11.547 INFO      | f507c71154cc47b1804558c7298d0313: term #9 started.
Apr 22 14:41:14 test etcd[578]: [etcd] Apr 22 14:41:14.847 INFO      | f507c71154cc47b1804558c7298d0313: snapshot of 10004 events at index 10004 completed
Apr 22 14:53:45 test etcd[578]: [etcd] Apr 22 14:53:45.297 INFO      | f507c71154cc47b1804558c7298d0313: warning: heartbeat near election timeout: 359.350151ms
Apr 22 14:55:22 test etcd[578]: [etcd] Apr 22 14:55:22.381 INFO      | f507c71154cc47b1804558c7298d0313: warning: heartbeat near election timeout: 1.574255587s
Apr 22 15:31:17 test etcd[578]: [etcd] Apr 22 15:31:17.551 INFO      | f507c71154cc47b1804558c7298d0313: snapshot of 10001 events at index 20005 completed
Apr 22 16:19:53 test etcd[578]: [etcd] Apr 22 16:19:53.870 INFO      | f507c71154cc47b1804558c7298d0313: snapshot of 10007 events at index 30012 completed
Apr 22 17:08:00 test etcd[578]: [etcd] Apr 22 17:08:00.254 INFO      | f507c71154cc47b1804558c7298d0313: snapshot of 10007 events at index 40019 completed
Apr 22 17:57:30 test etcd[578]: [etcd] Apr 22 17:57:30.622 INFO      | f507c71154cc47b1804558c7298d0313: snapshot of 10008 events at index 50027 completed
Apr 22 18:48:04 test etcd[578]: [etcd] Apr 22 18:48:04.084 INFO      | f507c71154cc47b1804558c7298d0313: snapshot of 10008 events at index 60035 completed
Apr 22 19:38:37 test etcd[578]: [etcd] Apr 22 19:38:37.641 INFO      | f507c71154cc47b1804558c7298d0313: snapshot of 10007 events at index 70042 completed
Apr 22 20:07:41 test etcd[578]: [etcd] Apr 22 20:07:39.493 INFO      | f507c71154cc47b1804558c7298d0313: state changed from 'follower' to 'candidate'.
Apr 22 20:07:44 test etcd[578]: [etcd] Apr 22 20:07:44.282 INFO      | f507c71154cc47b1804558c7298d0313: leader changed from 'fa61f58c81fd4e7abe9ac0b6585fafef' to ''.
Apr 22 20:07:44 test etcd[578]: [etcd] Apr 22 20:07:44.895 INFO      | f507c71154cc47b1804558c7298d0313: state changed from 'candidate' to 'follower'.
Apr 22 20:07:44 test etcd[578]: [etcd] Apr 22 20:07:44.899 INFO      | f507c71154cc47b1804558c7298d0313: term #13 started.
Apr 22 20:09:39 test etcd[578]: [etcd] Apr 22 20:09:39.269 INFO      | f507c71154cc47b1804558c7298d0313: state changed from 'follower' to 'candidate'.
Apr 22 20:09:39 test etcd[578]: [etcd] Apr 22 20:09:39.302 INFO      | f507c71154cc47b1804558c7298d0313: leader changed from 'fa61f58c81fd4e7abe9ac0b6585fafef' to ''.
Apr 22 20:09:39 test etcd[578]: [etcd] Apr 22 20:09:39.631 INFO      | f507c71154cc47b1804558c7298d0313: state changed from 'candidate' to 'follower'.
Apr 22 20:09:39 test etcd[578]: [etcd] Apr 22 20:09:39.632 INFO      | f507c71154cc47b1804558c7298d0313: term #15 started.
Apr 22 20:11:18 test systemd[1]: Stopping etcd...
Apr 22 20:11:18 test systemd[1]: etcd.service: main process exited, code=exited, status=2/INVALIDARGUMENT
Apr 22 20:11:18 test systemd[1]: Stopped etcd.
Apr 22 20:11:18 test systemd[1]: Unit etcd.service entered failed state.
Apr 22 20:11:18 test systemd[1]: etcd.service failed.

and the following will show
systemctl cat etcd.service

# /usr/lib64/systemd/system/etcd.service
[Unit]
Description=etcd

[Service]
User=etcd
PermissionsStartOnly=true
Environment=ETCD_DATA_DIR=/var/lib/etcd
Environment=ETCD_NAME=%m
ExecStart=/usr/bin/etcd
Restart=always
RestartSec=10s
LimitNOFILE=40000
2 comments
  • We have a slightly different scenario, but which also falls under this heading.

    We had a 3 node cluster, running a REST API and redis cluster. We wanted to increase the resources of the droplet running the API, so shut it down resized it and restarted it. Note that we did not at any ponit shut down / restart the other 2 nodes.

    Upon starting the node fleetctl list-machines was throwing a bunch of errors, after some investigation we found:
    systemctl status -l fleet
    ● fleet.service - fleet daemon
    Loaded: loaded (/usr/lib64/systemd/system/fleet.service; static; vendor preset: disabled)
    Active: inactive (dead)
    systemctl status -l etcd
    ● etcd.service - etcd
    Loaded: loaded (/usr/lib64/systemd/system/etcd.service; static; vendor preset: disabled)
    Active: inactive (dead)

    not sure why that happened, but on trying to start the services manually:
    systemctl enable etcd
    Failed to execute operation: Interactive authentication required.
    systemctl start etcd
    Failed to execute operation: Interactive authentication required.

    at this point the only relevant entries on google was this one. Can anyone shed any light on what has gone wrong during a resize?

  • You should use "sudo" to fix the "Failed to execute operation"

1 Answer

I just ran into the same issue and I've submitted a support request.

Have another answer? Share your knowledge.