So after a weekend of testing (creating and destoying) droplets using Core Os, I think there is an issue with joining clusters across region.
I can setup my the CoreOs machine (in a new cluster with a new discovery.etcd) in any region and it will work perfectly. If i add another machine to that cluster in the same data center then it will work perfectly. But if I try and add a new machine to that cluster from another region then it basically wont work. I am not using private_ip
, instead i use public_ip
.
Below is my #cloud-config file
#cloud-config
coreos:
etcd:
discovery: https://discovery.etcd.io/a34f937bef11618f5322416aef9d5edb
addr: $public_ipv4:4001
peer-addr: $public_ipv4:7001
election_timeout: 3000
heartbeat_timeout: 3000
fleet:
public-ip: $public_ipv4
units:
- name: etcd.service
command: start
- name: fleet.service
command: start
I have made 2 examples.
Test 1: https://discovery.etcd.io/6f509596dcb64a8427fefd2712faff11
My machines “t1lon” was setup first, works perfectly well. I then spawned “t1ny” with the same discovery id. Then it would break, if i called “fleetctl list-machines” from “t1lon” is would still only show 1 machine, when running the same command from the “t1ny” machine i would get
2014/10/06 07:52:22 INFO client.go:278: Failed getting response from http://127.0.0.1:4001/: too many redirects
2014/10/06 07:52:22 ERROR client.go:200: Unable to get result for {Get /_coreos.com/fleet/machines}, retrying in 100ms
(repeated many times).
Test 2: https://discovery.etcd.io/a34f937bef11618f5322416aef9d5edb
Now i have tested this the other way round, setting up the NY server first “t2ny” and then adding a LON server and I get the same issue but the other way round.
Due to this issue only really arriving when making a cluster across data centers I can only think its down to network/speed issues.
Please help me shed some light on this.
These answers are provided by our Community. If you find them useful, show some love by clicking the heart. If you run into issues leave a comment, or add your own answer to help others.
I’ve had varying success with setting up cross data center clusters. According to the CoreOS documentation, the default settings for etcd
are designed for working on a local network:
The default settings in etcd should work well for installations on a local network where the average network latency is low. However, when using etcd across multiple data centers or over networks with high latency you may need to tweak the heartbeat interval and election timeout settings.
They also suggest some guidelines for election_timeout
and heartbeat_timeout
:
The election timeout should be set based on the heartbeat interval and your network ping time between nodes. Election timeouts should be at least 10 times your ping time so it can account for variance in your network. For example, if the ping time between your nodes is 10ms then you should have at least a 100ms election timeout.
You should also set your election timeout to at least 4 to 5 times your heartbeat interval to account for variance in leader replication. For a heartbeat interval of 50ms you should set your election timeout to at least 200ms - 250ms.
Playing around with some larger values may help. This guide should point you in the right direction for doing some deeper debugging:
Click below to sign up and get $100 of credit to try our products over 60 days!
@rayslakinski I think that always returns a 404, its not an actual endpoint, have you tried
curl -L http://127.0.0.1:4001/v2/stats/leader
Having the same issue, though with a signal node. I tried creating a new single node with a new token and got the same issue