Question

CoreOs - cluster across regions not working

  • Posted October 6, 2014

So after a weekend of testing (creating and destoying) droplets using Core Os, I think there is an issue with joining clusters across region.

I can setup my the CoreOs machine (in a new cluster with a new discovery.etcd) in any region and it will work perfectly. If i add another machine to that cluster in the same data center then it will work perfectly. But if I try and add a new machine to that cluster from another region then it basically wont work. I am not using private_ip, instead i use public_ip.

Below is my #cloud-config file

#cloud-config

coreos:
  etcd:
    discovery: https://discovery.etcd.io/a34f937bef11618f5322416aef9d5edb
    addr: $public_ipv4:4001
    peer-addr: $public_ipv4:7001
    election_timeout: 3000
    heartbeat_timeout: 3000
  fleet:
    public-ip: $public_ipv4
  units:
    - name: etcd.service
      command: start
    - name: fleet.service
      command: start

I have made 2 examples.

Test 1: https://discovery.etcd.io/6f509596dcb64a8427fefd2712faff11

My machines “t1lon” was setup first, works perfectly well. I then spawned “t1ny” with the same discovery id. Then it would break, if i called “fleetctl list-machines” from “t1lon” is would still only show 1 machine, when running the same command from the “t1ny” machine i would get

2014/10/06 07:52:22 INFO client.go:278: Failed getting response from http://127.0.0.1:4001/: too many redirects 
2014/10/06 07:52:22 ERROR client.go:200: Unable to get result for {Get /_coreos.com/fleet/machines}, retrying in 100ms

(repeated many times).

Test 2: https://discovery.etcd.io/a34f937bef11618f5322416aef9d5edb

Now i have tested this the other way round, setting up the NY server first “t2ny” and then adding a LON server and I get the same issue but the other way round.

Due to this issue only really arriving when making a cluster across data centers I can only think its down to network/speed issues.

Please help me shed some light on this.

Subscribe
Share

@rayslakinski I think that always returns a 404, its not an actual endpoint, have you tried curl -L http://127.0.0.1:4001/v2/stats/leader

Having the same issue, though with a signal node. I tried creating a new single node with a new token and got the same issue

curl -I http://127.0.0.1:4001/
HTTP/1.1 404 Not Found
Content-Type: text/plain; charset=utf-8
Date: Mon, 06 Oct 2014 14:51:54 GMT

Submit an answer
You can type!ref in this text area to quickly search our full set of tutorials, documentation & marketplace offerings and insert the link!

These answers are provided by our Community. If you find them useful, show some love by clicking the heart. If you run into issues leave a comment, or add your own answer to help others.

I’ve had varying success with setting up cross data center clusters. According to the CoreOS documentation, the default settings for etcd are designed for working on a local network:

The default settings in etcd should work well for installations on a local network where the average network latency is low. However, when using etcd across multiple data centers or over networks with high latency you may need to tweak the heartbeat interval and election timeout settings.

They also suggest some guidelines for election_timeout and heartbeat_timeout:

The election timeout should be set based on the heartbeat interval and your network ping time between nodes. Election timeouts should be at least 10 times your ping time so it can account for variance in your network. For example, if the ping time between your nodes is 10ms then you should have at least a 100ms election timeout.

You should also set your election timeout to at least 4 to 5 times your heartbeat interval to account for variance in leader replication. For a heartbeat interval of 50ms you should set your election timeout to at least 200ms - 250ms.

Playing around with some larger values may help. This guide should point you in the right direction for doing some deeper debugging: