CoreOs - cluster across regions not working

So after a weekend of testing (creating and destoying) droplets using Core Os, I think there is an issue with joining clusters across region.

I can setup my the CoreOs machine (in a new cluster with a new discovery.etcd) in any region and it will work perfectly. If i add another machine to that cluster in the same data center then it will work perfectly. But if I try and add a new machine to that cluster from another region then it basically wont work. I am not using private_ip, instead i use public_ip.

Below is my #cloud-config file


    addr: $public_ipv4:4001
    peer-addr: $public_ipv4:7001
    election_timeout: 3000
    heartbeat_timeout: 3000
    public-ip: $public_ipv4
    - name: etcd.service
      command: start
    - name: fleet.service
      command: start

I have made 2 examples.

Test 1:

My machines “t1lon” was setup first, works perfectly well. I then spawned “t1ny” with the same discovery id. Then it would break, if i called “fleetctl list-machines” from “t1lon” is would still only show 1 machine, when running the same command from the “t1ny” machine i would get

2014/10/06 07:52:22 INFO client.go:278: Failed getting response from too many redirects 
2014/10/06 07:52:22 ERROR client.go:200: Unable to get result for {Get /}, retrying in 100ms

(repeated many times).

Test 2:

Now i have tested this the other way round, setting up the NY server first “t2ny” and then adding a LON server and I get the same issue but the other way round.

Due to this issue only really arriving when making a cluster across data centers I can only think its down to network/speed issues.

Please help me shed some light on this.

Show comments

Submit an answer

This textbox defaults to using Markdown to format your answer.

You can type !ref in this text area to quickly search our full set of tutorials, documentation & marketplace offerings and insert the link!

Sign In or Sign Up to Answer

These answers are provided by our Community. If you find them useful, show some love by clicking the heart. If you run into issues leave a comment, or add your own answer to help others.

Want to learn more? Join the DigitalOcean Community!

Join our DigitalOcean community of over a million developers for free! Get help and share knowledge in Q&A, subscribe to topics of interest, and get courses and tools that will help you grow as a developer and scale your project or business.

I’ve had varying success with setting up cross data center clusters. According to the CoreOS documentation, the default settings for etcd are designed for working on a local network:

The default settings in etcd should work well for installations on a local network where the average network latency is low. However, when using etcd across multiple data centers or over networks with high latency you may need to tweak the heartbeat interval and election timeout settings.

They also suggest some guidelines for election_timeout and heartbeat_timeout:

The election timeout should be set based on the heartbeat interval and your network ping time between nodes. Election timeouts should be at least 10 times your ping time so it can account for variance in your network. For example, if the ping time between your nodes is 10ms then you should have at least a 100ms election timeout.

You should also set your election timeout to at least 4 to 5 times your heartbeat interval to account for variance in leader replication. For a heartbeat interval of 50ms you should set your election timeout to at least 200ms - 250ms.

Playing around with some larger values may help. This guide should point you in the right direction for doing some deeper debugging: