CoreOs - cluster across regions not working

October 6, 2014 3.8k views

So after a weekend of testing (creating and destoying) droplets using Core Os, I think there is an issue with joining clusters across region.

I can setup my the CoreOs machine (in a new cluster with a new discovery.etcd) in any region and it will work perfectly. If i add another machine to that cluster in the same data center then it will work perfectly. But if I try and add a new machine to that cluster from another region then it basically wont work. I am not using private_ip, instead i use public_ip.

Below is my #cloud-config file

#cloud-config

coreos:
  etcd:
    discovery: https://discovery.etcd.io/a34f937bef11618f5322416aef9d5edb
    addr: $public_ipv4:4001
    peer-addr: $public_ipv4:7001
    election_timeout: 3000
    heartbeat_timeout: 3000
  fleet:
    public-ip: $public_ipv4
  units:
    - name: etcd.service
      command: start
    - name: fleet.service
      command: start

I have made 2 examples.

Test 1: https://discovery.etcd.io/6f509596dcb64a8427fefd2712faff11

My machines "t1lon" was setup first, works perfectly well. I then spawned "t1ny" with the same discovery id. Then it would break, if i called "fleetctl list-machines" from "t1lon" is would still only show 1 machine, when running the same command from the "t1ny" machine i would get

2014/10/06 07:52:22 INFO client.go:278: Failed getting response from http://127.0.0.1:4001/: too many redirects 
2014/10/06 07:52:22 ERROR client.go:200: Unable to get result for {Get /_coreos.com/fleet/machines}, retrying in 100ms

(repeated many times).

Test 2: https://discovery.etcd.io/a34f937bef11618f5322416aef9d5edb

Now i have tested this the other way round, setting up the NY server first "t2ny" and then adding a LON server and I get the same issue but the other way round.

Due to this issue only really arriving when making a cluster across data centers I can only think its down to network/speed issues.

Please help me shed some light on this.

2 comments
  • Having the same issue, though with a signal node. I tried creating a new single node with a new token and got the same issue

    curl -I http://127.0.0.1:4001/
    HTTP/1.1 404 Not Found
    Content-Type: text/plain; charset=utf-8
    Date: Mon, 06 Oct 2014 14:51:54 GMT
    
  • @rayslakinski I think that always returns a 404, its not an actual endpoint, have you tried curl -L http://127.0.0.1:4001/v2/stats/leader

1 Answer

I've had varying success with setting up cross data center clusters. According to the CoreOS documentation, the default settings for etcd are designed for working on a local network:

The default settings in etcd should work well for installations on a local network where the average network latency is low. However, when using etcd across multiple data centers or over networks with high latency you may need to tweak the heartbeat interval and election timeout settings.

They also suggest some guidelines for election_timeout and heartbeat_timeout:

The election timeout should be set based on the heartbeat interval and your network ping time between nodes. Election timeouts should be at least 10 times your ping time so it can account for variance in your network. For example, if the ping time between your nodes is 10ms then you should have at least a 100ms election timeout.

You should also set your election timeout to at least 4 to 5 times your heartbeat interval to account for variance in leader replication. For a heartbeat interval of 50ms you should set your election timeout to at least 200ms - 250ms.

Playing around with some larger values may help. This guide should point you in the right direction for doing some deeper debugging:

CoreOS is an extremely powerful operating system focused on cluster management, security, and containerized service deployments. However, the unconventional way that the system is set up can make troubleshooting somewhat difficult. In this guide, we'll cover the basics of how to track down issues in your deployment as well as your services.
  • Interesting, so there is some hope then... (apologies for this long post)

    After further reading (thanks for the links), i decided to have another go. I setup 2 servers within NYC and 1 LON.

    The ping rate between NYC and LON was about 100ms. Using the following for reference https://github.com/coreos/etcd/issues/656 and giving a little bit extra to be sure i used

    election_timeout: 7500
    heartbeat_interval: 1500
    
    #cloud-config
    
    coreos:
      etcd:
        name: robocop
        discovery: https://discovery.etcd.io/9e349b96a970f6661552876bbe3a5859
        addr: $public_ipv4:4001
        peer-addr: $public_ipv4:7001
        election_timeout: 7500
        heartbeat_interval: 1500
      fleet:
        public-ip: $public_ipv4
        metadata: region=lon,location=lon1,public_ip=$public_ipv4,host=robocop
      units:
        - name: etcd.service
          command: start
        - name: fleet.service
          command: start
    

    Sadly this did not work. The NYC servers joined together fine sadly the LON server did not attach properly. Running `fleetctl list-machines'

    2014/10/07 10:23:48 INFO client.go:278: Failed getting response from http://127.0.0.1:4001/: too many redirects
    2014/10/07 10:23:48 ERROR client.go:200: Unable to get result for {Get /_coreos.com/fleet/machines}, retrying in 100ms
    2014/10/07 10:23:48 INFO client.go:278: Failed getting response from http://127.0.0.1:4001/: too many redirects
    2014/10/07 10:23:48 ERROR client.go:200: Unable to get result for {Get /_coreos.com/fleet/machines}, retrying in 200ms
    2014/10/07 10:23:48 INFO client.go:278: Failed getting response from http://127.0.0.1:4001/: too many redirects
    2014/10/07 10:23:48 ERROR client.go:200: Unable to get result for {Get /_coreos.com/fleet/machines}, retrying in 400ms
    2014/10/07 10:23:48 INFO client.go:278: Failed getting response from http://127.0.0.1:4001/: too many redirects
    2014/10/07 10:23:48 ERROR client.go:200: Unable to get result for {Get /_coreos.com/fleet/machines}, retrying in 800ms
    2014/10/07 10:23:49 INFO client.go:278: Failed getting response from http://127.0.0.1:4001/: too many redirects
    2014/10/07 10:23:49 ERROR client.go:200: Unable to get result for {Get /_coreos.com/fleet/machines}, retrying in 1s
    2014/10/07 10:23:50 INFO client.go:278: Failed getting response from http://127.0.0.1:4001/: too many redirects
    2014/10/07 10:23:50 ERROR client.go:200: Unable to get result for {Get /_coreos.com/fleet/machines}, retrying in 1s
    

    Looking through the journalctl on the LON machine, the trouble started as soon as Establishing etcd connectivity

    Oct 06 23:50:47 robocop.roughcoder.com systemd[1]: Startup finished in 1.034s (kernel) + 2.993s (initrd) + 12.900s (userspace) = 16.928s.
    Oct 06 23:50:47 robocop.roughcoder.com coreos-cloudinit[472]: 2014/10/06 23:50:47 Result of 'restart oem-cloudinit.service': done
    Oct 06 23:50:47 robocop.roughcoder.com fleetd[588]: INFO fleet.go:144: No provided or default config file found - proceeding without
    Oct 06 23:50:47 robocop.roughcoder.com fleetd[588]: INFO server.go:137: Establishing etcd connectivity
    Oct 06 23:50:47 robocop.roughcoder.com fleetd[588]: INFO client.go:278: Failed getting response from http://localhost:4001/: dial tcp 127.0.0.1:4001: connection refused
    Oct 06 23:50:47 robocop.roughcoder.com fleetd[588]: ERROR client.go:200: Unable to get result for {Update /_coreos.com/fleet/machines/152f5e4477624726833d9b802e674ebb/object}, retrying in 100ms
    Oct 06 23:50:47 robocop.roughcoder.com fleetd[588]: INFO client.go:278: Failed getting response from http://localhost:4001/: dial tcp 127.0.0.1:4001: connection refused
    Oct 06 23:50:47 robocop.roughcoder.com fleetd[588]: ERROR client.go:200: Unable to get result for {Update /_coreos.com/fleet/machines/152f5e4477624726833d9b802e674ebb/object}, retrying in 200ms
    Oct 06 23:50:47 robocop.roughcoder.com fleetd[588]: INFO client.go:278: Failed getting response from http://localhost:4001/: dial tcp 127.0.0.1:4001: connection refused
    

    i ran curl -L http://127.0.0.1:4001/v2/stats/self on the LON box, and it seems to get some of the basic etcd information - the leader is correct.

    {"name": "robocop", "state": "follower", "startTime": "2014-10-06T23:50:47.413895957Z", "leaderInfo": {
        "leader": "r2d2",
        "uptime": "10h14m54.128897017s",
        "startTime": "2014-10-07T00:02:24.684201161Z"
    }, "recvAppendRequestCnt": 1, "sendAppendRequestCnt": 0}
    

    Though running stats against the leader fails curl -L http://127.0.0.1:4001/v2/stats/leader

    {"errorCode": 300, "message": "Raft Internal Error", "index": 566}
    

    I can save to the etcl though .

    time etcdctl set /foo bar
    bar
    
    real    0m0.169s
    user    0m0.004s
    sys 0m0.001s
    

    and read the machines in etcl

    etcdctl ls /_etcd/machines --recursive
    /_etcd/machines/r2d2
    /_etcd/machines/c3po
    

    So does anybody have any further advice/help on setting the correct electiontimeout/heartbeatinterval, an idea why fleet has issues or even why i cant curl -L http://127.0.0.1:4001/v2/stats/leader

    So many questions.....

  • with a

    curl -L http://127.0.0.1:4001/v2/stats/leader
    

    I get

    {"errorCode":300,"message":"Raft Internal Error","index":110760}
    
  • @neil172251

    By looking at journalctl you should your new region system trying to connect. In this instance I was using private ip address that the remote region cluster could not connect to.

    Nov 05 16:52:09 coreos-18 etcd[607]: [etcd] Nov 5 16:52:09.153 INFO | coreos-18 attempted to join via 10.132.189.118:7001 failed: fail checking join ver
    Nov 05 16:52:10 coreos-18 etcd[607]: [etcd] Nov 5 16:52:10.503 INFO | coreos-18 attempted to join via 10.132.189.115:7001 failed: fail checking join ver
    Nov 05 16:52:10 coreos-18 etcd[607]: [etcd] Nov 5 16:52:10.506 INFO | coreos-18 attempted to join via 10.132.192.113:7001 failed: fail checking join ver
    Nov 05 16:52:10 coreos-18 etcd[607]: [etcd] Nov 5 16:52:10.507 INFO | coreos-18 attempted to join via 10.132.191.194:7001 failed: fail checking join ver
    Nov 05 16:52:10 coreos-18 etcd[607]: [etcd] Nov 5 16:52:10.509 INFO | coreos-18 attempted to join via 10.132.189.116:7001 failed: fail checking join ver

Have another answer? Share your knowledge.