Private networking seems slow when Kubernetes pulls from my private docker registry.

November 11, 2019 624 views
DigitalOcean Managed Kubernetes Networking Kubernetes

My Current Setup

1) Kubernetes cluster with two nodes. Each node is 1vcpu and 2GB of RAM. ($10/mo per node * 2 nodes = $20/mo)
2) A single Droplet with 1vcpu and 2GB of RAM. ($10/mo)
3) The Droplet has a private IP, and is in the same region as the cluster.

Docker Registry

The Droplet hosts a private docker registry. The private docker registry has tls enabled. I use a wildcard certificate because I have two subdomains to my domain:

  • subdomain1.mydomain.com points to my droplet’s public IP, so I can push new images to it
  • subdomain2.mydomain.com points to my droplet’s private IP, and the Kubernetes cluster has a secret set such that it will pull images over the private network.

The private docker registry runs inside of a docker container on the droplet. The private docker registry listens on 0.0.0.0:443, which allows it to receive connections from the private AND public ip.
I used the droplet metrics to confirm that the cluster was indeed pulling the images over the private network and not over the public network.

The Problem

All connections work, but the private communications are quite slow (only about 16mbps at best). Additionally, if I launch two pods on the K8s cluster simultaneously, where both pods need to pull an image, one pod will timeout and the other pod will successfully pull the image. The cluster will retry pulling the second image a couple minutes later and will be able to pull the image successfully.

I suspect the timeout problem is something I need to fix myself in the docker registry config, but 16mbps seems kinda slow for private networking on Digital Ocean.
If I’m going to keep using Digital Ocean for my current project, I’m going to need a strong private network. My current project relies on a TON of communications between nodes in a cluster.

Additional Test

Finally I did one more test. I temporarily created a second droplet with private networking enabled (1vcpu 2GB RAM) and used docker to pull two images simultaneously from the docker registry. I managed to pull at a speed of around 25.4mbps. Neither of the pulls timed out. Still, the speed of only 25.4mbps seems slow.

edited by bobbyiliev
5 comments
  • Hi there,

    The CNI provider we use on our cluster’s is cilium. This runs as a pod on the nodes and manages the networking on the nodes. The small 1CPU/2GB nodes are not intended for production workloads. After containerized infrastructure is deployed theres very little resources left for user workloads. Can you run your tests on larger nodes to see if cilium is being choked a bit by resource issues?

  • @jkwiatkoski

    I switched to the ‘CPU Optimized’ 2vcpu 4GB nodes (2 of them) and got rid of the cluster of small nodes.
    Seems to have done a lot better. Monitoring shows I peaked at 47.2mbps.\

    The Kubernetes cluster downloaded 2 separate images of ~650MB in size.
    650 * 2 = 1300MB, downloaded in the total span of 20 seconds.

    Both pods took exactly 12 seconds to download their individual image. One pod get launched 8 seconds after the other (due to my own software, not an issue with DigitalOcean), and their downloads overlapped for 4 seconds.

    So avg download speed for individual pod: 650MB * (8mb / 1MB) / 12s = 433mbps
    Avg download speed for both pods combined: 1300MB * (8mb / 1MB) / 20s = 520mbps

    Probably safe to assume the CPU of the small droplet being used for the registry is the new bottleneck since htop was showing the cpu at 100% while the two images were downloading.

    Questions I Still Have:
    1) Did I use a large enough machine for the Kubernetes nodes for my test?
    2) Assuming I use appropriately sized droplets and K8s nodes, what upload/download speed can I realistically achieve over the private network?
    3) Why is my droplet’s graphs only showing a peak of 47.2mbps and a CPU peak of 13%? While Kubernetes was downloading the two images, the CPU on the Droplet was a solid 100% for the 20 seconds that one or both docker images were being downloaded.

  • 1) That would depend on whether you felt your performance was adequate for you application as this will vary per workload or per user preference. I’’ point out that reaching out to an external droplet will probably be a lot different performance wise than reaching out to other services within the cluster.

    2) I don’t believe we have an exact number to share here for expected performance.

    3) Can you please clarify which Droplet is being discussed here. Was the 13% spike on the registry or the DOKS node. Which droplet was then at 100% for 20s? Droplet graphs can be misleading especially when small bursts are occurring as the reporting would heavily be based on when the polling of resources is measured.

  • @jkwiatkoski

    The droplet that was at 100% cpu was the registry droplet. It was at 100% for the duration that the cluster was downloading the two images, which was about 20s total. It’s a 1vcpu/2GB droplet. It’s not in the cluster, but it is on the private network.

  • I would be curious to see if resources on the registry are your current bottle neck.

Be the first one to answer this question.