Question

Kubernetes deployment with external load balancer: zero downtime rollouts

Environment

My Kubernetes cluster only has 1 node for now - managed by DigitalOcean.

The web application that I deployed runs in 3 pods - all on ONE node. I used the external DigitalOcean’s load balancer to expose the application outside the cluster.

Here’s the k8s resource definitions:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: shovik-com
  labels:
    app: shovik-com
spec:
  replicas: 3
  selector:
    matchLabels:
      app: shovik-com
  template:
    metadata:
      labels:
        app: shovik-com
    spec:
      containers:
      - name: shovik-com
        image: aspushkinus/shovik:latest
        imagePullPolicy: Always
        ports:
          - containerPort: 80
        envFrom:
          - secretRef:
              name: shovik-com
---
apiVersion: v1
kind: Service
metadata:
  name: shovik-com-balancer
  annotations:
    service.beta.kubernetes.io/do-loadbalancer-tls-ports: "443"
    service.beta.kubernetes.io/do-loadbalancer-certificate-id: "do-cert-id"
    service.beta.kubernetes.io/do-loadbalancer-redirect-http-to-https: "true"    
spec:
  type: LoadBalancer
  selector:
    app: shovik-com
  ports:
    - name: http
      protocol: TCP
      port: 80
      targetPort: 80
    - name: https
      protocol: TCP
      port: 443
      targetPort: 80

This works great and the website is live: https://shovik.com/

The problem

Whenever I deploy the new version of the app using the standard k8s rolling strategy, my app goes down for a minute and DigitalOcean’s load balancer responds with “503 Service Unavailable”. This is despite the fact that at any given time there are at least 2 pods in the “running” status.

Question

How can I implement a zero-downtime deployment using DigitalOcean’s k8s and load balancer? Should I put another NodePort service in front of the LoadBalancer?


Submit an answer

This textbox defaults to using Markdown to format your answer.

You can type !ref in this text area to quickly search our full set of tutorials, documentation & marketplace offerings and insert the link!

Sign In or Sign Up to Answer

These answers are provided by our Community. If you find them useful, show some love by clicking the heart. If you run into issues leave a comment, or add your own answer to help others.

Accepted Answer

This is fixed now, I asked too soon, but I hope this will help someone else: I had to add livenessProbe and readinessProbe to my deployment - to have kubelet check to make sure my pods are ready to start accepting traffic.

https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-probes/

Updated deployment resource:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: shovik-com
  labels:
    app: shovik-com
spec:
  replicas: 3
  selector:
    matchLabels:
      app: shovik-com
  template:
    metadata:
      labels:
        app: shovik-com
    spec:
      containers:
      - name: shovik-com
        image: aspushkinus/shovik:latest
        imagePullPolicy: Always
        readinessProbe:
          httpGet:
            path: /
            port: 80
          initialDelaySeconds: 5
          periodSeconds: 5
        livenessProbe:
          httpGet:
            path: /
            port: 80
          initialDelaySeconds: 15
          periodSeconds: 15
        ports:
          - containerPort: 80
        envFrom:
          - secretRef:
              name: shovik-com

Want to learn more? Join the DigitalOcean Community!

Join our DigitalOcean community of over a million developers for free! Get help and share knowledge in Q&A, subscribe to topics of interest, and get courses and tools that will help you grow as a developer and scale your project or business.

Hi there!

Thank you for providing your solution. I am happy to see this resolved your issue.

To reduce downtime potential, I would highly recommend at least a second node, ideally 3. Single node clusters are not a solid foundation for production workloads. I would argue that focusing on this single point during rollouts would be less important as rollouts are controlled by you. I think dealing with node/infra outages which are bound to happen(patching, upgrades, maintenance, failure, etc), may be a more valuable step in our quest for uptime. Just food for thought.

Regards,

John Kwiatkoski Senior Developer Support Engineer

This comment has been deleted