Experiencing random sporadic 504 timeouts running Laravel on DigitalOcean Kubernetes

Question

I’m running a Laravel application running on a DigitalOcean kubernetes cluster. The application consists of three pods running on one of three cluster nodes and has traffic routed to it via nginx-ingress and a DigitalOcean Load Balancer. The application is using DigitalOcean managed databases for the backend and I’m using Grafana/Prometheus for monitoring and log collation.

Each application pod consists of three containers:

Laravel Application (PHP-FPM and Nginx sidecar) - I’m aware it’s bad practice to group multiple services together in a single pod but I did this for simplicity until I get my head around the K8s architecture. I was experiencing some major performance issues with using shared volumes for both the nginx and fpm containers to access the codebase. Pods were taking minutes to spin up because the files had to be copied from the Docker image to the shared volume. (If anyone has a better solution to this then I’m all ears!)
Nginx Metrics
FPM Metrics

I’m using Helm3/Flux to deploy the application and the .yaml files look like this:

Ingress:

apiVersion: networking.k8s.io/v1beta1
kind: Ingress
metadata:
  name: {{ template "app.fullname" . }}
  labels:
    app: {{ template "app.name" . }}
    chart: {{ template "app.chart" . }}
    release: {{ .Release.Name }}
    heritage: {{ .Release.Service }}
  annotations:
    kubernetes.io/ingress.class: nginx
    nginx.ingress.kubernetes.io/proxy-body-size: "50m"
spec:
  tls:
    - hosts:
        - [REDACTED].com
      secretName: app-tls
  rules:
    - host: {{ .Values.ingress.host }}
      http:
        paths:
          - path: {{ .Values.ingress.path }}
            backend:
              serviceName: {{ template "app.fullname" . }}
              servicePort: http

Service

apiVersion: v1
kind: Service
metadata:
  name: {{ template "app.fullname" . }}
  labels:
    app: {{ template "app.name" . }}
    chart: {{ template "app.chart" . }}
    release: {{ .Release.Name }}
    heritage: {{ .Release.Service }}
spec:
  type: ClusterIP
  ports:
    - port: 80
      targetPort: 80
      protocol: TCP
      name: http
    - port: 9253
      targetPort: 9253
      protocol: TCP
      name: fpmmetrics
    - port: 9113
      targetPort: 9113
      protocol: TCP
      name: nginxmetrics
  selector:
    app: {{ template "app.name" . }}
    release: {{ .Release.Name }}

With the above config I’m able to access the site via the public hostname. The vast majority of the time the site performs as expected, however around one in 10 requests never resolve and returns a 504 if I leave it long enough. If I initiate a refresh before the 504 is returned, chances are that the page loads almost instantly.

I’m aware that I haven’t given enough details for anyone to diagnose this but I’m at a loss where to start debugging this. I’ve checked the logs and nothing stands out as abnormal and when the site does load it performs as expected which leads me to believe that it’s an ingress/networking issue.

How would I go about starting to diagnose this?

John Kwiatkoski · Answer

Hi there!

Thanks for reaching out I would check the following options.

Laravel App logs -> Did the app itself return a 504? Do you see the request even logged here to verify its getting this far?

Ingress controller logs -> Did the nginx controller log that the connection attempt occurred? Did the ingress controller return the 504? Does it provide any data about what ingress rules were applied to the request e.g. “what did it do with the request”.

Other steps you can take:

Isolation is key

Does this happen for any other applications. If this occurs for all/other applications it’s not likely an issue with your Laravel app but a piece of the infrastructure involved in all your apps connections (ingress controller, LB, containerized infrastructure, nodes, internal DNS, etc.)

Does this behavior also occur when you remove the LB from the mix and access the ingress controller service via it’s nodeport? You can accomplish this by using curl and forcing the resolution of the hostname to the public IP of one of your nodes and the service’s nodeport to ensure it takes the proper ingress rules if you are using host-based routing for example.

Does the issue occur when we take the ingress controller out of the mix? Can you make the similar requests as mentioned above using curl but hitting your Laravel app directly? Does it eventually return the 504 as well? This would tell you if it’s an issue with your Laravel application.

I hope these steps help!

Regards,

John Kwiatkoski Senior Developer Support Engineer - Kubernetes

Ekion2 · Answer

Hi @jkwiatkoski , Thanks for your reply. Debugging this over the last few days it appears as though the issue is with PHP. I’ve taken the following steps to reach this conclusion: I’ve launched a simple non-Laravel podinfo deployment on the same cluster using the same ingress and the same load balancer and it performs flawlessly. Granted, this application is not making calls to a database but it at least eliminates the ingress/LB as the cause. When spinning up my pods I’m calling php artisan commands (passport:keys, migrate --force, config:cache) and this is sometimes making the pod take 3-4 minutes to start up. If I run these commands in the containers themselves they can take a long time to complete. To test the theory further I’ve simulated a load test against the homepage of the application (before login) and the application effectively slows to a crawl to the point when it’s not responsive at all. The next step was to swap out the application with a fresh Laravel install (using the same webdevops/php-nginx image). When performing a load test on the application (single pod, 10 PHP-FPM workers) the FPM pool fails and 0 workers are available for a good few minutes. I’m only starting to learn the intricacies of PHP-FPM but shouldn’t any additional requests be queued rather than killing the process entirely? alt text Digging into the FPM logs doesn’t reveal anything other than the request timing out 2020-07-04 12:49:02 2020/07/04 11:49:02 [error] 89#89: *67 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 10.244.0.67, server: [REDACTED].com, request: "GET / HTTP/1.1", upstream: "fastcgi://127.0.0.1:9000", host: "[REDACTED].com" I’m really at a loose end here on how to stop this from happening. I know that this isn’t the original question but I’m pretty certain that they’re linked. Any thoughts?

gmarineau · Answer

Hi @ekion2 , We have the same problem, have you found a solution?

Report this

Experiencing random sporadic 504 timeouts running Laravel on DigitalOcean Kubernetes

Become a contributor for community

DigitalOcean Documentation

Resources for startups and AI-native businesses

The developer cloud

Start building today