Load Balancer issues: 504 gateway errors and improper load balancing

August 16, 2019 423 views
DigitalOcean Nginx Load Balancing Ubuntu 16.04

I originally had a single droplet running Unbuntu 16.04, nginx unicorn, and Rails. I followed the tutorial here to setup TLS with Let’s encrypt: https://www.digitalocean.com/community/tutorials/how-to-secure-nginx-with-let-s-encrypt-on-ubuntu-16-04

I then wanted to create another instance of this droplet and put a load balancer in front of the 2 droplets. To do this, I followed this advice from the load balancer docs:

All of the Droplets you use with your load balancer need to have the same SSL certificate. After your setup works with one backend server, you can create an image of the first Droplet to use to create additional instances.

I then pointed my original DNS record to the load balancer’s IP, rather than the first droplet’s IP. Note that my DNS is hosted with Google Domains, not DO, not sure if that is significant.

I have my load balancer set to SSL passthrough and round robin.

I am seeing 2 issues now:

  1. A large majority of my requests are going to the first droplet (as per tailing the nginx access logs), rather than being equally distributed between the two. I am testing this by rapidly sending requests with Postman and Swagger.

  2. For the 2nd droplet, I’m getting only 504 bad gateway errors. Specifically:

==> error.log <==
2019/08/16 00:41:49 [error] 14126#14126: *243 connect() to unix:/home/gc-portal/app/shared/unicorn.sock failed (111: Connection refused) while connecting to upstream, client: 167.71.106.89, server: my.domain.io, request: "POST /v2/recipes HTTP/1.1", upstream: "http://unix:/home/gc-portal/app/shared/unicorn.sock:/v2/recipes", host: "my.domain.io"

My nginx conf for this app looks like this. Note though, that the nginx conf and the unicorn conf for both droplets is identical, since I cloned the first one.

Nginx conf:

upstream gc-portal {
        server unix:/home/gc-portal/app/shared/unicorn.sock fail_timeout=0;
}

server {
        server_name my.domain.io;

        root /home/gc-portal/app/current/public;

        location /assets/  {
                gzip_static on; # serve pre-gzipped version
                expires 1M;
                add_header Cache-Control public;
        }

        location / {
                try_files $uri @app;
        }

        location @app {
                proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
                proxy_set_header X-Forwarded-Proto $scheme;
                proxy_set_header Host $http_host;
                proxy_redirect off;
                proxy_pass http://gc-portal;
        }

     #   location /pubsub {
     #proxy_pass http://gc-portal/pubsub;
     #proxy_http_version 1.1;
     #proxy_set_header Upgrade websocket;
     #proxy_set_header Connection Upgrade;
     #proxy_set_header X-Real-IP $remote_addr;
     #proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
     #proxy_set_header Host $http_host;
    #}

    listen 443 ssl; # managed by Certbot
    ssl_certificate /etc/letsencrypt/live/my.domain.io/fullchain.pem; # managed by Certbot
    ssl_certificate_key /etc/letsencrypt/live/my.domain.io/privkey.pem; # managed by Certbot
    include /etc/letsencrypt/options-ssl-nginx.conf; # managed by Certbot
    ssl_dhparam /etc/letsencrypt/ssl-dhparams.pem; # managed by Certbot
}

server {
    if ($host = my.domain.io) {
        return 301 https://$host$request_uri;
    } # managed by Certbot


        listen 80;
        server_name my.domain.io;
    return 404; # managed by Certbot


}

Any thoughts on what the issue might be here?

2 Answers

Hello,

I think that the problem might be with your unicorn service on the second droplet. Can you make sure that unicorn is running and that the socket exists at /home/gc-portal/app/shared/unicorn.sock?

Also make sure that unicorn is set to start on boot so that if you have to restart your droplet, the unicorn service also starts.

Once the two droplets are healthy, the loadbalancing should start working as normal and equally distribute the load between the two droplets.

Hope that this helps!
Let me know how it goes.
Regards,
Bobby

  • The uncorn processes are running on both droplets. For the working droplet, ps aux | grep 'unicorn' yields:

    gc-port+  6176  0.0  0.2  86324 21732 ?        Sl   Aug15   0:01 unicorn master -c /home/gc-portal/app/shared/config/unicorn.conf.rb
    gc-port+  7187  0.0  1.7 599172 144224 ?       Sl   Aug16   0:56 unicorn worker[0] -c /home/gc-portal/app/shared/config/unicorn.conf.rb
    gc-port+  7190  0.0  1.7 601400 144820 ?       Sl   Aug16   0:54 unicorn worker[1] -c /home/gc-portal/app/shared/config/unicorn.conf.rb
    

    and for the problem droplet:

    gc-port+  1083  0.0  0.0  12944   964 pts/0    S+   22:51   0:00 grep --color=auto unicorn
    gc-port+ 14838  0.0  0.1  86336 21968 ?        Sl   Aug16   0:01 unicorn master -c /home/gc-portal/app/shared/config/unicorn.conf.rb
    gc-port+ 15304  0.0  0.8 591456 143648 ?       Sl   Aug16   1:03 unicorn worker[0] -c /home/gc-portal/app/shared/config/unicorn.conf.rb
    gc-port+ 15307  0.0  0.8 591872 144064 ?       Sl   Aug16   0:41 unicorn worker[1] -c /home/gc-portal/app/shared/config/unicorn.conf.rb
    
    

    The unicorn configuration files are exactly the same, and the path

    The load balancer is seeing both droplets as healthy; the problem droplet is showing many of these calls, which I presume are coming from the load balancer:

    167.71.108.163 - - [18/Aug/2019:22:54:25 -0400] "GET / HTTP/1.0" 404 178 "-" "-"
    167.71.106.89 - - [18/Aug/2019:22:54:28 -0400] "GET / HTTP/1.0" 404 178 "-" "-"
    

    Yet, actual API calls very rarely go to that droplet, and when they do they receive the 504 error.

    • This sounds like an issue with your Unicorn setup, can you share your uncorn.conf content here so that we could advise further?

      It is normal for the Load Balancer not to be sending traffic to the node that is not healthy. Once you’ve sorted out the uncorn problem the Load Balancing would get back to normal.

      • unicorn.conf.rb is:

        app_path = File.expand_path(File.join(File.dirname(__FILE__), '../../'))
        
        #listen '127.0.0.1:4000'
        listen File.join(app_path, 'shared/unicorn.sock'), :backlog => 64
        
        worker_processes 2
        
        working_directory File.join(app_path, 'current')
        pid File.join(app_path, 'shared/unicorn.pid')
        stderr_path File.join(app_path, 'current/log/unicorn.log')
        stdout_path File.join(app_path, 'current/log/unicorn.log')
        
        ENV['RACK_ENV'] = "deployment"
        
        • Hello,

          Here are a few questions that could point you to the right direction:

          • If you visit the IP of the problematic droplet in your browser what do you get?

          • Are there any errors in the unicorn.log?

          • Does the /home/gc-portal/app/shared/unicorn.sock socket exist?

          • Do the sock file and the /home/gc-portal/app/shared directory both have read permissions for user, group and other?

          • Have you tried restarting unicorn manually?

            • Visiting the IP of the problematic droplet redirects to example.com, showing and shows:

            “Example Domain
            This domain is established to be used for illustrative examples in documents. You may use this domain in examples without prior coordination or asking for permission.”

            • Yes, in fact. There are errors in the unicorn log for the problem server that do not appear in the healthy one:
            E, [2019-08-20T23:30:46.509841 #2869] ERROR -- : worker=1 PID:13934 timeout (61s > 60s), killing
            E, [2019-08-20T23:30:46.519658 #2869] ERROR -- : reaped #<Process::Status: pid 13934 SIGKILL (signal 9)> worker=1
            I, [2019-08-20T23:30:46.519908 #2869]  INFO -- : worker=1 spawning…
            I, [2019-08-20T23:30:46.521481 #14608]  INFO -- : worker=1 spawned pid=14608
            I, [2019-08-20T23:30:46.521669 #14608]  INFO -- : Refreshing Gem list
            I, [2019-08-20T23:30:50.153154 #14608]  INFO -- : worker=1 ready
            167.71.106.89 - - [20/Aug/2019:23:34:25 -0400] "GET /404 HTTP/1.0" 404 1722 0.0061
            E, [2019-08-20T23:35:29.806072 #2869] ERROR -- : worker=0 PID:13927 timeout (61s > 60s), killing
            E, [2019-08-20T23:35:29.821142 #2869] ERROR -- : reaped #<Process::Status: pid 13927 SIGKILL (signal 9)> worker=0
            I, [2019-08-20T23:35:29.821454 #2869]  INFO -- : worker=0 spawning…
            I, [2019-08-20T23:35:29.823911 #14628]  INFO -- : worker=0 spawned pid=14628
            I, [2019-08-20T23:35:29.824164 #14628]  INFO -- : Refreshing Gem list
            I, [2019-08-20T23:35:33.411407 #14628]  INFO -- : worker=0 ready
            
            

            This is probably the issue. Every time I make a request, the call seems to hang until the unicorn timeout is reached, the I see that the unicorn worker is reaped.

            The very peculiar thing is why this doesn’t happen for the other droplet. I’ve even powered down the other droplet to make sure it’s not something like duplicate attempts at database connection, or connection to logging server, etc. Even when the problem droplet is the only one running, the unicorn server still times out and is reaped. This problem droplet was made from an image, so the code and all the configs should be identical!

          • Hi @davidbf1f5deefd1bab899a7f8

            After seeing the errors and the additional information that you’ve provided, my guess would be that you are using an external database server and your new droplet’s IP is not allowed/whitelisted on the database server, so whenever unicorn tries to connect to the database the connections just timeout and and are being killed.

            If this is the case, I would suggest, grabbing your new droplet’s IP address and making sure that it is allowed on your database server.

            You could test the connection manually by logging into your new droplet and using the telnet command:

            telnet your.db.server.ip your.db.port
            

            If the connection timesout then this is most certainly the case.

            Let me know how it goes!
            Regards,
            Bobby

@bobbyiliev

You were totally right about the IP whitelisting for the database– things are working now as expected. Not the first time this has happened… it’s easily overlooked! Thanks for pointing out the answer hiding in plain sight.

  • Hello,

    No problem at all. I’m happy to hear that we’ve managed to get to the bottom of the issue and that it’s all working now!

    Regards,
    Bobby

Have another answer? Share your knowledge.