my applications were down but servers were running good

November 23, 2015 1.1k views
Linux Commands Server Optimization Networking Monitoring Nginx Redis Ruby Ruby on Rails PostgreSQL CentOS

Hi, my current stack is on CentOS 6.4, NGINX with Passenger, Postgresql, Redis and Rails.
It has been happened today, during EST 1:22AM to 1:25AM, all of my server reported ping down, but I had active SSH sessions running from Putty. the failed ping also leads to my websites/application downtime for that 5 minutes.

Did Digitalocean had a network maintenance during the given period, on 23rd November? if not, how can I find out the exact reason about this?

My postgres server makes a dump every 2 hours, so it is not supposed to do any harm, otherwise I would have sites down every 2 hours always or frequently.

I need some detailed help on how to find out the issues, specially what logs to check or what status to check. Kindly help.

1 Answer

I am not sure if PostgreSQL locks everything when you dump it. If it does, that could lead to issues if you are currently making database transactions. If you have questions regarding downtime, you can look at the status page and/or contact support with questions. Looking at my statuses of my droplets (All in NYC3), I had 100% uptime for the 23rd. You could've experienced a routing issue, etc. There are many causes for downtime, but the first place to check are your logs.

  • Well the postgres dump run every two hours. If it is to cause the downtime, then it should happens several times a week. I have 10 servers at Digitalocean and all reported Ping Failure on 23rd several times.

    Can you please help or guide me what to look at the logs, specially on which log file?

    For the routing issue, let me know what to share, I'll post for more clarification.

  • @islammanjurul Sorry about the delay. DigitalOcean didn't answer send me my hourly updates.

    The best place to dig for logs is /var/log. I can't tell you specifics of what caused it because it could really be anything. If the issue is resolved now, there shouldn't be anything to really worry about. But if you are really interested, check the kernel logs and system logs. If you have any resource logs (NewRelic, NodeQuery, etc) you could see if everything just spiked in usage and caused the server to be unable to answer to a ping because it was busy with something else.

    • I am really worried about one thing, reporting about several site downtime in a day, and ruxit reports both local and global outage from its different location, one of them is New York, where even my droplets are located. See the below links for screenshots:

      shot 1

      shot 2

      shot 3

      shot 4

      I have many other shots from, but the moment those downtime happened, I couldn't traceroute or run mtr output. But having some difficulties fixing this issues. I have another server hosted at OVH, that did not produce similar misbehaving.

    • @islammanjurul I am sure it was nothing and you have nothing to worry about. It could've been a temporary routing issue.

Have another answer? Share your knowledge.