spaxe
By:
spaxe

CoreOS shows failed units at login, what does it mean?

January 9, 2016 4.2k views
Linux Commands Logging

When I log into my CoreOS recently, I noticed a built-up list of "failed units":
$ ssh core@188.166.209.155
Last login: Fri Jan 8 14:36:34 2016 from 101.176.57.46
CoreOS stable (835.9.0)
Failed Units: 5
sshd@16493-188.166.209.155:22-222.186.15.79:2573.service
sshd@1756-188.166.209.155:22-151.25.198.11:59295.service
sshd@1766-188.166.209.155:22-151.25.198.11:55062.service
sshd@23027-188.166.209.155:22-111.74.239.61:2490.service
sshd@23028-188.166.209.155:22-111.74.239.61:2039.service

There are no explanations of what they are. Can someone explain? They seem to be a list of broken SSH connections...?

3 Answers

First, investigate the failed units. These frequently can be reports of attempts of faileed SSH connections, or dropped sessions from legitimate users.

Then clear them:

sudo systemctl reset-failed

This happened to me as well. Join me on my journey to find an answer, below

Phrasing the Question

This happens when some systemd unit fails on the base system. (This has nothing to do with docker, or docker images.)

Presumably, the sshd server is failing, which may or may not be related to a dropped connection.

You can use these commands to see more information:

$ systemctl --failed
$ systemd status sshd@16493-188.166.209.155:22-222.186.15.79:2573.service

For reference, I'm getting the following error:

Failed to run 'start' task: Transport endpoint is not connected
Finding the Answer

Here is my output from list-units, which helped me figure this out, finally:

$ systemctl list-units | grep sshd
  sshd-keygen.service                                                           loaded active exited    Generate sshd host keys
  sshd@1296-104.236.58.186:22-137.22.169.192:43537.service                       loaded active running   OpenSSH per-connection server daemon (98.22.169.192:43537)
● sshd@532-104.236.58.186:22-277.186.58.136:1283.service                        loaded failed failed    OpenSSH per-connection server daemon (222.186.58.136:1283)
  system-sshd.slice                                                             loaded active active    system-sshd.slice
  sshd.socket                                                                   loaded active listening OpenSSH Server Socket

It looks like sshd.socket is where the sshd daemon listens for new connections. The sshd@ lines report one failure and one success (for me). These appear to take the form of

sshd@1234-<server-IP>:<server-port>-<remote-IP>:<remote-port>.service

This explains what happened: I tried logging in from work the other day, and did not have the correct ssh key. So, for me at least, this represents a failed login. While that represents a security crisis averted — and I'm grateful that CoreOS brought this to my attention — I haven't figured how to clear it out yet.

References: Postscript

Good luck with your new web service!

{"reason":"Nothing here."}
Systemd is an init system and system manager that is widely becoming the new standard for Linux machines. While there is considerable controversy as to whether systemd is an improvement over the init systems it is replacing, the majority of distributions are either...
  • I figured out how to clear it, but it means downtime! From the droplet's command-line, type:

    $ sudo shutdown -r now
    

    That's not ideal, but it does clear the failed units message

    • Thanks @jpaugh. That seems really helpful. And I'm glad you tried out my server to prove that you did the right thing :)

      I contacted DO's customer support a while back, and one of them suspects the failed systemd/sshd comes from automated botnet attacks, in this case all 5 IPs appear to be from China.

      Eventually the system rebooted (I managed to crash my own server at one point overloading it with requests, oops) and the messages went away.

      Next time I know more!

      • Oh, wow! Good thing CoreOS is pretty locked-down by default (disabling remote root login, and remote password login).

        I had a friend who thought root password login was okay. He found out that he had been hacked when a major website sent him a cease and desist — apparently, a botnet had already infiltrated his server, and was hacking others.

It means somebody (not necessarily yourself, perhaps a bad guy) tried to login via ssh, and failed.

To protect the host from compromise / resource exhaustion, you may want to change port & disable password authentication in sshd_config. Many script kids stops trying when they find that password auth is not available.

Have another answer? Share your knowledge.