Post Mortem 2021-06-06

Incident summary

Due to a hardware defect on a-hel-fi we got a service degredation in various systems

Leadup

The server a-hel-fi had some strange behavior for about a week.

Fault

The datacenter reported faulty hardware.

Impact

  • build.voidlinux.org was down
  • docs.voidlinux.org was down
  • alpha.de.repo.voidlinux.org was down
  • man.voidlinux.org was down
  • package search on voidlinux.org was unavailable

Detection

The issue was reported in IRC 4 minutes after monitoring and automation detected the fault. No automatic alerts were raised.

Response

  • hardware reset from hetzners robot webinterface
  • hardware reset to rescue system from hetzners robot webinterface
  • ticket was opened

Recovery

The datacenter moved the hdds to new hardware.

What went well

Communication with the datacenter was good. From the initial report to the fix took only an hour and most of the delay was caused at our side.

The handling of the incident was good and the response time was fast. We also shared the state of the incident via twitter and reddit, which helped to let users show understanding for the downtime.

What could be done better

It was just luck, that Gottox was available. He was the only one that was able to interact with the webinterface.

Lessons learned

  • putting to many services on one host isn't the best idea
  • the access to the webinterface of Hetzner should be accessible for more people

Timeline

Timestamps are GMT+00

  • 2021-06-06 09:30: The machine stopped replying to heartbeats.
  • 2021-06-06 09:54: Issue was reported on IRC by maldridge
  • 2021-06-06 09:58: Hardware reset was issued by Gottox from the robot webinterface.
  • 2021-06-06 10:13: Hardware reset to rescue system was issued by Gottox from the robot webinterface.
  • 2021-06-06 10:23: maldridge was provided with access to the robot webinterface for that specific server
  • 2021-06-06 10:26: Ticket was opened at the datacenter.
  • 2021-06-06 10:31: A remote power button press was initiated as the webinterface reported 'power off' by maldridge
  • 2021-06-06 10:58: A hardware reset was initiated by the Hetzner support
  • 2021-06-06 11:14: Reporting back to the support, that the server is still no reachable by Gottox
  • 2021-06-06 11:24: Server started pinging again
  • 2021-06-06 11:25: Hetzner support reported back, that the server was hanging in post and that they replaced the hardware.
  • 2021-06-06 11:25: restart of the following services as reported by maldridge, done by Gottox: wireguard, unbound, consul, nomad
  • 2021-06-06 12:08: restart of nginx as firefox reported cert issues, done by Gottox