Post Mortem 2021-06-06

Incident summary

Due to a hardware defect on a-hel-fi we got a service degredation in various systems


The server a-hel-fi had some strange behavior for about a week.


The datacenter reported faulty hardware.


  • was down
  • was down
  • was down
  • was down
  • package search on was unavailable


The issue was reported in IRC 4 minutes after monitoring and automation detected the fault. No automatic alerts were raised.


  • hardware reset from hetzners robot webinterface
  • hardware reset to rescue system from hetzners robot webinterface
  • ticket was opened


The datacenter moved the hdds to new hardware.

What went well

Communication with the datacenter was good. From the initial report to the fix took only an hour and most of the delay was caused at our side.

The handling of the incident was good and the response time was fast. We also shared the state of the incident via twitter and reddit, which helped to let users show understanding for the downtime.

What could be done better

It was just luck, that Gottox was available. He was the only one that was able to interact with the webinterface.

Lessons learned

  • putting to many services on one host isn't the best idea
  • the access to the webinterface of Hetzner should be accessible for more people


Timestamps are GMT+00

  • 2021-06-06 09:30: The machine stopped replying to heartbeats.
  • 2021-06-06 09:54: Issue was reported on IRC by maldridge
  • 2021-06-06 09:58: Hardware reset was issued by Gottox from the robot webinterface.
  • 2021-06-06 10:13: Hardware reset to rescue system was issued by Gottox from the robot webinterface.
  • 2021-06-06 10:23: maldridge was provided with access to the robot webinterface for that specific server
  • 2021-06-06 10:26: Ticket was opened at the datacenter.
  • 2021-06-06 10:31: A remote power button press was initiated as the webinterface reported 'power off' by maldridge
  • 2021-06-06 10:58: A hardware reset was initiated by the Hetzner support
  • 2021-06-06 11:14: Reporting back to the support, that the server is still no reachable by Gottox
  • 2021-06-06 11:24: Server started pinging again
  • 2021-06-06 11:25: Hetzner support reported back, that the server was hanging in post and that they replaced the hardware.
  • 2021-06-06 11:25: restart of the following services as reported by maldridge, done by Gottox: wireguard, unbound, consul, nomad
  • 2021-06-06 12:08: restart of nginx as firefox reported cert issues, done by Gottox