Post Mortem 2021-06-06

Incident summary

Due to a hardware defect on a-hel-fi we got a service degredation in various systems

Leadup

The server a-hel-fi had some strange behavior for about a week.

Fault

The datacenter reported faulty hardware.

Impact

build.voidlinux.org was down
docs.voidlinux.org was down
alpha.de.repo.voidlinux.org was down
man.voidlinux.org was down
package search on voidlinux.org was unavailable

Detection

The issue was reported in IRC 4 minutes after monitoring and automation detected the fault. No automatic alerts were raised.

Response

hardware reset from hetzners robot webinterface
hardware reset to rescue system from hetzners robot webinterface
ticket was opened

Recovery

The datacenter moved the hdds to new hardware.

What went well

Communication with the datacenter was good. From the initial report to the fix took only an hour and most of the delay was caused at our side.

The handling of the incident was good and the response time was fast. We also shared the state of the incident via twitter and reddit, which helped to let users show understanding for the downtime.

What could be done better

It was just luck, that Gottox was available. He was the only one that was able to interact with the webinterface.

Lessons learned

putting to many services on one host isn't the best idea
the access to the webinterface of Hetzner should be accessible for more people

Timeline

Timestamps are GMT+00

2021-06-06 09:30: The machine stopped replying to heartbeats.
2021-06-06 09:54: Issue was reported on IRC by maldridge
2021-06-06 09:58: Hardware reset was issued by Gottox from the robot webinterface.
2021-06-06 10:13: Hardware reset to rescue system was issued by Gottox from the robot webinterface.
2021-06-06 10:23: maldridge was provided with access to the robot webinterface for that specific server
2021-06-06 10:26: Ticket was opened at the datacenter.
2021-06-06 10:31: A remote power button press was initiated as the webinterface reported 'power off' by maldridge
2021-06-06 10:58: A hardware reset was initiated by the Hetzner support
2021-06-06 11:14: Reporting back to the support, that the server is still no reachable by Gottox
2021-06-06 11:24: Server started pinging again
2021-06-06 11:25: Hetzner support reported back, that the server was hanging in post and that they replaced the hardware.
2021-06-06 11:25: restart of the following services as reported by maldridge, done by Gottox: wireguard, unbound, consul, nomad
2021-06-06 12:08: restart of nginx as firefox reported cert issues, done by Gottox

InfraDocs