Post Mortem 2021-06-06

Incident summary

For approximately 2 days the Nomad managed tasks on a-fsn-de were unavailable, including repository management tasks. Resolution was and continues to be hampered by a distinct and ongoing outage.

Ammended 2021-07-31

CAS service was restored on 2021-07-22, any point below that mentions and ongoing outage was written while the incident was ongoing.

At this time we do not know what caused the reboot of a-fsn-de, however we do know that a parallel and ongoing incident was occurring at the same time. Void uses a centralized authentication service (CAS) to manage access to our machines, and like many secure services this one relies on TLS certificates. This certificate expired without being noticed, which prevented what would have otherwise been a quick recovery of logging in and bouncing some services.

Additionally, when the CAS is unavailable, we maintain a break-glass login capability for a handful of extremely trusted maintainers (1:1 with the people that have access to the package signing key). This access was discovered to be impaired by one developer's key missing entirely, and one developer having failed to rotate their key. The third developer was on vacation but was able to log in and rectify the keying problem.

Fault

The Nomad outage was caused by an unexpected restart of a-fsn-de. When Nomad hosts reboot there is a known defect in that runit may bring up the services in a race that results in Nomad not being usable until services are restarted in a specific order.

The unavailability of the CAS system means that we still cannot log in to all hosts normally.

The issues with the break-glass keys caused the recovery of both the specific Nomad host and the CAS server to be slow.

Impact

Publicly Visible:

Repository signing unavailable
Builds not completing normally
musl and aarch64 appeared to lag behind glibc

Internally Visible:

CAS logins not available
Detailed failure logs not available in Grafana (requires CAS login)
Couldn't make control requests to Nomad (requires Vault CAS integration)

Excellent internal communication kept everyone in the loop as to what was broken, what was being done to fix it, and who was responsible for taking action.
Break-glass access, once used, did work effectively.

What could be done better

External communication was not great. A post went up on voidlinux.org, but no twitter notification was made, and the post was not widely shared on our other channels.
Break glass connectivity existed, but did not work.
Initially recalling critical team members from vacation was an ad-hoc process.

Lessons learned

Having the capability isn't enough. Break glass needs to be regularly tested to be effective.
For foundational infrastructure that has very infrequent updates, such as long lived TLS certificates, we should ensure multiple people are aware of the expiry date, and make use of multiple calendars to ensure critical life-cycle events are not missed.

Timeline

No timeline is provided for this incident.

InfraDocs

Post Mortem 2021-06-06

Incident summary

Ammended 2021-07-31

Leadup

Fault

Impact

Detection

Response

Recovery

What went well

What could be done better

Lessons learned

Timeline