For approximately 2 days the Nomad managed tasks on a-fsn-de were unavailable, including repository management tasks. Resolution was and continues to be hampered by a distinct and ongoing outage.
CAS service was restored on 2021-07-22, any point below that mentions and ongoing outage was written while the incident was ongoing.
At this time we do not know what caused the reboot of a-fsn-de, however we do know that a parallel and ongoing incident was occurring at the same time. Void uses a centralized authentication service (CAS) to manage access to our machines, and like many secure services this one relies on TLS certificates. This certificate expired without being noticed, which prevented what would have otherwise been a quick recovery of logging in and bouncing some services.
Additionally, when the CAS is unavailable, we maintain a break-glass login capability for a handful of extremely trusted maintainers (1:1 with the people that have access to the package signing key). This access was discovered to be impaired by one developer's key missing entirely, and one developer having failed to rotate their key. The third developer was on vacation but was able to log in and rectify the keying problem.
The Nomad outage was caused by an unexpected restart of a-fsn-de. When Nomad hosts reboot there is a known defect in that runit may bring up the services in a race that results in Nomad not being usable until services are restarted in a specific order.
The unavailability of the CAS system means that we still cannot log in to all hosts normally.
The issues with the break-glass keys caused the recovery of both the specific Nomad host and the CAS server to be slow.
- Repository signing unavailable
- Builds not completing normally
- musl and aarch64 appeared to lag behind glibc
- CAS logins not available
- Detailed failure logs not available in Grafana (requires CAS login)
- Couldn't make control requests to Nomad (requires Vault CAS integration)
The Nomad failure was detected by external observation that an update
less package was not signed, and was preventing installs
The CAS failure was discovered a few days before the signing issue but was deemed non-critical as it was an inconvenience that could be fixed within a week.
The ability to merge changes in GitHub was restricted to prevent new builds from running that might further complicate recovery efforts.
@the-maldridge and @Gottox were recalled from vacation to recover access to the system.
@Gottox used break-glass access to both restore break-glass access for @the-maldridge and @duncaen, and to restart the stuck signing process.
Excellent internal communication kept everyone in the loop as to what was broken, what was being done to fix it, and who was responsible for taking action.
Break-glass access, once used, did work effectively.
External communication was not great. A post went up on voidlinux.org, but no twitter notification was made, and the post was not widely shared on our other channels.
Break glass connectivity existed, but did not work.
Initially recalling critical team members from vacation was an ad-hoc process.
Having the capability isn't enough. Break glass needs to be regularly tested to be effective.
For foundational infrastructure that has very infrequent updates, such as long lived TLS certificates, we should ensure multiple people are aware of the expiry date, and make use of multiple calendars to ensure critical life-cycle events are not missed.
No timeline is provided for this incident.