Introduction

Welcome to InfraDocs. This is the manual for Void systems management. The manual is split up into infrastructure managed by different systems. Knowledge of Void or base Linux administration is assumed.

Terraform

Not all infrastructure owned by the Void project is hosted on our infrastructure or integrated into our systems. For some infrastructure we need to mirror data out to 3rd party systems. This is done with HashiCorp Terraform.

Files for terraform end in .tf and live in the terraform subdirectory of the infrastructure repo. There is currently no automation that pushes terraform state to remote systems.

Important!

It is VERY IMPORTANT that only one Terraform push is in progress at a time. We use a central state and lock server to ensure this happens, but occasionally there are changes that have been pushed but not merged yet. Always ensure that the diff that terraform offers is what you expected it to be.

Setting Up

Terraform is configured to use remote state. One-time configuration is required to access this state:

Ensure that your netauth user is a member of the appropriate NetAuth group for the project you want to act on. Presently, all projects are in the prod namespace and membership in the netauth/terrastate-prod group is required. Without access to this group you will not be able to access the terraform state.

Export the following variables in order to authenticate your access to the remote state storage. These are your netauth credentials:

TF_HTTP_USERNAME=<entity-id>
TF_HTTP_PASSWORD=<entity-pw>

Change the terraform project directory and run the following command:

$ terraform init

Obtaining Control Authority

Having access to state isn't sufficient. Depending on what projects you're wishing to manage, you may need any of the following additional credentials:

  • GitHub Personal Access Token (PAT) exported as GITHUB_TOKEN
  • Fastly API Token exported as FASTLY_API_KEY
  • DigitalOcean API Token exported as DIGITALOCEAN_API_TOKEN
  • Vault Token at either ~/.vault-token or VAULT_TOKEN
  • Nomad Token exported as NOMAD_TOKEN
  • Consul Token exported as CONSUL_HTTP_TOKEN

These variables and keys are in addition to the state access which must be initialized individually per project.

GitHub

GitHub only provides an interface to sync data from LDAP, and even then only if using the enterprise version. Since Void is an open source project and isn't using this option, we don't sync data. The organization at github.com/void-linux has very little state, primarily users and groups.

Groups

There are currently three groups that gate access into GitHub resources:

pkg-committers

Members of this group have broad commit access and can generally push to any Void owned repo. The primary reason for people to gain access to this group is to be able to push package templates. Access to this group should be assumed to contain the ability to trigger builds that will eventually be signed for inclusion in the main repo.

void-ops

Membership into this group is highly restricted and should generally not be authorized without a signoff from an infrastructure lead or maldridge@. This group gates access to the infrastructure repo itself, and is restricted to prevent accidental breakage from pushing something that is later pushed by automation that performs change detection against the state of the repo.

doc-writers

Members of this group have access to push changes into the void-docs repository which is responsible for holding all content that appears on our handbook.

Adding and Removing Members

Adding and removing members takes place in github_members.tf. This file contains a stanza for every user and every group they are in. To change membership of a group add or remove a stanza, then apply the state transformation to GitHub.

This file is manually formatted, take care to maintain lexical sort ordering and indentation. For example if a new committer with username voidfu was to be added, a new stanza as follows would be added to the file:

resource "github_team_membership" "pkg-committers_voidfu" {
  team_id = "${github_team.pkg-committers.id}"
  role = "maintainer"
  username = "voidfu"
}

The name placed in the resource line should always be lower case. The name that appears in the username should be an exact match for the username shown on the user's profile page.

Pushing state changes

Pushing a state change can only be done by organization owners. To request a push of terraform state, request action from one of:

* the-maldridge
* gottox
* duncaen

It is very important that only one push be in progress at a time. To this end, anyone making a push should endeavor to determine no other changes are in motion, manual or terraformed.

Authenticating to GitHub for Push

Github needs authentication to authorize the push. This takes the format of a personal access token. The token must contain sufficient permissions to add and remove people from the organization, add and remove repositories, and add and remove groups. The token should be stored in the environment variable GITHUB_TOKEN.

Pushing the Changes

Pushing the changes is done in two phases. The first phase is a planning phase. In this phase call terraform as shown:

$ terraform plan

Verify that the output is sane, it will provide a diff of any action that terraform wants to take. This should be very simple to understand what is going to happen because you shouldn't push large changes, instead prefer to push incremental changes in succession.

When you are satisfied with the planned actions, apply them:

$ terraform apply

You'll be asked to confirm the application of state. If you're satisfied, apply the state. Terraform is not like Ansible, be careful that you don't remove people from the organization or clear permissions that you can't put back without assistance.

DigitalOcean

DigitalOcean generously sponsors significant infrastructure components, all of these components are managed with teraform.

To apply changes to the account you need an API key for both the regular API services and for the Spaces API. The API is not meaningfully namespaced, so only organization owners have access at this time.

Fastly

Fastly sponsors a CDN which fronts the main packages colletion and other artifacts we serve from the mirrors.

The distribution is configured to answer to repo-fastly.voidlinux.org and uses a certificate from Lets Encrypt. The origin data is served from all mirrors by requesting the repo-fastly name from nginx, which is the same virtual host as the other tier one mirrors.

Ansible

Ansible is the primary configuration system for Void's hardware. Ansible is a standard technology currently owned by RedHat, and complete information can be found at the ansible website.

Operation of Ansible is beyond the scope of this manual, but a short introduction is nonethelss provided.

Installation

To apply the Ansible playbooks you need to install Ansible. To do this make sure you have python and virtualenv available. Void's configuration expects python from the 3.x branch on the control host. Any version of python is acceptable on the target machines, but for consistency reasons the system python is 2.7 series.

The first step is to install and check Ansible. Within the ansible/ directory run the following commands. The virtual environment must live in a subdirectory called 'venv' as this path is referenced by ansible.cfg.

$ virtualenv venv
$ source venv/bin/activate
$ pip install -r requirements.txt

Once installed, verify that you have ansible available within path:

$ ansible --version

The version should match exactly:

$ ansible --version
ansible 2.5.5

Secrets

Almost all of Void's configuration data is public. Things that are not public are referred to as "secrets". This includes information such as the buildbot login file, signing keys, various tokens that authenticate services and so on. This data lives in ansible/secret and you must obtain a copy of this directory before trying to push to any machine.

The format of files in the secret directory should be plain text for files that are string secrets, or the native file format of the secret in question. Secret names should be of the form ROLENAME_SECRET so a token signing key for the netauth service should be named netauth_token.key in the secret directory and its file format should be PEM encoded key data.

Storing Secrets

Secrets should be encrypted when at rest. It is advised to store secrets in an EncFS directory which is mounted into the appropriate location as needed. Any encrypted system is acceptable here assuming that it provides a normal filesystem view and supports strong cryptography.

Obtaining Secrets

Secrets should be held by as few people as possible, but no fewer than 2 at any given time. Secrets should only be exchanged after positive identity confirmation has been confirmed. Transfer should then be done via secure means, such as a copy made to a Void host and then made visible via Unix file permissions.

Deploying a Playbook

Deploying a playbook to the Ansible managed infrastructure is done with the ansible-playbook command. An example invocation to update the buildmaster is shown below:

$ ansible-playbook -DK build.yml --limit vm1.a-lej-de.m.voidlinux.org

Breaking down the above command line:

  • -D: Provide a diff of the changes that are made.
  • -K: Prompt for the sudo password
  • build.yml: The playbook that we want to run
  • --limit: Restrict running of this playbook to the following host(s)
  • vm1.a-lej-de.m.voidlinux.org: The hostname of the specific server that runs the buildmaster.

Here's what a full run of this command looks like:

$ ansible-playbook -DK build.yml --limit vm1.a-lej-de.m.voidlinux.org
SUDO password: 

PLAY [buildmaster] ***************************************************************************

TASK [Gathering Facts] ***********************************************************************
ok: [vm1.a-lej-de.m.voidlinux.org]

TASK [network : Configure hosts] *************************************************************
ok: [vm1.a-lej-de.m.voidlinux.org]

TASK [network : Configure hostname] **********************************************************
ok: [vm1.a-lej-de.m.voidlinux.org]

TASK [network : Install iptables] ************************************************************
ok: [vm1.a-lej-de.m.voidlinux.org]

TASK [network : Install iptables-reload command] *********************************************
ok: [vm1.a-lej-de.m.voidlinux.org]

TASK [network : Configure dhcpcd] ************************************************************
--- before: /etc/dhcpcd.conf
+++ after: /home/maldridge/.ansible/tmp/ansible-local-3289rle0c9_z/tmp_37j0_hr/dhcpcd.conf.j2
@@ -10,7 +10,7 @@
 
 	noipv6
 interface eth1
-	nopipv4
+	noipv4
 
 	static ip6_address=2a01:4f8:212:34cc::01d:b/64
 	static domain_name_servers=2a01:4f8:0:a0a1::add:1010

changed: [vm1.a-lej-de.m.voidlinux.org]

TASK [network : Enable dhcpcd] ***************************************************************
ok: [vm1.a-lej-de.m.voidlinux.org]

TASK [network : Enable wpa_supplicant hook] **************************************************
skipping: [vm1.a-lej-de.m.voidlinux.org]

TASK [network : Add dhcpcd iptables hook] ****************************************************
ok: [vm1.a-lej-de.m.voidlinux.org]

TASK [network : Enable dhcpcd iptables hook] *************************************************
ok: [vm1.a-lej-de.m.voidlinux.org]

TASK [network : Make iptables.d] *************************************************************
ok: [vm1.a-lej-de.m.voidlinux.org]

TASK [network : Configure base rules for IPv4 firewall] **************************************
ok: [vm1.a-lej-de.m.voidlinux.org]

TASK [network : Make ip6tables.d] ************************************************************
ok: [vm1.a-lej-de.m.voidlinux.org]

TASK [network : Configure base rules for IPv6 firewall] **************************************
ok: [vm1.a-lej-de.m.voidlinux.org]

TASK [acmetool : Install acmetool] ***********************************************************
ok: [vm1.a-lej-de.m.voidlinux.org]

TASK [acmetool : Create acmetool data root] **************************************************
ok: [vm1.a-lej-de.m.voidlinux.org]

TASK [acmetool : Create acmetool directories] ************************************************
ok: [vm1.a-lej-de.m.voidlinux.org] => (item=accounts)
ok: [vm1.a-lej-de.m.voidlinux.org] => (item=certs)
ok: [vm1.a-lej-de.m.voidlinux.org] => (item=conf)
ok: [vm1.a-lej-de.m.voidlinux.org] => (item=desired)
ok: [vm1.a-lej-de.m.voidlinux.org] => (item=keys)
ok: [vm1.a-lej-de.m.voidlinux.org] => (item=live)
ok: [vm1.a-lej-de.m.voidlinux.org] => (item=tmp)

TASK [acmetool : Install acmetool responses file] ********************************************
ok: [vm1.a-lej-de.m.voidlinux.org]

TASK [acmetool : Check for quickstart flag] **************************************************
ok: [vm1.a-lej-de.m.voidlinux.org]

TASK [acmetool : Run quickstart] *************************************************************
skipping: [vm1.a-lej-de.m.voidlinux.org]

TASK [acmetool : Install acmetool configuration] *********************************************
ok: [vm1.a-lej-de.m.voidlinux.org]

TASK [acmetool : Configure wanted certificates] **********************************************
ok: [vm1.a-lej-de.m.voidlinux.org] => (item=build.voidlinux.eu)
ok: [vm1.a-lej-de.m.voidlinux.org] => (item=sources.voidlinux.eu)
ok: [vm1.a-lej-de.m.voidlinux.org] => (item=repo.voidlinux.eu)

TASK [acmetool : Ensure cron.d exists] *******************************************************
ok: [vm1.a-lej-de.m.voidlinux.org]

TASK [acmetool : Install renewal crontab] ****************************************************
ok: [vm1.a-lej-de.m.voidlinux.org]

TASK [acmetool : Install acmetool firewall rules] ********************************************
ok: [vm1.a-lej-de.m.voidlinux.org]

TASK [nginx : Install nginx] *****************************************************************
ok: [vm1.a-lej-de.m.voidlinux.org]

TASK [nginx : Configure nginx] ***************************************************************
ok: [vm1.a-lej-de.m.voidlinux.org]

TASK [nginx : Install dhparam.pem] ***********************************************************
ok: [vm1.a-lej-de.m.voidlinux.org]

TASK [nginx : Create the webroot] ************************************************************
ok: [vm1.a-lej-de.m.voidlinux.org]

TASK [nginx : Create sites-available] ********************************************************
ok: [vm1.a-lej-de.m.voidlinux.org]

TASK [nginx : Create sites-enabled] **********************************************************
ok: [vm1.a-lej-de.m.voidlinux.org]

TASK [nginx : Enable nginx] ******************************************************************
ok: [vm1.a-lej-de.m.voidlinux.org]

TASK [nginx : Configure nginx firewall rules] ************************************************
ok: [vm1.a-lej-de.m.voidlinux.org]

TASK [nginx : Configure nginx firewall rules] ************************************************
ok: [vm1.a-lej-de.m.voidlinux.org]

TASK [crond : Install cronie] ****************************************************************
ok: [vm1.a-lej-de.m.voidlinux.org]

TASK [crond : Enable cronie] *****************************************************************
ok: [vm1.a-lej-de.m.voidlinux.org]

TASK [buildmaster : Create the void-repo group] **********************************************
ok: [vm1.a-lej-de.m.voidlinux.org]

TASK [buildmaster : Install the buildmaster firewall rules] **********************************
ok: [vm1.a-lej-de.m.voidlinux.org]

TASK [buildmaster : Install the buildmaster firewall rules (v6)] *****************************
ok: [vm1.a-lej-de.m.voidlinux.org]

TASK [buildmaster : Install virtualenv & deps] ***********************************************
ok: [vm1.a-lej-de.m.voidlinux.org]

TASK [buildmaster : Create the BuildBot Master user] *****************************************
ok: [vm1.a-lej-de.m.voidlinux.org]

TASK [buildmaster : Create the BuildMaster Root Directory] ***********************************
ok: [vm1.a-lej-de.m.voidlinux.org]

TASK [buildmaster : Install Buildbot] ********************************************************
ok: [vm1.a-lej-de.m.voidlinux.org]

TASK [buildmaster : Make Buildbot More Terse] ************************************************
ok: [vm1.a-lej-de.m.voidlinux.org]

TASK [buildmaster : Create BuildMaster Subdirectories] ***************************************
ok: [vm1.a-lej-de.m.voidlinux.org] => (item=scripts)
ok: [vm1.a-lej-de.m.voidlinux.org] => (item=public_html)
ok: [vm1.a-lej-de.m.voidlinux.org] => (item=templates)

TASK [buildmaster : Copy un-inheritable Buildbot Assets] *************************************
ok: [vm1.a-lej-de.m.voidlinux.org] => (item=bg_gradient.jpg)
ok: [vm1.a-lej-de.m.voidlinux.org] => (item=default.css)
ok: [vm1.a-lej-de.m.voidlinux.org] => (item=favicon.ico)
ok: [vm1.a-lej-de.m.voidlinux.org] => (item=robots.txt)

TASK [buildmaster : Copy Buildbot Bootstrap Database] ****************************************
ok: [vm1.a-lej-de.m.voidlinux.org]

TASK [buildmaster : Install GitHub Webhook Password] *****************************************
ok: [vm1.a-lej-de.m.voidlinux.org]

TASK [buildmaster : Configure BuildMaster] ***************************************************
ok: [vm1.a-lej-de.m.voidlinux.org]

TASK [buildmaster : Install Static Scripts] **************************************************
ok: [vm1.a-lej-de.m.voidlinux.org] => (item=__init__.py)
ok: [vm1.a-lej-de.m.voidlinux.org] => (item=ShellCommandChangeList.py)

TASK [buildmaster : Install Buildbot Master Configuration] ***********************************
ok: [vm1.a-lej-de.m.voidlinux.org]

TASK [buildmaster : include_vars] ************************************************************
ok: [vm1.a-lej-de.m.voidlinux.org]

TASK [buildmaster : include_vars] ************************************************************
ok: [vm1.a-lej-de.m.voidlinux.org]

TASK [buildmaster : Configure BuildSlave References] *****************************************
--- before: //home/void-buildmaster//buildmaster/scripts/user_settings.py
+++ after: /home/maldridge/.ansible/tmp/ansible-local-3289rle0c9_z/tmpuaj4n11l/user_settings.py.j2
@@ -9,7 +9,7 @@
         'BootstrapArgs': '-N',
         'slave_name': 'x86_64_void',
         'slave_pass': 'REDACTED',
-        'admin': 'xtraeme'
+        'admin': 'gottox'
     },
     {
         'name': 'i686-primary',
@@ -21,7 +21,7 @@
         'BootstrapArgs': '-N',
         'slave_name': 'i686_void',
         'slave_pass': 'REDACTED',
-        'admin': 'xtraeme'
+        'admin': 'gottox'
     },
     {
         'name': 'armv6l-primary',
@@ -33,7 +33,7 @@
         'BootstrapArgs': '-N',
         'slave_name': 'cross-rpi_void',
         'slave_pass': 'REDACTED',
-        'admin': 'xtraeme'
+        'admin': 'gottox'
     },
     {
         'name': 'armv7l-primary',
@@ -45,7 +45,7 @@
         'BootstrapArgs': '-N',
         'slave_name': 'cross-armv7l_void',
         'slave_pass': 'REDACTED',
-        'admin': 'xtraeme'
+        'admin': 'gottox'
     },
     {
         'name': 'x86_64-musl-primary',

changed: [vm1.a-lej-de.m.voidlinux.org]

TASK [buildmaster : Install BuildBot Service (1/2)] ******************************************
ok: [vm1.a-lej-de.m.voidlinux.org]

TASK [buildmaster : Install BuildBot Service (2/2)] ******************************************
ok: [vm1.a-lej-de.m.voidlinux.org]

TASK [buildmaster : Enable BuildBot Service] *************************************************
ok: [vm1.a-lej-de.m.voidlinux.org]

TASK [buildmaster : Configure webserver] *****************************************************

TASK [nginx : Create folder for external nginx locations] ************************************
ok: [vm1.a-lej-de.m.voidlinux.org]

TASK [nginx : Install site descriptor] *******************************************************
ok: [vm1.a-lej-de.m.voidlinux.org]

TASK [nginx : Enable site] *******************************************************************
ok: [vm1.a-lej-de.m.voidlinux.org]

TASK [nginx : Install firewall rules for resolvers] ******************************************
ok: [vm1.a-lej-de.m.voidlinux.org]

TASK [nginx : Install firewall v6 rules for resolvers] ***************************************
ok: [vm1.a-lej-de.m.voidlinux.org]

TASK [buildmaster : Install root location block] *********************************************
ok: [vm1.a-lej-de.m.voidlinux.org]

TASK [buildmaster : Create the Signing User] *************************************************
ok: [vm1.a-lej-de.m.voidlinux.org]

TASK [buildmaster : Create .ssh for void-repomaster] *****************************************
ok: [vm1.a-lej-de.m.voidlinux.org]

TASK [buildmaster : Install Signing Key] *****************************************************
ok: [vm1.a-lej-de.m.voidlinux.org]

TASK [buildmaster : Create bin/ directory for void-repomaster] *******************************
ok: [vm1.a-lej-de.m.voidlinux.org]

TASK [buildmaster : Install Signing and Repo-Management Scripts] *****************************
ok: [vm1.a-lej-de.m.voidlinux.org] => (item=xbps-sign-repos)
ok: [vm1.a-lej-de.m.voidlinux.org] => (item=xbps-clean-repos)

TASK [buildmaster : Install Signing Cronjob] *************************************************
ok: [vm1.a-lej-de.m.voidlinux.org]

TASK [buildmaster : Install rsync] ***********************************************************
ok: [vm1.a-lej-de.m.voidlinux.org]

TASK [buildmaster : Install Sync Keys] *******************************************************
ok: [vm1.a-lej-de.m.voidlinux.org] => (item=None)
ok: [vm1.a-lej-de.m.voidlinux.org] => (item=None)
ok: [vm1.a-lej-de.m.voidlinux.org]

TASK [root-mirror-shim : Create Repo Directory] **********************************************
ok: [vm1.a-lej-de.m.voidlinux.org]

TASK [root-mirror-shim : Create xlocate group] ***********************************************
ok: [vm1.a-lej-de.m.voidlinux.org]

TASK [root-mirror-shim : Create Static Mirror Directories] ***********************************
ok: [vm1.a-lej-de.m.voidlinux.org] => (item=distfiles)
ok: [vm1.a-lej-de.m.voidlinux.org] => (item=live)
ok: [vm1.a-lej-de.m.voidlinux.org] => (item=logos)
ok: [vm1.a-lej-de.m.voidlinux.org] => (item=static)
ok: [vm1.a-lej-de.m.voidlinux.org] => (item=current)
ok: [vm1.a-lej-de.m.voidlinux.org] => (item=xlocate)

TASK [root-mirror-shim : Mount the package filesystem into the mirror] ***********************
ok: [vm1.a-lej-de.m.voidlinux.org]

TASK [void-updates : Install void-updates] ***************************************************
ok: [vm1.a-lej-de.m.voidlinux.org]

TASK [void-updates : Create the voidupdates user] ********************************************
ok: [vm1.a-lej-de.m.voidlinux.org]

TASK [void-updates : Install Update Check Cron Job] ******************************************
ok: [vm1.a-lej-de.m.voidlinux.org]

TASK [void-updates : Link Results] ***********************************************************
ok: [vm1.a-lej-de.m.voidlinux.org]

TASK [mirror-base : Create the reposync group] ***********************************************
ok: [vm1.a-lej-de.m.voidlinux.org]

TASK [mirror-base : Create the reposync user] ************************************************
ok: [vm1.a-lej-de.m.voidlinux.org]

TASK [rsyncd : Install rsync] ****************************************************************
ok: [vm1.a-lej-de.m.voidlinux.org]

TASK [rsyncd : Install rsync firewall rules] *************************************************
ok: [vm1.a-lej-de.m.voidlinux.org] => (item=iptables.d)
ok: [vm1.a-lej-de.m.voidlinux.org] => (item=ip6tables.d)

TASK [rsyncd : Create rsyncd.conf.d] *********************************************************
ok: [vm1.a-lej-de.m.voidlinux.org]

TASK [rsyncd : Template rsyncd.conf] *********************************************************
ok: [vm1.a-lej-de.m.voidlinux.org]

TASK [rsyncd : Enable rsyncd] ****************************************************************
ok: [vm1.a-lej-de.m.voidlinux.org]

TASK [live-mirror : Install Prerequisites] ***************************************************
ok: [vm1.a-lej-de.m.voidlinux.org]

TASK [live-mirror : Create the mirror dataroot directory] ************************************
ok: [vm1.a-lej-de.m.voidlinux.org]

TASK [live-mirror : Configure firewall rules] ************************************************
ok: [vm1.a-lej-de.m.voidlinux.org]

TASK [live-mirror : Configure webserver] *****************************************************

TASK [nginx : Create folder for external nginx locations] ************************************
ok: [vm1.a-lej-de.m.voidlinux.org]

TASK [nginx : Install site descriptor] *******************************************************
ok: [vm1.a-lej-de.m.voidlinux.org]

TASK [nginx : Enable site] *******************************************************************
ok: [vm1.a-lej-de.m.voidlinux.org]

TASK [nginx : Install firewall rules for resolvers] ******************************************
ok: [vm1.a-lej-de.m.voidlinux.org]

TASK [nginx : Install firewall v6 rules for resolvers] ***************************************
ok: [vm1.a-lej-de.m.voidlinux.org]

TASK [live-mirror : Include rsyncd user secrets] *********************************************
ok: [vm1.a-lej-de.m.voidlinux.org]

TASK [live-mirror : Configure rsyncd] ********************************************************
ok: [vm1.a-lej-de.m.voidlinux.org]

TASK [live-mirror : Configure rsyncd secrets] ************************************************
skipping: [vm1.a-lej-de.m.voidlinux.org]

TASK [live-mirror : Install sync service secret] *********************************************
skipping: [vm1.a-lej-de.m.voidlinux.org]

TASK [live-mirror : Install mirror sync service (1/4)] ***************************************
skipping: [vm1.a-lej-de.m.voidlinux.org]

TASK [live-mirror : Install mirror sync service (2/4)] ***************************************
skipping: [vm1.a-lej-de.m.voidlinux.org]

TASK [live-mirror : Install mirror sync service (3/4)] ***************************************
skipping: [vm1.a-lej-de.m.voidlinux.org]

TASK [live-mirror : Install mirror sync service (4/4)] ***************************************
skipping: [vm1.a-lej-de.m.voidlinux.org]

TASK [sources_site : Create sources link] ****************************************************
ok: [vm1.a-lej-de.m.voidlinux.org]

TASK [sources_site : Configure webserver] ****************************************************

TASK [nginx : Create folder for external nginx locations] ************************************
ok: [vm1.a-lej-de.m.voidlinux.org]

TASK [nginx : Install site descriptor] *******************************************************
ok: [vm1.a-lej-de.m.voidlinux.org]

TASK [nginx : Enable site] *******************************************************************
ok: [vm1.a-lej-de.m.voidlinux.org]

TASK [nginx : Install firewall rules for resolvers] ******************************************
ok: [vm1.a-lej-de.m.voidlinux.org]

TASK [nginx : Install firewall v6 rules for resolvers] ***************************************
ok: [vm1.a-lej-de.m.voidlinux.org]

RUNNING HANDLER [network : dhcpcd] ***********************************************************
changed: [vm1.a-lej-de.m.voidlinux.org]

PLAY [buildslave] ****************************************************************************

TASK [Gathering Facts] ***********************************************************************
ok: [vm1.a-lej-de.m.voidlinux.org]

TASK [buildslave : Install BuildBot Slave and Dependencies] **********************************
ok: [vm1.a-lej-de.m.voidlinux.org]

TASK [buildslave : Create Buildslave user (void-buildslave)] *********************************
ok: [vm1.a-lej-de.m.voidlinux.org]

TASK [buildslave : Create Buildsync user (void-buildsync)] ***********************************
skipping: [vm1.a-lej-de.m.voidlinux.org]

TASK [buildslave : Create void-buildsync .ssh] ***********************************************
skipping: [vm1.a-lej-de.m.voidlinux.org]

TASK [buildslave : Install sync key] *********************************************************
skipping: [vm1.a-lej-de.m.voidlinux.org]

TASK [buildslave : Create Builder Directories] ***********************************************
ok: [vm1.a-lej-de.m.voidlinux.org] => (item=x86_64)
ok: [vm1.a-lej-de.m.voidlinux.org] => (item=i686)
ok: [vm1.a-lej-de.m.voidlinux.org] => (item=armv6l)
ok: [vm1.a-lej-de.m.voidlinux.org] => (item=armv7l)

TASK [buildslave : Enforce permissions on hostdir] *******************************************
ok: [vm1.a-lej-de.m.voidlinux.org] => (item=DE-1)
ok: [vm1.a-lej-de.m.voidlinux.org] => (item=DE-1)
ok: [vm1.a-lej-de.m.voidlinux.org] => (item=DE-1)
ok: [vm1.a-lej-de.m.voidlinux.org] => (item=DE-1)

TASK [buildslave : include_vars] *************************************************************
ok: [vm1.a-lej-de.m.voidlinux.org]

TASK [buildslave : Configure buildbot-slave] *************************************************
ok: [vm1.a-lej-de.m.voidlinux.org] => (item=x86_64)
ok: [vm1.a-lej-de.m.voidlinux.org] => (item=i686)
ok: [vm1.a-lej-de.m.voidlinux.org] => (item=armv6l)
ok: [vm1.a-lej-de.m.voidlinux.org] => (item=armv7l)

TASK [buildslave : Create buildbot-slave info directories] ***********************************
ok: [vm1.a-lej-de.m.voidlinux.org] => (item=x86_64)
ok: [vm1.a-lej-de.m.voidlinux.org] => (item=i686)
ok: [vm1.a-lej-de.m.voidlinux.org] => (item=armv6l)
ok: [vm1.a-lej-de.m.voidlinux.org] => (item=armv7l)

TASK [buildslave : Configure buildbot host description] **************************************
ok: [vm1.a-lej-de.m.voidlinux.org] => (item=x86_64)
ok: [vm1.a-lej-de.m.voidlinux.org] => (item=i686)
ok: [vm1.a-lej-de.m.voidlinux.org] => (item=armv6l)
ok: [vm1.a-lej-de.m.voidlinux.org] => (item=armv7l)

TASK [buildslave : Configure buildbot admin description] *************************************
--- before: //home/void-buildslave//void-builder-x86_64/info/admin
+++ after: /home/maldridge/.ansible/tmp/ansible-local-3289rle0c9_z/tmpq5hidygh/admin.j2
@@ -1 +1 @@
-Juan RP <xtraeme@voidlinux.eu>
+Enno Boland <gottox@voidlinux.eu>

changed: [vm1.a-lej-de.m.voidlinux.org] => (item=x86_64)
--- before: //home/void-buildslave//void-builder-i686/info/admin
+++ after: /home/maldridge/.ansible/tmp/ansible-local-3289rle0c9_z/tmpny2jz3zs/admin.j2
@@ -1 +1 @@
-Juan RP <xtraeme@voidlinux.eu>
+Enno Boland <gottox@voidlinux.eu>

changed: [vm1.a-lej-de.m.voidlinux.org] => (item=i686)
--- before: //home/void-buildslave//void-builder-armv6l/info/admin
+++ after: /home/maldridge/.ansible/tmp/ansible-local-3289rle0c9_z/tmpb6hfxdu2/admin.j2
@@ -1 +1 @@
-Juan RP <xtraeme@voidlinux.eu>
+Enno Boland <gottox@voidlinux.eu>

changed: [vm1.a-lej-de.m.voidlinux.org] => (item=armv6l)
--- before: //home/void-buildslave//void-builder-armv7l/info/admin
+++ after: /home/maldridge/.ansible/tmp/ansible-local-3289rle0c9_z/tmpaw5ppogt/admin.j2
@@ -1 +1 @@
-Juan RP <xtraeme@voidlinux.eu>
+Enno Boland <gottox@voidlinux.eu>

changed: [vm1.a-lej-de.m.voidlinux.org] => (item=armv7l)

TASK [buildslave : Configure xbps-src] *******************************************************
ok: [vm1.a-lej-de.m.voidlinux.org]

TASK [buildslave : Configure local build mirror] *********************************************
skipping: [vm1.a-lej-de.m.voidlinux.org]

TASK [buildslave : Create Service Directories] ***********************************************
ok: [vm1.a-lej-de.m.voidlinux.org] => (item=x86_64)
ok: [vm1.a-lej-de.m.voidlinux.org] => (item=i686)
ok: [vm1.a-lej-de.m.voidlinux.org] => (item=armv6l)
ok: [vm1.a-lej-de.m.voidlinux.org] => (item=armv7l)

TASK [buildslave : Configure Runit] **********************************************************
ok: [vm1.a-lej-de.m.voidlinux.org] => (item=x86_64)
ok: [vm1.a-lej-de.m.voidlinux.org] => (item=i686)
ok: [vm1.a-lej-de.m.voidlinux.org] => (item=armv6l)
ok: [vm1.a-lej-de.m.voidlinux.org] => (item=armv7l)

TASK [buildslave : Enable BuildSlave] ********************************************************
ok: [vm1.a-lej-de.m.voidlinux.org] => (item=x86_64)
ok: [vm1.a-lej-de.m.voidlinux.org] => (item=i686)
ok: [vm1.a-lej-de.m.voidlinux.org] => (item=armv6l)
ok: [vm1.a-lej-de.m.voidlinux.org] => (item=armv7l)

PLAY RECAP ***********************************************************************************
vm1.a-lej-de.m.voidlinux.org : ok=115  changed=4    unreachable=0    failed=0

The end of the play will always have a "Play Recap" which will show what hosts finished in what state. Always check and if necessary re-apply for hosts that are in failed state.

Applying Ansible playbooks requires unrestricted root or access to a service user on each node. Members of netauth/dante can run playbooks manually.

Certificate Authority

Void operates a private certificate authority based on CloudFlare's cfssl tool. The configuration data for this CA lives in the CA/ directory of the infrastructure repo.

The certificates can be generated using the bin/gencerts.sh script. They should be copied to the appropriate location in the ansible/secret directory after being generated. Once copied, use bin/shred.sh in the CA/ directory to clean up.

LetsEncrypt vs Void CA

When the option exists to obtain a certificate dynamically from LetsEncrypt, this option should be used. Additionally any time a certificate will be visible to an end user this certificate must have a valid trust-root. Since Void's CA isn't trusted by anything automatically user facing certificates MUST be issued by an external CA.

Void's CA should be used for infrastructure needs that require certificates for authentication, or long lived certificates for channel integrity. Void Operations should be consulted before adding any new certificate configurations or adjusting the CA configuration.

Network Architecture

Void's global network is based on a Wireguard mesh between all servers. The network does not support ECMP. The mesh is statically computed by Ansible and is installed into the fleet when machines are added or removed.

The primary use case of the mesh is to allow us to run services that expect to be inside of a single broadcast domain without expending significant effort to enable machine to machine communication. As such, most services should be accessed via HTTP proxies rather than connecting to the mesh directly.

HashiStack

Void's fleet and desire for services makes it impractical to ask every contributor to understand how the Ansible playbooks are laid out and how the fleet is architected. It is also brittle from a services perspective to host important things in only one machine with no fallback capability. To solve these problems Void uses the Hashicorp stack to enable dynamic workloads scheduled across the fleet, and to enable updates to be decoupled from package updates leading to a more stable infrastructure ecosystem.

More information about the individual services can be found on their specific pages. The remainder of this document speaks in general terms about the architecture of Void's cluster.

The Global Namespace

Void runs a single large global namespace, rather than segmenting our fleet into "datacenters". This is largely a result of the small number of machines we have in each region, but also reflects that Void's fleet is viewed as a single large pool of computing power, rather than segmented computing power.

Presently, the control plane is located in the US, and the larger build machines are located in Hetzner datacenters in the EU. Small service machines are colocated in the US with the control plane. Where possible traffic egresses at the nearest edge to the source data to avoid our traffic needing to transit multiple regions.

The Control Plane

The control plane is composed of the Consul, Nomad, and Vault servers which are colocated on a set of machines hosted in the DigitalOcean SFO3 region. These machines are the point from which the fleet is commanded, and they operate as a highly available trio to ensure fault tolerance.

Consul

Void uses Consul as our service discovery layer. As most of our services are not actually clustered or service discovery aware, this primarily means that we use Consul's DNS mechanism to provide internal DNS that is topology aware.

Our Consul system is configured with default deny ACLs, so any application that needs to make use of consul must have a corresponding ACL that grants it access.

More information can be found on the Consul Website.

Vault

Vault serves as a dedicated storage point for secret values such as the signing keys for repos, authentication tokens to 3rd party services, and as an authentication nexus between short lived API tokens and long lived NetAuth credentials.

To log into vault you will need a NetAuth account, and may then log in using the following command:

export VAULT_ADDR=https://vault.voidlinux.org
vault login -method=ldap username=<you>

This will prompt you for your password and request a vault token valid for a maximum of 12 hours. You don't need to do anything to actually use the vault token, all software that is Vault aware will natively pick up your authority. After the 12 hours has expired you will need to log in again to refresh your session.

If you wish to explicitly revoke your token, for example logging off for the day, you may do so with the command vault token revoke -self which will request immediate revocation.

Nomad

Nomad is a cluster level job scheduler. Nomad lets us pretend that our various machines are part of one large pool of compute, and that as we add and remove machines we don't have to edit as many task definitions to account for the change in fleet resources.

Nomad also allows us to carve up machines so that different sevice groups can be managed by different people, such as debuginfod having a different level of access than the cron-job that signs packages. Complete documentation for Nomad can be found in the upstream docs site here.

To work with nomad you will need a nomad token which you can obtain from vault:

export NOMAD_ADDR=https://nomad.voidlinux.org
vault read nomad/creds/<role>

By default nomad tokens are valid for 1 hour. You can renew your token until your vault session expires by using vault lease renew <lease ID> where the lease ID is the value provided with the initial token.

Tips and Tricks

There are some things that can be done to make working with the hashistack more streamlined, and some general tricks that have been learned over time.

Aliases for Common Functions

You can source the following file and it sets up convenient aliases for logging in and getting tokens for nomad and consul:

export NOMAD_ADDR=https://nomad.voidlinux.org
export VAULT_ADDR=https://vault.voidlinux.org
export CONSUL_HTTP_ADDR=https://consul.voidlinux.org
export NOMAD_NAMESPACE='*'

vlogin() {
    vault login -method=ldap username=maldridge
}

ntok() {
    if ! nomad acl token self -token "$(jq -r .data.secret_id < ~/.nomad-token)" > /dev/null 2>&1 ; then
        vault read -format json nomad/creds/management > ~/.nomad-token
    fi

    NOMAD_TOKEN=$(jq -r .data.secret_id < ~/.nomad-token)
    export NOMAD_TOKEN
}

nkeepalive() {
    ntok

    while vault lease renew "$(jq -r .lease_id < ~/.nomad-token)" > /dev/null 2>&1 ; do
        sleep 300
    done
}

ctok() {
    if ! consul acl token read -id "$(jq -r .data.accessor < ~/.consul-token)" > /dev/null 2>&1 ; then
        vault read -format json consul/creds/root > ~/.consul-token
    fi

    CONSUL_HTTP_TOKEN=$(jq -r .data.token < ~/.consul-token)
    export CONSUL_HTTP_TOKEN
}

Note that if you are not a member of netauth/dante you will likely need to change your roles in the aliases, and you need to change the username in vlogin() to match your netauth username.

You can also background the keepalive function above to keep a nomad token going for the entire lifetime of your vault token:

$ nkeepalive &
[1] 8679

Take note of this number, as it is how you can stop the keepalive process later should you wish to stop it preemptively. If you use bash, you can also stop the process by checking the output of jobs.

Debugging a Service Task

Service tasks operate just like a task running on the host, and can be remotely attached to using the nomad CLI. Here's an example of attaching to a running service container and looking around:

  1. First you need the allocation ID, which you can get by getting the status of the top level job.

    $ nomad job status minio
    ID            = minio
    Name          = minio
    Submit Date   = 2021-01-22T00:45:33-08:00
    Type          = service
    Priority      = 50
    Datacenters   = VOID
    Namespace     = infrastructure
    Status        = running
    Periodic      = false
    Parameterized = false
    
    Summary
    Task Group  Queued  Starting  Running  Failed  Complete  Lost
    app         0       0         1        109     13        2
    
    Allocations
    ID        Node ID   Task Group  Version  Desired  Status   Created    Modified
    187e21b6  e27ec674  app         8        run      running  4d12h ago  4d12h ago
    
  2. The allocation ID in this case is 187e21b6. We can remotely connect to the job now using nomad alloc exec:

    $ nomad alloc exec -i -t 187e21b6 /bin/sh
    #
    

    The -i option specifies that this is to be an interactive session and to connect standard I/O, and the -t specifies terminal behavior. The executable invoked must exist in the task's namespace, and must be specified by absolute path.

  3. When finished with the interactive session you can exit by closing the shell, which will return you to your local prompt. Note that the shell depends on the validity of your nomad token, so you may need to renew your token if you expect to remain attached to a debug session for a long interval.

Debugging a batch/periodic task

The steps for debugging a batch periodic task are slightly different from debugging a service task. You need to change the entrypoint to keep the container running while you're attached to it:

                image = "eeacms/rsync"
                -        args = [
                -          "rsync", "-vurk",
                -          "--delete-after",
                -          "-e", "ssh -i /secrets/id_rsa -o UserKnownHostsFile=/local/known_hosts",
                -          "void-buildsync@b-hel-fi.node.consul:/mnt/data/pkgs/", "/pkgs/"
                -        ]
                +        entrypoint = ["/bin/sleep", "3600"]
                +        # args = [
                +        #   "rsync", "-vurk",
                +        #   "--delete-after",
                +        #   "-e", "ssh -i /secrets/id_rsa -o UserKnownHostsFile=/local/known_hosts",
                +        #   "void-buildsync@b-hel-fi.node.consul:/mnt/data/pkgs/", "/pkgs/"
                +        # ]

This changes the entrypoint to simply sleep, and after submitting the job to the cluster and optionally forcing a periodic launch with nomad job periodic force <job>, you can inspect and attach to the job as shown above. Note that this only gives you an hour to debug, if you need more time than that, change the value in the sleep command.

Services

Void operates a number of services across the managed fleet. This section documents that various services and appropriate care and feeding.

Services are mapped onto physical or virtual hosts by Ansible configuration. This mapping is encapsulated in the ansible/inventory file. Some services are replicated or distributed. In many cases, services take additional configuration values which are stored in either the host_vars or the group_vars depending on the appropriate variable scope.

acmetool

All SSL certificates for Void are provided by LetsEncrypt via acmetool. Configuration for names requested in acmetool certs are done through various host variables.

Acmetool is configured to run under snooze, and should attempt to renew certificates once a day. Certificates that have more than 30 days remaining will not be renewed. Acmetool does not automatically restart services that consume certificates. In the case of web services, it is assumed that there will be a push that restarts services frequently enough that this will not be an issue.

In the future we may include certificate checks to restart services that do not support dynamic certificate reloading.

BuildBot

BuildBot is our legacy build scheduler.

The buildbot master runs at build.voidlinux.org and provides unified scheduling to all other build tasks in the fleet. BuildBot also exposes a web interface.

The current status of the build infrastructure can be found on the build waterfall. This view shows what each of the buildslaves is doing right now, and uses traffic light colors for build state. A purple builder is usually a reason to contact void-ops and figure out what's wrong with the build host.

Authenticated users of the buildbot can restart builds that have failed without needing to push a new commit. Not all committers have access to restart failed builds this way. If you believe that you should have this access, contact maldridge@.

Moving a buildslave

Don't.

In the event that this is unavoidable, all builds need to be paused until the move can be completed. In the even the builder that needs to be moved is on the musl cluster, all musl builders will need to be moved with it. Similarly, the aarch64 builders must always move as a pair.

EOL

BuildBot is slated for replacement this fall/winter. The system will be replaced by the Distributed XBPS Package Builder (dxpb) which will resolve many of the long standing problems in the buildbot.

DevSpace

Sometimes maintainers wish to distribute certain files to either each other or to external users for test. A good example is new builds of the major browsers, and builds that are unsuitable for inclusion in the main repo in their current state, such as when a new technology is being trialed. This service is available at https://devspace.voidlinux.org.

For these purposes Void Linux maintains the worlds worst webhost, void DevSpace. This is a webserver and SFTP server combination that derive authentication from NetAuth so that its users are divorced from users on the physical host. The hosting service provided is extremely limited and its only feature is auto-indexing of the filesystem tree.

If you are a currently active maintainer and wish to have an account on this system, contact maldridge@. Once you have an account you may connect via SFTP to devspace-sftp.voidlinux.org on port number 2022. Expect the following key fingerprints:

3072 SHA256:kQvGWsG7SGP4qTHn11RtifPJIxDchzdWDqoYcW9obrw [devspace-sftp.voidlinux.org]:2022 (RSA)
256 SHA256:/1lubJnK04FUqH+NJH9QXRyzuK1BDq2baRa21K/OzzQ [devspace-sftp.voidlinux.org]:2022 (ECDSA)
256 SHA256:E/VvL7jVtAGutQDyswxm/dL639i56wEHiDJgS5L+QQ8 [devspace-sftp.voidlinux.org]:2022 (ED25519)

Loki

Loki is the central log server that we wish to make publicly accessible. Note that not all logs are accessible; because Loki may be queried by anybody on the internet, we do not want to push all syslog data to Loki.

Loki is from Grafana, and there are two main ways to interact with it, both of which use LogQL. If you are familiar with Prometheus, LogQL should be fairly straightforward, as it follows the same general design as PromQL.

PromQL Cheat Sheet

PromQL expressions are always started by a stream selector in curly braces. The stream selector must specify at least one stream by the labels that are available. By default the following labels are available for all tasks:

  • nomad_job: The top level ID of the job that is running.
  • filename: The filename that the log was read from.

Here's an example LogQL query that gives you the contents of the buildsync-musl log:

{nomad_job="buildsync-musl"}

This will pull all streams that are labelled with buildsync-musl as the Nomad job. To select only stderr output, amend the stream selector as follows:

{nomad_job="buildsync-musl", filename="/alloc/logs/rsync.stderr.0"}

Sometimes you might want to use regular expressions to match multiple labels at once in the stream, such as if you want to get many logs from a logrotated set of files. By default Nomad rotates its logs every 10MB.

{nomad_job=~"buildsync-(musl|aarch64)", filename=~"/alloc/logs/rsync.std(err|out).*"}

The above example demonstrates the use of a regular expression to match multiple log streams simultaneously. The operators for matching are as follows:

  • =: Exactly Equal
  • !=: Does Not Equal
  • =~: Regex Matches
  • !~: Regex Does Not Match

You can also perform filtering once the log stream has been selected. If you wanted to only match lines containing the strings xbps and repodata you could extend the above query to the following:

{nomad_job=~"buildsync-(musl|aarch64)", filename=~"/alloc/logs/rsync.std(err|out).*"} |~ "(xbps|repodata)

A full list of expressions and matchers is available in the Upstream Loki Documentation.

Querying Logs With Grafana

Users with grafana credentials can use the "explore" page of the grafana web interface to query logs. If you find a particularly useful log query, consider adding a new dashboard to quickly refer to the query again.

Querying Logs With LogCLI

All users can use LogCLI to query loki directly from the command line using logcli from the loki package. LogCLI can run the first query from the cheat sheet above as follows:

$ export LOKI_ADDR=https://loki.voidlinux.org
$ logcli query '{nomad_job="buildsync-musl"}'
Common labels: {filename="/alloc/logs/rsync.stdout.0", nomad_group="rsync", nomad_job="buildsync-musl", nomad_namespace="build", nomad_task="promtail"}
2021-03-07T20:30:59-08:00 {} debug/xlunch-dbg-4.7.0_1.armv6l-musl.xbps
2021-03-07T20:30:59-08:00 {} debug/udftools-dbg-2.3_1.armv6l-musl.xbps
2021-03-07T20:30:59-08:00 {} debug/opensp-dbg-1.5.2_9.armv7l-musl.xbps
2021-03-07T20:30:59-08:00 {} debug/mDNSResponder-dbg-1310.80.1_1.armv6l-musl.xbps
2021-03-07T20:30:59-08:00 {} debug/libhunspell1.7-dbg-1.7.0_3.armv6l-musl.xbps
2021-03-07T20:30:59-08:00 {} debug/libflac-dbg-1.3.3_2.armv7l-musl.xbps
2021-03-07T20:30:59-08:00 {} debug/libaspell-dbg-0.60.8_4.armv6l-musl.xbps
2021-03-07T20:30:58-08:00 {} debug/icu-libs-dbg-67.1_2.armv7l-musl.xbps
2021-03-07T20:30:58-08:00 {} debug/icu-dbg-67.1_2.armv7l-musl.xbps
2021-03-07T20:30:58-08:00 {} debug/hunspell-dbg-1.7.0_3.armv6l-musl.xbps
2021-03-07T20:30:58-08:00 {} debug/flac-dbg-1.3.3_2.armv7l-musl.xbps
2021-03-07T20:30:58-08:00 {} debug/clucene-dbg-2.3.3.4_9.armv6l-musl.xbps
2021-03-07T20:30:58-08:00 {} debug/aspell-dbg-0.60.8_4.armv6l-musl.xbps
2021-03-07T20:30:58-08:00 {} debug/armv7l-musl-repodata
2021-03-07T20:30:58-08:00 {} debug/armv6l-musl-repodata
2021-03-07T20:30:58-08:00 {} xlunch-4.7.0_1.armv6l-musl.xbps
2021-03-07T20:30:58-08:00 {} udftools-2.3_1.armv6l-musl.xbps
2021-03-07T20:30:58-08:00 {} opensp-devel-1.5.2_9.armv7l-musl.xbps
2021-03-07T20:30:58-08:00 {} opensp-1.5.2_9.armv7l-musl.xbps
2021-03-07T20:30:57-08:00 {} nomad-1.0.4_1.armv6l-musl.xbps
2021-03-07T20:30:57-08:00 {} mDNSResponder-1310.80.1_1.armv6l-musl.xbps
2021-03-07T20:30:57-08:00 {} libhunspell1.7-1.7.0_3.armv6l-musl.xbps
2021-03-07T20:30:57-08:00 {} libflac-devel-1.3.3_2.armv7l-musl.xbps
2021-03-07T20:30:57-08:00 {} libflac-1.3.3_2.armv7l-musl.xbps
2021-03-07T20:30:57-08:00 {} libaspell-0.60.8_4.armv6l-musl.xbps
2021-03-07T20:30:56-08:00 {} icu-libs-67.1_2.armv7l-musl.xbps
2021-03-07T20:30:55-08:00 {} icu-devel-67.1_2.armv7l-musl.xbps
2021-03-07T20:30:55-08:00 {} icu-67.1_2.armv7l-musl.xbps
2021-03-07T20:30:55-08:00 {} hunspell-devel-1.7.0_3.armv6l-musl.xbps
2021-03-07T20:30:55-08:00 {} hunspell-1.7.0_3.armv6l-musl.xbps

NetAuth

NetAuth provides all the authentication and authorization information to systems within Void's managed fleet. NetAuth is an open source project with a website at https://netauth.org.

Full documentation and usage information for NetAuth can be found at docs.netauth.org.

Architecture

Void's deployment has a NetAuth server hosted on a dedicated VM which uses certificates from the Void CA for transport security. The server is configured to use the ProtoDB storage engine and is backed up regularly by manual action. Automatic backups are not deemed necessary at this time since the information changes infrequently.

The primary NetAuth server can be reached at netauth.voidlinux.org on port 8443 and uses TLS for all connections.

Remote Linux Systems

Linux systems that need to derive authentication and authorization information are configured to use a combination of pam_netauth and nsscache to provide required services. The authentication information is cached to local systems on use by the PAM Policycache and refreshed periodically. The grooup and authorization information is cached every 30 minutes to disk on all machines. Keys for systems such as SSH are requested on-demand via a helper binary netkeys which does not perform any caching.

While less than ideal, Void could operate for an extended period of time without the primary NetAuth server running.

Basic Administration

NetAuth uses a capability based system for administration of itself. Members of group dante have permissions to make changes on behalf of other users and generally should be the only people making changes to the directory.

Adding a New User

When adding a new user make sure to specify the username and number to ensure the number is in the range that will be cached by nsscached.

$ netauth entity create <username> --number <number>

Making an entity a valid shell user

Shell users have additional required attributes, these can be set seperately:

$ netauth entity update <username> --primary-group netusers --shell /bin/bash

For all users the primary group should be netusers and the shell should generally be /bin/bash. Additional fields may be set as needed.

Adding an entity to a group

Groups are used to gate access to all resources across the fleet. For example to add a new build operator who can unwedge the buildslaves, the following command sets the appropriate groups:

$ netauth entity membership <username> ADD build-ops

Adding and removing SSH keys

Adding and removing SSH keys is done with the netauth command. The default type of key is SSH. When adding and removing keys the key content needs to be quoted to avoid splitting by the shell. When removing keys the server will match keys on substrings, so technically the key comment should be sufficient to remove it if it is unique.

$ netauth entity key add SSH "<key>"

Basic user interaction

Initial configuration

An initial config file for NetAuth can be obtained from the void-infrastructure repository. It can be stored in ~/.netauth/config.toml, for example, and should be modified so that the tls.certificate key points to a file containing the certificate for the <netauth.voidlinux.org> domain. The certificate can be obtained one of two ways shown below:

$ openssl s_client -showcerts -connect netauth.voidlinux.org:1729 </dev/null | openssl x509 -outform pem

or

$ cfssl certinfo -domain netauth.voidlinux.org:1729 | jq --raw-output .pem

At that point, the password can be set with netauth auth change-secret.

Setting the entity ID

Netauth uses the system username as the entity ID for netauth operations. In some cases, the netauth entity ID for a user may be different from the system username. To override this, use the --entity flag or set the NETAUTH_ENTITY environment variable.

nginx

Void's preferred webserver is nginx using drop in config fragments. All nginx instances are managed by Ansible and have an Apache2 style sites-available and sites-enabled directory structure in /etc/nginx/. Additionally an /etc/nginx/locations.d/ exists for each site to provide location {} fragments that may not be owned by the same task that created the original site.

When possible, it is preferable to proxy web services through nginx to do TLS termination and to abstract certificate handling away from backend services. Services that communicate via protocols that use HTTP as a transport such as gRPC services do not need to use nginx as a proxy.

PopCorn

PopCorn is Void's package popularity system, similar to Debian's popularity contest system popcon, from which the name was inspired.

Information in depth about PopCorn can be found at the project's GitHub repository.

Querying Stats

Stats from PopCorn are available for anyone who wishes to query the system. The server is live at popcorn.voidlinux.org with reports services on port 8000 and the stats repository available on port 8003.

Getting the day's stats

You can download the current raw stats at any time. These are the stats that are written to the per-day files at the PopCorn site.

$ popcornctl --server popcorn.voidlinux.org --port 8001 report

If no file is specified then output.json will be written to the current directory. If a file is specified by passing --file <path> to report, then the output will be written to the named file.

Finding out versions of a package

The versions for a package can be queried from the statsrepo with popcornctl. By default the stats are queried over the most recent 30 day interval. To get known versions us the following query:

$ popcornctl --server popcorn.voidlinux.org --port 8003 pkgstats --pkg <pkg>

Additional formatting options are available by specifying --format. Useful alternate formats are date and csv which provide information about versions seen over time.

rsyncd

All managed mirrors provide unauthenticated rsync. Like nginx rsyncd is configured to use drop in files read from /etc/rsync.conf.d.

rsync is the preferred way to mirror large amounts of package data between two locations, even for ad-hoc migrations. For persistent sync the rsync protocol (rsync://) is preferred.

void-updates

The void-updates site provides a text file every day that shows all package maintainers and all packages with updates known.

Because of the mechanism by which void-updates works, care must be taken to not let it run unthrottled. We have configured it to scrape for updates once per day, and this seems to be infrequent enough to keep most webmasters happy.

While a manual run of void-updates can be triggered, be aware that this can cause instability in the output data and is discouraged.

xlocate

The xlocate service provides the data source for consumption by the xlocate command from the xtools package. This task is responsible for once a day regenerating the search index that is used for all package files.

Required Colocation

Because the xlocate indexing task requires running xbps-query over all packages in x86_64, this task must be colocated with a package mirror. The mirror must also be configured locally so that xlocate does not unecessarily load a webserver on the same host.

xq-api

xq-api is a server that reads local repodata and serves it over HTTP. It is used to provide package search and package information to users.

Information about xq-api, such as paths served by xq-api and the data it returns, can be read in its man page, xq-api(8), and at its GitHub repository.

Organization

Void Linux is a controlled anarchy. This is working as intended, and we like it this way. We've decided that it's better to have flexible workflows that can adapt to new situations as they arise rather than needing to consult detailed documentation or request authorization in advance.

Rather than determining processes for every action, we instead choose to trust our members to think on their feet and come up with reasonable solutions.

We still need some processes though, and we need consistency in the way that people think about problems. The processes described in this section aim to keep the organization running.

Onboarding

This section explains how new members are proposed, approved, and new permissions are assigned.

Proposing a New Member

Any existing member of the organization may propose a candidate. Candidates are expected to have been an active part of the organization for some time prior to proposing.

The proposal should take the form of written notice to the existing team via some private channel. Email is a good option for this. The mail should include the following information:

  • Candidate's name or well-known username
  • Candidate's contributions to Void, ideally as bullet points
  • A deadline by which comments should be provided, at least 1 week

After the comment period expires, the comments and final statements should be reviewed. Organization owners will have the final ability to approve a proposal and to dismiss comments or objections to the proposal.

If the proposal is approved, the candidate should be contacted, if they agree to join the organization, proceed to the next section.

Onboarding an Approved Candidate

Onboarding the candidate requires at minimum a member of netauth/terraform to run the final commands that will add the new candidate. During the review approval process, a Void Ops lead should have agreed to handle the proposals, they will ensure the following steps are done.

A patch should be created for the void-infrastructure repo which adds the candidate to the github_members.tf file. This patch should be sent to the repo as a pull request. The pull request should contain the same information as the email to internal team members, including a comment period of at least one week.

At the conclusion of the comment period, a final decision will be made to submit the patch or to rescind the invitation. This decision is made by agreement of the organization owners. The default is to approve.

Unless the new member objects, a post should be made to voidlinux.org welcoming them to the organization (a new member may not wish to have this kind of attention immediately).

Additional Powers

Once added to the organization, additional powers may need to be delegated. This should be discussed in the PR that added the individual to the organization.

Additional powers include:

  • NetAuth system account
  • NetAuth mail account
  • NetAuth group memberships

Consult the service specific documentation for how to apply additional permissions as needed.

Offboarding

Sometimes people leave the project, either of their own volition or with a helping hand. Its very important to ensure that things are done right and according to this process when someone leaves, so please read it in full.

Cause for following these offboarding guidelines can take the form of any of the following non-exhaustive list:

  • acted maliciously
  • have become inactive
  • wish to be removed
  • or any other reason that justifies the removal

Proposing a Removal

Any current member of the Void Linux Organization can propose a removal. There are two ways this process may go depending on if the removal candidate is also the requesting entity.

Removing Yourself

If you'd like to remove yourself, contact an administrator of the organization and state your request plainly.

  1. Contact an administrator via a private channel.
  2. The administrator will confirm the request via your registered email address.
  3. The administrator will file a PR to the void-infrastructure repo to be processed by void-ops.

If you're leaving the project permanently, attempting to find a successor to maintain your packages is greatly appreciated.

Removing Someone Else

Removing someone else is a more involved process and may require more discussion.

If you're proposing a security removal, escalate to void-ops directly, these are processed immediately.

  1. Send an email to an administrator via their registered email address. This email should contain the person you believe should be removed and why.
  2. The administrator should contact the named individual and notify them of a removal request. This should be done via the registered email. This email must also contain a deadline of at least one week for the individual to make a statement.
  3. Any statement should be considered and discussed amongst the admins of the organization. This may require pushing the deadline back.
  4. Once an agreement has been reached the removal will either proceed or be dropped. In the case no agreement can be reached after a reasonable amount of time has passed and a sincere effort from both admins and the individual, dropping membership is the default resolution.
  5. A ticket is created for void-ops to process the removal.

Ops Removal Checklist

Removing a member of the organization follows the following checklist which can be copy/pasted into a markdown aware ticket.

  • Remove access from the GitHub organization via TerraForm
  • Remove all memberships in NetAuth, removing the NetAuth principal is optional, but discouraged.
  • Remove any moderator bits in the wiki
  • Remove any existing manually provisioned mail aliases

Preserving NetAuth entities seems a bit unusual, but it prevents accidentally re-provisioning the name later, and the default groups grant no non-public access.

Keeping the Lights On

There are a handful of manual actions that must be taken from time to time to keep Void running. These are either manual due to the automation technology not being available, or the complexity of that technology not being worth the investment.

Distributable Images

Void prepares and distributes multiple live images. These are prepared manually due to the need for full root authority during build, and for the need to sign them after building.

Building the Images

The images should be built using Github CI in the void-mklive repository. This can be triggered on Github or by using the release.sh script in void-mklive:

$ ./release.sh start

By default, this will build:

  • Live ISOs with base and xfce variants for x86_64* and i686
  • ROOTFSes for x86_64*, i686, aarch64*, armv7l*, and armv6l*
  • PLATFORMFSes for aarch64*, armv7l*, and armv6l* Raspberry Pis
  • SBC images for aarch64*, armv7l*, and armv6l* Raspberry Pis

This will take approximately 2 hours for the default settings. To ensure all images have the same datecode, the datecode is cached at the beginning of the run. The CI workflow will also generate sha256sum.txt for the built images.

Collecting the Images

Once all images are built, they need to be collected from the Github CI artifacts. This can be done via the Github CI web interface, on the "Summary" tab of the CI run, or void-mklive's release.sh can download them to a directory called void-live-<date> with:

$ ./release.sh dl

Note: this currently assumes latest successful CI run is the one to download.

Once downloaded, verify all sha256sums match:

$ cd void-live-<date>
$ sha256sum -C *

The images can then be uploaded to DevSpace or the mirrors for testing.

Signing the Images

Signing the images is done after all the images have been checked and validated, and after the decision has been made to promote the set to current.

Generate a new signing key:

$ export DATECODE=<date>
$ pwgen -cny 25 1 > void-release-$DATECODE.key
$ cat void-release-$DATECODE.key void-release-$DATECODE.key | \
	minisign -G -p void-release-$DATECODE.pub -s void-release-$DATECODE.sec \
	-c "This key is only valid for images with date $DATECODE." \

Copy the public half of this key to the void-release-keys package in void-packages and make a release. Copy the passphrase (.key), privkey (.sec), and pubkey (.pub) to secret/releng/image-keys/<date>/{passphrase,privkey,pubkey} in Vault and ensure that the copy has been completed successfully.

Copy the sha256sum.txt file to your local workstation and sign it with the appropriate key.

$ minisign -S -x sha256sum.sig -s void-release-$DATECODE.sec \
	-c "This key is only valid for images with date $DATECODE." \
	-t "This key is only valid for images with date $DATECODE." \
	-m sha256sum.txt < void-release-$DATECODE.key

Alternatively, key generation and signing can be done with release.sh in void-mklive, which will generate the proper keys and sign the files as described above:

$ ./release.sh sign <date> sha256sum.txt

Copy the signed file back up to the master mirror and change the current symlink to point to the now signed ISOs.

Once you have confirmed that the link has updated, post an update to the website and arrange for the new key to be distributed as widely as possible.

Post Mortem

In this section we collect Post Mortem Documentation to incrementally harden and improve our infrastructure.

Post Mortem 2021-06-06

Incident summary

Due to a hardware defect on a-hel-fi we got a service degredation in various systems

Leadup

The server a-hel-fi had some strange behavior for about a week.

Fault

The datacenter reported faulty hardware.

Impact

  • build.voidlinux.org was down
  • docs.voidlinux.org was down
  • alpha.de.repo.voidlinux.org was down
  • man.voidlinux.org was down
  • package search on voidlinux.org was unavailable

Detection

The issue was reported in IRC 4 minutes after monitoring and automation detected the fault. No automatic alerts were raised.

Response

  • hardware reset from hetzners robot webinterface
  • hardware reset to rescue system from hetzners robot webinterface
  • ticket was opened

Recovery

The datacenter moved the hdds to new hardware.

What went well

Communication with the datacenter was good. From the initial report to the fix took only an hour and most of the delay was caused at our side.

The handling of the incident was good and the response time was fast. We also shared the state of the incident via twitter and reddit, which helped to let users show understanding for the downtime.

What could be done better

It was just luck, that Gottox was available. He was the only one that was able to interact with the webinterface.

Lessons learned

  • putting to many services on one host isn't the best idea
  • the access to the webinterface of Hetzner should be accessible for more people

Timeline

Timestamps are GMT+00

  • 2021-06-06 09:30: The machine stopped replying to heartbeats.
  • 2021-06-06 09:54: Issue was reported on IRC by maldridge
  • 2021-06-06 09:58: Hardware reset was issued by Gottox from the robot webinterface.
  • 2021-06-06 10:13: Hardware reset to rescue system was issued by Gottox from the robot webinterface.
  • 2021-06-06 10:23: maldridge was provided with access to the robot webinterface for that specific server
  • 2021-06-06 10:26: Ticket was opened at the datacenter.
  • 2021-06-06 10:31: A remote power button press was initiated as the webinterface reported 'power off' by maldridge
  • 2021-06-06 10:58: A hardware reset was initiated by the Hetzner support
  • 2021-06-06 11:14: Reporting back to the support, that the server is still no reachable by Gottox
  • 2021-06-06 11:24: Server started pinging again
  • 2021-06-06 11:25: Hetzner support reported back, that the server was hanging in post and that they replaced the hardware.
  • 2021-06-06 11:25: restart of the following services as reported by maldridge, done by Gottox: wireguard, unbound, consul, nomad
  • 2021-06-06 12:08: restart of nginx as firefox reported cert issues, done by Gottox

Post Mortem 2021-06-06

Incident summary

For approximately 2 days the Nomad managed tasks on a-fsn-de were unavailable, including repository management tasks. Resolution was and continues to be hampered by a distinct and ongoing outage.

Ammended 2021-07-31

CAS service was restored on 2021-07-22, any point below that mentions and ongoing outage was written while the incident was ongoing.

Leadup

At this time we do not know what caused the reboot of a-fsn-de, however we do know that a parallel and ongoing incident was occurring at the same time. Void uses a centralized authentication service (CAS) to manage access to our machines, and like many secure services this one relies on TLS certificates. This certificate expired without being noticed, which prevented what would have otherwise been a quick recovery of logging in and bouncing some services.

Additionally, when the CAS is unavailable, we maintain a break-glass login capability for a handful of extremely trusted maintainers (1:1 with the people that have access to the package signing key). This access was discovered to be impaired by one developer's key missing entirely, and one developer having failed to rotate their key. The third developer was on vacation but was able to log in and rectify the keying problem.

Fault

The Nomad outage was caused by an unexpected restart of a-fsn-de. When Nomad hosts reboot there is a known defect in that runit may bring up the services in a race that results in Nomad not being usable until services are restarted in a specific order.

The unavailability of the CAS system means that we still cannot log in to all hosts normally.

The issues with the break-glass keys caused the recovery of both the specific Nomad host and the CAS server to be slow.

Impact

Publicly Visible:

  • Repository signing unavailable
  • Builds not completing normally
  • musl and aarch64 appeared to lag behind glibc

Internally Visible:

  • CAS logins not available
  • Detailed failure logs not available in Grafana (requires CAS login)
  • Couldn't make control requests to Nomad (requires Vault CAS integration)

Detection

The Nomad failure was detected by external observation that an update to the less package was not signed, and was preventing installs from proceeding.

The CAS failure was discovered a few days before the signing issue but was deemed non-critical as it was an inconvenience that could be fixed within a week.

Response

The ability to merge changes in GitHub was restricted to prevent new builds from running that might further complicate recovery efforts.

@the-maldridge and @Gottox were recalled from vacation to recover access to the system.

Recovery

@Gottox used break-glass access to both restore break-glass access for @the-maldridge and @duncaen, and to restart the stuck signing process.

What went well

  • Excellent internal communication kept everyone in the loop as to what was broken, what was being done to fix it, and who was responsible for taking action.

  • Break-glass access, once used, did work effectively.

What could be done better

  • External communication was not great. A post went up on voidlinux.org, but no twitter notification was made, and the post was not widely shared on our other channels.

  • Break glass connectivity existed, but did not work.

  • Initially recalling critical team members from vacation was an ad-hoc process.

Lessons learned

  • Having the capability isn't enough. Break glass needs to be regularly tested to be effective.

  • For foundational infrastructure that has very infrequent updates, such as long lived TLS certificates, we should ensure multiple people are aware of the expiry date, and make use of multiple calendars to ensure critical life-cycle events are not missed.

Timeline

No timeline is provided for this incident.