nix-config/README.md

294 lines
9.2 KiB
Markdown

---
gitea: none
title: Flockige Infrastruktur deklarativ
include_toc: yes
lang: en
---
# C3D2 infrastructure based on NixOS
## Setup
### Enable nix flakes user wide
Add the setting to the user nix.conf. Only do this once!
```bash
echo 'experimental-features = nix-command flakes' >> ~/.config/nix/nix.conf
```
### Enable nix flakes system wide (preferred for NixOS)
add this to your NixOS configuration:
```nix
nix.settings.experimental-features = [ "nix-command" "flakes" ];
```
### nixpkgs/nixos
The nixpkgs/nixos input used lives at <https://github.com/supersandro2000/nixpkgs/tree/nixos-23.05>.
We are using a fork managed by sandro to make backports, cherry-picks and custom fixes dead easy.
If you want to have an additional backport, cherry-pick or other change, please contact sandro.
### nixos-modules repo
The nixos-modules repo lives at <https://github.com/supersandro2000/nixos-modules> and is mirrored to <https://gitea.c3d2.de/c3d2/nixos-modules>.
Auto generated documentation about all options is available at <https://supersandro2000.github.io/nixos-modules/>.
It contains options sandro shares between his private nixos configs and the C3D2 one.
It sets many options by default and when searching for a particular setting you should always grep this repo, too.
In question ask sandro and consider improving the documentation about this with comments and readme explanations.
Something should be changed/added/removed/etc? Please create a PR or start a conversations with your ideas.
### secrets repo
The secrets repo is absolutely deprecated!
Everything new must be done through sops and everything old should be migrated.
If you don't have secrets access ask sandro or astro to get onboarded.
### SSH access
If people should get root access to *all* machines, their keys should be added to ``ssh-public-keys.nix``.
## Deployment
### Deploy to a remote NixOS system
For every host that has a `nixosConfiguration` in our Flake, there are two scripts that can be run for deployment via ssh.
- `nix run .#HOSTNAME-nixos-rebuild switch`
Copies the current state to build on the target system.
This may fail due to resource limits on eg. Raspberry Pis.
- `nix run .#HOSTNAME-nixos-rebuild-local switch`
Builds everything locally, then uses `nix copy` to transfer the new NixOS system to the target.
To use the cache from hydra set the following nix options similar to enabling flakes:
```
trusted-public-keys = nix-cache.hq.c3d2.de:KZRGGnwOYzys6pxgM8jlur36RmkJQ/y8y62e52fj1ps=
trusted-substituters = https://nix-cache.hq.c3d2.de
```
This can also be set with the `c3d2.addBinaryCache` option from the [c3d2-user-module](https://gitea.c3d2.de/c3d2/nix-user-module).
### Checking for updates
```shell
nix run .#list-upgradable
```
![list-upgradable output](doc/list-upgradable.png)
Checks all hosts with a `nixosConfiguration` in `flake.nix`.
### Update from [Hydra build](https://hydra.hq.c3d2.de/jobset/c3d2/nix-config#tabs-jobs)
The fastest way to update a system, a manual alternative to setting
`c3d2.autoUpdate = true;`
Just run:
```shell
update-from-hydra
```
### Deploy a MicroVM
#### Build a microvm remotely and deploy
```shell
nix run .#microvm-update-HOSTNAME
```
#### Build microvm locally and deploy
```shell
nix run .#microvm-update-HOSTNAME-local
```
#### Update MicroVM from our Hydra
Our Hydra runs `nix flake update` daily in the `updater.timer`,
pushing it to the `flake-update` branch so that it can build fresh
systems. This branch is setup as the source flake in all the MicroVMs,
so the following is all that is needed on a MicroVM-hosting server:
```shell
microvm -Ru $hostname
```
## Cluster deployment with Skyflake
### About
[Skyflake](https://github.com/astro/skyflake) provides Hyperconverged
Infrastructure to run NixOS MicroVMs on a cluster. Our setup unifies
networking with one bridge per VLAN. Persistent storage is replicated
with Cephfs.
Recognize nixosConfiguration for our Skyflake deployment by the
`self.nixosModules.cluster-options` module being included.
### User interface
We use the less-privileged `c3d2@` user for deployment. This flake's
name on the cluster is `config`. Other flakes can coexist in the same
user so that we can run separately developed projects like
*dump-dvb*. *leon* and potentially other users can deploy Flakes and
MicroVMs without name clashes.
#### Deploying
**git push** this repo to any machine in the cluster, preferably to
Hydra because there building won't disturb any services.
You don't deploy all MicroVMs at once. Instead, Skyflake allows you to
select NixOS systems by the branches you push to. **You must commit
before you push!**
**Example:** deploy nixosConfigurations `mucbot` and `sdrweb` (`HEAD` is your
current commit)
```bash
git push c3d2@hydra.serv.zentralwerk.org:config HEAD:mucbot HEAD:sdrweb
```
This will:
1. Build the configuration on Hydra, refusing the branch update on
broken builds (through a git hook)
2. Copy the MicroVM package and its dependencies to the binary cache
that is accessible to all nodes with Cephfs
3. Submit one job per MicroVM into the Nomad cluster
*Deleting* a nixosConfiguration's branch will **stop** the MicroVM in Nomad.
#### Updating
**TODO:** how would you like it?
#### MicroVM status
```bash
ssh c3d2@hydra.serv.zentralwerk.org status
```
### Debugging for cluster admins
#### Nomad
##### Check the cluster state
```shell
nomad server members
```
Nomad *servers* **coordinate** the cluster.
Nomad *clients* **run** the tasks.
##### Browse in the terminal
[wander](https://github.com/robinovitch61/wander) and
[damon](https://github.com/hashicorp/damon) are nice TUIs that are
preinstalled on our cluster nodes.
##### Browse with a browser
First, tunnel TCP port `:4646` from a cluster server:
```bash
ssh -L 4646:localhost:4646 root@server10.cluster.zentralwerk.org
```
Then, visit https://localhost:4646 for for full klickibunti.
##### Reset the Nomad state on a node
After upgrades, Nomad servers may fail rejoining the cluster. Do this
to make a *Nomad server* behave like a newborn:
```shell
systemctl stop nomad
rm -rf /var/lib/nomad/server/raft/
systemctl start nomad
```
## Secrets management
### Secrets Management Using `sops-nix`
#### Adding a new host
Edit `.sops.yaml`:
1. Add an AGE key for this host. Comments in this file tell you how to do it.
2. Add a `creation_rules` section for `host/$host/*.yaml` files
#### Editing a hosts secrets
Edit `.sops.yaml` to add files for a new host and its SSH pubkey.
```bash
# Get sops
nix develop
# Decrypt, start en EDITOR, encrypt
sops hosts/.../secrets.yaml
# Push
git commit -a -m Adding new secrets
git push origin
```
### Secrets management with PGP
Add your gpg-id to the .gpg-id file in secrets and let somebody reencrypt it for you.
Maybe this works for you, maybe not. I did it somehow:
```bash
PASSWORD_STORE_DIR=`pwd` tr '\n' ' ' < .gpg-id | xargs -I{} pass init {}
```
Your gpg key has to have the Authenticate flag set. If not update it and push it to a keyserver and wait.
This is necessary, so you can login to any machine with your gpg key.
## Laptops / Desktops
This repo could be used in the past as a module. While still technically possible, it is not recommended
because the amounts of flake inputs highly increased and the modules are not designed with that in mind.
For end user modules take a look at the [c3d2-user-module](https://gitea.c3d2.de/c3d2/nix-user-module).
For the deployment options take a look at [deployment](https://gitea.c3d2.de/c3d2/deployment).
## File system setup
Set the `disko` options for the machine and run:
```shell
$(nix build --print-out-paths --no-link -L '.#nixosConfigurations.HOSTNAME.config.system.build.disko')
```
When adding new disks the paths under ``/dev/disk/by-id/`` should be used, so that the script is idempotent across device restarts.
## Install new server
- Copy the nix files from an existing, similar host.
- Disable all secrets until after the installation is finished.
- Set `simd.arch` option to the output of ``nix shell nixpkgs#gcc -c gcc -march=native -Q --help=target | grep march`` and update the comment next to it
- If that returns `x86_64` search on a search engine for the `ark.intel.com` entry for the processor which can be found by catting ``/proc/cpuinfo``
- Generate `networking.hostId` with ``head -c4 /dev/urandom | od -A none -t x4`` according to the options description.
- Boot live ISO
- If your ssh key is not baked into the iso, set a password for the `nixos` with passwd to be able to log in over ssh.
- `rsync` the this directory into the live system.
- generate and apply disk layout with disko (see above).
- Generate `hardware-configuration.nix` with ``sudo nixos-generate-config --no-filesystems --root /mnt``.
- If luks disks should be decrypted in initrd over ssh, enable DHCP in the `hardware-configuration.nix` for the interfaces that should be used for that.
- Install nixos system with ``sudo nixos-install --root /mnt --no-channel-copy --no-root-passwd --flake .#HOSTNAME``.
- After a reboot add age key to sops-nix with ``nix shell nixpkgs#ssh-to-age`` and ``ssh-to-age < /etc/ssh/ssh_host_ed25519_key.pub``.
- Add ``/etc/machine-id`` and luks password to sops secrets.
- Enable and deploy secrets again.
- Improve new machine setup by automating easy to automate steps and document others.
- Commit everything and push