November 7, 2025
Hi. This post is about my home Kubernetes cluster. I figured I would write some words about it, if for no other reason than to keep some notes for my own reference. For completeness, some of this is going to be “standard Linuxy stuff” and “standard networky stuff”.
I began running a home Kubernetes cluster because I needed to know more about Kubernetes for work, and I already had some home server bits and pieces in among Docker Compose files. Plus, I am happy doing Recreational Sysadmin.
Hardware
At the core of it all is three Raspberry Pis. These are the “control-plane” “master” nodes running etcd. They are creatively named pi5a, pi5b, and pi5c. Each one has the standard Raspberry Pi cooler, a h…
November 7, 2025
Hi. This post is about my home Kubernetes cluster. I figured I would write some words about it, if for no other reason than to keep some notes for my own reference. For completeness, some of this is going to be “standard Linuxy stuff” and “standard networky stuff”.
I began running a home Kubernetes cluster because I needed to know more about Kubernetes for work, and I already had some home server bits and pieces in among Docker Compose files. Plus, I am happy doing Recreational Sysadmin.
Hardware
At the core of it all is three Raspberry Pis. These are the “control-plane” “master” nodes running etcd. They are creatively named pi5a, pi5b, and pi5c. Each one has the standard Raspberry Pi cooler, a high durability 64 GB microSDXC card, and a Pimoroni NVMe Base, each with a 1TB NVMe drive.
The NVMe drives are either Kingston KC3000 or NV3. I originally got all KC3000s, but one of them was refunded under warranty. They have decently low write latency but seem to run hot, so behind the Raspberry Pis in the cupboard are two 60mm fans blowing outwards from the cupboard. Behind the fans are three “5V USB-C UPS” units that have USB-C power in and take 2x 18650 lithium cells each, that I scrounged from old USB battery banks. These units are strapped to a ceramic tile and each powered by a standard Raspberry Pi 27W power supply.
I also have a couple of other computers joined to the cluster. These are variously specced, but generally getting on a bit (manufactured 2017-2019). All have at least 16 GB RAM, 1TB NVMe storage of whatever brand, and 1 Gb/s Ethernet each. They have a mix of arm64 and amd64 CPUs with 4 or 8 cores.
Networking
Networking the nodes together is a Mikrotik CSS610 switch. I originally had all the nodes plugged into the Mikrotik CRS326 router shared with the rest of the home network, but moved to the separate switch for three reasons:
- I already had the switch.
- The router is getting on a bit and takes several minutes to reboot, and RouterOS gets more frequent software updates. The switch, on the other hand, only takes seconds to reboot.
- A 10G copper Ethernet SFP+ module in the router runs over 90°C. The same module, with the same host on the other end, runs no hotter than 70°C in the switch. This may be because the Raspberry Pi fans are on the same level as the switch and not the router, but I suspect the switch has better thermal design overall.
The networking gear, the nbn NTD in another room, and the WiFi, are all backed by a DC UPS of one kind or another.
The Raspberry Pis are running Raspbian Lite and other nodes run Debian 13 (stable). The networking equipment is on its standard operating system (RouterOS 7, swOS Lite).
Kubernetes distribution
I use k3s, since it was very easy to get started with and promises to be lightweight, which is important, since my hardware is also lightweight. K3s has opinions, in the form of bundled components, and I was happy to try running k3s as vanilla as possible.
But the combination of Traefik/ServiceLB/Flannel is already a fair bit more complicated than just slapping Caddy on a server. Plus I already had Caddy working great. Particularly, out of the box, k3s lacks automated Let’s Encrypt TLS certificates. But it was easy to put Caddy in front of the cluster while I started migrating things into Kubernetes, and I could figure out getting rid of Caddy once everything was across and the certificate situation was resolved.
Preparing to install
For networking, I settled on statically assigning addresses (IPv4 and IPv6) for “important nodes”, since it would make a lot of other things easier (like firewall rules). I also decided that my cluster would definitely support IPv6 (shouldn’t that, like, be the default everywhere now? haha).
For the non-server nodes, I went with fixed DHCP reservations anyway. Long ago I carved out a space of 30ish addresses from the local range for static assignment. I strongly suspect dynamic addressing might not even be supported for k3s in cluster mode. It’s probably a bad idea.
Anyway. Configuring static addressing in /etc/network/interfaces on the Raspberry Pis (yes, Raspbian/Debian still use ifupdown, yes Raspbian still uses ethX-style interface naming):
allow-hotplug eth0
iface eth0 inet static
address 192.168.[REDACTED]/24
gateway 192.168.[REDACTED]
iface eth0 inet6 static
address 2404:e80:18b::[REDACTED]/64
accept_ra 2
scope global
accept_ra 2 is needed for both Tailscale and Kubernetes reasons (I believe). The only router on the network sending RAs should be the CRS326, so I’m comfortable with that.
Aside from the usual software updates (sudo apt update && sudo apt upgrade --auto-remove && sudo rpi-eeprom-update -a && ...) I had to format and mount the NVMe drives into the filesystem. I didn’t want to put everything on NVMe though, just use it as storage (described later on).
This fstab stuff is very standard, but this post felt incomplete without it.
$ # Just do everything as root
$ sudo -s
# # Make an /nvme directory to mount to
# mkdir /nvme
# # Make an EXT4 partition on nvme0n1 with a terminal UI
# cfdisk /dev/nvme0n1
# # Format the newly created partition
# mkfs.ext4 /dev/nvme0n1p1
# # Get the UUID of /dev/nvme0n1p1
# blkid
# # Append a new line to /etc/fstab
# echo "UUID=(the uuid) /nvme ext4 defaults,noatime,nofail 0 2" >> /etc/fstab
Some mount options I learned while setting that up:
defaults- use the default optionsnoatime- a small performance tweaknofail- don’t block system start up if the mount fails
nofail is useful because these machines run headless. I would rather the system at least start up from the microSD card, so that I could SSH in and poke around to find out what went wrong.
Installing k3s
I have a repository in my Forgejo which has various notes I’ve written and commands I’ve run and manifests I’ve deployed. (Taking notes from the beginning of a project is very worthwhile and highly recommended. Ask me how I know.)
Here’s how I installed k3s servers on the Raspberry Pis.
$ # Download the install script so I can look at it before running
$ curl -o k3s-install.sh -sfL https://get.k3s.io
$ chmod +x k3s-install.sh
$
$ # In my password manager: create a K3S_TOKEN to use in the commands below.
$
$ # Run the script with --cluster-init on one of the Raspberry Pis
$ K3S_TOKEN='[REDACTED]' ./k3s-install.sh server \
--cluster-init \
--tls-san=pi5a.internal \
--cluster-cidr=10.[REDACTED].0.0/16,2404:e80:18b:[REDACTED]::/56 \
--service-cidr=10.[REDACTED].0.0/16,2404:e80:18b:[REDACTED]::/112
$
$ # Run the script on the others, joining to the cluster.
$ K3S_TOKEN='[REDACTED]' ./k3s-install.sh server \
--server https://pi5a.internal:6443 \
--tls-san=pi5b.internal \
--cluster-cidr=10.[REDACTED].0.0/16,2404:e80:18b:[REDACTED]::/56 \
--service-cidr=10.[REDACTED].0.0/16,2404:e80:18b:[REDACTED]::/112
$
$ K3S_TOKEN='[REDACTED]' ./k3s-install.sh server \
--server https://pi5a.internal:6443 \
--tls-san=pi5c.internal \
--cluster-cidr=10.[REDACTED].0.0/16,2404:e80:18b:[REDACTED]::/56 \
--service-cidr=10.[REDACTED].0.0/16,2404:e80:18b:[REDACTED]::/112
Now I have a high-availability dual-stack k3s cluster!
Regarding --cluster-cidr and --service-cidr - the k3s docs clearly call out that:
Dual-stack networking must be configured when the cluster is first created. It cannot be enabled on an existing cluster once it has been started as IPv4-only.
I ignored this the first time, and resolved it by… nuking the cluster and starting again from scratch. Learn from my mistake!
Next, I got the cluster credentials onto my laptop so I could run kubectl and k9s. These are in /etc/rancher/k3s/k3s.yaml. It’s the same format as ~/.kube/config, and it was pretty straightforward to manually integrate with my existing config in a text editor.
One final tweak - because the NVMe drives are so much snappier than even the best microSD card, I shut down k3s and moved the whole of /var/lib/rancher/k3s to /nvme/k3s, replacing /var/lib/rancher/k3s with a symlink to /nvme/k3s, before restarting k3s:
$ sudo -s
# systemctl stop k3s
# mv /var/lib/rancher/k3s /nvme/k3s
# ln -s /nvme/k3s /var/lib/rancher/k3s
# systemctl start k3s
Relocating the k3s directory is probably possible via a config file option or flag, but I couldn’t get it to work that way. The etcd metrics confirm that disk writes are about 10x faster.
Once the cluster was up, I could install the other nodes. I was happy that k3s.io wasn’t serving malware via scripts, so on these nodes I ran:
curl -sfL https://get.k3s.io | \
K3S_URL=https://pi5a.internal:6443 \
K3S_TOKEN='[REDACTED]' \
sh -
Spegel
k3s ships an embedded container image mirror called Spegel, which is supposed to work in a peer-to-peer fashion, and gets container images from the local containerd cache. This was appealing to me as I wanted to try making the cluster resilient to Docker hub limits and registry outages (including my own).
There are multiple steps for enabling it, some of which have to happen on the servers and some on all nodes (I think).
On all the server nodes, create / add to the config file /etc/rancher/k3s/config.yaml:
embedded-registry: true
On all nodes (server and agent), create / add to the repositories file /etc/rancher/k3s/registries.yaml:
mirrors:
"*":
With those in place, I made it pick up the changes with systemctl restart k3s or systemctl restart k3s-agent.
Load balancers and reverse proxies
By default, each node in the k3s cluster has ports exposed on the LAN corresponding to every Kubernetes Service of type LoadBalancer.
There’s no particular reason for me to do this, but rather than every node exposing ports with ServiceLB, I picked the four nodes with static IP addresses and some kind of UPS power (the Raspberry Pis and one other) to run ServiceLB. This is accomplished by labelling those nodes:
$ kubectl label node pi5a svccontroller.k3s.cattle.io/enablelb=true
$ kubectl label node pi5b svccontroller.k3s.cattle.io/enablelb=true
$ kubectl label node pi5c svccontroller.k3s.cattle.io/enablelb=true
$ kubectl label node asdf svccontroller.k3s.cattle.io/enablelb=true
As soon as you label one node this way, k3s figures out not to run ServiceLB on any other nodes.
By default, there is a Service deployed by Traefik for the usual HTTP ports (80 and 443). Traefik then reverse-proxies HTTP(S) requests to other Services (but of the ClusterIP type) as defined by Ingress resources.
I wanted HTTP/3 support, which Traefik supports! But k3s (or perhaps rather the default Traefik Helm chart) does not enable HTTP/3 by default. So I kubectl apply -f-ed this:
apiVersion: helm.cattle.io/v1
kind: HelmChartConfig
metadata:
name: traefik
namespace: kube-system
spec:
valuesContent: |-
ports:
websecure:
http3:
enabled: true
advertisedPort: 443
Alas, because of Kubernetes shenanigans, enabling the same port on both UDP and TCP requires creating the Service with both at the same time, and if Traefik’s default Service is already deployed, the same port on the other layer 4 protocol won’t be opened. The fix is to delete the Helm release, bounce k3s, and apply the HelmChartConfig again (but with some change to the content for it to detect a change and install):
$ helm delete traefik -n kube-system # once
$ sudo systemctl restart k3s # on each ServiceLB node
$ kubectl apply -f traefik/config.yaml
I also eventually learned that for Traefik to be able to see real source IP addresses of requests, I would need to tweak a few other options. There are two ways, but after a bit of research I believe the externalTrafficPolicy: Local plus “Traefik on same node as ServiceLB” method is superior to host-networking mode.
I also learned of a few other Traefik options that I thought would be cool. Here’s the entire Traefik HelmChartConfig resource I use (with comments):
# Applying this manifest will kick off the helm-controller and
# cause Traefik to be reinstalled.
apiVersion: helm.cattle.io/v1
kind: HelmChartConfig
metadata:
name: traefik
namespace: kube-system
spec:
valuesContent: |-
# <3 Traefik
global:
sendAnonymousUsage: true
# Don't check servers in the cluster for TLS, they're all mostly
# plain HTTP and AFAICT the NSA hasn't? MitMed my home network...yet?
globalArguments:
- "--serversTransport.insecureSkipVerify=true"
# Run 4 instances of Traefik (one for each ServiceLB node)
deployment:
replicas: 4
# Expose Prometheus /metrics
metrics:
prometheus:
entryPoint: metrics
addRoutersLabels: true
addServicesLabels: true
# Only run Traefik pods on the LB-enabled nodes
nodeSelector:
svccontroller.k3s.cattle.io/enablelb: "true"
# Enable HTTP/3 on the standard port
ports:
websecure:
http3:
enabled: true
advertisedPort: 443
# Let Traefik see the real source IP of requests.
# This may also need Traefik on the same node as ServiceLB, hence
# running 4 Traefiks across the nodes with ServiceLB enabled.
service:
spec:
externalTrafficPolicy: Local
# Avoid putting multiple instances of Traefik on the same node
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app.kubernetes.io/name: '{{ template "traefik.name" . }}'
app.kubernetes.io/instance: '{{ .Release.Name }}-{{ include "traefik.namespace" . }}'
topologyKey: kubernetes.io/hostname
Instead of Ingress resources for defining routes in Traefik, you can use Traefik custom resources, or enable Gateway API support, but for now I am sticking with standard Kubernetes Ingresses because of reasons I will get to shortly.
More networking stuff
At home I have a fairly conventional nbn internet connection. But I got a static IPv4 address and static IPv6 /48 range, which is pretty sweet.
Access from the LAN IPv4 range to the internet is accomplished with the usual Source NAT (SNAT, or src-nat). That is, the router (which is configured as the “gateway” or default route on the LAN) replaces the source IP address in outgoing packets with its address (my one static IPv4 address). It keeps a record of the connection (what TCP or UDP port it sent the packet from), and when it receives a reply packet, replaces the destination IP address with the source from the original outgoing packet before sending back to that host on the LAN.
In the form of Mikrotik RouterOS commands, that src-nat rule looks like:
/ip firewall nat add chain=srcnat src-address=192.168.[LAN]/24 action=src-nat to-addresses=103.205.28.157 out-interface=WAN
Routers with a dynamic public IP address typically use a masquerade action in place of src-nat, but I have a static IP address, so that’s a detail I don’t have to worry about.
“Port forwarding” from the internet back into the local network needs Destination NAT (DNAT/dstnat). A host on the internet is sending the router a packet. Instead of the router going “oh that’s for me”, it notices that it’s for a destination port configured for dstnat, then replaces the destination IP address of the packet with the LAN host. It also tracks the connection similarly to srcnat, so that reply packets are matched to the same connection. Here it is for HTTPS (port 443):
/ip firewall nat add chain=dstnat action=dst-nat dst-address=103.205.28.157 dst-port=443 to-addresses=192.168.[SERVER] to-ports=443 protocol=tcp
But there’s typically a problem with this, in the form of “I want to connect to the port-forwarded service from within the LAN”. When computer A sends a packet to computer B, it expects responses from computer B. If the router has replaced the destination with the address of computer C, which replies to A from itself, then A will get a reply from C, which it’s not expecting, and will typically ignore.
Resolving this requires combining both dst-nat and src-nat into a hairpin NAT or “NAT loopback”. This solves the problem by replacing both addresses in the packet: computers A and C both think they are communicating with B.
Before the dst-nat rule we add a src-nat rule:
/ip firewall nat
add action=masquerade chain=srcnat dst-address=192.168.[SERVER] out-interface=LAN protocol=tcp src-address=192.168.[LAN]/24
when a packet to the router’s external address is received and it’s a “forwarded” port, it:
- replaces the source address with its own address
- replaces the destination address with the server’s address
That’s great when you have only one server you are port-forwarding to, but I have four. I did it the complicated way with per-connection classifier rules that mark the packets as either 1st_conn, 2nd_conn, 3rd_conn, or 4th_conn depending on a hash of source address and port, and then repeating the dst-nat rules for each packet mark. (Plus some address lists to make things a little neater).
;;; Classifier rules (shared among all forwarded ports)
/ip/firewall/mangle
add chain=prerouting action=mark-connection new-connection-mark=1st_conn passthrough=yes per-connection-classifier=src-address-and-port:4/0
add chain=prerouting action=mark-connection new-connection-mark=2nd_conn passthrough=yes per-connection-classifier=src-address-and-port:4/1
add chain=prerouting action=mark-connection new-connection-mark=3rd_conn passthrough=yes per-connection-classifier=src-address-and-port:4/2
add chain=prerouting action=mark-connection new-connection-mark=4th_conn passthrough=yes per-connection-classifier=src-address-and-port:4/3
/ip/firewall/nat
;;; hairpin tcp -> servers; shared among all forwarded ports
add chain=srcnat action=masquerade protocol=tcp src-address-list=not_in_internet dst-address-list=web
;;; hairpin udp -> servers; shared among all forwarded ports
add chain=srcnat action=masquerade protocol=udp src-address-list=not_in_internet dst-address-list=web
;;; HTTPS port forwarding
add chain=dstnat action=dst-nat to-addresses=192.168.[SERVER1] to-ports=443 protocol=tcp dst-address-list=public_ip connection-mark=1st_conn dst-port=443
add chain=dstnat action=dst-nat to-addresses=192.168.[SERVER2] to-ports=443 protocol=tcp dst-address-list=public_ip connection-mark=2nd_conn dst-port=443
add chain=dstnat action=dst-nat to-addresses=192.168.[SERVER3] to-ports=443 protocol=tcp dst-address-list=public_ip connection-mark=3rd_conn dst-port=443
add chain=dstnat action=dst-nat to-addresses=192.168.[SERVER4] to-ports=443 protocol=tcp dst-address-list=public_ip connection-mark=4th_conn dst-port=443
;;; (...several other ports forwarded to 4 destinations...)
Then I did basically the same thing (minus the unnecessary src-nat) for IPv6. I could simply have put the IPv6 addresses of the servers in the DNS records for the domains they’re serving, but this way I can have some router-level influence over load-balancing.
cert-manager
cert-manager lets you get TLS certs from Let’s Encrypt without having to centralise (e.g. Caddy style). The certs are stored in the Kubernetes (as Secrets) and Traefik knows how to use a Secret for TLS. cert-manager takes care of intercepting the ACME validation and updating the cert. cert-manager knows which things to get certs for by looking at Ingresses (among other ways).
The guide on K3s Rocks confused me a bit at first, but now it all makes sense. In the end I think it was very helpful.
After installing cert-manager (by deploying its manifest), you have to do a few things, but after that it takes care of itself:
- cert-manager needs to know that it can request certificates from Let’s Encrypt. That’s what the
ClusterIssuerresource is for. (Don’t forget to replace${EMAIL}with your email address.) You only have to create this once in the cluster. - Sometimes you want Traefik to redirect all HTTP requests to HTTPS. That’s what the
Middlewareis for. Again, the middleware only has to be created once in the cluster. - Finally the
whoami-ingress-tls.yamlis an example of anIngresswhere cert-manager figures out that you want a certificate for it. The important parts are thecert-manager.io/cluster-issuer: letsencrypt-prodannotation that let cert-manager know that it needs to use theClusterIssuercreated above, and thetlsblock, that let Traefik know to use TLS. Thetraefik.ingress.kubernetes.io/router.middlewares: default-redirect-https@kubernetescrdannotation is an example of using an annotation to tell Traefik to use theredirect-httpsMiddlewarethat exists in thedefaultnamespace.
With that out of the way, I created TLS-enabled Ingresses for the various servers I’m hosting based on the example, as I migrated them from Caddy to Traefik.
Anubis
Anubis is a Web AI Firewall Utility (nice) that I’ve been using to slow down the terribly-written “AI” scrapers. Here are a few things I learned running several Anubises across a few domains:
- Actually read the docs. There’s a lot of cool config options now.
- Set
ED25519_PRIVATE_KEY_HEX. It’s worth it, just trust me. I guess it’s OK to share the same value across all domains you might be hosting? Anyway, I have aSecretwith this value that every Anubis container gets access to viaenvFrom. - Set
COOKIE_DOMAIN. I found out that, without this, having separate Anubises running between subdomains and the primary domain (such as this site you are reading right now andgitea.drjosh.dev), there will be rejected challenge solutions. - As soon as you have more than 1 Anubis running for a domain (e.g. your
Deploymenthas an Anubis and another container in the pod template, and you decide to setreplicas: 3or something for “high availability”), set up storage. This requires(?) writing a policy file, which is a tad more complicated than just deploying the Anubis container with a few env vars, and it seems you can’t just skip sections where you’re happy with the defaults. But it is worth it, such as later on when Anubis is blocking something that you don’t want it to.
I use a shared Valkey install for all domains (again, this seems fine?). And there is a Helm chart for Valkey, so I use that:
apiVersion: helm.cattle.io/v1
kind: HelmChart
metadata:
name: anubis-valkey
namespace: kube-system
spec:
targetNamespace: anubis
createNamespace: true
chart: valkey
repo: https://valkey.io/valkey-helm/
version: 0.7.7
set:
replicaCount: 3
All my Anubises share a policy file, set via a ConfigMap. The main purpose of the policy file is to get it to use Valkey, but I added a couple of custom rules too:
apiVersion: v1
kind: ConfigMap
metadata:
name: shared-anubis-policy
namespace: default
data:
policy.yaml: |
bots:
# Anubis 1.23 default config *should work*
- import: (data)/meta/default-config.yaml
# Allow "Gitea" RSS feeds
- import: (data)/apps/gitea-rss-feeds.yaml
# ...(more rules)...
dnsbl: false
store:
backend: valkey
parameters:
url: "redis://anubis-valkey.anubis.svc:6379/0"
# ...(more stuff copied from the default policy file)...
Putting it all together
In configuration terms, a typical website that I host now looks like this manifest:
apiVersion: apps/v1
kind: Deployment
metadata:
name: awakeman-dot-com
labels:
app: awakeman-dot-com
spec:
replicas: 3
selector:
matchLabels:
app: awakeman-dot-com
template:
metadata:
labels:
app: awakeman-dot-com
proxy: anubis
spec:
containers:
- name: awakeman-dot-com
image: gitea.drjosh.dev/josh/awakeman.com:2025.1103.0@sha256:c55a3da7166f106dc2ea3025f5298a5bb63bce506943241a88565aa69d43627e
imagePullPolicy: IfNotPresent
ports:
- name: http
containerPort: 80
livenessProbe:
httpGet:
port: http
path: /
scheme: HTTP
initialDelaySeconds: 10
periodSeconds: 1
timeoutSeconds: 15
successThreshold: 1
failureThreshold: 8
- name: anubis
image: ghcr.io/techarohq/anubis:v1.23.0@sha256:04d7d67f1d09e8fa87ba42108f920eb487bb75b43922a009ea87bbcf1b5b5c0b
imagePullPolicy: IfNotPresent
env:
- name: BIND
value: ":8080"
- name: COOKIE_DOMAIN
value: awakeman.com
# - name: DIFFICULTY
# value: "3"
- name: METRICS_BIND
value: ":9090"
- name: POLICY_FNAME
value: /etc/anubis/policy.yaml
# - name: SERVE_ROBOTS_TXT
# value: "true"
- name: TARGET
value: "http://localhost:80"
envFrom:
- secretRef:
name: shared-anubis-ed25519-key
ports:
- name: proxy
containerPort: 8080
- name: anubis-metrics
containerPort: 9090
livenessProbe:
httpGet:
port: anubis-metrics
path: /healthz
scheme: HTTP
initialDelaySeconds: 10
periodSeconds: 10
timeoutSeconds: 15
successThreshold: 1
failureThreshold: 8
volumeMounts:
- name: shared-anubis-policy
mountPath: /etc/anubis
volumes:
- name: shared-anubis-policy
configMap:
name: shared-anubis-policy
---
apiVersion: v1
kind: Service
metadata:
name: awakeman-dot-com-frontend
spec:
selector:
app: awakeman-dot-com
ports:
- name: web
port: 80
# targetPort: http
targetPort: proxy
---
apiVersion: traefik.io/v1alpha1
kind: Middleware
metadata:
name: redirect-to-awakeman-dot-com
spec:
redirectRegex:
regex: '^https?://www\.awakeman\.com/(.*)'
replacement: "https://awakeman.com/${1}"
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: awakeman-dot-com-ingress
annotations:
cert-manager.io/cluster-issuer: letsencrypt-prod
traefik.ingress.kubernetes.io/router.middlewares: default-redirect-https@kubernetescrd,default-redirect-to-awakeman-dot-com@kubernetescrd
spec:
tls:
- secretName: awakeman-dot-com-tls
hosts:
- awakeman.com
- www.awakeman.com
rules:
- host: awakeman.com
http: &http
paths:
- pathType: Prefix
path: "/"
backend:
service:
name: awakeman-dot-com-frontend
port:
number: 80
- host: www.awakeman.com
http: *http
Remote access and VPN
I’ve been a long time Tailscale user, and of course I wanted to continue that tradition. I avoid exposing SSH to the internet by instead letting Tailscale expose itself. (Though I reckon I could make an OpenSSH server pretty secure if I wanted to.)
Setting up Tailscale on the underlying hosts is easy. Setting up the Tailscale Operator lets you access Services via the tailnet. After mucking around with the static manifest and manual Helm installoing, these days I deploy the Tailscale Operator via the k3s inbuilt helm-controller.
# I already have the operator-oauth secret in the
# tailscale namespace.
apiVersion: helm.cattle.io/v1
kind: HelmChart
metadata:
name: tailscale-operator
# The HelmChart needs to go in kube-system in order for k3s's
# inbuilt helm-controller to see it.
namespace: kube-system
spec:
targetNamespace: tailscale
createNamespace: true # this line not needed anymore
chart: tailscale-operator
repo: https://pkgs.tailscale.com/helmcharts
version: 1.90.5
Walking through what this does:
- The
HelmChartresource is created in thekube-systemnamespace. - k3s’s inbuilt helm-controller sees it, and then runs Helm commands to fetch and install the Helm chart.
- Helm expands the chart (which is a collection of templated resource) and creates those.
- Some of those resources (
Deployment,StatefulSet, etc) then create more resources (Pod, etc).
With the operator installed, exposing a service to the tailnet is then a matter of creating a Service with type: LoadBalancer and loadBalancerClass: tailscale. (There are other ways but I find this the easiest.) For example, here’s how I expose Forgejo’s SSH port to the tailnet, which (since I’m the only committer) lets me not expose the SSH port to the internet:
apiVersion: v1
kind: Service
metadata:
name: forgejo-tailscale-frontend
labels:
app: forgejo
spec:
type: LoadBalancer
loadBalancerClass: tailscale
selector:
app: forgejo
ports:
- name: ssh
protocol: TCP
port: 22
targetPort: ssh
Walking through how that works:
- The
Serviceis created - The Tailscale Operator notices that the
Servicehas been created - The Tailscale Operator compares the current state of the cluster (is there a pod running Tailscale for this service?) against its desired state (there should be a pod running Tailscale for this service!)
- The Tailscale Operator creates / deletes a pod running Tailscale, as needed.
This is starting to feel delightfully like a Rube-Goldberg device!
Storage
For storage I stuck with Rancher projects, and installed Longhorn. This has mostly been great, but has also been the source of a few headaches.
4GB is the minimum recommended for Longhorn, but I believe that taking this at face value has been the main source of my misery. On nodes with only 4GB RAM, even though there was hardly any observable memory pressure or even swap usage, I noticed that Longhorn components (anything with the longhorn-manager binary) got OOM-killed a lot.
I also learned that while a Longhorn config option is “Experimental”, stay away! I tried enabling “RWX Volume Fast Failover”, thinking that it might help one day. That was a mistake - etcd was swamped with traffic from Longhorn until I turned it back off.
Some other Longhorn lessons I’ve learned:
- Keep volumes as small as possible. Volumes can be expanded later on. Writes will eventually fill the empty space, and reclaiming that space requires a filesystem trim, which might not free anything. Reducing the total possible space a volume takes up speeds up rebuilding, moving replicas, and restoring backups.
- For manually created volumes, where you want to stick some files in yourself, don’t set RWX mode (ReadWriteMany) until you’ve finished setting up the volume. I’m not sure what maintenance mode does, but it doesn’t seem very useful.
- “Default Data Locality” set to
disabledis fine, actually. Longhorn spends less time moving data onto the node that just had a pod scheduled. “Default Data Locality” is probably good if there’s one replica? If there are multiple replicas of the volume, writes wait on network anyway. (Right?) - “Replica Auto Balance” set to
best-effortis probably a better way to spread replicas around nodes than “Default Data Locality”. - “Pod Deletion Policy When Node is Down” set to
delete-both-statefulset-and-deployment-podis also fine. Useful when you want new pods using volumes to be recreated, rather than just be stuck. - A system backup by default includes all the volumes. I already have configured backups for the volumes I care about. So I have to remember to disable “Volume Backup Policy” every time.
- Be absolutely sure before upgrading to a new version and don’t skip the release notes. Downgrading is at worst impossible and at best a full delete-and-restore-from-backup. (Ask me how I know.)
Turning stuff off
I fiddle with nodes in the cluster a lot, and sometimes I want to take something offline (for a while or permanently). My MOP for this is now basically:
- Go to the Longhorn UI and edit the node. Disable “Node Scheduling” and enable “Eviction Requested”. This evicts all the volumes from the node.
- Wait until the node has no more Longhorn volumes on it. This happens when enough replicas are running on the other nodes.
kubectl drain [NODE] --ignore-daemonsets --delete-emptydir-data(but be sure that your emptyDir data isn’t as unimportant as mine!). This cordons the node (preventing new workloads) and evicts all pods. New pods should be spun up elsewhere (because that’s howDeploymentsandStatefulSetswork).- SSH to the node and
systemctl disable k3s-agent. This stops it rejoining the cluster later on.
That’s it for now! I’m out of words. See you next time!
☆