← blog

My self-hosting journey

Once upon a time, I decided that I really wanted to start self-hosting many of services. Specifically, hosting my photos and files. I did this for a few reasons:

  1. I want control of my data
  2. I was not happy with existing options (Google Drive, Proton Drive, etc. all have slow syncing and don’t have ignore rules)

A First Version: Monolith

I took a bunch of old PC parts I had lying around and scrapped together a mid-tower build. I set up Arch Linux on it and ran a few things in Docker containers:

  1. NextCloud for files, photos, and CalDAV
  2. Tailscale Serve to expose services under a public URL which requires users to be on my Tailnet

    • creates domains like <service>.<tailnet name>.ts.net

Tailscale made remote management easy. But I had a problem - my disk is encrypted so what happens if my server reboots? It’ll just be stuck at cryptlvm login. To fix this I added mkinitcpio-dropbear which adds Dropbear SSH into the /boot volume and runs it before the disk is decrypted.

I Hate NextCloud

NextCloud continuously frustrated me with how complex, slow, and overall just bad it was. File syncing was a horrible experience and it wouldn’t let me add ignore rules (ex. for node_modules, target). It felt over-engineered for what I was doing.

I tried a few competitors to NextCloud, but none of them really had what I wanted. So, I decided to scrap the frontend entirely, opting to use Syncthing to host my files. This was genuinely an amazing experience. Not only did it allow great ignore rules, it was super fast. Combining it with GNU Stow enabled me to sync my dotfiles between my laptop and PC.

More Services

I added:

  1. Immich for photos (very much like Google Photos)
  2. Radicale for CalDAV
  3. Kanidm OIDC
  4. linkding, mealie, searxng, ..

As I started adding services, I needed a way to manage them. I had a folder in /home which contained folders for each service. Each service had its own docker-compose.yml. I created a variety of bash scripts to update and manage all of them.

While this worked fine, I wanted something more purpose built for what I was doing. Furthermore, I just did not feel good enough about the security of this to expose services to the internet.

Moving to Kubernetes: Blizzard

Building a Rack

I took apart my server PC and sold each component. With this money, I was able to find some great Facebook Marketplace deals for old servers.

I made a rack in closet with the following:

  1. Ubiquiti Cable Internet (Cable Modem)
  2. Ubiquiti UDM Pro
  3. Ubiquiti 10G switch (from my Dad)
  4. 2x Dell PowerEdge R430

    • dual E5-2640v3
    • 128GB ECC DDR4 2133MHz
    • 4x 750GB SAS SSDs
    • Intel(R) 10GbE 2P X710 Adapter
    • PERC H730P Mini (RAID controller I bought separately)
    • Dual 550W power supplies
    • Dual (redundant) SD card module (IDSDM)
  5. 1x Skullcandy Intel NUCs (from my Dad)
  6. Rack mount UPS

OS

I installed Talos Linux on all the 2 Dell servers and NUC. All were configured as both control planes and workers.

Talos Linux is a super cool operating system, meant solely to run Kubernetes (k8s). It’s an immutable OS with a very small attack surface. You simply just configure each node with a YAML file and boot it. It doesn’t even have SSH!

Services

I used Helm and helmfile to configure my cluster.

I ran many of the same services I had on the old setup, but also:

Getting all my old services running here took a lot of time. Very few self-hosters run Kubernetes, so almost none of the services make it easy to run in Kubernetes.

Results

Overall I was super happy with this setup. I was very happy with the security model of this, it had HA, and I publicly exposed it to the internet on <service>.liamsnow.com domains.

However, it limited what services I could run. I gave up trying to run Jellyfin (and accompanying services). It was also quite a pain to manage - because I am just not a Kubernetes expert. In many ways it taught me I never wanted to work in infrastructure lol.

The Blizzard Revolution

I made giant mistakes in this cluster. Since Talos was immutable, I placed the OS on the dual redundant SD cards and used the RAID array entirely for mayastor.

While I was studying abroad in Morocco I noticed that randomly all my services went down. After a long investigation, I think I understand what happened:

  1. Flake 1 (dell server) lost IDSDM SD card redundancy because 1 of the SD cards failed. Soon after, it lost the other SD card.

    • While Talos is immutable, it has an ephemeral partition for logs, container data, and etcd data (in this case on the IDSDM) source
    • SD cards quickly degrade at this high of write frequency
    • This never alerted me because I never finished setting up iDRAC alerts over Ntfy
  2. Flake 1 didn’t go out cleanly. It was holding on for dear life:

    • Its corrupted SD card caused the WAL to become inconsistent. etcd lost track of what it had actually persisted, so its view of the Raft log diverged from what it had already communicated to Flake 2 & 3
    • Before fully dying, it was sending stale or malformed AppendEntries RPCs to the other nodes, poisoning their view of the log
    • Timing out as leader or failing heartbeats, triggering reelections
    • Once Flake 1 was fully down, Flake 2 & 3 were left disagreeing with each other on the state of the cluster

Some sources:

So, I worked on getting Flake 1 back up and running. I had to:

  1. SSH tunnel iDRAC through my PC so I could remotely manage Flake1
  2. Split the RAID array into two parts (one for Talos, one for mayastor)
  3. Reinstall Talos, by using virtual media in iDRAC
  4. Set up the node back into the cluster

But this didn’t work. The cluster was broken and Flake 1 couldn’t recover the damage it did.

I was frustrated and wanted to give up. But I couldn’t go back to a bunch of Docker containers..

NixOS

NixOS seemed like everything I wanted. Everything was defined from config files, it made adding services easy, and had proper segmentation of services without needing docker.

It was quite a great experience. I basically got immich running with just:

services.immich.enable = true;

I could secure the machine easily, add anything I wanted with ease, rollback when things broke, add automatic updating, and more.

But I soon encountered that NixOS is double-edged sword:

  1. Services work great when their popular. Their nixpkgs are maintained and constantly upgraded
  2. Unpopular services have either no nixpkg, it’s out of date, or its completely broken

I do love self-hosting, but I can’t always dedicate that much time to it. This means that I am just simply not very good at writing Nix derivations for services I want to run. It just makes the experience pretty horrible sometimes.

Even with its faults, I still choose to host most of my of services on NixOS.

Helios

I am planning to apply to Oxide. So, I want to familiarize myself with Helios, Oxide’s OS built on illumos.

Since I am just running NixOS on Flake 2, I have Flake 1 free to be my experimentation server. I managed to get Helios running on it with no trouble at all. In fact, I would say it was easier than Arch linux to setup.

It has only been a great experience learning more about illumos. I had some trouble getting some things running (ex. my fish shell fork) but nothing was too hard.

I have moved my reverse proxy (Caddy) and this website to Helios, with each in their own zone.

Conclusion

I am super happy I decided to start self-hosting. It has taught me so many invaluable skills and let me take back control of my data.

I’m excited for what the future holds for my homelab!