IV

Oxide Notes

The hardware and software behind Oxide computers

Hardware

Rack

Oxide sells a rack as the unit (as opposed to traditional servers, which are individual units that must be placed into a rack). This has many benefits:

  • Customs don’t have to spend time (and have employees to) build racks — which is much more time-consuming than you’d expect

  • Oxide can control everything & are not bound to constraints of traditional racks

The rack contains:

  • Compute Sleds (the servers)

    • Look more like blades than traditional servers

    • Half-width & placed side-by-side in the rack

    • Can have 32x 1st gen, or 24x 2nd gen

  • 2x Network Switches

    • Designed by them!

    • Sit in the middle

    • 12.8 Tbit/s

  • 2x Power Shelves

    • Converts AC → 54V DC

    • 1+1 redundant or 2+0 non-redundant

The Sleds themselves do not have power supplies and instead get power from the power shelves (via the bus bar). This allows for:

  • More efficient power supplies

  • Bigger fans → quieter

The Sleds simply slot in (blind mate), which connects them power and networking (without having to manually wire cables). This makes both setup and maintenance easier.

Compute Sled

Only has 1 processor, why?

  • Simpler, cheaper, faster, more power efficient

  • Never have to be concerned about NUMA

  • Already have 64-192 cores — dual-socket would only make sense if you truly need to exceed this

  • (I assume) many customers are running distributed workloads on Oxide racks anyways, so a dual-socket system has less of a benefit

10x U.2/U.3 NVMe 2.5-inch (15mm) Bays

Software

Cheat-Sheet

  • Helios: distro of #link("illumos")[illumos]

  • Hubris: embedded OS

  • Propolis: VMM userspace

  • Crucible: block storage service (fulfills similar use case to mayastor)

  • Omicron: control plane

Helios

TODO

Propolis

A Rust VMM userspace for illumos bhyve (GitHub).

While other illumos distros use bhyve on both the kernel and userspace sides, Oxide only uses it for the kernel side.

  • bhyve ("beehive") is a type 2 (hosted) hypervisor / VMM. It is both a kernel module and userspace process.

Why use type 2 hypervisor, not a type 1 (bare-metal)? I assume:

  • Hardware support: doesn’t require virtualized OS’s to understand Oxide’s non-standard hardware (only Helios does)

  • To enable monitoring, debugging, and management

  • Provide security

TODO

Holistic Boot

Oxide’s sleds have very different mechanisms for booting than traditional servers (and desktops). RFC

What Traditional Boot Looks Like

Old computers booted with BIOS, now they use UEFI. They give you options — like picking the operating system.

Booting requires IO, so the BIOS must also do initialization. But, an OS assumes that nothing’s initialized — to solve this, the BIOS must "send the machine backwards," making the machine look like it hasn’t been booted.

  • We spent all this time initializing, just to undo it, and have the OS redo it

  • "Sending backwards" is not always perfect and leaves artifacts

Additionally, traditional server also have:

  • System Management Mode (SMM) (ring -2): handles low-level hardware management (fan control, thermal protection, firmware operations). Can interrupt the kernel or hypervisor at any point. Massive security risk — malware planted here is invisible to the OS.

  • Baseboard Management Controller (BMC): separate service-processor chip on the motherboard for remote management (ex. iDRAC). Full Linux system — large attack surface.

What Oxide Does Instead

It has no:

  • BIOS or UEFI

  • SMM: if SMM is reached → system panics

  • BMC, and instead a service processor (SP):

  • 400MHz STM32 (much less powerful than a BMC)

  • UART to CPU

  • Runs Hubris

A Two-Phase, Single-Kernel Boot

The kernel loaded from phase #1 stays resident (no handoff). All hardware initialization happens once (no "sending backwards").

  • Phase #1: Stored on SPI NOR flash. Contains the kernel, boot archive, and enough of the OS to reach phase-2 (network stack, NVMe drivers, etc.)

  • Phase #2: Stored on NVMe SSD. Contains the rest of the OS as a ramdisk image with a small ZFS pool

BSU Redundancy

Every sled has two independent Boot Storage Units:

  • Each BSU = 1 SPI flash + 1 NVMe

  • SP controls which BSU is active

  • Enables updates and recovery

The Boot Sequence

  1. Stage 0:

  2. phbl (Pico Host Boot Loader):

    • A "necessary evil" forced by x86 reset vector semantics

    • Decompresses the phase #1 boot archive

    • Locates and loads the kernel ELF image

    • Jumps to kernel entry point

  3. Helios kernel: stays resident, get hardware info from SP, mounts phase #2

Why this Architecture?

TODO write about LinuxBoot

Hubris

A lightweight kernel for embedded systems (like the SP). GitHub Docs

  • Cannot load foreign programs at runtime, instead, "tasks" are configured and compiled in

  • Memory-protected

  • Message-passing

Humility

TODO

Crucible

TODO

Omicron

TODO