Oxide Notes
The hardware and software behind Oxide computers
Hardware
Rack
Oxide sells a rack as the unit (as opposed to traditional servers, which are individual units that must be placed into a rack). This has many benefits:
-
Customs don’t have to spend time (and have employees to) build racks — which is much more time-consuming than you’d expect
-
Oxide can control everything & are not bound to constraints of traditional racks
The rack contains:
-
Compute Sleds (the servers)
-
Look more like blades than traditional servers
-
Half-width & placed side-by-side in the rack
-
Can have 32x 1st gen, or 24x 2nd gen
-
-
2x Network Switches
-
Designed by them!
-
Sit in the middle
-
12.8 Tbit/s
-
-
2x Power Shelves
-
Converts AC → 54V DC
-
1+1 redundant or 2+0 non-redundant
-
The Sleds themselves do not have power supplies and instead get power from the power shelves (via the bus bar). This allows for:
-
More efficient power supplies
-
Bigger fans → quieter
The Sleds simply slot in (blind mate), which connects them power and networking (without having to manually wire cables). This makes both setup and maintenance easier.
Compute Sled
Only has 1 processor, why?
-
Simpler, cheaper, faster, more power efficient
-
Never have to be concerned about NUMA
-
Already have 64-192 cores — dual-socket would only make sense if you truly need to exceed this
-
(I assume) many customers are running distributed workloads on Oxide racks anyways, so a dual-socket system has less of a benefit
10x U.2/U.3 NVMe 2.5-inch (15mm) Bays
Software
Cheat-Sheet
-
Helios: distro of #link("illumos")[illumos]
-
Hubris: embedded OS
-
Propolis: VMM userspace
-
Crucible: block storage service (fulfills similar use case to mayastor)
-
Omicron: control plane
Helios
TODO
Propolis
A Rust VMM userspace for illumos bhyve (GitHub).
While other illumos distros use bhyve on both the kernel and userspace sides, Oxide only uses it for the kernel side.
-
bhyve ("beehive") is a type 2 (hosted) hypervisor / VMM. It is both a kernel module and userspace process.
Why use type 2 hypervisor, not a type 1 (bare-metal)? I assume:
-
Hardware support: doesn’t require virtualized OS’s to understand Oxide’s non-standard hardware (only Helios does)
-
To enable monitoring, debugging, and management
-
Provide security
TODO
Holistic Boot
Oxide’s sleds have very different mechanisms for booting than traditional servers (and desktops). RFC
What Traditional Boot Looks Like
Old computers booted with BIOS, now they use UEFI. They give you options — like picking the operating system.
Booting requires IO, so the BIOS must also do initialization. But, an OS assumes that nothing’s initialized — to solve this, the BIOS must "send the machine backwards," making the machine look like it hasn’t been booted.
-
We spent all this time initializing, just to undo it, and have the OS redo it
-
"Sending backwards" is not always perfect and leaves artifacts
Additionally, traditional server also have:
-
System Management Mode (SMM) (ring -2): handles low-level hardware management (fan control, thermal protection, firmware operations). Can interrupt the kernel or hypervisor at any point. Massive security risk — malware planted here is invisible to the OS.
-
Baseboard Management Controller (BMC): separate service-processor chip on the motherboard for remote management (ex. iDRAC). Full Linux system — large attack surface.
What Oxide Does Instead
It has no:
-
BIOS or UEFI
-
SMM: if SMM is reached → system panics
-
BMC, and instead a service processor (SP):
-
400MHz STM32 (much less powerful than a BMC)
-
UART to CPU
-
Runs Hubris
A Two-Phase, Single-Kernel Boot
The kernel loaded from phase #1 stays resident (no handoff). All hardware initialization happens once (no "sending backwards").
-
Phase #1: Stored on SPI NOR flash. Contains the kernel, boot archive, and enough of the OS to reach phase-2 (network stack, NVMe drivers, etc.)
-
Phase #2: Stored on NVMe SSD. Contains the rest of the OS as a ramdisk image with a small ZFS pool
BSU Redundancy
Every sled has two independent Boot Storage Units:
-
Each BSU = 1 SPI flash + 1 NVMe
-
SP controls which BSU is active
-
Enables updates and recovery
The Boot Sequence
-
Stage 0:
-
AMD Platform Security Processor
-
internal initialization
-
release bootstrap process (BSP)
-
loads phbl from SPI flash
-
-
SP selects BSU, prepares for later steps
-
-
phbl (Pico Host Boot Loader):
-
A "necessary evil" forced by x86 reset vector semantics
-
Decompresses the phase #1 boot archive
-
Locates and loads the kernel ELF image
-
Jumps to kernel entry point
-
-
Helios kernel: stays resident, get hardware info from SP, mounts phase #2
Why this Architecture?
TODO write about LinuxBoot
Hubris
Crucible
TODO
Omicron
TODO