Skip to content

Latest commit

 

History

History
287 lines (195 loc) · 27.1 KB

control-plane-architecture.adoc

File metadata and controls

287 lines (195 loc) · 27.1 KB

Control plane architecture

Note
Much of this material originally came from [RFD 48] and [RFD 61]. This is now the living documentation for all the material covered here.
Note
The RFD references in this documentation may be Oxide-internal. Where possible, we’re trying to move relevant documentation from those RFDs into docs here.

1. What is the control plane

In software systems the terms data plane and control plane are often used to refer to the parts of the system that directly provide resources to users (the data plane) and the parts that support the configuration, control, monitoring, and operation of the system (the control plane). Within the Oxide system, we say that the data plane comprises those parts that provide CPU resources (including both the host CPU and hypervisor software), storage resources, and network resources. The control plane provides the APIs through which users provision, configure, and monitor these resources and the mechanisms through which these APIs are implemented. Also part of the control plane are the APIs and facilities through which operators manage the system itself, including fault management, alerting, software updates for various components of the system, and so on.

Broadly, the control plane must provide:

  • an externally-facing API endpoint described in [RFD 4] through which users can provision elastic infrastructure backed by the system. This includes APIs for compute instances, storage, networking, as well as supporting resources like organizations, users, groups, ssh keys, tags, and so on. This API may be used by developers directly as well as the developer console backend. See [RFD 30].

  • an externally-facing API endpoint for all operator functions. This is a long list, including configuration and management of hardware and software components and monitoring.

  • implementation of lifecycle activities, like initial system setup; adding, removing, or replacing servers or other components; and the like.

  • facilities for remote support by Oxide, including secure access to crash dumps, core files, log files, and system consoles.

2. Fundamental properties

Note
These are design goals. They have not all been fully implemented yet.

Availability. Availability of the control plane refers to the property that requests to provision resources succeed when the underlying resources are available within the system and requests to reconfigure or monitor resources succeed as long as they are well-formed. Unavailability refers to request failure due to hardware or software failure.

Important
Generally, the control plane is expected to remain available in the face of any two hardware or software failures, including transient failures of individual compute sleds, power rectifiers, switches, or the like.

Durability. Along the same lines, resources created in the control plane are expected to be durable unless otherwise specified. That is, if the whole system is powered off and on again ("cold start"), the system should converge to a point where all instances, disks, and networking resources that were running before the power outage are available as they were from the user’s perspective before the event. Similarly, if a compute server is lost (either through graceful decommissioning or otherwise), it should be possible to resume service of resources that were running on that server (e.g., instances, disks) on other servers in the system. There may be additional constraints on how many servers can fail permanently before data is lost, but in no case should it be possible to permanently lose an instance, disk, or other resource after the permanent failure of two compute sleds.

Important
Resources created by users should generally survive permanent failure of any two hardware or software components.

Consistency. Generally, users can expect strong consistency for resources within some namespace. The bounds of the namespace for a particular resource may vary as described in [RFD 24]. For example, if a user creates an instance, another user with appropriate permissions should immediately see that instance. In terms of CAP, the system is generally CP, with an emphasis on avoiding partitions through reliable software and hardware.

Important
The API namespace is generally expected to provide strong consistency.

Scalability and performance. The API is designed with a scheme for naming and pagination that supports operating on arbitrarily large collections, so in principle it’s expected to support arbitrary numbers of most resources. In practice, the system is intended to support on the order of 100 servers in a rack and 10,000 VMs in a rack. While these numbers are unlikely to change drastically in the future, the long-term goal of providing a single view over multiple racks means the system will need to support much larger numbers of servers and other resources. To avoid catastrophic degradation in performance (to the point of unavailability) as the system is scaled, aggressive limits will be imposed on the numbers of most resources. Operators may choose to raise these limits but will be advised to test the system’s performance at the new scale.

Important
The API should support arbitrarily large systems. The system itself should be clear about its target scale and avoid catastrophic degradation due to users consuming too many resources.

Security. Older versions of [RFD 6] discussed control plane security in great detail. That content needs to be extracted from the history and probably put here.

Supportability and debuggability. Effective customer support includes rapidly diagnosing issues and releasing fixes with low-risk updates. To achieve this, all the software in the system, including the control plane, must be built with supportability in mind, which means being able to collect enough information about failures to diagnose them from their first occurrence in the field as much as possible and being able to update software with low risk to the system. Details will be covered in an RFD to-be-named-later.

3. Parts of the control plane

3.1. Crash course on hardware architecture

For our purposes, an Oxide rack comprises three types of boards (systems):

  • Up to 32 compute sleds (servers). These are sometimes called Gimlets, though "Gimlet" technically refers to a particular hardware generation. Within the sled, the host system is the x86 box we generally think of as "the server".

  • 1 or 2 switches, each attached via PCIe to one of the 32 compute sleds. (The switches are also connected to each of the 32 sleds for networking. This PCIe connection we’re talking about is for control of the switch itself, which is only done by one sled.) The chassis that house the switches are sometimes called Sidecars, though "Sidecar" technically refers to a particular hardware generation. Sleds that are attached to switches are often called Scrimlets (which is a little unfortunate since the name obviously comes from "Gimlet", but it might not be a Gimlet (since Gimlet refers to a specific hardware generation)).

  • 1-2 power shelves, each with a Power Shelf Controller (PSC) that provides basic monitoring and control for the rectifiers that make up the power shelf.

Each type of system (Gimlet, Sidecar, and PSC) contains a service processor (SP) that’s responsible for basic monitoring and control, typically including power control and thermal management.

[RFD 210] discusses service processors in more detail.

3.2. Components that run alongside specific hardware

Control Plane Architecture
Figure 1. Overview of the control plane

At the "bottom" of the stack, we have a few basic components that reside alongside the specific pieces of hardware that they manage:

  • On each sled, the sled agent manages instances, storage, networking, and the sled’s other resources. Sled agent also collects information about hardware and reports it to Nexus. Each sled also runs either a boundary NTP or internal NTP service to synchronize the sled’s clock. More on boundary NTP below.

  • On the two Scrimlets, a "switch zone" provides additional functionality related to the switch:

    • Dendrite provides APIs for configuring the switch itself (e.g., populating various tables used for packet forwarding, NAT, etc.).

    • Management Gateway Service (MGS) provides APIs for communicating with all the rack’s service processors (including those on the sleds, Sidecars, and PSCs). See [RFD 210] for details.

    • Wicket and its associated service wicketd provide a text user interface (TUI) that’s accessible over the rack’s technician ports. Wicket is used for initial system setup (before networking has been configured) and for support.

    • Boundary NTP provides NTP service for all sleds in the rack based on upstream NTP servers provided by the customer.

Table 1. Components deployed alongside specific hardware
Component How it’s deployed Availability/scalability

Sled agent

One per sled, tied to that specific sled

N/A

Internal DNS

One zone per non-Scrimlet sled

N/A

Boundary NTP

One zone per Scrimlet. Both instances within a rack are fungible.

There are two. Short-term failure (order of hours or even days) is unlikely to affect anything since sled clocks do not drift that quickly.

Dendrite

Part of the switch zone (one per Scrimlet), tied to that specific switch

Unavailability of either instance results in loss of ability to configure and monitor the corresponding switch.

Management Gateway

Part of the switch zone (one per Scrimlet) Both instances within one rack are fungible.

Only one of the two instances are generally required to maintain service.

Wicket

Part of the switch zone (one per Scrimlet). Both instances within one rack are fungible.

Wickets operate independently. Failure of one means unavailability of the TUI over that technician port.

3.3. Higher-level components

Most other components:

  • are deployed in illumos zones

  • don’t care where they run and can even be deployed multiple times on the same sled

  • can be deployed multiple times for availability, horizontal scalability, or both

They are:

  • Nexus provides primary control for the whole control plane. Nexus hosts all user-facing APIs (both operator and customer), the web console, and internal APIs for other control plane components to report inventory, generate alerts, and so on. Nexus is also responsible for background control plane activity, including utilization management, server failure detection and recovery, and the like. Persistent state is stored elsewhere (in CockroachDB), which allows Nexus to be scaled separately.

  • CockroachDB provides a replicated, strongly-consistent, horizontally scalable database that stores virtually all control plane data. See [RFD 53] and [RFD 110] for details.

  • Clickhouse provides storage and querying services for metric data collected from all components in the rack. See [RFD 125] for more information.

  • Oximeter collects metric data from the other components and store it into Clickhouse. See [RFD 162] for more information.

  • External DNS operates authoritative DNS nameservers for end users and operators. These are authoritative nameservers for whatever DNS name the customer specifies. They currently just provide DNS names for the the external API and web console.

  • Internal DNS provides DNS names for all control plane components. This is how most of the control plane discovers its dependencies. (See [RFD 206] and [RFD 248].)

Table 2. Hardware-agnostic components
Component How it’s deployed Horizontal scalability Availability

Nexus

Using zones, as many as needed. Instances are fungible.

Not architecturally limited. State provided by CockroachDB.

With N instances needed to handle load, and M instances deployed, can survive M - N failures.

CockroachDB

Using zones, as many as needed. Instances are fungible.

Required, provided by CockroachDB cluster expansion.

Required, provided by CockroachDB range replication.

Clickhouse

Using zones, as many as needed. Instances are fungible.

TBD

Required, provided by Clickhouse replication (see [RFD 468]).

Oximeter

Using zones, as many as needed.

Yes. Configuration managed by Nexus, stored in CockroachDB, and cached in local storage for improved availability when other components are down

TBD.

External DNS

Using zones, as many as needed. Instances are fungible.

Not architecturally limited. Generally limited by the number of external DNS server IP addresses provided by the customer, which is usually 2-5.

Generally, only one is needed for service.

Internal DNS

Using zones, as many as needed. Instances are fungible.

Hardcoded limit of 5.

With N instances needed to handle load, and M instances deployed, can survive M - N failures.

4. Design principles

4.1. Basics

As much as possible, components are deployed in illumos zones. These are lightweight containers that act as their own complete systems (e.g., with their own dedicated networking stack with its own interfaces, IPs, etc.).

Oxide-produced components are written in Rust. They communicate over HTTP using APIs managed via OpenAPI using Dropshot. HTTP may not provide the best latency, but we don’t expect the throughput of API requests to be so high or the target latency so low that the overhead of HTTP internally will noticeably impact the customer experience. Using OpenAPI enables us to leverage investments in OpenAPI libraries, tooling, and documentation that we need for the external API. Rigorous use of OpenAPI, including automatically generating OpenAPI specifications from server implementations, allows us to automatically identify potentially breaking API changes. This information will eventually be included in metadata associated with each component’s update images so that the upgrade software can use this to ensure that only compatible combinations of components are deployed.

Service discovery happens via DNS. See [RFD 206] and [RFD 248].

4.2. Nexus, data flow

Nexus is the place where system-wide decisions get made. CockroachDB is the source of truth for all configuration.

Nexus stores all of its state in CockroachDB. It’s the only component that communicates directly with CockroachDB.

Nexus instances operate independently, without directly coordinating with each other except through CockroachDB.

Generally, when a change gets made, the process is:

  1. Nexus receives a request to make the change (e.g., via the external API)

  2. Nexus validates the requested change

  3. Nexus stores the information into CockroachDB. (This is the point where change is serialized against any concurrent changes.)

  4. Nexus propagates the change to other components that need to know about it.

There are a few basic contexts in Nexus:

  • API requests from either the external or internal API. Here, Nexus is latency-sensitive. When we make database queries or other requests in this context, we usually do not retry transient failures, but leave that to callers (See "end-to-end principle"). API request handlers may kick off sagas or activate background tasks.

  • Distributed sagas are a design pattern for carrying out multi-step operations in a distributed system. Saga actions generally do retry transient errors indefinitely.

  • Background tasks are periodic or event-triggered activities that manage everything else that has to happen in the system (e.g., change propagation, CockroachDB cluster management, fault tolerance, etc.). Nexus has a framework for background tasks that’s oriented around the "reconciler" pattern (see [RFD 373]). In this context, we also usually don’t retry individual operations — instead, the entire activity will be retried on a periodic basis. Background tasks are structured to re-evaluate the state of the world each time they’re run and then determine what to do, on the assumption that things may have changed since the last time they ran.

It’s essential that components provide visibility into what they’re doing for debugging and support. Software should be able to exonerate itself when things are broken.

  • API requests are short-lived. The Nexus log is currently the only real way to see what these have done.

  • Sagas are potentially long-lived. Without needing any per-saga work, the saga log provides detailed information about which steps have run, which steps are in-progress, and the results of each step that completed.

  • Background tasks are continuous processes. They can provide whatever detailed status they want to, including things like: activity counters, error counters, ringbuffers of recent events, data produced by the task, etc. These can be viewed with omdb.

4.3. Backwards Compatibility

4.3.1. Rules for data compatibility across versions (read this)

These rules are the most important things to focus on in terms of backwards compatibility, because they are the ad-hoc steps not covered by our infrastructure. Following these 2 rules should help make migrations safe and seamless, which will allow the production and test code to use the latest version of the given data structures at all times.

  1. Ensure the code to perfom an upgrade / backfill is in one location. This makes it easier to find and remove once it is no longer needed. It also makes it easier to test in isolation, and to understand the complete change.

  2. When doing a migration from an old format to a new format, prefer to do it up front during some kind of startup operation so that the rest of the system can operate only in the new world. The system should not be trying to backfill data during normal operation as this makes code have to support both the old and new formats simultaneously and creates more code bifurcation for testing.

4.3.2. Rationale and Background (read this if you care)

Network services

In general, backwards compatibility between services will be provided at the API level as described in [RFD 421]. Most internal control plane service APIs are Dropshot based and therefore can utilize the same strategy. Some other services, such as trust quroum and Crucible, operate over TCP with custom protocols and have their own mechanisms for backwards compatibility. The introduction of new services of this type should be largely unnecessary for the foreseeable future.

Database state

While runtime compatibility between services is largely taken care of semantically, we still have to worry about compatibility of data on persistent storage. As a distributed system that cannot be atomically updated, a rack may have different versions of software running on different sleds with each sled containing persistent state in a slightly different format. Furthermore, the structure of this data may be different across different customer racks depending upon when they were first setup. We have various categories of persistent state. Some of it is stored in database management systems (DBMS), where schemas are concrete and well-defined. For these scenarios, we can rely on our schema migration strategy as defined in [RFD 527]. After much discussion, this is largely a "solved" problem.

Ad-hoc persistent state

Slightly more concerning are things like serde serialized data types stored in JSON on various drives in the system. Most of these are stored in Ledgers across both M.2 drives and are only read by the sled-agent. These ledgers are used to store things such as the initial rack plan, key shares for trust quorum, and networking (bootstore data) for early cold boot support. We have largely been dealing with these in an ad-hoc manner. In most cases, new code in sled-agent reads the old structure and writes the new version to disk on sled-agent startup. This largely works fine, but in some instances has caused problems during upgrade when this was not done properly. This seems to be a reliable strategy so far for this limited set of ledger data, and it is unlikely we will need to change it. We do have to carefully test our upgrade paths, but we should be doing that anyway, and our support on this front is being worked on currently. An additional concern is to remember to prune old version support once all customers are past the point of needing it.

It is also important to note why the previous strategy works well and is largely foolproof. Each of these structures is coupled to a local process and only written and read by that process in a controlled manner. Format modifications are only made during update and are only visible locally. And most importantly, the code to perform those reads and writes is largely centralized in a single method, or at least single file per ledger. This makes it easy to reason about and unit test.

Migration of state from sled-agent to nexus

Now we get to what has been the hairiest of the problems for data compatibility across versions. As we add more features, and make our system more consistent in its promise that Nexus manages state for the control plane instead of sled-agent, we have realized that Nexus sometimes doesn’t have enough information to take over this responsibility. In such cases when performing RSS handoff to Nexus, we have had to add new state to the handoff message so that Nexus can create a blueprint to drive the rest of the system to its desired state via Reconfigurator. However, this only works for new rack deployments when we actually run RSS. For existing deployments that have already gone through initial rack setup, the new Nexus code does not have enough information to proceed with running reconfigurator. In this case we must backfill that information. This can, and has, been done a variety of ways. We sometimes may have to add new data to CRDB, and sometimes modify a schema and backfill columns. Othertimes, we may need to retrieve important data from sled-agent and store it in existing placeholders in blueprints. In any event, doing this is tricky and influences how legible the code is to read, how testable it is, and how correct it is under all circumstances. It’s for this reason that we proposed the rules for data compatibility in the prior section, which largely align with how we do ledger updates.

5. Cold start

"Cold start" refers to starting the control plane from a rack that’s completely powered off. Achieving this requires careful consideration of where configuration is stored and how configuration changes flow through the system.

We’ll start from the point where sleds are powered on, even though a lot happens with the rectifiers, service processors, Sidecars, etc. before that point. Once host systems are powered on:

  • Sled agents start up, communicate with each other, and form a trust quorum that enables each of them to decrypt their local storage. This local storage includes:

    • a bootstore containing basic network configuration needed to bring up the rack

    • information about what control plane services are running on this sled

  • Sled agents apply any needed network configuration and start any services they’re supposed to be running:

    • On Scrimlets, the switch zone and boundary NTP are started. Boundary NTP synchronizes time from the customer-provided NTP servers.

    • On non-Scrimlets, internal DNS is started. The rest of cold boot waits until time has been synchronized from the boundary NTP instances.

    • Once time is synchronized, internal DNS services are started so that components can find each other.

    • Once internal DNS is available, all other services are started concurrently.

      • CockroachDB nodes start up, discover the rest of the cluster via DNS, and form a cluster.

      • Nexus starts up and waits for CockroachDB to become available.

      • All other services start up and wait for their dependencies to become available.

For this to work:

  • Bootstore must contain enough information to configure networking on the switches and each host to reach other services within the rack as well as the outside world (for NTP).

  • Internal DNS must be able to come up without any external dependencies, meaning it stores a complete copy of all DNS data locally.

However, Nexus is the place where all changes to configuration are made, and CockroachDB is the source of truth for all configuration. As a result, when changing bootstore contents or internal DNS, the change is first made at Nexus, stored into CockroachDB, and then propagated to all sleds and internal DNS instances for local persistent storage so that it’s available on cold start (of the sled) without the rest of the control plane being up.

This is a very rough approximation, but gives an idea of the dependencies associated with cold start.

References

Unfortunately, most of these RFDs are not yet public.