Refile

Outline

Introduction

What is this book about?
Who is this book for?
Why read this book?
How this book is organized
Notes, etc., on programming languages

What is an Operator?

Why? Why Not?

Operator Frameworks

Kubebuilder Operator SDK client-go

Convergence and Level-Triggering

Interacting with the Kubernetes API

Case Study: ??

Solving Problems with Operators

This chapter needs to be meaty, but the preceding one shouldn’t be too simplistic - otherwise it’ll hit like a truck…
Examples here should be cluster-internal, or mostly so
Probably want somewhere between 5-10 examples
Backing up a database
Cluster autoscaler
Scaling a database cluster with orchestration

This might be the most important chapter.

Working with External Systems

Operator API Design

Performance and Scalability

Expanding the state of the world

Case Study: Deployment Controller

“Meta-operators”

Advanced Topics

Generations Admission Controllers Finalizers Owner References

Testing?

Chapters

Introduction

Welcome, and thanks for reading! As the title suggests, this is a book about programming “operators”: software that manages software running on the Kubernetes containerization platform. It’s a book I wish I had when building my own operators-based control systems as well as when helping others build theirs. It’s also a book about “why” - in part because I love asking “why” and getting to first principles; but also because although capital-O Operators and Kubernetes are this year’s models, control loops, level-triggering and convergence stand on their own as valuable tools for thinking about and designing distributed systems. And distributed systems are forever.

The rapid growth of Kubernetes <<>>

The purpose of this book is not to discuss “consuming” operators as part of deploying a software stack on Kubernetes. Operators as a software delivery mechanism are secondary. Rather, this is about using the operator pattern to build (and operate!) production-quality systems.

The ideas underpinning operators are not specific to Kubernetes, but Kubernetes is where this book’s focus is. This is largely a practical matter: Kubernetes is popular, and has good support for operator-style systems, and there aren’t a lot of folks doing this kind of thing elsewhere. Regardless, I hope that readeres gain an intuition for building convergence-based distributed systems that is useful in other domains. At some point in the future the cool kids won’t know what Kubernetes is, but they’ll still need to write correct distributed systems.

Regarding nomenclature: I’ll use the term “operator” (with a lower-case “o”) and “operator pattern” in this book to refer to the sorts of convergence-based systems discussed throughout. You’ll get the idea. “Operator” is a strange term. Kubernetes generally doesn’t use it: it has a `controller-manager`, not `operator-manager`. But the word is a useful shorthand. I explicitly don’t intend the terms to refer to RedHat’s Operator Framework or Operator Maturity Model, except where otherwise noted.

Regarding code: this is a book about programming. You should expect to read code while reading the book, specifically Go code. Go is the lanugage best supported by the Kubernetes API, which makes it far easier to build working operators without yak shaving on frameworks. It’s certainly possible to write operators in Python, C, or Bash, but it’s just more work. If you aren’t familiar with Go, don’t worry - there’s not a whole lot to it, and I don’t use any esoteric features of the language. If you’re familiar with any other imperative language, you’ll probably do fine.

Finally, all code in this book should be functional and mostly idiomatic as of Kubernetes 1.16. Kubernetes moves quickly, and the CRD interfaces are still evolving as of this writing, but the community takes backwards-compatibility seriously, so I expect that the gist of the projects here will continue to function for quite some time.

What is an Operator?

There has been an explosion in the world’s collective dependence on software in the twenty-first century. “Software is eating the world”, more and more. Our cars, our thermostats, our power grids, our communications, our businesses, our shopping - in many parts of the world it’s difficult to find an aspect of modern life that doesn’t depend deeply on software. “The Cloud” has compounded this phenomenon. Even a task as mundane and nonsocial as doing your taxes is typically done in a client-server model via a web browser instead of a standalone application, let alone visiting a brick-and-mortar office and talking to another human.

These factors in turn put a greater burden on the shoulders of the developers and maintainers of this software. Modern software must be highly scalable and highly available. Economic concerns demand that this be done as cheaply as possible. Programmers are push to do more with less.

Operators are a technology designed to reduce this burden by making it easier to automate management of systems - particularly those that resist automation by traditional means. They fill a gap in the average programmer’s toolkit for managing distributed stateful systems at scale.

You might wonder what makes operators special, why they’re different from “traditional means”, what that gap is, and how they fill it. To answer these questions, let’s take a brief detour through the history of computer systems management and look at how we got here and what we may have left behind.

A brief history of managing computer systems

N.B.: what follows is a rational reconstruction; it is not intended to be a comprehensive history by any means. There are loads of great books out there on the history of computing and computers; this isn’t one of them.

Beginnings: early electromechanical machines

The earliest days of modern computing are almost unrecognizable - no stored programs, no programming languages, no displays, machines that take up entire rooms - but the computers of the 1940s such as the Harvard Mark I and the ENIAC are unmistakably computers in the modern sense. The fact that they are so different from the world of the twenty-first century provides a useful starting point for our investigation. From a systems management perspective, several points are worth noting:

Precision is important, too: In the context of the 1940s, one of the key features of these early machines was their calculation precision. The analog computers they replaced were precise only to two or three significant digits. The Harvard Mark I could calculate precisely to twenty-three digits. Unfortunately, unlike in modern systems where failures typically manifest as availability loss, component failure in an electromechanical computer could compromise precision as well, resulting in silent corruption of results.
Component failures really matter: These machines had good reason to take up entire rooms. Without microchips, integrated circuits, or even transistors, they were composed of millions of discrete components, each subject to independent failure. There is an interesting distinction between the Harvard Mark I and the later ENIAC here. The Mark I engineers traded performance for reliability by using multipole relays in place of vacuum tubes, resulting in a machine that was able to operate nearly continuously for sixteen years. ENIAC, on the other hand, relied on nearly 18,000 vacuum tubes and rarely ran for more than a few days without blowing one out. It was a far faster machine, but only achieved approximately 50% uptime.
Sysadmins, too: These machines were extremely expensive, specialized pieces of hardware, usually dedicated to military purposes. Access to them was tightly controlled. ENIAC, for example, was run by a very small team that developed both the programming techniques as well as sophisticated operational procedures such as “don’t turn it off because it’ll blow tubes when it boots up”.

Finally, no discussion of systems management on electromechanical copmuters would be complete without mentioning the Semi-Automatic Ground Environment, or SAGE. SAGE was a Cold War-era missile defense system designed to coordinate and aggregate radar data across North America. It composed hundreds of radar systems feedling data to a set of twenty four “Direction Centers”. Every Direction Center housed a pair of AN/FSQ-7 computers each of which individually was the largest discrete computer ever built: it weighed 250 tons, required a 1 megawatt power supply, housed 49,000 vacuum tubes. These machines operated 24x7 in an active/standby pair delivering 99.95% uptime. Data from radars across the continent was fed to the Direction Centers on phone lines at 1300 bps. From a management perspective SAGE was a landmark. Not only was it tackling harder problems by introducing network failures and their attendant distributed systems problems, it also set a higher bar in attempting to overcome them.

Transistors and commercialization

While some commercial machines did exist in the vacuum tube era the mass productionization of transistors in the early 1960s marked both the beginning of a boom in commercial computing and a revolution in systems reliability. With solid-state transistors replacing vacuum tubes at the core of the machines, reliability increased tremedously. The attendant revoluion in operational cost along with the diminished physical scale and capital requirements of the machines made computing available to a much wider audience than ever before. This led roughly down two paths. One, which we’ll return to later on, starts with the TX-0 at MIT and proceeds through to the DEC PDPs and the Internet. The other, which we’ll trace here, remains in the realm of “big iron” computing - impersonal business machines built along the lines of their first-generation predecessors but taking advantage of transistors’ scale and reliability to achieve results far surpassing the electromechanical systems.

These machines - for example, the IBM 7000 series and the CDC 6000 series - were the first “mainframes”; some variants were the first to be labeled “supercomputers”. Applications were still largely local and batch-oriented, not interactive. Administration of the systems was still a specialized task, solidifying the “high priesthood of computing” trope. Reliability concerns were still typically directed toward single-system uptime via component testing and local redundancy. Reliability across multiple systems was usually achieved by running identical workloads on active/active pairs of machines at great cost. This line of thinking - “I’ve got this one system running my application, make it as fast and reliable as possible” may seem quaint to today’s programmers steeped in commodity hardware approaches to reliability, but in some circles it’s still preached as high art - IBM does, after all, continue to produce new Z-series mainframes year after year.

SABRE

The overriding theme in second-generation/transistorized computing of qualitative change, not quantitative change: new technology reduced the cost of ownership of computers to the point that they became economical for businesses, so the sorts of computing activities performed for military and scientific purposes in the first generation were adapted to business problems. This played out writ large with SABRE, the “Semi-Automated Business Research Environment,” whose name should be reminiscent of the “Semi-Automatic Ground Environment” - i.e., SAGE. (It seems that “SABE” didn’t have the same ring to it).

The seat reservation system at American Airlines in the 1950s was fundamentally unscalable. The entire process bottlenecked in a room with eight humans standing around a lazy susan in Little Rock, Arkansas, USA, and in the worst case it took upwards of three hours to reserve a seat. With air traffic rapidly growing, this situation was untenable. The obvious solution from IBM’s perspective was to take the principles of SAGE and apply them to American’s problem.

SAGE was fundamentally a distributed system: a series of relatively dumb terminals networked to a centralized store of radar data. Replace radar readouts with reservation agents’ terminals, and radar data with a seat reservation database and you can quickly get the idea behind SABRE. The first revision, deployed in 1960 at an unadjusted cost of more than $40 million, was built on the back of a pair of IBM 7090s, networked to more than 1000 reservation terminals in 50 cities via 10,000 miles of leased lines.

Reservation systems built on a similar architecture followed, and the pattern began to spread to other industries. Visa developed its electronic authorization and settlement systems, called BASE-I and BASE-II, in a similar architecture, though notably the “terminals” in BASE-I were actually DEC PDP-11s - far more powerful machines than the teletypes used by SABRE.

With the development of these systems, system reliability became geographical problem impacting thousands or millions of consumers every day. The centralized computers in these systems were required by the businesses they supported to be available; Lamport’s famous quip on distributed system - “[…] one in which the failure of a computer you didn’t even know existed can render your own computer unusable” - is nonsense before the advent of systems like SABRE and SAGE.

Fast forward to the 21st century, and these systems - more more accurately, their descendants, are still in production. IBM’s Transaction Processing Facility, the descendant of SABRE, is still in development and available to Z-series mainframe customers. In some ways this style of architecture seems like a dead end - it’s really the antithesis of the currently dominant commodity approach to systems design - but there is much to be learned from it, especially for those interested in design and operations of stateful systems.

A different path: consumer Internet and commodity hardware

Previous revision

The modern consumer expects Internet services that are always-on, fast, and cheap. These expectations place a burden on the shoulders of implementors and operators of these services, to deliver them in an economical fashion. Over the past twenty years there has been an explosion of tools, both technical and nontechnical, designed to meet this challenge. DevOps, Site Reliability Engineering, infrastructure-as-code, cattle-not-pets, containerization, NoSQL, Kubernetes, serverless - all of these have played a part in the evolution of the state of the art. To some extent the future is here (Kubernetes would be near-unthinkable by the average engineer in 2000, but is commonplace in 2020) but it’s not evenly distributed.

Many of these tools focus to near-exclusivity on the problems of stateless services, such as API servers and other so-called “microservices”. This makes some sense; stateless services are relatively simple from an run-time point of view. But it’s a rare stateless service that exists without the backing of a stateful system, such as a database or a message queue. And stateful systems are another beast entirely.

Managing stateful systems effectively and economically - scaling, upgrading, reconfiguring, backing up, recoving from failure - requires careful application of domain knowledge that cannot be encoded in Kubernetes. This isn’t to say that it’s not possible to deploy stateful systems to Kubernetes, or that “operable” stateful systems don’t exist. But it is the case that as a community of practice we do not have a shared understanding of how to do this in a repeatable, reliable, and economical fashion. The closest things we have come from the observability space, such as runbooks and alerts, and they don’t scale.

Operators are an answer to this problem. They are a tool for delivering and managing stateful systems running on Kubernetes. An operator is higher-level software that implements the behaviors to manage a stateful system in a robust manner on top of the Kubernetes runtime.

The heart of an operator is a controller. The controller periodically inspects “the world” and compares to a representation of the desired state of that world. If any action is necessary to make the world more closely reflect the desired state, the controller takes that action. This loop is performed in perpetuity. Think of a thermostat: every so often, it observes the state of “the world” - in this case, the ambient temperature - and compares it to the user’s desired state of that world - say, seventy degrees Farenheit. If the world is too cold, the furnace is powered on. If the world is too hot, perhaps the furnace will be powered off. Some time later, the thermostat will inspect the state of the world again, compare to the user’s desired state (perhaps seventy-one degrees this time!) and again take necessary action.

Despite the apparent triviality of the thermostat example it turns out that this pattern can be effectively applied to solve a broad range of systems management problems. This is not an accident; the fundamentals of controllers have been well-explored and systematized in the field of control theory which dates back to Maxwell in the nineteenth century. There’s a big difference between an operator that provisions an object storage bucket and a PID controller that auto-tunes a pool of web servers, but the basic idea is pretty similar, once you’ve got a control theory lens on the world.

In the context of of Kubernetes, a minimal operator comprises two components: a special type of resource called a CustomResourceDefinition, or CRD and a controller that uses instances of that CRD as the specification of the desired state of the world.

An instance of a CRD is a Kubernetes resource like any other. It includes standard ObjectMeta fields such as Name, CreationTimestamp, ResourceVersion, and Generation, as well as Spec and Status fields whose underlying structure is in the hands of the implementor. For example, the manifest below defines a custom resource called ”Foo ” in the programmingoperators.com/v1 namespace:

apiVersion: apiextensions.k8s.io/v1beta1
kind: CustomResourceDefinition
metadata:
  name: foos.programmingoperators.com
spec:
  group: programmingoperators.com
  names:
    plural: foos
    singular: foo
    kind: Foo
  versions:
    - name: v1
      served: true
      storage: true
      subresources:
        # status enables the status subresource.
        status: {}

And here’s a sample manifest of an instance of that CRD:

apiVersion: "programmingoperators.com/v1"
kind: Foo
metadata:
  name: my-foo
spec:
  [ ... ]
status:
  [ ... ]

A controller associated with this resource uses the Kubernetes APIs to watch instances of foos.programmingoperators.com/v1 and take action against “the world” when necessary. It might create or mutate other Kubernetes resources, reconfigure a database, or launch missiles; beyond some basic limitations, Kubernetes doesn’t much care.

func Sync(foo *v1.Foo) {
    TODO
}

That’s it, really. “Operator” may imply big ideas about autonomous, self-healing distributed systems and thousands of SREs on the streets looking for jobs, but it doesn’t entail them. An operator can be as big or as small as the role it needs to fill with a system.

Most of the rest of this book is about how to turn this trivial example in to something worthwhile and useful. There’s a lot to cover. To dip in just a bit deeper let’s touch on two big ideas.

The first is related to the fact that Operators exist at the behest of the Kubernetes API, and that API has very particular semantics. One of these semantics, and perhaps the most important for operator programmers, is that there is no way to observe every change to a resource. Whatever your operator needs to do, it neds to do based on the current state of the resource at any given time. This idea is sometimes called “level-triggering”, and it’s essential to writing operators that do what they say on the box. We’ll be talking about it quite a bit.

The second big idea is convergence. Given a resource with a certain specification, the controller should eventually converge to a stable state where no more action is taken. “Eventually” could be a very long time - hundreds of iterations, or hours in the real-world - but the controller should be working to align the state of the world with the state specified in the resources it manages. If that specified state changes - even before the world has been converged - the controller should start moving towards the new state.

Operator Frameworks

Convergence and Level-Triggering

While this book is largely focused on operators as a pattern for building systems on Kubernetes, it’s worth keeping in mind that the basic ideas here don’t have much to do with Kubernetes at all. In this chapter we’ll take a step back from the pragmatic and get a little more grounding in the theory behind the practice. We’ll swing back around to the “real thing” soon enough, newly armed with a deeper understand of the fundamentals. The key concepts are declarativity, convergence, and level-triggering. Maybe these are already familiar to you, or perhaps you have more of a practical bias; if that’s the case, don’t hesitate to skip forward. This chapter stands alone, and you can always come back to it.

INPROGRESS Declarativity: what, not how

If you’ve been around programming circles for a while, there’s a good chance you’re already familiar with the the idea of “declarativity”, and declarative vs. imperative. Declarative is about “what”, and imperative is about “how”, right? It’s the difference between the elegance of mergesort in Haskell and the mechanical grind (and the beauty!) of Timsort in C. In the world of operators, declarativity is of supreme importance

Convergence

Level-triggering

Interacting with the Kubernetes API

Case Study: ??

Solving Problems with Operators

We’ve now covered all of the basics; it’s time to dive in to some more realistic problems and discuss how to solve them with operators. This chapter will go in-depth with several real world problems and sketch out solutions.

Working with External Systems

One of the challenges of managing complex stateful systems is that they tend to be hard to “contain”: their correct behavior depends on a range of technologies such as:

distributed consensus stores (etcd, chubby)
block storage volumes
object storage buckets

These systems probably aren’t running on top of Kubernetes and they’re unlikely to be managed by the same operator as the system in question. To build a proper Kubernetes-native management tool for systems with these sorts of dependencies, the operator will need to interact with systems outside of its direct purview.

The basic “stateful” primitives native to Kubernetes are a useful first step. The PersistentVolume~/~PersistentVolumeClaim~/~CSI interfaces are more or less exactly what we’re talking about: an bridge between the Kubernetes world and and the external world. It turns out that the Kubernetes authors have written the controllers themselves, so working with these abstractions is somewhat abstracted from what one might call and operator. But under the hood, it’s a similar idea - a Kubernetes-native, resource-based interface to the lifecycle of an external stateful service.

But what about other resources? What if we have to write the system ourselves, controller and all? Let’s take as a concrete example an object storage bucket such as Amazon S3 or IBM’s Cloud Object Storage.

Operator API Design

As is common with much of computing, user experience is an oft-neglected part of building effective operators. Controller behavior really is the heart of the pattern, but it must not be forgotten that the CustomResourceDefinitions exposed by an operator are an interface. They’re the entrypoint for humans and computers in to your system; if they’re a mess, no one is going to be eager to use it. This goes for both inputs - the spec - and outputs - the status - of your CustomResourceDefinitions. If this chapter we’ll discuss several considerations in the interface arena:

Use more CRDs
Use built-in types
Make generations meaningful
Use labels and annotations judiciously
Hide unnecessary complexity
Follow Kubernetes conventions
Don’t accidentally make imperative interfaces

Fat CRDs, God Objects, and RBAC

When surveying the state of the art in open-source operators it’s common to see a one-to-one correspondence between an operator and a CustomResourceDefinition. Assuming these operators are all doing something useful, this is unfortunate for a couple of reasons.

First, off, analogous to the “god object” problem in object-oriented programming, “fat” CRDs provide interfaces (and require implementations!) that unecessarily couple systems across unrelated concerns. This can make programming against your operator difficult; “everything knows about everything” and its associated brittleness will leak in to your users’ applications.

Consider, for example, an operator for managing a PostgreSQL database that supports both provisioning the database and managing its backups. One way to express this would be to use a PostgreSQLInstance CRD with a Backups sub-field inside Status:

apiVersion: "api.programmingoperators.com/v1"
kind: PostgreSQLInstance
metadata:
  name: my-psql
spec:
  replicas: 2
status:
  backups:
  - 1581742748
  - 1581656378

This will work, more or less, but it’s definitely not ideal. For one, it leaves the user with no interface to work with an individual backup, e.g., to delete it. A more useful structure would abstract out a PostgreSQLBackup CRD and support operations directly against it, instead:

apiVersion: "api.programmingoperators.com/v1"
kind: PostgreSQLBackup
metadata:
  name: my-first-backup
spec:
  expires: 1584161978
status:
  state: completed

The operator can then expose a declarative interface for the backups - creating a PostgreSQLBackup causes a backup to be taken with the given specification, deleting one causes it to be deleted, etc.

In general, looking to object-oriented programming patterns as a model for CRD design isn’t a bad idea, though keep in mind they’re not free - the single responsibility principle might be taking things a bit too far.

Carrying on a bit deeper with this PostgreSQL example, let’s talk about RBAC. Kubernetes’ role-based access control system is a tremendously important part of operators. In what other system can you get a fully-baked authz model for your distributed system, with all the bells, whistles, and integrations you could want, for free, out of the box? It is critical not to overlook RBAC when designing CRDs.

In the first example above, where backups are embedded in the PostgreSQLInstance CRD, it’s impossible for a user to write a tool to delete a backup that cannot also delete the entire PostgreSQL instance, perhaps with all of its data! On the other hand, with the dedicated PostgreSQLBackup CRD, users can define fine-grained RBAC rules that permit manipulation of backups independent of permissions on instances.

Users need to implement services with minimal responsibilities, and the tools to enable them are right in front of you. Use them!

Use built-in types liberally

Kubernetes is a pretty “big” system. It covers quite a lot of conceptual ground in the data represented by its basic functionality, from abstract concepts like Deployment and Pod to more concrete ones such as bytes of memory. When designing your CRD specifications, you can achieve a much more “native” look-and-feel if you adopt types from the upstream libraries directly.

For example, if you’re representing a resource, such as memory, don’t ever do this:

type FooSpec struct {
    // How much memory to allocate to a Foo
    MemoryInBytes int32
}

Instead, use a resource.Quantity, just like Kubernetes resources do:

type FooSpec struct {
    // How much memory to allocate to a Foo
    Memory resource.Quantity
}

This approach allows both you and your users to take advantage of the work the Kubernetes authors have already done - parsing human-readable strings, efficient conversions under the hood, etc. And your users are already familiar with it - minimize cognitive burden!

Make generations meaningful

From an API consumer’s point of view, one of the more challenging poarts of working with operators is understanding when a specific action is completed. This flows naturally from the declarative nature of Kubernetes: a successful response to an API call represents only the acceptance of a request - not its execution. For all the caller knows, that request may not be possible to satisfy, or the relevant controller may be out to lunch, or it may take a week to complete by design - it’s not immediately obvious. This is a general problem for Kubernetes controllers. For example, try figuring out when an image change on a Deployment is completely rolled-out - it’s not straightforward. Many controllers don’t even bother to communicate convergence progress; those that do generally do is in an ad-hoc manner. It is unfortunately up to the user to understand the semantics of each resource-controller pair.

As the designed of CRD-based operators this is a problem for you, too. And of course, there’s no “right” answer - if there were, it’d be standardized. One thing that you can do, however, is to use resource generations. It’s quite simple, and you should probably be using them anyway (see the chapter on performance).

Resource generations are part of the ObjectMeta struct embedded in every Kubernetes resource. Every time a resource’s specification - not status, not labels, not annotations, but specifically the Spec field - is mutated, the generation is atomically incremented by the API server.

For historical reasons, this works slightly differently for custom resources. By default, generations are never incremented for custom resources. In order to enable them, the “status” subresource must be enabled on the CustomResourceDefinition, like this:

apiVersion: programmingoperators.com/v1
kind: CustomResourceDefinition
metadata:
  name: foo.programmingoperators.com
spec:
  group: programmingoperators.
  versions:
    - name: v1
      served: true
      storage: true
      subresources:
        # status enables the status subresource.
        status: {}

With the status resource enabled, writes to the custom resource - barring .metadata and .status fields - will increment the resource’s .metadata.generation field.

Note: if you’re working operators in the wild, and want to migrate existing resources to the support the status subresource, it is possible to mutate existing CustomResourceDefinitions to enable it. Do so with care, however: doing so will irrevocably destroy all existing data at .status.

With the status subresource enabled we need one more datum to tie the story together here: observedGeneration. A resource’s “observed generation” - .status.observedGeneration typically represents an indication from the resource’s controller that the resource’s specification at that generation has been reflected in the state of the world. That is, from a caller’s point of view, after mutating a resource and noting the returned resource’s .metadata.generation, one must simply wait for .metadata.observedGeneration >= .metadata.generation. At this point, the caller’s request has been “satisfied” in some sense.

Do note that this mechanism is ad-hoc; a controller’s definition of having “observed” a generation may be completely idiosyncratic. It’s up to you as the controller author to clearly document - and make consistent - the semantics of .metadata.observedGeneration to help your users make effective use of your operators.

Collapse file tree

Files

programmingoperators.org

Latest commit

History

programmingoperators.org

File metadata and controls

Refile

Outline

Introduction

What is an Operator?

Operator Frameworks

Convergence and Level-Triggering

Interacting with the Kubernetes API

Case Study: ??

Solving Problems with Operators

Working with External Systems

Operator API Design

Performance and Scalability

Expanding the state of the world

Case Study: Deployment Controller

“Meta-operators”

Advanced Topics

Testing?

Chapters

Introduction

What is an Operator?

A brief history of managing computer systems

Beginnings: early electromechanical machines

Transistors and commercialization

SABRE

A different path: consumer Internet and commodity hardware

Previous revision

Operator Frameworks

Convergence and Level-Triggering

INPROGRESS Declarativity: what, not how

Convergence

Level-triggering

Interacting with the Kubernetes API

Case Study: ??

Solving Problems with Operators

Working with External Systems

Operator API Design

Fat CRDs, God Objects, and RBAC

Use built-in types liberally

Make generations meaningful

Performance and Scalability

Expanding the state of the world

Case Study: Deployment Controller

“Meta-operators”

Advanced Topics

Testing?

Ideas

Meta

Export code