Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
31 changes: 27 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,25 @@ The official CAST AI kubernetes cluster controller written in Go

## Installation

Check our official helm charts repo https://github.com/castai/castai-helm-charts
Check our official helm charts repo <https://github.com/castai/castai-helm-charts>

## Configuration

### Environment Variables

| Variable | Description | Default |
|----------|-------------|---------|
| `API_KEY` | CAST AI API key (required) | - |
| `API_URL` | CAST AI API URL (required) | - |
| `CLUSTER_ID` | CAST AI cluster ID (required) | - |
| `DRAIN_VOLUME_DETACH_TIMEOUT` | Default timeout for waiting for VolumeAttachments to detach during node drain | `60s` |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a good idea? Feel free to ignore if it's not.

Suggested change
| `DRAIN_VOLUME_DETACH_TIMEOUT` | Default timeout for waiting for VolumeAttachments to detach during node drain | `60s` |
| `FALLBACK_DRAIN_VOLUME_DETACH_TIMEOUT` | Default timeout for waiting for VolumeAttachments to detach during node drain | `60s` |

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... how about DRAIN_VOLUME_DETACH_DEFAULT_TIMEOUT? So it's clearer to the user, that it's a default value, but it could be overridden by something (API payload in that case).

| `INFORMER_CACHE_SYNC_TIMEOUT` | Timeout for informer cache sync at startup | `1m` |

### VolumeAttachment Wait Feature

The cluster-controller supports waiting for VolumeAttachments to be deleted after draining a node. This helps prevent Multi-Attach errors when CSI drivers need time to clean up volumes.

This feature is controlled per-action via the API. The `DRAIN_VOLUME_DETACH_TIMEOUT` environment variable provides the default timeout when the API doesn't specify a custom value.

## Testing

Expand All @@ -19,11 +37,13 @@ Deploy cluster-controller to already connected remote cluster.
*NOTE*: Make sure your kubectl context is pointing to your remote cluster.

Have a configured `gcloud`. Make sure to docker login with

```shell
gcloud auth configure-docker gcr.io
```

Clone https://github.com/castai/castai-helm-charts adjacent to repo root folder. It will be used by our scripts
Clone <https://github.com/castai/castai-helm-charts> adjacent to repo root folder. It will be used by our scripts

```shell
cd <cluster-controller-parent-directory>
git clone https://github.com/castai/castai-helm-charts gh-helm-charts
Expand Down Expand Up @@ -56,8 +76,10 @@ The cluster-controller can be tested locally with a full e2e flow using `kind`:
Setup a `kind` cluster with a local docker registry by running the `./hack/kind/run.sh` script.

Option 1. Deploy controller in Kind cluster.

* Build your local code and push it to the local registry with `./hack/kind/build.sh`.
* Deploy the chart to the `kind` cluster with

```shell
helm repo add castai-helm https://castai.github.io/helm-charts
helm repo update
Expand All @@ -69,12 +91,13 @@ Option 1. Deploy controller in Kind cluster.
```

### Load tests

See [docs](loadtest/README.md)

## Community

- [Twitter](https://twitter.com/cast_ai)
- [Discord](https://discord.gg/4sFCFVJ)
* [Twitter](https://twitter.com/cast_ai)
* [Discord](https://discord.gg/4sFCFVJ)

## Contributing

Expand Down
45 changes: 45 additions & 0 deletions cmd/controller/run.go
Original file line number Diff line number Diff line change
Expand Up @@ -33,9 +33,11 @@ import (
"github.com/castai/cluster-controller/internal/controller/logexporter"
"github.com/castai/cluster-controller/internal/controller/metricexporter"
"github.com/castai/cluster-controller/internal/helm"
"github.com/castai/cluster-controller/internal/informer"
"github.com/castai/cluster-controller/internal/k8sversion"
"github.com/castai/cluster-controller/internal/metrics"
"github.com/castai/cluster-controller/internal/monitor"
"github.com/castai/cluster-controller/internal/volume"
"github.com/castai/cluster-controller/internal/waitext"
)

Expand Down Expand Up @@ -132,13 +134,38 @@ func runController(

log.Infof("running castai-cluster-controller version %v, log-level: %v", binVersion, logger.Level)

informerOpts := []informer.Option{
informer.WithCacheSyncTimeout(cfg.Informer.CacheSyncTimeout),
informer.WithDefaultVANodeNameIndexer(),
}
if cfg.Informer.EnableNode {
informerOpts = append(informerOpts, informer.EnableNodeInformer())
}
if cfg.Informer.EnablePod {
informerOpts = append(informerOpts, informer.EnablePodInformer())
}

informerManager := informer.NewManager(
log,
clientset,
12*time.Hour, // resync period, every which the whole informer cache is refreshed
informerOpts...,
)
if err := informerManager.Start(ctx); err != nil {
return fmt.Errorf("starting informer manager: %w", err)
}
defer informerManager.Stop()

vaWaiter := getVADetachWaiter(log, cfg, clientset, informerManager)

actionHandlers := actions.NewDefaultActionHandlers(
k8sVer.Full(),
cfg.SelfPod.Namespace,
log,
clientset,
dynamicClient,
helmClient,
vaWaiter,
)

actionsConfig := controller.Config{
Expand Down Expand Up @@ -365,6 +392,24 @@ func runningOnGKE(clientset *kubernetes.Clientset, cfg config.Config) (bool, err
return isGKE, err
}

func getVADetachWaiter(log *logrus.Entry, cfg config.Config, clientset kubernetes.Interface, informerManager *informer.Manager) volume.DetachmentWaiter {
if cfg.Drain.DisableVolumeDetachWait {
log.Info("VA wait feature disabled by configuration")
return nil
}

vaIndexer := informerManager.GetVAIndexer()
if informerManager.GetVAIndexer() == nil {
log.Info("VolumeAttachment informer not enabled, VA wait feature will be disabled even if requested by API")
return nil
}

vaWaiter := volume.NewDetachmentWaiter(clientset, vaIndexer, 5*time.Second, cfg.VolumeAttachment.DefaultTimeout)
log.Info("VolumeAttachment informer synced, VA wait feature enabled")

return vaWaiter
}

func saveMetadata(clusterID string, cfg config.Config, log *logrus.Entry) error {
metadata := monitor.Metadata{
ClusterID: clusterID,
Expand Down
4 changes: 3 additions & 1 deletion internal/actions/actions.go
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ import (

"github.com/castai/cluster-controller/internal/castai"
"github.com/castai/cluster-controller/internal/helm"
"github.com/castai/cluster-controller/internal/volume"
)

type ActionHandlers map[reflect.Type]ActionHandler
Expand All @@ -20,10 +21,11 @@ func NewDefaultActionHandlers(
clientset *kubernetes.Clientset,
dynamicClient dynamic.Interface,
helmClient helm.Client,
vaWaiter volume.DetachmentWaiter,
) ActionHandlers {
return ActionHandlers{
reflect.TypeFor[*castai.ActionDeleteNode](): NewDeleteNodeHandler(log, clientset),
reflect.TypeFor[*castai.ActionDrainNode](): NewDrainNodeHandler(log, clientset, castNamespace),
reflect.TypeFor[*castai.ActionDrainNode](): NewDrainNodeHandler(log, clientset, castNamespace, vaWaiter),
reflect.TypeFor[*castai.ActionPatchNode](): NewPatchNodeHandler(log, clientset),
reflect.TypeFor[*castai.ActionCreateEvent](): NewCreateEventHandler(log, clientset),
reflect.TypeFor[*castai.ActionChartUpsert](): NewChartUpsertHandler(log, helmClient),
Expand Down
Loading
Loading