Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

External Cluster Environments #1244

Open
wants to merge 32 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 21 commits
Commits
Show all changes
32 commits
Select commit Hold shift + click to select a range
49985f5
add remote cluster functionality
Shrinjay Jan 24, 2023
d491508
Add context selection to helm and passthrough
Shrinjay Jan 25, 2023
686ed39
Remove label
Shrinjay Jan 25, 2023
7e90611
Put back comments
Shrinjay Jan 25, 2023
2daf2cc
Fix names
Shrinjay Jan 25, 2023
ff5aab5
Fix docstrings, generator, and various other
Shrinjay Jan 25, 2023
20a5c56
black fixes
Shrinjay Jan 25, 2023
0d43bec
Add remote autoconfiguration and role
Shrinjay Jan 26, 2023
875fafa
Fix env var checking
Shrinjay Jan 26, 2023
5d1d732
move to namespaced roles for remote clusters
Shrinjay Jan 27, 2023
19f92c7
Formatting and comments
Shrinjay Jan 27, 2023
e0df4e2
Add documentation and cleanup
Shrinjay Jan 27, 2023
5e23382
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Jan 27, 2023
c9b90ce
Fix doc requirements
Shrinjay Jan 30, 2023
0e0ae6f
Merge branch 'feature/remote-cluster' of https://github.com/Shrinjay/…
Shrinjay Jan 30, 2023
9a9416e
Merge branch 'main' into feature/remote-cluster
Shrinjay Jan 30, 2023
9ba9ab7
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Jan 30, 2023
0a4c6fb
Change to external cluster naming
Shrinjay Jan 31, 2023
d8b55b6
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Jan 31, 2023
94ab66e
Chart lint fixes
Shrinjay Jan 31, 2023
3514129
Merge branch 'feature/remote-cluster' of https://github.com/Shrinjay/…
Shrinjay Jan 31, 2023
4fbe8b7
Update docs/source/operators/deploy-kubernetes.md
Shrinjay Apr 14, 2023
51256b9
Update docs/source/operators/deploy-kubernetes.md
Shrinjay Apr 14, 2023
8107dac
Update enterprise_gateway/services/processproxies/crd.py
Shrinjay Apr 14, 2023
26cba1a
Apply suggestions from code review
Shrinjay Apr 14, 2023
55065ca
Update docs/source/operators/deploy-kubernetes.md
Shrinjay Apr 14, 2023
3282e14
Update etc/kubernetes/helm/enterprise-gateway/templates/deployment.yaml
Shrinjay Apr 14, 2023
047538e
progress
Shrinjay Apr 14, 2023
7c06f2b
throw warning
Shrinjay Apr 14, 2023
c909106
merge latest
Shrinjay Apr 14, 2023
a1a9e46
Merge branch 'main' into feature/remote-cluster
Shrinjay Apr 14, 2023
8554343
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Apr 14, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
27 changes: 27 additions & 0 deletions docs/source/operators/deploy-kubernetes.md
Original file line number Diff line number Diff line change
Expand Up @@ -520,6 +520,10 @@ can override them with helm's `--set` or `--values` options. Always use `--set`
| `kip.pullPolicy` | Determines whether the Kernel Image Puller will pull kernel images it has previously pulled (`Always`) or only those it hasn't yet pulled (`IfNotPresent`) | `IfNotPresent` |
| `kip.criSocket` | The container runtime interface socket, use `/run/containerd/containerd.sock` for containerd installations | `/var/run/docker.sock` |
| `kip.defaultContainerRegistry` | Prefix to use if a registry is not already specified on image name (e.g., elyra/kernel-py:VERSION) | `docker.io` |
| `externalCluster.enable` | Launch kernels in a remote cluster. Used for multi-cluster environments. **Must place a kubeconfig file in the `config/` folder of the helm chart**. | `false` |
| `externalCluster.configPath` | Path to mount kubeconfig at | `/etc/kube/config` |
| `externalCluster.configFilename` | Filename to kubeconfig file inside `config/` directory of chart | `kubeconfig` |
| `externalCluster.autoconfigureRemote` | Automatically create service account in remote cluster | `false` |

## Uninstalling Enterprise Gateway

Expand Down Expand Up @@ -955,6 +959,29 @@ Of particular importance is the mapping to port `8888` (e.g.,`32422`). If you ar

The value of the `JUPYTER_GATEWAY_URL` used by the gateway-enabled Notebook server will vary depending on whether you choose to define an external IP or not. If and external IP is defined, you'll set `JUPYTER_GATEWAY_URL=<externalIP>:8888` else you'll set `JUPYTER_GATEWAY_URL=<k8s-master>:32422` **but also need to restart clients each time Enterprise Gateway is started.** As a result, use of the `externalIPs:` value is highly recommended.

## Multi-Cluster Environments

### Overview

With `externalCluster.enabled` set to `true`, Enterprise Gateway can be used on multi-cluster environments where the jupyter enterprise gateway pods and kernel pods are launched on separate clusters. To configure this:

1. Ensure your two clusters have interconnceted networks. Pods in the two clusters must be able to communicate with each other over pod IP alone.
1. Provide a kubeconfig file for use in the `config/` subdirectory of `etc/kubernetes/helm/enterprise-gateway` chart.
1. Set `externalCluster.enabled` to `true`.

Enterprise Gateway will now launch kernel pods in whichever cluster you have set to default in your kubeconfig.

### Resources in Remote Clusters

For Enterprise Gateway to work accross clusters, Enterprise Gateway must create the following resources in the cluster your kernels will be launched on.

- The kernel resource.
- A service account for the kernel pods (if `externalCluster.autoconfigureRemote` is set to `true`).
- A namespaced role for the namespace where your kernel pods will be launched.
- A role binding between your namespaced role and your service account.

The role resource is defined in the `templates/kernel-role.yaml` template of the helm chart. Permissions can be set there.

## Kubernetes Tips

The following items illustrate some useful commands for navigating Enterprise Gateway within a kubernetes environment.
Expand Down
Empty file.
5 changes: 5 additions & 0 deletions enterprise_gateway/services/external/k8s_client.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
"""Instantiates a static global factory and a single atomic client"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is more about the introduction of the services/external directory. One thing we need to keep in mind is that for EG 4.0, most all of these changes will move to gateway_provisioners. Since it ONLY has the equivalent of services/processproxies and since this doesn't seem like an "external service", I'm inclined to suggest we simply include these files alongside services/processproxies/k8s.py. I don't think they warrant their own location within the package since they are solely used by KubernetesProcessProxy and its subclasses (and the launcher). Would there be an issue with moving these into services/processproxies?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, I think that's fine, I can move them into processproxies and go from there

from enterprise_gateway.services.external.k8s_client_factory import KubernetesClientFactory

KUBERNETES_CLIENT_FACTORY = KubernetesClientFactory()
kubernetes_client = KUBERNETES_CLIENT_FACTORY.get_kubernetes_client()
45 changes: 45 additions & 0 deletions enterprise_gateway/services/external/k8s_client_factory.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
"""Contains factory to create kubernetes api client instances using a single confguration"""
import os

from kubernetes import client, config
from traitlets.config import SingletonConfigurable

from ..utils.envutils import is_env_true


class KubernetesClientFactory(SingletonConfigurable):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love this! Thank you for making this better!

"""Manages kubernetes client creation from environment variables"""

def __init__(self) -> None:
"""Maintain a single configuration object and populate based on environment"""
super().__init__()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem necessary at the moment and would recommend its removal unless we find it necessary. Sometimes the linting can demand this kind of thing (which is annoying).


def get_kubernetes_client(self, get_remote_client: bool = True) -> client.ApiClient:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't get_remote_client default to False since EG_USER_REMOTE_CLUSTER does the same?

"""Get kubernetes api client with appropriate configuration

Args:
get_remote_client (bool): Return a client for the remote cluster if configured. Else, return incluster config. Defaults to True.

Returns:
ApiClient: Kubernetes API client for appropriate cluster
"""
kubernetes_config: client.Configuration = client.Configuration()
if os.getenv("KUBERNETES_SERVICE_HOST"):
# Running inside cluster
if is_env_true('EG_USE_REMOTE_CLUSTER') and get_remote_client:
kubeconfig_path = os.getenv(
'EG_REMOTE_CLUSTER_KUBECONFIG_PATH', '/etc/kube/config/kubeconfig'
)
context = os.getenv('EG_REMOTE_CLUSTER_CONTEXT', None)
config.load_kube_config(
client_configuration=kubernetes_config,
config_file=kubeconfig_path,
context=context,
)
else:
config.load_incluster_config(client_configuration=kubernetes_config)
else:
config.load_kube_config(client_configuration=kubernetes_config)

self.log.debug(f"Created kubernetes client for host {kubernetes_config.host}")
return client.ApiClient(kubernetes_config)
13 changes: 12 additions & 1 deletion enterprise_gateway/services/processproxies/crd.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,10 +4,12 @@

from __future__ import annotations

import os
from typing import Any

from kubernetes import client

from ..external.k8s_client import kubernetes_client
from ..kernels.remotemanager import RemoteKernelManager
from .k8s import KubernetesProcessProxy

Expand Down Expand Up @@ -37,6 +39,13 @@ async def launch_process(
kwargs["env"]["KERNEL_CRD_VERSION"] = self.version
kwargs["env"]["KERNEL_CRD_PLURAL"] = self.plural

use_remote_cluster = os.getenv("EG_USE_REMOTE_CLUSTER")
if use_remote_cluster:
kwargs["env"]["EG_USE_REMOTE_CLUSTER"] = 'true'
kwargs["env"]["EG_REMOTE_CLUSTER_KUBECONFIG_PATH"] = os.getenv(
"EG_REMOTE_CLUSTER_KUBECONFIG_PATH"
)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious why this code is in CustomResourceProcessProxy and not KubernetesProcessProxy since the former is a subclass of the latter and it strikes me as necessary for KubernetesProcessProxy.

await super().launch_process(kernel_cmd, **kwargs)
return self

Expand All @@ -48,7 +57,9 @@ def delete_managed_object(self, termination_stati: list[str]) -> bool:

Note: the caller is responsible for handling exceptions.
"""
delete_status = client.CustomObjectsApi().delete_namespaced_custom_object(
delete_status = client.CustomObjectsApi(
api_client=kubernetes_client
).delete_namespaced_custom_object(
self.group,
self.version,
self.kernel_namespace,
Expand Down
94 changes: 81 additions & 13 deletions enterprise_gateway/services/processproxies/k8s.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,10 +10,14 @@
from typing import Any

import urllib3
from kubernetes import client, config
import yaml
from kubernetes import client
from kubernetes.utils.create_from_yaml import create_from_yaml_single_item

from ..external.k8s_client import kubernetes_client
from ..kernels.remotemanager import RemoteKernelManager
from ..sessions.kernelsessionmanager import KernelSessionManager
from ..utils.envutils import is_env_true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we refactor this filename to include an underscore for word separation?

Suggested change
from ..utils.envutils import is_env_true
from ..utils.env_utils import is_env_true

from .container import ContainerProcessProxy

urllib3.disable_warnings()
Expand All @@ -29,8 +33,6 @@
share_gateway_namespace = bool(os.environ.get("EG_SHARED_NAMESPACE", "False").lower() == "true")
kpt_dir = os.environ.get("EG_POD_TEMPLATE_DIR", "/tmp") # noqa

config.load_incluster_config()


class KubernetesProcessProxy(ContainerProcessProxy):
"""
Expand Down Expand Up @@ -81,7 +83,7 @@ def get_container_status(self, iteration: int | None) -> str | None:
# is used for the assigned_ip.
pod_status = None
kernel_label_selector = "kernel_id=" + self.kernel_id + ",component=kernel"
ret = client.CoreV1Api().list_namespaced_pod(
ret = client.CoreV1Api(api_client=kubernetes_client).list_namespaced_pod(
namespace=self.kernel_namespace, label_selector=kernel_label_selector
)
if ret and ret.items:
Expand Down Expand Up @@ -117,7 +119,7 @@ def delete_managed_object(self, termination_stati: list[str]) -> bool:
# Deleting a Pod will return a v1.Pod if found and its status will be a PodStatus containing
# a phase string property
# https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.21/#podstatus-v1-core
v1_pod = client.CoreV1Api().delete_namespaced_pod(
v1_pod = client.CoreV1Api(api_client=kubernetes_client).delete_namespaced_pod(
namespace=self.kernel_namespace, body=body, name=self.container_name
)
status = None
Expand Down Expand Up @@ -164,7 +166,7 @@ def terminate_container_resources(self) -> bool | None: # noqa
body = client.V1DeleteOptions(
grace_period_seconds=0, propagation_policy="Background"
)
v1_status = client.CoreV1Api().delete_namespace(
v1_status = client.CoreV1Api(api_client=kubernetes_client).delete_namespace(
name=self.kernel_namespace, body=body
)
status = None
Expand Down Expand Up @@ -227,7 +229,6 @@ def _determine_kernel_pod_name(self, **kwargs: dict[str, Any] | None) -> str:
return pod_name

def _determine_kernel_namespace(self, **kwargs: dict[str, Any] | None) -> str:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(The following applies to lines L240-L254 below...)

  1. Today, if the user is bringing their own namespace via KERNEL_NAMESPACE it is their responsibility to ensure proper operation within the namespace. With this PR, and if EG_USE_REMOTE_CLUSTER is True, I'm assuming the same requirement holds. Are there any pieces of new functionality that would preclude users from bringing their own namespace (where that namespace resides in the remote cluster)?

  2. Clearly, both EG_SHARED_NAMESPACE = True and EG_USE_REMOTE_CLUSTER = True conflict. Perhaps we should check for EG_SHARED_NAMESPACE = True when obtaining the k8s client and either log and return the local client or raise an exception?

((Just to note, in gateway_provisioners the default value for EG_SHARED_NAMESPACE is True to allow for easier deployments outside of something like since they are available to any jupyter_client application.))

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Something I realize should be documented better is that remote cluster operation requires the namespace to be already created before we launch a kernel there. This was mostly a security concern, as long as operations are isolated within a namespace in a remote cluster it's fine, but creating a namespace is a cluster level operation which can be scary for operators to enable.

  2. Agreed on this, we should log a warning then ignore it if EG_USE_REMOTE_CLUSTER is enabled


# Since we need the service account name regardless of whether we're creating the namespace or not,
# get it now.
service_account_name = KubernetesProcessProxy._determine_kernel_service_account_name(
Expand Down Expand Up @@ -280,10 +281,17 @@ def _create_kernel_namespace(self, service_account_name: str) -> str:

# create the namespace
try:
client.CoreV1Api().create_namespace(body=body)
client.CoreV1Api(api_client=kubernetes_client).create_namespace(body=body)
self.delete_kernel_namespace = True
self.log.info(f"Created kernel namespace: {namespace}")

# If remote cluster is being used, service account may not be present, create before role binding
# If creating service account is disabled, operator must manually create svc account
if is_env_true('EG_USE_REMOTE_CLUSTER') and os.getenv('EG_CREATE_REMOTE_SVC_ACCOUNT'):
self._create_service_account_if_not_exists(
namespace=namespace, service_account_name=service_account_name
)

# Now create a RoleBinding for this namespace for the default ServiceAccount. We'll reference
# the ClusterRole, but that will only be applied for this namespace. This prevents the need for
# creating a role each time.
Expand All @@ -307,14 +315,66 @@ def _create_kernel_namespace(self, service_account_name: str) -> str:
body = client.V1DeleteOptions(
grace_period_seconds=0, propagation_policy="Background"
)
client.CoreV1Api().delete_namespace(name=namespace, body=body)
client.CoreV1Api(api_client=kubernetes_client).delete_namespace(
name=namespace, body=body
)
self.log.warning(f"Deleted kernel namespace: {namespace}")
else:
reason = f"Error occurred creating namespace '{namespace}': {err}"
self.log_and_raise(http_status_code=500, reason=reason)

return namespace

def _create_service_account_if_not_exists(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is strictly for external clusters, could we rename this to something like: _create_external_service_account_if_not_exists or _create_remote_service_account_if_not_exists. Since the other configurable options refer to "external", I guess I'd prefer the former.

self, namespace: str, service_account_name: str
) -> None:
"""If service account doesn't exist in target cluster, create one. Occurs if a remote cluster is being used."""
service_account_list_in_namespace: client.V1ServiceAccountList = client.CoreV1Api(
api_client=kubernetes_client
).list_namespaced_service_account(namespace=namespace)

service_accounts_in_namespace: list[
client.V1ServiceAccount
] = service_account_list_in_namespace.items
service_account_names_in_namespace: list[str] = [
svcaccount.metadata.name for svcaccount in service_accounts_in_namespace
]

if service_account_name not in service_account_names_in_namespace:
service_account_metadata = {"name": service_account_name}
service_account_to_create: client.V1ServiceAccount = client.V1ServiceAccount(
kind="ServiceAccount", metadata=service_account_metadata
)

client.CoreV1Api(api_client=kubernetes_client).create_namespaced_service_account(
namespace=namespace, body=service_account_to_create
)

self.log.info(
f"Created service account {service_account_name} in namespace {namespace}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any way to access the "name" of the external cluster? Seems really helpful to include that here.

Suggested change
f"Created service account {service_account_name} in namespace {namespace}"
f"Created service account {service_account_name} in namespace {namespace} of external cluster {external_cluster}"

)

def _create_role_if_not_exists(self, namespace: str) -> None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A similar name change to that suggested previously would be nice here.

"""If role doesn't exist in target cluster, create one. Occurs if a remote cluster is being used"""
role_yaml_path = os.getenv('EG_REMOTE_CLUSTER_ROLE_PATH')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should role_yaml_path be validated for None and the file's existence?

This line (and any validation) should be moved within the if kernel_cluster_role not in remote_cluster_role_names: block at L379 so that it's only performed if needed.


# Get Roles in remote cluster
remote_cluster_roles: client.V1RoleList = client.RbacAuthorizationV1Api(
api_client=kubernetes_client
).list_namespaced_role(namespace=namespace)
remote_cluster_role_names = [role.metadata.name for role in remote_cluster_roles.items]

# If the kernel Role does not exist in the remote cluster.
if kernel_cluster_role not in remote_cluster_role_names:
with open(role_yaml_path) as f:
role_yaml = yaml.safe_load(f)
role_yaml["metadata"]["namespace"] = namespace
create_from_yaml_single_item(yml_object=role_yaml, k8s_client=kubernetes_client)

self.log.info(f"Created role {kernel_cluster_role} in namespace {namespace}")
else:
self.log.info(f"Found role {kernel_cluster_role} in namespace {namespace}")

def _create_role_binding(self, namespace: str, service_account_name: str) -> None:
# Creates RoleBinding instance for the given namespace. The role used will be the ClusterRole named by
# EG_KERNEL_CLUSTER_ROLE.
Expand All @@ -327,9 +387,17 @@ def _create_role_binding(self, namespace: str, service_account_name: str) -> Non
role_binding_name = kernel_cluster_role # use same name for binding as cluster role
labels = {"app": "enterprise-gateway", "component": "kernel", "kernel_id": self.kernel_id}
binding_metadata = client.V1ObjectMeta(name=role_binding_name, labels=labels)
binding_role_ref = client.V1RoleRef(
api_group="", kind="ClusterRole", name=kernel_cluster_role
)

# If remote cluster is used, we need to create a role on that cluster
if is_env_true('EG_USE_REMOTE_CLUSTER'):
self._create_role_if_not_exists(namespace=namespace)
# We use namespaced roles on remote clusters rather than a ClusterRole
binding_role_ref = client.V1RoleRef(api_group="", kind="Role", name=kernel_cluster_role)
else:
binding_role_ref = client.V1RoleRef(
api_group="", kind="ClusterRole", name=kernel_cluster_role
)

binding_subjects = client.V1Subject(
api_group="", kind="ServiceAccount", name=service_account_name, namespace=namespace
)
Expand All @@ -341,7 +409,7 @@ def _create_role_binding(self, namespace: str, service_account_name: str) -> Non
subjects=[binding_subjects],
)

client.RbacAuthorizationV1Api().create_namespaced_role_binding(
client.RbacAuthorizationV1Api(api_client=kubernetes_client).create_namespaced_role_binding(
namespace=namespace, body=body
)
self.log.info(
Expand Down
Empty file.
7 changes: 7 additions & 0 deletions enterprise_gateway/services/utils/envutils.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
"""""Utilities to make checking environment variables easier"""
import os


def is_env_true(env_variable_name: str) -> bool:
"""If environment variable is set and value is "TRUE" or "true", then return true. Else return false"""
return bool(os.getenv(env_variable_name, "False").lower() == "true")
Loading