Description
Description
Recently I’ve monitored EG cannot handle concurrent 30 kernel start requests. Here’s the itest code.
def scale_test(kernelspec, example_code, _):
"""test function for scalability test"""
res = True
gateway_client = GatewayClient()
kernel = None
try:
# scalability test is a kind of stress test, so expand launch_timeout to our service request timeout.
kernel = gateway_client.start_kernel(kernelspec)
if example_code:
res = kernel.execute(example_code)
finally:
if kernel is not None:
gateway_client.shutdown_kernel(kernel)
return res
class TestScale(unittest.TestCase):
def _scale_test(self, spec, test_max_scale):
LOG.info('Spawn {} {} kernels'.format(test_max_scale, spec))
example_code = []
example_code.append('print("Hello World")')
with Pool(processes=test_max_scale) as pool:
children = []
for i in range(test_max_scale):
children.append(pool.apply_async(scale_test,
(self.KERNELSPECS[spec],
example_code,
i)))
test_results = [child.get() for child in children]
for result in test_results:
self.assertRegexpMatches(result, "Hello World")
def test_python3_scale_test(self):
test_max_sacle = int(self.TEST_MAX_PYTHON_SCALE)
self._scale_test('python3', test_max_scale)
def test_spark_python_scale_test(self):
test_max_sacle = int(self.TEST_MAX_SPARK_SCALE)
self._scale_test('spark_python', test_max_scale)
I’ve set LAUNCH_TIMEOUT
to 60 seconds, and used kernelspecs already pulled in the node. In case of Spark kernel, the situation got worse because spark-submit
processes launched by EG makes process starvation among EG process and other spark-submit
processes.
When I did the test, CPU utilization rose up to more than 90%. (4 core, 8GiB memory instance)
I know that there’s work for HA in progress, but it looks like Active / Stand-by mode. In that approach, we couldn’t make EG scale-out, but scale-up. However, “Scale Up” always has limitations in that we cannot expand our instance to the size bigger than the node EG is running on.
In those reasons, I want to start to increase the scalability of EG, and need your opinion about the following idea. (Let me just assume that EG is running on k8s)
-
Process starvation
- In order to resolve process starvation in EG instance, I have two ideas.
- Spawn
spark-submit
pod andlaunch-kubernetes
pod instead of launching processes. Using container, isolate the spark-submit process from EG instance. - Create another
submitter
pod. submitter pod queues the requests from EG, and launch processes with limited process pool. This submitter pod is also scalable while EG is not scalable yet because EG always passes the parameters for launching a process.
- Spawn
- In order to resolve process starvation in EG instance, I have two ideas.
-
Session Persistence (duplicate of High Availability - session persistence on bare metal machines #562, Implementing HA Active/Active with distributed file system #594 )
- This is actually very flaky issue which has high possibility to make a lot of side effects. My idea is to move all session objects into external in-memory db such as Redis. So everytime EG needs to access session, it reads the session information from the Redis. I cannot estimate how many sources should be modified and never looked into the source code yet. But I’m guessing I have to change Session Manager and Kernel Manager classes. Could anybody give me a feedback about this?
Through those two resolutions, I think we can scale out EG instances. Any advices will be appreciated.
Thanks.
Environment
Enterprise Gateway Version 2.x with Asynchronous kernel start feature (#580)
Platform: Kubernetes
Others: nb2kg latest