Skip to content

Process starvation in concurrent kernel starts, and scaling out JEG for scalability. #732

Open
@esevan

Description

@esevan

Description

Recently I’ve monitored EG cannot handle concurrent 30 kernel start requests. Here’s the itest code.

    def scale_test(kernelspec, example_code, _):
        """test function for scalability test"""
        res = True
        gateway_client = GatewayClient()

        kernel = None
        try:
            # scalability test is a kind of stress test, so expand launch_timeout to our service request timeout.
            kernel = gateway_client.start_kernel(kernelspec)
            if example_code:
                res = kernel.execute(example_code)
        finally:
            if kernel is not None:
                gateway_client.shutdown_kernel(kernel)

        return res

    class TestScale(unittest.TestCase):
        def _scale_test(self, spec, test_max_scale):
            LOG.info('Spawn {} {} kernels'.format(test_max_scale, spec))
            example_code = []
            example_code.append('print("Hello World")')

            with Pool(processes=test_max_scale) as pool:
                children = []
                for i in range(test_max_scale):
                    children.append(pool.apply_async(scale_test,
                                                     (self.KERNELSPECS[spec],
                                                      example_code,
                                                      i)))

                test_results = [child.get() for child in children]

            for result in test_results:
                self.assertRegexpMatches(result, "Hello World")

        def test_python3_scale_test(self):
            test_max_sacle = int(self.TEST_MAX_PYTHON_SCALE)
            self._scale_test('python3', test_max_scale)

        def test_spark_python_scale_test(self):
            test_max_sacle = int(self.TEST_MAX_SPARK_SCALE)
            self._scale_test('spark_python', test_max_scale)

I’ve set LAUNCH_TIMEOUT to 60 seconds, and used kernelspecs already pulled in the node. In case of Spark kernel, the situation got worse because spark-submit processes launched by EG makes process starvation among EG process and other spark-submit processes.

When I did the test, CPU utilization rose up to more than 90%. (4 core, 8GiB memory instance)
image

I know that there’s work for HA in progress, but it looks like Active / Stand-by mode. In that approach, we couldn’t make EG scale-out, but scale-up. However, “Scale Up” always has limitations in that we cannot expand our instance to the size bigger than the node EG is running on.

In those reasons, I want to start to increase the scalability of EG, and need your opinion about the following idea. (Let me just assume that EG is running on k8s)

  • Process starvation

    • In order to resolve process starvation in EG instance, I have two ideas.
      • Spawn spark-submit pod and launch-kubernetes pod instead of launching processes. Using container, isolate the spark-submit process from EG instance.
      • Create another submitter pod. submitter pod queues the requests from EG, and launch processes with limited process pool. This submitter pod is also scalable while EG is not scalable yet because EG always passes the parameters for launching a process.
  • Session Persistence (duplicate of High Availability - session persistence on bare metal machines #562, Implementing HA Active/Active with distributed file system #594 )

    • This is actually very flaky issue which has high possibility to make a lot of side effects. My idea is to move all session objects into external in-memory db such as Redis. So everytime EG needs to access session, it reads the session information from the Redis. I cannot estimate how many sources should be modified and never looked into the source code yet. But I’m guessing I have to change Session Manager and Kernel Manager classes. Could anybody give me a feedback about this?

Through those two resolutions, I think we can scale out EG instances. Any advices will be appreciated.

Thanks.

Environment

Enterprise Gateway Version 2.x with Asynchronous kernel start feature (#580)
Platform: Kubernetes
Others: nb2kg latest

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions