Process starvation in concurrent kernel starts, and scaling out JEG for scalability.

# Description

Recently I’ve monitored EG cannot handle concurrent 30 kernel start requests. Here’s the itest code.
```
    def scale_test(kernelspec, example_code, _):
        """test function for scalability test"""
        res = True
        gateway_client = GatewayClient()

        kernel = None
        try:
            # scalability test is a kind of stress test, so expand launch_timeout to our service request timeout.
            kernel = gateway_client.start_kernel(kernelspec)
            if example_code:
                res = kernel.execute(example_code)
        finally:
            if kernel is not None:
                gateway_client.shutdown_kernel(kernel)

        return res

    class TestScale(unittest.TestCase):
        def _scale_test(self, spec, test_max_scale):
            LOG.info('Spawn {} {} kernels'.format(test_max_scale, spec))
            example_code = []
            example_code.append('print("Hello World")')

            with Pool(processes=test_max_scale) as pool:
                children = []
                for i in range(test_max_scale):
                    children.append(pool.apply_async(scale_test,
                                                     (self.KERNELSPECS[spec],
                                                      example_code,
                                                      i)))

                test_results = [child.get() for child in children]

            for result in test_results:
                self.assertRegexpMatches(result, "Hello World")

        def test_python3_scale_test(self):
            test_max_sacle = int(self.TEST_MAX_PYTHON_SCALE)
            self._scale_test('python3', test_max_scale)

        def test_spark_python_scale_test(self):
            test_max_sacle = int(self.TEST_MAX_SPARK_SCALE)
            self._scale_test('spark_python', test_max_scale)
```
I’ve set `LAUNCH_TIMEOUT` to 60 seconds, and used kernelspecs already pulled in the node. In case of Spark kernel, the situation got worse because `spark-submit` processes launched by EG makes process starvation among EG process and other `spark-submit` processes.

When I did the test, CPU utilization rose up to more than 90%. (4 core, 8GiB memory instance)
![image](https://user-images.githubusercontent.com/8223765/63936539-fecaf900-ca9a-11e9-8b57-86b3ab46d07e.png)

I know that there’s work for HA in progress, but it looks like Active / Stand-by mode. In that approach, we couldn’t make EG scale-out, but scale-up. However, “Scale Up” always has limitations in that we cannot expand our instance to the size bigger than the node EG is running on.

In those reasons, I want to start to increase the scalability of EG, and need your opinion about the following idea. (Let me just assume that EG is running on k8s)

* Process starvation  
  * In order to resolve process starvation in EG instance, I have two ideas.
    * Spawn `spark-submit` pod and `launch-kubernetes` pod instead of launching processes. Using container, isolate the spark-submit process from EG instance.
    * Create another `submitter` pod. submitter pod queues the requests from EG, and launch processes with limited process pool. This submitter pod is also scalable while EG is not scalable yet because EG always passes the parameters for launching a process.

* Session Persistence (duplicate of #562, #594 )
  * This is actually very flaky issue which has high possibility to make a lot of side effects. My idea is to move all session objects into external in-memory db such as Redis. So everytime EG needs to access session, it reads the session information from the Redis. I cannot estimate how many sources should be modified and never looked into the source code yet. But I’m guessing I have to change Session Manager and Kernel Manager classes. Could anybody give me a feedback about this?  

Through those two resolutions, I think we can scale out EG instances. Any advices will be appreciated.

Thanks.

# Environment

Enterprise Gateway Version 2.x with Asynchronous kernel start feature (#580)
Platform: Kubernetes
Others: nb2kg latest

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Process starvation in concurrent kernel starts, and scaling out JEG for scalability. #732

Description

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Process starvation in concurrent kernel starts, and scaling out JEG for scalability. #732

Description

Description

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions