Sessions and whether they impose implementation details on Render Farms #20

justinfx · 2024-01-12T02:49:42Z

justinfx
Jan 12, 2024

When I was reading through the information on Sessions, I kept wondering whether the job spec is implying implementation details on render farm software. The spec seems to take the perspective of render farms that are decentralized, where a worker node is asking for work and says "ok I am going to start servicing Job 123". And then it keeps asking for tasks from the job until it knows there is nothing left. That would make sense where the worker node could have clear boundaries of when to establish and tear down the working location for a job.
Our render farm at Weta, Plow, is a centralized implementation. The worker nodes are told what to do via an RPC and aren't really aware of the bounds of any given job. It just receives instructions "run this task with these resources" and it actively reports the status. So in order to support this spec and satisfy the requirement for Sessions, does that mean we have to build logic into our server side and communicate job boundaries to participating worker nodes to support the Session temp location on each worker?
I just wonder how much of the spec assumes certain architectures about any given render farm? There are other similar aspects like the embedded files feature. Does that imply that any supporting render farm needs to support the injection of arbitrary file data with the job submission?

jvanns · 2024-01-12T10:41:38Z

jvanns
Jan 12, 2024

Ah, yes, I think I alluded to this in my post too - sorry Justin, it would probably have been better posted here in reply to your comment.

0 replies

ddneilson · 2024-01-12T16:08:35Z

ddneilson
Jan 12, 2024

Hi Justin,
This is a great question.

I kept wondering whether the job spec is implying implementation details on render farm software

I think the answer is a combination of yes & no. I'll elaborate...

There's nothing in this spec that is intended to dictate a push vs. pull model for getting work to a worker, but you are right in that the Sessions concept implies either some additional logic exist in the scheduler/worker communications, or that another solution is in place.

This concept of a Session was generalized from a similar and useful runtime optimization that exists in Thinkbox's Deadline product . The idea is that you can have expensive setup operations that happen just once, and then run more than one task against that setup to save compute/wall-clock time, and thus money. We've seen this as particularly useful with applications like Maya where it can take a very long time just to open the application and load the scene; being able to leverage that load time for more than one task can speed things up significantly.

That all said, it is an optimization, and optimizations should be optional.

If a scheduler/worker are stateful and designed for a Sessions-compatible state machine then hopefully there wouldn't be an issue.

If not, then imagine having available a CLI that could be given a job template, step name, and series of task numbers (if we linearize the step's parameter space). If that CLI were available on the worker host, as any application would be, then a job submission of an OpenJD job to the render management system could translate that job into its own native internal form by having the tasks in the system simply run that CLI. Concretely, let's say you have a job template like:

specificationVersion: jobtemplate-2023-09
name: DemoJob
steps:
- name: Demo
  parameterSpace:
    taskParameterDefinitions:
    - name: Frame
      type: INT
      range: 1-100
    - name: StereoCamera
      type: STRING
      range:
      - left
      - right
  stepEnvironments:
  - name: Env1
    script:
      embeddedFiles:
      - name: Enter
        type: TEXT
        runnable: True
        data: |
          #!/bin/bash
          echo "Entering"
          sleep 600 # This takes a long time
      - name: Exit
        type: TEXT
        runnable: True
        data: |
          #!/bin/bash
          echo "Cleaning up setup"
      actions:
        onEnter:
          command: "{{ Env.File.Enter }}"
        onExit:
          command: "{{ Env.File.Exit }}"
  script:
    actions:
      onRun:
        # Some speedy "render"
        command: "/usr/bin/echo"
        args:
        - "Running: Frame={{Task.Param.Frame}}, StereoCamera={{Task.Param.StereoCamera}}"

That has one Step with 200 Tasks. The artist/TD submits that to the farm to run. As part of that submission, say that the tool that they use to translate OpenJD to the native form decides to pre-chunk the work into 5 Tasks per Session. The commands that are run on the farm, by the native render manager may look something like:

openjd --template file:///mnt/share/submissions/job_template_abc123.yaml --step "Demo" --tasks "0-4"

When that's run on the worker, then the openjd command would do all of the local stateful management of the Session -- setting up the working directory, Entering/Exiting the Environment, materializing the two embedded files to the working directory, and running Tasks (1,left), (2,left), (3,left), (4,left), and (5,left)

How does that sound to you?

0 replies

mwiebe · 2024-01-12T18:37:13Z

mwiebe
Jan 12, 2024
Collaborator

The answer I posted in #21 (comment) is related, and echos similar ideas as Daniel.

0 replies

justinfx · 2024-01-12T23:01:54Z

justinfx
Jan 12, 2024
Author

This concept of a Session was generalized from a similar and useful runtime optimization that exists in Thinkbox's Deadline product .

I can see how the initial decisions are based on the features of a concrete Render Farm implementation.

Looking at the reply and the details of Sessions more, I can see how there is a similarity to what our internal Job Description Framework (Kenobi) does for job state management. First I want to comment on the more general aspect and then circle back to the state management...

It does appear that OpenJobDescription wants to be more of a framework and not just a specification language. That is clear because of the suggestion to rely on CLI tooling like openjd as part of the task command bootstrapping. This has direct correlation to both our older job description framework (we called 'blueprint') and our current one, Kenobi. The idea being that you have an abstraction layer API to build your job. In the case of Blueprint it was either a python API or a simple declarative file format. And in the case of Kenobi it is through the python API or a visual graph builder. Once you have the framework representation of the job, the framework then has backend module implementations that can accept the framework representation and translate it into a job submission to the target backend. Kenobi is what enabled us to migrate from Alfred to Plow in a mostly seamless fashion. The front end API stays the same, but we toggle to another backend implementation to submit to Plow.
A key similarity to OpenJobDescription is that the task commands are actually wrapped into the state management system, which is also modular. That is, we have a primary implementation that uses our own custom S3-like temporary KV store, as well as support for simple filesystem storage. The task command, in terms of how it is render-farm-facing, is a kexecute command (like openjd) which is provided the parameters for the Kenobi job and task id. kexecute then uses the internal state system to pull the data for running the actual work. This work could be a single command, or a batch of 10 command derived from a 1-100 job submission, etc. But the work is opaque to the render farm beyond the kexecute wrapper.

So all that being said, I can see how OpenJobDescription is really trying to become a Job Description Framework and not just a yaml spec. For the extra features to really work, they have to be enabled behind tooling that accompanies the framework. If this were Kenobi, then one could see it implemented for other render farms as a backend for Deadline and an appropriate state management module, if an existing one were not appropriate.

5 replies

jvanns Jan 15, 2024

"It does appear that OpenJobDescription wants to be more of a framework and not just a specification language." - I think that summarises it quite nicely!

ddneilson Jan 15, 2024

I think that we're all on similar pages. We're at step 0 of what I see as a journey. We've taken this first step -- releasing a spec proposal -- to learn from and work with the larger community to steer the next steps. I think that a part of that journey will have to include creation of tools to simplify integration with the standard for this endeavor to be successful.

I'm encouraged that some of my thinking around having a CLI (openjd/kexecute-like thing) lines up with a proven successful system at WetaFX. I'm interested to hear more about the tooling around Kenobi that helped it be successful, if you can share it. "kexecute then uses the internal state system to pull the data for running the actual work" and the S3-like & filesystem integrations are particularly intriguing.

justinfx Jan 15, 2024
Author

@ddneilson I had mentioned to Pauline that I would be happy to discuss how Kenobi works in more detail

pkpriority Jan 16, 2024

I can figure out how to schedule a deeper dive; but I'm really liking these open discussions too. I can't wait for something like SIGGRAPH, where we can just talk shop, and maybe pull other people in.

Should we think about a Birds of a Feather? Perhaps raising this as a topic of renderfarming BoF, or cloud-native-studios?

justinfx Jan 17, 2024
Author

I'll see if I can share more details of kenobi design in this public discussion!

pkpriority · 2024-02-29T00:46:21Z

pkpriority
Feb 29, 2024

I don't want to hijack the previous comment from @justinfx too much (#20 (comment))

For the extra features to really work, they have to be enabled behind tooling that accompanies the framework.

For this part in particular, I wanted to connect this line of thought to our recent 2/13 releases of the openjd-cli, and openjd-sessions packages. I'm hoping these packages provide more context. I know @justinfx you had been trying to get more details of kenobi up for discussion. Even if that's not possible, I'm hoping we can continue to discuss more using the openjd packages and the way they work as examples.

openjd-sessions in particular codifies how we run job templates. Please give these packages a try, @justinfx , @jvanns and let us know what you think! Also I wanted to say thanks because your enthusiasm helped us stay on track to get these in front of you sooner rather than later :)

1 reply

justinfx Feb 29, 2024
Author

I know @justinfx you had been trying to get more details of kenobi up for discussion. Even if that's not possible, I'm hoping we can continue to discuss more using the openjd packages and the way they work as examples.

At it turns out, I can only have a private discussion about the details of Kenobi at this stage. Will find some time to have a play with the new packages and get familiar with how they work.

Sessions and whether they impose implementation details on Render Farms #20

Uh oh!

justinfx Jan 12, 2024

Replies: 5 comments · 6 replies

Uh oh!

jvanns Jan 12, 2024

Uh oh!

Uh oh!

ddneilson Jan 12, 2024

Uh oh!

mwiebe Jan 12, 2024 Collaborator

Uh oh!

justinfx Jan 12, 2024 Author

Uh oh!

jvanns Jan 15, 2024

Uh oh!

ddneilson Jan 15, 2024

Uh oh!

justinfx Jan 15, 2024 Author

Uh oh!

pkpriority Jan 16, 2024

Uh oh!

justinfx Jan 17, 2024 Author

Uh oh!

pkpriority Feb 29, 2024

Uh oh!

justinfx Feb 29, 2024 Author

justinfx
Jan 12, 2024

Replies: 5 comments 6 replies

jvanns
Jan 12, 2024

ddneilson
Jan 12, 2024

mwiebe
Jan 12, 2024
Collaborator

justinfx
Jan 12, 2024
Author

justinfx Jan 15, 2024
Author

justinfx Jan 17, 2024
Author

pkpriority
Feb 29, 2024

justinfx Feb 29, 2024
Author