Hello and a few initial thoughts! #21

jvanns · 2024-01-12T10:40:49Z

jvanns
Jan 12, 2024

Hi there OJD community (a very new and young one at that!). Well this is interesting to see :D A great idea surely pondered by many over the decades, myself included. I've read through these pages and have caught up on Pauline's ASWF presentation, all of which I was unaware of before (I mostly keep quiet over here in the UK). However, having experimented many times before in this domain (Framestore, ILM and at home!) you've piqued my interest again. I love the idea of a standardised submission/description language that is the interface into to whats likely to be a set of very different farm management systems at each studio. Although, and I could be wrong, it did sound like there was a desire within this spec to also impose implementation or interchange within the farm system post-submission? I think you're referring to this as the 'Session' in the runtime description page. I think that could be a tricky thing to agree on since its possibly too far into the weeds of studio specifics. Sure, a decent interchange format or API (remember DRMAA from Grid days?) that presents an abstraction for a job should be achievable but beyond that, once its in the system, I'm unsure how adoptable that would be - especially if it begins to imply an environment or protocol between [running] task and scheduler/dispatcher or whatever component a studio has for managing tasks.

I could certainly get onboard with this. FWIW, since I'm never going to finish it(!), I thought I'd share an example of some humble beginnings I've drafted before. Its quite intentionally not YAML or JSON since the idea was to first focus on something declarative and readable to a regular user - artist or TD. I relied upon ANTLR for the parsing and code generation, which I then used to get objects ready for submission etc. (with a few to having these objects be used by farm submission APIs). As you can see its very wordy with lots of redundancy to make it feel more of a natural language (basically English!). It sits only in your 'How Jobs Are Constructed' remit rather than how they're run (lets see how markdown ruins this!). You could be forgiven for thinking it reminds you of Alfred ;)

# Job Annotation Format
define map "attributes" as {
   set shot to "ayz421"
   set phasers to "stun"
   set number to 1234567
}

define job "foobar" as {
   expect tasks # 'tasks' will form a variable

   include values from map "attributes"
   set title to "Tomy My First JAF job"
}

define task "baz" as {
   expect frames

   define command "echo" as {
      expect frame
      set pi to 3.14159
      requires resources "cpu=1,ram=1"

      execute "echo %frame%" # JAF will perform substitution
   }

   define command "bogus" as {
       # Demonstrates extended resource description format - a separate idea
       requires resources "cpu=1,gpu=1[mem=4096;vendor=nvidia;driver>430.26]"
   }

   loop command "echo" over frames
}

clone task "baz" as "zip"

assign frames from range 1-999 to task "zip"
assign frames from range 1000-2000 by 5 to task "baz"
assign tasks from task "baz" to job "foobar"

That aside, I think the idea of trying to draft a submission format/API/interchange is a great one and it'd be interesting to see how easy or difficult it would be for different studios with different workflows and systems are able to contribute or even where they share commonality.

ddneilson · 2024-01-12T16:31:42Z

ddneilson
Jan 12, 2024

Welcome, James. Happy to have you on board. :-)

Any insight you can offer from your wealth of experience is very much appreciated.

gpu=1[mem=4096;vendor=nvidia;driver>430.26]

This caught my eye; specifically the reference to a driver version. I'd be interested to hear your thoughts on whether the host requirements part of OpenJD might serve this use-case. The OpenJD solution that we're imaging for this sort of thing is to use customer-defined attribute requirements. The idea is that you'd define an attribute that represents the host configuration -- abstracting the specific software and software versions available. We're basically assuming here that a studio with 1000 hosts doesn't have 1000 unique snowflake hosts, but rather a smaller set of host images/configurations that they deploy to those hosts. That would allow you to put something like this in to your Step definition:

hostRequirements
  attributes:
  - name: attr.framestore.london.config
    anyOf:
    - ConfigA
    - ConfigB

Where the values ConfigA/ConfigB are abstracting some set of software that is configured on the host -- say nvidia driver + gcc version + ...

1 reply

jvanns Feb 20, 2024
Author

Personally, I think that may end up getting far too wordy and just look like a sea of YAML! It isn't too dissimilar to something we have at ILM now and its quite an unwieldy wall of text where, in extreme cases, many dozens of lines defining a set of required resources could be represented succinctly in 1 line and a few dozen characters ;) There may not be 1k snowflake hosts but among 4k hosts and many generations of them, there are definitely many batches of different ones - and DRMs like to carve them up even more ;)

mwiebe · 2024-01-12T18:30:06Z

mwiebe
Jan 12, 2024
Collaborator

Thanks for taking a look, it's great to read your thoughts!

Although, and I could be wrong, it did sound like there was a desire within this spec to also impose implementation or interchange within the farm system post-submission? I think you're referring to this as the 'Session' in the runtime description page. I think that could be a tricky thing to agree on since its possibly too far into the weeds of studio specifics.

The problem that this part of the spec is solving for, is how to express a pattern within a job that looks like:

Start a process, and have it load the scene into memory. For example think of a game engine startup vs frame render time.
Schedule many frame renders or other tasks, that run within that process.
Shut down that process and clean up

For job templates that use this structure, a render farm implementation has a choice of how to map that job into its own structure. One option is to use chunking, so that each task runs as a Session, encapsulating the environment enters, task runs, and environment exits as a single schedulable entity. This doesn't impose implementation on the farm system, because all of that logic can live inside the task definition itself. The other option is to make the render farm scheduler aware of the Environment, Step, and Task entities, and to represent the Session directly in the scheduler.

In order for a general job description schema to support both of these implementation choices, it needs to support specifying the job in a way that can be mapped to either, and that's what we've attempted to do. By having clearly defined semantics as documented by the runtime description, I believe that this kind of mapping is possible to create without requiring studio-specific changes that are too disruptive. This needs validation, of course, and I'm curious what examples everyone can come up with from their experience at different studios.

I did a pass through your example to try expressing it as OpenJD yaml, and here's what I came up with. It's definitely more verbose, and it looks like the GPU driver constraint isn't expressed well in what we have now, but I think the mapping worked well. Your syntax reads more like code to me than a data structure, and we avoided that in order to make job templates more easily consumed by GUI and pipeline tooling. Curious what you think of that.

specificationVersion: 'jobtemplate-2023-09'

name: "Tomy My First JAF job"
decription: |
  This is an attempt to translate jvanns' example job into OpenJD format, to help see
  how well it holds up.

parameterDefinitions:
- name: "shot"
  type: STRING
  default: "ayz421"
- name: "phasers"
  type: STRING
  default: "stun"
- name: "number"
  type: INT
  default: 1234567
- name: "frames"
  type: STRING
  default: "1-999"
- name: "frames_zip"
  type: STRING
  default: "1000-2000:5"

steps:
- name: "baz"
  parameterSpace:
    taskParameterDefinitions:
    - name: frame
      type: INT
      range: "{{Param.frames}}"
    - name: pi
      type: FLOAT
      range: [3.14159]
  script:
    actions:
      onRun:
        command: '{{Task.File.echo}}'
    embeddedFiles:
      - name: echo
        runnable: true
        type: TEXT
        data: |
          #!/bin/env bash
          set -euo pipefail

          echo "{{Task.Param.frame}}"
  hostRequirements:
    attributes:
    - name: attr.worker.os.family
      anyOf:
      - linux

- name: "zip"
  parameterSpace:
    taskParameterDefinitions:
    - name: frame
      type: INT
      range: "{{Param.frames_zip}}"
    - name: pi
      type: FLOAT
      range: [3.14159]
  script:
    actions:
      onRun:
        command: '{{Task.File.echo}}'
    embeddedFiles:
      - name: echo
        runnable: true
        type: TEXT
        data: |
          #!/bin/env bash
          set -euo pipefail

          echo "{{Task.Param.frame}}"
  hostRequirements:
    attributes:
    - name: attr.worker.os.family
      anyOf:
      - linux
    amounts:
    - name: amount.worker.vcpu
      min: 1
    - name: amount.worker.memory
      min: 1024


    - name: amount.worker.gpu
      min: 1

- name: "bogus"
  script:
    actions:
      onRun:
        command: 'bash'
        args: ["-c", ""]
  hostRequirements:
    attributes:
    - name: attr.gpu.driver
      anyOf:
      - "430.26"
    amounts:
    - name: amount.worker.vcpu
      min: 1
    - name: amount.worker.gpu
      min: 1
    - name: amount.worker.memory
      min: 4096

0 replies

jvanns · 2024-01-15T11:34:43Z

jvanns
Jan 15, 2024
Author

Hi. Thanks for taking the time to read through and respond! Yes, I see the benefit (and indeed have written & used systems that demonstrate this) of the collaboration between running task (environment) and scheduler where a comms channel can enable richer features such as real-time metrics or load-once-exec-many. In fact, this is also often required for DCC features too (e.g. old netrenders and others similar to it where there were a set of workers and one coordinator, spread across the farm). I guess I just wasn't ready to approach standardising a way of supporting that between different systems that may or may not require that level of sophistication! But everyone needs to get their job into a system, so I'm down with a common description for that! My goal was not so much 'code vs. data' (perhaps language) but rather easily readable to humans hence the redundant, declarative format but overall far less verbose due to more expressive statements. The example I gave was definitely contrived though - more to demonstrate the idea of the language than a bona fide job!

The resource description is actually a separate parser project (extended from a simpler format I wrote for a Mesos-based farm scheduler) but happens to fit nicely into a string in 'JAF'. I don't have a name for it, but its pretty expressive. Here's a few more examples (as you can see they're lifted directly from C code comments where I'd written the example parser & type/value generator rather than rely upon ANTLR as I had for the job description);

// Resource type description for a global catalogue
"maya=500[integer;consumable{shared:user}]";
"nuke=500[integer;consumable{shared:node}]";
"uber-sim=100[integer;consumable{shared:core}]";
"site=3[set{lon,gcp,aws};static]";

// Resouce type description for hosts
"cpu=[real;mandatory;consumable]";
"mem=[real;mandatory;consumable]"; // unit = MiB
"/var/tmp=[integer;consumable]";   // unit = MiB
"gpu=[integer;consumable]";
"latency=[integer;static]";        // unit = milliseconds
"ubuntu=[boolean;static]";
"kernel=[text;static]";

// Resource requirements description for a task
"cpu=8,mem=512,/var/tmp=8192,kernel=5.3.4,ubuntu[optional]";
"cpu=1,mem=256,gpu=1[mem=4096;driver>430.26;vendor=nvidia]";
"cpu=64[ghz>3;avx;sse2;L3=12],mem=65536,site![aws;gcp]";
"cpu=2,mem=2048,latency<2";
"cpu=16,mem=16384,gpu=1[mem=4096;driver>430.26;vendor=ati],redhat";
"cpu=56,mem=262144,/var/tmp=1048576,kernel=5.3.4,ubuntu=T";
"cpu=64,mem=65536,site=aws";

It's really just demonstrable food-for-thought since it isn't used in any production system but rather written as an isolated example for a strict but flexible type+catalogue system that could be easily extended to support new farm features but without a lot of code adjustments.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Hello and a few initial thoughts! #21

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Hello and a few initial thoughts! #21

Uh oh!

jvanns Jan 12, 2024

Replies: 3 comments · 1 reply

Uh oh!

ddneilson Jan 12, 2024

Uh oh!

jvanns Feb 20, 2024 Author

Uh oh!

mwiebe Jan 12, 2024 Collaborator

Uh oh!

jvanns Jan 15, 2024 Author

jvanns
Jan 12, 2024

Replies: 3 comments 1 reply

ddneilson
Jan 12, 2024

jvanns Feb 20, 2024
Author

mwiebe
Jan 12, 2024
Collaborator

jvanns
Jan 15, 2024
Author