This document describes the high-level architecture of the Intel(R) Graphics Compute Runtime (NEO): how the source tree is organized, how the major components relate to each other, and how a request travels from an application API call down to a hardware submission. It is intended as an entry point for readers who want to understand the project as a whole before diving into a specific subsystem.
For build instructions see BUILD.md. For coding and contribution rules see GUIDELINES.md and CONTRIBUTING.md. For feature-specific deep dives see the Programmers Guide.
- Layered structure
- Top-level source layout
- The shared core
- Object and ownership model
- Hardware abstraction
- Operating-system abstraction
- Memory management
- Command submission path
- Compiler interface
- End-to-end runtime flows
- Testing layout
NEO is a single codebase that exposes two compute APIs on top of a common core. There are three top-level layers:
+-----------------------------+ +-----------------------------+
| opencl/ | | level_zero/ |
| OpenCL(TM) frontend | | oneAPI Level Zero frontend|
| (cl_* entry points, | | (ze*/zes* entry points, |
| CommandQueue, MemObj, | | CommandList, Module, |
| Kernel, Program, ...) | | Kernel, Event, ...) |
+--------------+--------------+ +--------------+--------------+
| |
+----------------+----------------+
|
+----------------v----------------+
| shared/ |
| Common runtime core: |
| Device, CommandStreamReceiver,|
| MemoryManager, command |
| container/encoder, compiler |
| interface, helpers, HW + OS |
| abstraction |
+----------------+----------------+
|
+----------------v----------------+
| OS kernel-mode driver + HW |
| (Linux i915/xe DRM, Windows |
| WDDM) |
+---------------------------------+
opencl/andlevel_zero/are API frontends. They translate API objects and calls into operations on shared-core objects. They do not talk to the hardware or OS directly.shared/is the runtime core. It owns device enumeration, memory management, command generation, submission, synchronization, and all hardware- and OS-specific behavior.
A key consequence: most non-trivial functional behavior lives in shared/, and
the two frontends are comparatively thin. When changing behavior, prefer
implementing it once in shared/ and exposing it through both frontends rather
than duplicating it.
| Path | Responsibility |
|---|---|
shared/source/ |
The runtime core (see The shared core). |
shared/test/ |
Unit tests (ULTs) for shared code. |
shared/offline_compiler/ |
ocloc offline kernel compiler tool. |
opencl/source/ |
OpenCL API implementation. |
opencl/test/ |
OpenCL ULTs. |
level_zero/core/ |
Level Zero core API (devices, command lists/queues, modules, kernels, events). |
level_zero/sysman/ |
Level Zero System Management (power, frequency, temperature, RAS, ...). |
level_zero/tools/ |
Level Zero tools API (metrics, debug, sysman tools). |
level_zero/experimental/ |
Experimental Level Zero extensions. |
programmers-guide/ |
Feature-oriented design and optimization guides. |
third_party/ |
Vendored external headers and sources. |
manifests/ |
Component revision pins used by the build / CI. |
Within each API frontend the object directories follow the API's own
vocabulary: OpenCL has platform, cl_device, context, command_queue,
mem_obj, program, kernel, event, sampler; Level Zero has driver,
device, context, cmdqueue, cmdlist, module, kernel, event,
fence, image, sampler.
The most important directories under shared/source/:
| Directory | Responsibility |
|---|---|
execution_environment/ |
Process-wide root object and per-root-device environment. |
device/ |
Device, RootDevice, SubDevice and device capabilities. |
command_stream/ |
CommandStreamReceiver (CSR) - command buffer assembly, submission, synchronization. |
command_container/ |
CommandContainer and the Encode* command encoders that emit hardware commands. |
commands/ |
Hardware command and structure definitions. |
memory_manager/ |
MemoryManager, GraphicsAllocation, GPU address-space partitioning, host-pointer tracking. |
direct_submission/ |
Ring-buffer submission path that bypasses per-submission kernel-mode calls. |
os_interface/ |
OS abstraction (OSInterface, device discovery, contexts) with Linux and Windows backends. |
compiler_interface/ |
Interface to the Intel Graphics Compiler (IGC) and kernel build flow. |
built_ins/ |
Built-in kernels (copy, fill, ...) and the System Routine (SIP) for debugging. |
device_binary_format/ |
Parsing/packing of device binaries (zebin, patchtokens). |
program/ |
Program/kernel binary handling shared by both APIs. |
helpers/ |
Hardware- and product-aware helper classes (see below). |
release_helper/ |
Per-IP-release feature/workaround switches. |
gmm_helper/ |
Integration with GmmLib (tiling, cache settings, page tables). |
debug_settings/ |
Debug keys / environment-variable controlled behavior. |
debugger/ |
GPU debugger (tools) support. |
gen12lp/, xe_hpg_core/, xe_hpc_core/, xe2_hpg_core/, xe3_core/, xe3p_core/ |
Per-graphics-family hardware code (see Hardware abstraction). |
NEO builds a tree of long-lived objects during initialization. Ownership flows top-down; child objects hold non-owning back-pointers to their parents.
ExecutionEnvironment (one per process)
|- MemoryManager (one, shared by all root devices)
|- DirectSubmissionController
|- OsEnvironment
|- rootDeviceEnvironments[] (one RootDeviceEnvironment per root device)
|- HardwareInfo (platform, feature table, capabilities)
|- OSInterface (DRM or WDDM backend)
|- GfxCoreHelper / ProductHelper / ReleaseHelper / CompilerProductHelper
|- GmmHelper, CompilerInterface, BuiltIns, Debugger, ...
|- MemoryOperationsHandler
Device (RootDevice) (created per root device, references its RootDeviceEnvironment)
|- SubDevice[] (one per tile on multi-tile parts)
|- allEngines : EngineControl[] (one entry per engine instance the device uses)
|- CommandStreamReceiver (owns command buffers, tag/fence, residency)
|- OsContext (the OS-level submission context for an engine)
ExecutionEnvironmentis the process-wide root. It owns the singleMemoryManagerand the vector ofRootDeviceEnvironmentobjects. It is reference-counted so it outlives every device that references it.RootDeviceEnvironmentholds everything that is per-physical-device but not per-API-object: theHardwareInfo, theOSInterface, and the whole set of hardware/product helper objects.Deviceis the runtime's notion of a usable device. ARootDevicemay exposeSubDevicechildren (one per tile). ADeviceowns its engines.EngineControlpairs aCommandStreamReceiverwith anOsContext. Each engine instance (e.g. a compute engine, a copy engine) that the device uses has its ownEngineControl.OsContextcarries the engine type and usage (regular, low/high priority, internal, cooperative).
The API frontends wrap these: an OpenCL ClDevice or a Level Zero Device
(L0) holds a pointer to the shared Device; a CommandQueue/CommandList
ultimately routes work to an EngineControl's CommandStreamReceiver.
NEO supports many graphics families from a single codebase using template
specialization on a GfxFamily type plus a file-naming convention. This is
the most distinctive structural pattern in the tree and is described in detail
in Hardware Abstraction. In short:
- Family-independent algorithms are written once as templates
(
class CommandStreamReceiverHw<GfxFamily>,struct EncodeDispatchKernel<GfxFamily>, ...). TheGfxFamilytype supplies the concrete hardware command structs (e.g.GfxFamily::PIPE_CONTROL). - Implementation bodies live in
.inlfiles whose suffix encodes the range of families they apply to (e.g._base.inl,_xehp_and_later.inl,_from_xe_hpc_to_xe3.inl). Per-family.cppfiles in the family directories instantiate the templates for a concrete family. - Behavioral differences that are not worth a full template specialization are
expressed through helper objects selected at runtime:
GfxCoreHelper- architecture-level (per graphics family) behavior.ProductHelper- product/SKU-level behavior and workarounds.ReleaseHelper- per-IP-release feature and workaround switches.CompilerProductHelper- product-specific compiler-facing details.AILConfiguration(Application Intelligence Layer) - per-application behavior tweaks based on the running process.
OS-specific behavior is isolated behind OSInterface and a set of backends.
Two operating systems are supported from the same sources: Linux (using the
i915 or xe DRM kernel-mode driver) and Windows (using WDDM). The Linux path is
the one that is buildable from this repository; the Windows path shares the
codebase but is built elsewhere.
The convention used throughout shared/source/os_interface/ (and mirrored in
other directories) selects the backend by file-name suffix:
*_drm.cpp- Linux/DRM implementation.*_wddm.cpp- Windows/WDDM implementation.*_drm_or_wddm.cpp- selects between the two at build time when a single binary may target either.linux/andwindows/subdirectories hold larger backend-specific bodies.
The same suffix convention appears in direct_submission/, os_interface/, and
elsewhere. The OS layer is responsible for device discovery, allocating and
mapping memory through the kernel-mode driver, creating submission contexts
(OsContext), and submitting command buffers.
A single MemoryManager (owned by ExecutionEnvironment) serves all root
devices. Allocations are represented by GraphicsAllocation objects; a
MultiGraphicsAllocation groups the per-root-device copies of one logical
allocation.
Key concepts:
- Allocation type (
allocation_type.h) - the purpose of an allocation (buffer, image, kernel ISA, internal heap, command buffer, ...). It drives placement, caching, and residency decisions. GfxPartition- partitions the GPU virtual address space into heaps (e.g. standard, standard64K, SVM, heap for allocations addressable in 32 bits). See Allocations greater than 4GB.- Host-pointer management (
HostPtrManager) - tracks user-provided host memory that must be made resident for the GPU. - Residency - before a submission, every
GraphicsAllocationthe work touches must be made resident in the engine's context. The CSR collects these into a residency container and the OS layer commits them. - Deferred deletion - allocations still referenced by in-flight work are retired asynchronously rather than freed immediately.
For Unified Shared Memory placement details see System Memory Allocations.
Work reaches the hardware through the CommandStreamReceiver (CSR). The CSR is
the heart of submission and synchronization:
- The frontend records work into a command buffer (a
LinearStreambacked by aGraphicsAllocation), using theEncode*encoders incommand_container/to emit family-correct hardware commands. - State that must precede the work (state base address, pipeline select,
preemption, scratch space, ...) is programmed by the CSR via
flushTask/flushImmediateTask. - Allocations touched by the work are added to the residency set.
- The batch buffer is flushed: either through the classic per-submission
path (a kernel-mode submit call) or through direct submission, where the
driver keeps a ring buffer resident and rings a doorbell, avoiding a
kernel-mode call per submit (see
shared/source/direct_submission/). - Completion is tracked through a monotonically increasing task count written by the GPU to a tag allocation; host waits poll or block on this tag (with an optional kernel-mode notify fallback).
CommandContainer owns the command buffer and the indirect heaps (surface
state, dynamic state) for a sequence of commands. The Encode* structs (e.g.
EncodeDispatchKernel, EncodeStateBaseAddress, EncodeSemaphore) are the
single place where hardware command bits are written, which keeps command
generation consistent across both APIs.
For the immediate-command-list submission model see Immediate Commandlist.
Kernels are compiled by the Intel Graphics Compiler (IGC), reached through
compiler_interface/. The runtime passes source (OpenCL C, or SPIR-V for Level
Zero) and build options to IGC and receives a device binary.
device_binary_format/ parses the result (zebin or legacy patchtokens) into a
KernelDescriptor describing arguments, binding-table layout, required SLM,
and other dispatch metadata. The runtime also ships built-in kernels
(built_ins/) for operations like copy and fill, and a System Routine
(SIP) used during debugging. Compiled binaries can be cached on disk; see
NEO Compiler Cache.
Initialization / device discovery
ExecutionEnvironmentis created andprepareRootDeviceEnvironmentsis called.- The OS layer discovers devices (
device_factory.cpp/discover_devices_*), creating oneRootDeviceEnvironmentper device and populatingHardwareInfoandOSInterface. - Helper objects (
GfxCoreHelper,ProductHelper,ReleaseHelper,CompilerProductHelper) are constructed for each device's family/product. - The shared
MemoryManageris initialized. Device/RootDeviceobjects are created, sub-devices and engines are set up, and each engine gets aCommandStreamReceiver+OsContext.- The API frontend wraps the shared devices and exposes them through the API.
Memory allocation
- The frontend requests an allocation from the
MemoryManagerwithAllocationProperties(size, type, flags). - The manager chooses a heap via
GfxPartition, applies product/GMM cache and tiling settings, and asks the OS layer to back and map the memory. - A
GraphicsAllocation(or per-device set in aMultiGraphicsAllocation) is returned and later made resident at submission time.
Kernel dispatch and submission
- The frontend resolves the kernel (already compiled into a
KernelDescriptor) and the dispatch parameters (global/local sizes). - Commands are encoded into the command buffer via
EncodeDispatchKerneland related encoders; required heaps and state are programmed. - Touched allocations are added to the residency set.
- The CSR flushes the batch buffer (classic or direct submission).
- The host synchronizes on the resulting task count when it needs the results.
Hot dispatch paths (zeCommandListAppendLaunchKernel,
clEnqueueNDRangeKernel) are performance-critical; new non-default behavior
should be gated behind early-outs so these paths stay fast (see
GUIDELINES.md).
Tests are unit-level tests (ULTs) co-located with the code they validate:
shared/test/unit_test- shared core.level_zero/core/test/unit_tests- Level Zero.opencl/test/unit_test- OpenCL.
Multi-threaded code is tested in the dedicated mt_tests suites. AUB tests
validate the exact command stream against a simulator. See
GUIDELINES.md for testing conventions.