OpenGL/Vulkan Asynchronous Copy Engine

Overview

Modern GPUs have dedicated hardware for moving data around (in particular between system memory and device memory), called "copy engines" or "transfer queues" in Vulkan parlor. These can work completely independently from the main graphics pipeline. So while the GPU is busy rendering triangles, it can also be moving texture data or buffer contents in parallel. While APIs like Vulkan and DirectX explicitly expose this hardware to the programmer, OpenGL does not. The NVIDIA Quadro Dual Copy Engines white paper describes how to achieve some form of parallelism on professional visualization-class graphics cards. But it comes with a couple of downsides:

Multiple OpenGL contexts: Separate OpenGL contexts had to be created and managed for different operations
Multithreading required: Each context had to be bound to its own thread, resulting in complex thread management
Driver roulette: Not all OpenGL copy paths were fully parallelized in the driver, potentially leading to implicit serialization across all OpenGL contexts in play; even if the program did all the right things.

This sample demonstrates the use of Vulkan interop to implement "Asynchronous Copy Engines" for OpenGL. These copy engines allow performing buffer and texture transfers fully in parallel to OpenGL's rendering. Timeline semaphores are used to synchronize the various timelines: CPU, main rendering context and potentially multiple copy engines, all working concurrently.

Design Goals

The AsyncCopyEngine API is designed with specific goals in mind:

Familiar to OpenGL developers: The API follows OpenGL conventions and naming patterns, making it easy for OpenGL developers to understand and use
Hide Vulkan complexity: Vulkan and its intricacies are completely hidden from the user - no Vulkan knowledge is required
Drop-in ready: Designed to be integrated with minimal effort into existing OpenGL applications by minimizing dependencies

The goal is to provide the performance benefits of Vulkan's transfer queues while maintaining the simplicity and familiarity of OpenGL.

Key Concepts

AsyncCopyEngineBuilder

The AsyncCopyEngineBuilder class initializes the Vulkan context and offers functionality to create and destroy buffers, textures and semaphores and copy engines. Internally, these objects are created by means of Vulkan/OpenGL interop, allowing both, Vulkan and OpenGL to share resources and operate on them at the same time.

AsyncCopyEngine

AsyncCopyEngine encapsulates a single Vulkan transfer queue, offering functionality to copy data to and from buffers and textures. Naturally, it can only operate on (interop) objects created by the AsyncCopyEngineBuilder.

The execution model of AsyncCopyEngine is that of Vulkan: commands are recorded into command buffers, but not executed immediately. Only after a call to AsyncCopyEngine::flush() the copy engines get to work. flush() is also the only opportunity to synchronize the copy workload with the main OpenGL graphics context or other copy engines. flush() can insert a semaphore to wait on prior work to finish, before starting the copy workload and signaling the end of the copy work.

Internally, AsyncCopyEngine manages two or more Vulkan command buffers, with one command buffer always being in 'recording' state and potentially multiple other command buffers still in flight, performing the copy work. When all command buffers are still in use, flush() will wait for the oldest copy workload to finish. Since this would introduce a CPU stall, the number of command buffers AsyncCopyEngine may work on is flexible and can be adjusted.

Neither AsyncCopyEngineBuilder nor AsyncCopyEngine will disturb any OpenGL binding points (or any other OpenGL state for that matter).

Timeline Semaphores

Timeline semaphores are used throughout AsyncCopyEngine to track command buffer lifetime and work progress as well as synchronizing the copy engine work back and forth with the OpenGL command stream. They offer multiple advantages over binary semaphores:

Multiple waiters: Multiple operations can wait for the same timeline value
Ease of use: Enable complex multi-buffered schemes where the CPU can produce commands ahead of the GPU
Complex execution schemes: Support sophisticated multi-engine workflows with complex dependencies

Basic Usage

Simple example of using the copy engine to download a texture into system memory

// Initialize the builder (creates Vulkan context and discovers transfer queues)
AsyncCopyEngineBuilder builder;
builder.init();

// Create multiple engines for different workloads
auto copyEngine = builder.createCopyEngine(4);    // CPU→GPU transfers

// Create shared resources
GLuint buffer = builder.createBuffer(size, GL_MAP_WRITE_BIT | GL_MAP_READ_BIT | GL_MAP_PERSISTENT_BIT | GL_MAP_COHERENT_BIT);
GLuint texture = builder.createTexture(GL_TEXTURE_2D, 1, GL_RGBA8, m_renderSize.x, m_renderSize.y);
GLuint renderDoneSemaphore = builder.createSemaphore(0);
GLuint copyDoneSemaphore = builder.createSemaphore(0);

// Map buffer for CPU access
void* mappedPtr = builder.mapBuffer(buffer, GL_MAP_WRITE_BIT | GL_MAP_PERSISTENT_BIT | GL_MAP_COHERENT_BIT);

Example of copying a texture into mapped buffer object:

// Record copy engine command(s)
// The call to getTextureSubImage() is placed here to show that it is possible to record 
// copy operations out of line with the OpenGL command stream. Only the call to flush()
// down below embeds the copy with the OpenGL stream via glSignalSemaphore()  and glWaitSemaphore()
copyEngine->getTextureSubImage(texture, 0, 0, 0, 0, width, height, 1, buffer, 0);

// Submit copy engine work to be executed in parallel to OpenGL rendering
GLuint dstLayout = GL_LAYOUT_TRANSFER_SRC_EXT;
builder.glSignalSemaphore(renderDoneSemaphore, waitValue, 1, &buffer, 1, &texture, &dstLayout);

// execute the copy
copyEngine->flush(renderDoneSemaphore, waitValue++, copyDoneSemaphore, signalValue);

... do something else in meantime...

// make GL wait for the copy
builder.glWaitSemaphore(copyDoneSemaphore, signalValue++, 1, &buffer, 1, &texture, &dstLayout);

// copy is done here, we can do some more rendering with the texture
// builder.clientWaitSemaphore() can be used to wait for the copy on the CPU for further 
// processing of the texture data

Example of data from a mapped staging buffer object into a device local buffer:

// Create shared resources
GLuint mappedBuffer = builder.createBuffer(size, GL_MAP_WRITE_BIT |GL_MAP_PERSISTENT_BIT | GL_MAP_COHERENT_BIT);
GLuint deviceBuffer = builder.createBuffer(size, 0);
y);
GLuint renderDoneSemaphore = builder.createSemaphore(0);
GLuint copyDoneSemaphore = builder.createSemaphore(0);
// Map buffer for CPU access
void* mappedPtr = builder.mapBuffer(mappedBuffer, GL_MAP_WRITE_BIT | GL_MAP_PERSISTENT_BIT | GL_MAP_COHERENT_BIT);

...

builder.clientWaitSemaphore(copyDoneSemaphore, copyDoneTimelineValue++;
memcpy(mappedPtr, srcData, srcDataSize);


// Record copy engine command(s)
copyEngine->copyBufferSubData(mappedBuffer, deviceBuffer, 0, 0, srcDataSize);

// Submit copy engine work to be executed in parallel to OpenGL rendering.
// Make sure, prior OpenGL rendering is done with 'deviceBuffer'
GLuint buffers[] = {mappedBuffer, deviceBuffer};
builder.glSignalSemaphore(renderDoneSemaphore, renderDoneTimelineValue, 2, buffers);

// execute the copy
copyEngine->flush(renderDoneSemaphore, renderDoneTimelineValue++, copyDoneSemaphore, copyDoneTimelineValue);

...

// make GL wait for the copy
builder.glWaitSemaphore(copyDoneSemaphore, copyDoneTimelineValue, 2, buffers);

glBindBuffer(deviceBuffer, GL_ARRAY_BUFFER);

... render Stuff ...

Synchronization with Timeline Semaphores

Timeline semaphores are a modern, flexible and very convenient way to track GPU workload progress. They are implemented in core Vulkan 1.2 and OpenGL's NV_timeline_semaphore extension. AsyncCopyEngineBuilder::glSignalSemaphore() and AsyncCopyEngineBuilder::glWaitSemaphore() wrap the OpenGL use of NV_timeline_semaphore conveniently. In addition, AsyncCopyEngineBuilder::clientWaitSemaphore() brings the ability to wait on a semaphore on the CPU side.

// OpenGL signals data is ready
builder.glSignalSemaphore(uploadSemaphore, frameNumber, 1, &buffer, 0, nullptr, nullptr);

// Upload engine waits for data, then signals completion
uploadEngine->flush(uploadSemaphore, frameNumber, uploadCompleteSemaphore, frameNumber);

// render something else while the upload happens
glDrawElements()

// dovetail OpenGL back with upload timeline
builder.glWaitSemaphore(uploadCompleteSemaphore, frameNumber, 1, &buffer, 0, nullptr, nullptr);

Parallel Operation Flow

By multi-buffering the staging and vertex buffer VBO as well as the render texture, multiple frames worth of work can be recorded by the CPU ahead of time. The timelines of the three engines at work can overlap, effectively increasing render throughput by 'hiding' the upload of the frame N's vertex data in the frame N-1 rendering time, while overlapping download the N's texture download with the frame N+1 rendering.

Here's how the engines work together:

Frame N:   [Upload Engine] [Graphics Engine] [Download Engine]
Frame N+1:                 [Upload Engine]   [Graphics Engine] [Download Engine]
Frame N+2:                                   [Upload Engine]   [Graphics Engine] [Download Engine]

Each engine operates independently, with timeline semaphores coordinating when data is ready for the next stage. So while the upload engine is busy copying data for frame N+1, the graphics engine can be rendering frame N, and the download engine can be reading back frame N-1.

Scheduling and buffering strategies

The choice between single and dual copy engines, along with the number of staging buffers, significantly impacts the achievable parallelism between copy operations and rendering:

Single copy engine: Essentially serializes all download and upload activity, regardless of how many staging buffers are available. However, if upload and download time combined is less than render time, we can 'hide' all copy activity in the render time: downloading last frame and uploading next frame essentially work in parallel to rendering current frame (as long as we have at least two staging buffers).

1 Copy Engine, 1 Staging buffer. No Achievable overlap, since all operations have to be serialized

1 Copy Engine, 2 Staging buffers. Still no Achievable overlap, since all copy operations are executed serialized and in turn serialize rendering to the copies.

1 Copy Engine, 2 Staging buffers, Next-Frame-Upload. Now the copy operations can overlap with rendering.

Dual-copy engine: Upload and download can actually happen at the same time (PCIe is full duplex), but if there are not enough staging buffers (<2), the copy activity cannot overlap with the render activity, because the GPU is dependent on having the render data and render texture ready.

2 Copy Engine, 1 Staging buffer. Copies can overlap, but rendering is serialized to copies

2 Copy Engine, 2 Staging buffers. Overlap of rendering, vertex buffer upload and texture download.

2 Copy Engine, 2 Staging buffers. Upload frame data N+1 in frame N. Copy operations overlap with each other and rendering.

More than 2 frames staging have no effect on the copy and render overlap. It merely helps the CPU to queue up more frames ahead of time.

In conclusion, 2 staging buffers, 1 copy queue and "upload next frame data" is the least resource intensive scheme that allows for maximum parallel activity of graphics engine, copy engine and CPU, but it requires a little trickier setup and may not be suitable for all use cases.

Performance Tips

Use multiple engines: Separate upload and download operations into different engines, if you can make the upload and download operations overlap and want to afford multiple copy engines.
Overlap operations: Start next frame's upload while current frame is rendering
Batch operations: Group related operations in the same command buffer, try to queue as many copy operations as possible for each call to flush(). A flush will trigger a Vulkan vkQueueSubmit() call, which can be expensive.
Use nSight Systems: nSight Systems is an awesome tool to profile OpenGL and Vulkan applications, in particular it will show you the timeline of various GPU engines and the CPU in relation to each other. It will help you assess the achieved parallelism and overlap in GPU/CPU frames as well as between the copy engines and the graphics engine.

Resource Management Considerations

Important: The current implementation assumes that the user allocates only a moderate number of staging buffers and textures. Each buffer and texture allocated through AsyncCopyEngineBuilder uses a dedicated Vulkan allocation. This approach is suitable for applications with dozens of interop resources, but may cause issues with hundreds or thousands of allocations. It is not advised to allocate all of the application's textures and buffers through the AsyncCopyEngine class. Interop textures and buffers severely limit OpenGL's ability to manage these resources under memory pressure.

Instead, consider keeping only a few interop staging textures and buffers around that you use to move the data across the PCIe bus using the AsyncCopyEngine. Once the data is on the GPU, you can move the data to their final destination using commands like glCopyBufferSubData(),glTexSubImage() an glBlitFramebuffer(). Conversely, if you'd wanted to read many textures back, consider keeping two interop buffers around (one on the device, one mapped), perform the texture read back into the on-device buffer, followed by a download copy into the mapped buffer:

// Create shared resources
GLuint deviceBuffer = builder.createBuffer(size, 0);
GLuint hostBuffer = builder.createBuffer(size, GL_MAP_READ_BIT | GL_MAP_PERSISTENT_BIT | 
GL_MAP_COHERENT_BIT);
uint8_t *hostPointer = (uint8_t *)builder.mapBuffer(hostBuffer, GL_MAP_READ_BIT | GL_MAP_PERSISTENT_BIT | 
GL_MAP_COHERENT_BIT);

...

imageDownloadSemaphoreValue = imageDownloadSemaphoreValue + 1;

// Copy texture on-device into deviceBuffer
// Make sure, the last frame's copy is done and has relinquished 'deviceBuffer'
builder.glWaitSemaphore(imageDownloadSemaphore, imageDownloadSemaphoreValue-1);
glBindBuffer(GL_PIXEL_PACK_BUFFER, deviceBuffer);
glGetTextureImage(texture, 0, 0,0,0, width, height, 1, GL_RGBA, GL_UNSIGNED_BYTE, size, 0);

builder.glSignalSemaphore(imageReadySemaphore, imageReadySemaphoreValue);

// Download across PCIe using async copy engine
uploadEngine->copyBufferSubData(deviceBuffer, hostBuffer, 0, 0, size);
uploadEngine->flush(imageReadySemaphore, imageReadySemaphoreValue, imageDownloadSemaphore, imageDownloadSemaphoreValue);

// Do some more rendering in parallel to the download
postProcess();

// in a second thread, we might want to wait for the copy to be done and do some more CPU processing like writing to disk
builder.clientWaitSemaphore(imageDownloadSemaphore, imageDownloadSemaphoreValue);

writeToDisk(hostPointer, size);

// let the main thread know it can write to the hostBuffer again
signalMainThreadWritingDone();

Multithreading

The AsyncCopyEngine API supports multi-threaded usage with specific synchronization requirements.

AsyncCopyEngineBuilder: Object creation and destruction must be externally synchronized. Multiple threads cannot safely create or destroy resources simultaneously.
AsyncCopyEngine instances: Each engine can be operated on in different threads, including submitting work via flush().

Note, that it is not strictly necessary to utilize multithreading for maximum peformance. The AsyncCopyEngine sample itself does not make use of extra threads. The CPU just needs to be able to schedule enough work ahead of time, for the GPU to execute.

// Thread 1: Resource creation (must be synchronized)
AsyncCopyEngineBuilder builder;
builder.init();
auto engine1 = builder.createCopyEngine(4);
auto engine2 = builder.createCopyEngine(4);

// Thread 2: Dedicated to engine1
engine1->copyBufferSubData(src1, dst1, 0, 0, size1);
engine1->flush(waitSem1, waitVal1, signalSem1, signalVal1);

// Thread 3: Dedicated to engine2  
engine2->copyBufferSubData(src2, dst2, 0, 0, size2);
engine2->flush(waitSem2, waitVal2, signalSem2, signalVal2);

Note: AsyncCopyEngineBuilder::glWaitSemaphore and AsyncCopyEngineBuilder::glSignalSemaphore operate on the GL context bound to the current thread.

UI options

The sample's UI lets one play with these scenarios and shows the impact on the timeline of operations on the GPU immediately. The workload for each engine (upload/render/download) can be artificially increased to observe the impact it has on the framerate.

next frame upload: better scheduling of transfers reduces dependencies and allows for more parallelism between transfers and rendering
single vs dual copy engines: in transfer limited scenarious, it can make a big difference that transfers can happen in both directions at the same time
number of active staging buffers: the difference between single- and double buffered staging resources can be stark
upload repeat count: artificially increase upload workload
download repeat count: artificially increase download workload. Changing the render resolution will have a similar effect.
Torus Grid settings: artificially increase or decrease rendering workload

Extensions used in the sample

Vulkan Extensions

Core Vulkan 1.2 functionality and a couple of platform specific extensions are used:

Windows: VK_KHR_external_memory_win32, VK_KHR_external_semaphore_win32
Linux: VK_KHR_external_memory_fd, VK_KHR_external_semaphore_fd

OpenGL Extensions

Required OpenGL extensions:

GL_EXT_memory_object
GL_EXT_semaphore
GL_NV_timeline_semaphore
Platform-specific extensions for handle import EXT_external_objects_fd, EXT_external_objects_win32

Keywords

OpenGL, Vulkan, interop, asynchronous copy engine, timeline semaphores

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
cmake		cmake
doc		doc
include		include
.gitattributes		.gitattributes
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
CONTRIBUTING		CONTRIBUTING
LICENSE		LICENSE
README.md		README.md
async_copy_engine.cpp		async_copy_engine.cpp
async_copy_engine.hpp		async_copy_engine.hpp
main.cpp		main.cpp
timing.cpp		timing.cpp
timing.hpp		timing.hpp

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OpenGL/Vulkan Asynchronous Copy Engine

Overview

Design Goals

Key Concepts

AsyncCopyEngineBuilder

AsyncCopyEngine

Timeline Semaphores

Basic Usage

Synchronization with Timeline Semaphores

Parallel Operation Flow

Scheduling and buffering strategies

Performance Tips

Resource Management Considerations

Multithreading

UI options

Extensions used in the sample

Vulkan Extensions

OpenGL Extensions

Keywords

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

License

nvpro-samples/gl_vk_async_copy

Folders and files

Latest commit

History

Repository files navigation

OpenGL/Vulkan Asynchronous Copy Engine

Overview

Design Goals

Key Concepts

AsyncCopyEngineBuilder

AsyncCopyEngine

Timeline Semaphores

Basic Usage

Synchronization with Timeline Semaphores

Parallel Operation Flow

Scheduling and buffering strategies

Performance Tips

Resource Management Considerations

Multithreading

UI options

Extensions used in the sample

Vulkan Extensions

OpenGL Extensions

Keywords

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages