Modern GPUs have dedicated hardware for moving data around (in particular between system memory and device memory), called "copy engines" or "transfer queues" in Vulkan parlor. These can work completely independently from the main graphics pipeline. So while the GPU is busy rendering triangles, it can also be moving texture data or buffer contents in parallel. While APIs like Vulkan and DirectX explicitly expose this hardware to the programmer, OpenGL does not. The NVIDIA Quadro Dual Copy Engines white paper describes how to achieve some form of parallelism on professional visualization-class graphics cards. But it comes with a couple of downsides:
- Multiple OpenGL contexts: Separate OpenGL contexts had to be created and managed for different operations
- Multithreading required: Each context had to be bound to its own thread, resulting in complex thread management
- Driver roulette: Not all OpenGL copy paths were fully parallelized in the driver, potentially leading to implicit serialization across all OpenGL contexts in play; even if the program did all the right things.
This sample demonstrates the use of Vulkan interop to implement "Asynchronous Copy Engines" for OpenGL. These copy engines allow performing buffer and texture transfers fully in parallel to OpenGL's rendering. Timeline semaphores are used to synchronize the various timelines: CPU, main rendering context and potentially multiple copy engines, all working concurrently.
The AsyncCopyEngine API is designed with specific goals in mind:
- Familiar to OpenGL developers: The API follows OpenGL conventions and naming patterns, making it easy for OpenGL developers to understand and use
- Hide Vulkan complexity: Vulkan and its intricacies are completely hidden from the user - no Vulkan knowledge is required
- Drop-in ready: Designed to be integrated with minimal effort into existing OpenGL applications by minimizing dependencies
The goal is to provide the performance benefits of Vulkan's transfer queues while maintaining the simplicity and familiarity of OpenGL.
The AsyncCopyEngineBuilder class initializes the Vulkan context and offers functionality to create and destroy buffers, textures and semaphores and copy engines. Internally, these objects are created by means of Vulkan/OpenGL interop, allowing both, Vulkan and OpenGL to share resources and operate on them at the same time.
AsyncCopyEngine encapsulates a single Vulkan transfer queue, offering functionality to copy data to and from buffers and textures. Naturally, it can only operate on (interop) objects created by the AsyncCopyEngineBuilder.
The execution model of AsyncCopyEngine is that of Vulkan: commands are recorded into command buffers, but not executed immediately. Only after a call to AsyncCopyEngine::flush() the copy engines get to work. flush() is also the only opportunity to synchronize the copy workload with the main OpenGL graphics context or other copy engines. flush() can insert a semaphore to wait on prior work to finish, before starting the copy workload and signaling the end of the copy work.
Internally, AsyncCopyEngine manages two or more Vulkan command buffers, with one command buffer always being in 'recording' state and potentially multiple other command buffers still in flight, performing the copy work. When all command buffers are still in use, flush() will wait for the oldest copy workload to finish. Since this would introduce a CPU stall, the number of command buffers AsyncCopyEngine may work on is flexible and can be adjusted.
Neither AsyncCopyEngineBuilder nor AsyncCopyEngine will disturb any OpenGL binding points (or any other OpenGL state for that matter).
Timeline semaphores are used throughout AsyncCopyEngine to track command buffer lifetime and work progress as well as synchronizing the copy engine work back and forth with the OpenGL command stream. They offer multiple advantages over binary semaphores:
- Multiple waiters: Multiple operations can wait for the same timeline value
- Ease of use: Enable complex multi-buffered schemes where the CPU can produce commands ahead of the GPU
- Complex execution schemes: Support sophisticated multi-engine workflows with complex dependencies
Simple example of using the copy engine to download a texture into system memory
// Initialize the builder (creates Vulkan context and discovers transfer queues)
AsyncCopyEngineBuilder builder;
builder.init();
// Create multiple engines for different workloads
auto copyEngine = builder.createCopyEngine(4); // CPU→GPU transfers
// Create shared resources
GLuint buffer = builder.createBuffer(size, GL_MAP_WRITE_BIT | GL_MAP_READ_BIT | GL_MAP_PERSISTENT_BIT | GL_MAP_COHERENT_BIT);
GLuint texture = builder.createTexture(GL_TEXTURE_2D, 1, GL_RGBA8, m_renderSize.x, m_renderSize.y);
GLuint renderDoneSemaphore = builder.createSemaphore(0);
GLuint copyDoneSemaphore = builder.createSemaphore(0);
// Map buffer for CPU access
void* mappedPtr = builder.mapBuffer(buffer, GL_MAP_WRITE_BIT | GL_MAP_PERSISTENT_BIT | GL_MAP_COHERENT_BIT);
Example of copying a texture into mapped buffer object:
// Record copy engine command(s)
// The call to getTextureSubImage() is placed here to show that it is possible to record
// copy operations out of line with the OpenGL command stream. Only the call to flush()
// down below embeds the copy with the OpenGL stream via glSignalSemaphore() and glWaitSemaphore()
copyEngine->getTextureSubImage(texture, 0, 0, 0, 0, width, height, 1, buffer, 0);
// Submit copy engine work to be executed in parallel to OpenGL rendering
GLuint dstLayout = GL_LAYOUT_TRANSFER_SRC_EXT;
builder.glSignalSemaphore(renderDoneSemaphore, waitValue, 1, &buffer, 1, &texture, &dstLayout);
// execute the copy
copyEngine->flush(renderDoneSemaphore, waitValue++, copyDoneSemaphore, signalValue);
... do something else in meantime...
// make GL wait for the copy
builder.glWaitSemaphore(copyDoneSemaphore, signalValue++, 1, &buffer, 1, &texture, &dstLayout);
// copy is done here, we can do some more rendering with the texture
// builder.clientWaitSemaphore() can be used to wait for the copy on the CPU for further
// processing of the texture dataExample of data from a mapped staging buffer object into a device local buffer:
// Create shared resources
GLuint mappedBuffer = builder.createBuffer(size, GL_MAP_WRITE_BIT |GL_MAP_PERSISTENT_BIT | GL_MAP_COHERENT_BIT);
GLuint deviceBuffer = builder.createBuffer(size, 0);
y);
GLuint renderDoneSemaphore = builder.createSemaphore(0);
GLuint copyDoneSemaphore = builder.createSemaphore(0);
// Map buffer for CPU access
void* mappedPtr = builder.mapBuffer(mappedBuffer, GL_MAP_WRITE_BIT | GL_MAP_PERSISTENT_BIT | GL_MAP_COHERENT_BIT);
...
builder.clientWaitSemaphore(copyDoneSemaphore, copyDoneTimelineValue++;
memcpy(mappedPtr, srcData, srcDataSize);
// Record copy engine command(s)
copyEngine->copyBufferSubData(mappedBuffer, deviceBuffer, 0, 0, srcDataSize);
// Submit copy engine work to be executed in parallel to OpenGL rendering.
// Make sure, prior OpenGL rendering is done with 'deviceBuffer'
GLuint buffers[] = {mappedBuffer, deviceBuffer};
builder.glSignalSemaphore(renderDoneSemaphore, renderDoneTimelineValue, 2, buffers);
// execute the copy
copyEngine->flush(renderDoneSemaphore, renderDoneTimelineValue++, copyDoneSemaphore, copyDoneTimelineValue);
...
// make GL wait for the copy
builder.glWaitSemaphore(copyDoneSemaphore, copyDoneTimelineValue, 2, buffers);
glBindBuffer(deviceBuffer, GL_ARRAY_BUFFER);
... render Stuff ...
Timeline semaphores are a modern, flexible and very convenient way to track GPU workload progress. They are implemented in core Vulkan 1.2 and OpenGL's
NV_timeline_semaphore extension.
AsyncCopyEngineBuilder::glSignalSemaphore() and AsyncCopyEngineBuilder::glWaitSemaphore() wrap the OpenGL use of
NV_timeline_semaphore conveniently. In addition, AsyncCopyEngineBuilder::clientWaitSemaphore() brings the ability to
wait on a semaphore on the CPU side.
// OpenGL signals data is ready
builder.glSignalSemaphore(uploadSemaphore, frameNumber, 1, &buffer, 0, nullptr, nullptr);
// Upload engine waits for data, then signals completion
uploadEngine->flush(uploadSemaphore, frameNumber, uploadCompleteSemaphore, frameNumber);
// render something else while the upload happens
glDrawElements()
// dovetail OpenGL back with upload timeline
builder.glWaitSemaphore(uploadCompleteSemaphore, frameNumber, 1, &buffer, 0, nullptr, nullptr);
By multi-buffering the staging and vertex buffer VBO as well as the render texture, multiple frames worth of work can be recorded by the CPU ahead of time. The timelines of the three engines at work can overlap, effectively increasing render throughput by 'hiding' the upload of the frame N's vertex data in the frame N-1 rendering time, while overlapping download the N's texture download with the frame N+1 rendering.
Here's how the engines work together:
Frame N: [Upload Engine] [Graphics Engine] [Download Engine]
Frame N+1: [Upload Engine] [Graphics Engine] [Download Engine]
Frame N+2: [Upload Engine] [Graphics Engine] [Download Engine]
Each engine operates independently, with timeline semaphores coordinating when data is ready for the next stage. So while the upload engine is busy copying data for frame N+1, the graphics engine can be rendering frame N, and the download engine can be reading back frame N-1.
The choice between single and dual copy engines, along with the number of staging buffers, significantly impacts the achievable parallelism between copy operations and rendering:
Single copy engine: Essentially serializes all download and upload activity, regardless of how many staging buffers are available. However, if upload and download time combined is less than render time, we can 'hide' all copy activity in the render time: downloading last frame and uploading next frame essentially work in parallel to rendering current frame (as long as we have at least two staging buffers).

1 Copy Engine, 1 Staging buffer. No Achievable overlap, since all operations have to be serialized

1 Copy Engine, 2 Staging buffers. Still no Achievable overlap, since all copy operations are executed serialized and in turn serialize rendering to the copies.

1 Copy Engine, 2 Staging buffers, Next-Frame-Upload. Now the copy operations can overlap with rendering.
Dual-copy engine: Upload and download can actually happen at the same time (PCIe is full duplex), but if there are not enough staging buffers (<2), the copy activity cannot overlap with the render activity, because the GPU is dependent on having the render data and render texture ready.

2 Copy Engine, 1 Staging buffer. Copies can overlap, but rendering is serialized to copies

2 Copy Engine, 2 Staging buffers. Overlap of rendering, vertex buffer upload and texture download.

2 Copy Engine, 2 Staging buffers. Upload frame data N+1 in frame N. Copy operations overlap with each other and rendering.
More than 2 frames staging have no effect on the copy and render overlap. It merely helps the CPU to queue up more frames ahead of time.
In conclusion, 2 staging buffers, 1 copy queue and "upload next frame data" is the least resource intensive scheme that allows for maximum parallel activity of graphics engine, copy engine and CPU, but it requires a little trickier setup and may not be suitable for all use cases.
- Use multiple engines: Separate upload and download operations into different engines, if you can make the upload and download operations overlap and want to afford multiple copy engines.
- Overlap operations: Start next frame's upload while current frame is rendering
- Batch operations: Group related operations in the same command buffer, try to queue as many copy operations as possible for each call to flush(). A flush will trigger a Vulkan vkQueueSubmit() call, which can be expensive.
- Use nSight Systems: nSight Systems is an awesome tool to profile OpenGL and Vulkan applications, in particular it will show you the timeline of various GPU engines and the CPU in relation to each other. It will help you assess the achieved parallelism and overlap in GPU/CPU frames as well as between the copy engines and the graphics engine.
Important: The current implementation assumes that the user allocates only a moderate number of staging buffers and
textures. Each buffer and texture allocated through AsyncCopyEngineBuilder uses a dedicated Vulkan allocation. This
approach is suitable for applications with dozens of interop resources, but may cause issues with hundreds or thousands
of allocations. It is not advised to allocate all of the application's textures and buffers through the AsyncCopyEngine
class. Interop textures and buffers severely limit OpenGL's ability to manage these resources under memory pressure.
Instead, consider keeping only a few interop staging textures and buffers around that you use to move the data across
the PCIe bus using the AsyncCopyEngine. Once the data is on the GPU, you can move the data to their final destination
using commands like glCopyBufferSubData(),glTexSubImage() an glBlitFramebuffer(). Conversely, if you'd wanted to read many textures
back, consider keeping two interop buffers around (one on the device, one mapped), perform the texture read back into
the on-device buffer, followed by a download copy into the mapped buffer:
// Create shared resources
GLuint deviceBuffer = builder.createBuffer(size, 0);
GLuint hostBuffer = builder.createBuffer(size, GL_MAP_READ_BIT | GL_MAP_PERSISTENT_BIT |
GL_MAP_COHERENT_BIT);
uint8_t *hostPointer = (uint8_t *)builder.mapBuffer(hostBuffer, GL_MAP_READ_BIT | GL_MAP_PERSISTENT_BIT |
GL_MAP_COHERENT_BIT);
...
imageDownloadSemaphoreValue = imageDownloadSemaphoreValue + 1;
// Copy texture on-device into deviceBuffer
// Make sure, the last frame's copy is done and has relinquished 'deviceBuffer'
builder.glWaitSemaphore(imageDownloadSemaphore, imageDownloadSemaphoreValue-1);
glBindBuffer(GL_PIXEL_PACK_BUFFER, deviceBuffer);
glGetTextureImage(texture, 0, 0,0,0, width, height, 1, GL_RGBA, GL_UNSIGNED_BYTE, size, 0);
builder.glSignalSemaphore(imageReadySemaphore, imageReadySemaphoreValue);
// Download across PCIe using async copy engine
uploadEngine->copyBufferSubData(deviceBuffer, hostBuffer, 0, 0, size);
uploadEngine->flush(imageReadySemaphore, imageReadySemaphoreValue, imageDownloadSemaphore, imageDownloadSemaphoreValue);
// Do some more rendering in parallel to the download
postProcess();
// in a second thread, we might want to wait for the copy to be done and do some more CPU processing like writing to disk
builder.clientWaitSemaphore(imageDownloadSemaphore, imageDownloadSemaphoreValue);
writeToDisk(hostPointer, size);
// let the main thread know it can write to the hostBuffer again
signalMainThreadWritingDone();
The AsyncCopyEngine API supports multi-threaded usage with specific synchronization requirements.
AsyncCopyEngineBuilder: Object creation and destruction must be externally synchronized. Multiple threads cannot safely create or destroy resources simultaneously.AsyncCopyEngineinstances: Each engine can be operated on in different threads, including submitting work viaflush().
Note, that it is not strictly necessary to utilize multithreading for maximum peformance. The AsyncCopyEngine sample itself does not make use of extra threads. The CPU just needs to be able to schedule enough work ahead of time, for the GPU to execute.
// Thread 1: Resource creation (must be synchronized)
AsyncCopyEngineBuilder builder;
builder.init();
auto engine1 = builder.createCopyEngine(4);
auto engine2 = builder.createCopyEngine(4);
// Thread 2: Dedicated to engine1
engine1->copyBufferSubData(src1, dst1, 0, 0, size1);
engine1->flush(waitSem1, waitVal1, signalSem1, signalVal1);
// Thread 3: Dedicated to engine2
engine2->copyBufferSubData(src2, dst2, 0, 0, size2);
engine2->flush(waitSem2, waitVal2, signalSem2, signalVal2);Note: AsyncCopyEngineBuilder::glWaitSemaphore and AsyncCopyEngineBuilder::glSignalSemaphore operate on the GL context bound to the current thread.
The sample's UI lets one play with these scenarios and shows the impact on the timeline of operations on the GPU immediately. The workload for each engine (upload/render/download) can be artificially increased to observe the impact it has on the framerate.
- next frame upload: better scheduling of transfers reduces dependencies and allows for more parallelism between transfers and rendering
- single vs dual copy engines: in transfer limited scenarious, it can make a big difference that transfers can happen in both directions at the same time
- number of active staging buffers: the difference between single- and double buffered staging resources can be stark
- upload repeat count: artificially increase upload workload
- download repeat count: artificially increase download workload. Changing the render resolution will have a similar effect.
- Torus Grid settings: artificially increase or decrease rendering workload
Core Vulkan 1.2 functionality and a couple of platform specific extensions are used:
- Windows:
VK_KHR_external_memory_win32,VK_KHR_external_semaphore_win32 - Linux:
VK_KHR_external_memory_fd,VK_KHR_external_semaphore_fd
Required OpenGL extensions:
GL_EXT_memory_objectGL_EXT_semaphoreGL_NV_timeline_semaphore- Platform-specific extensions for handle import EXT_external_objects_fd, EXT_external_objects_win32
OpenGL, Vulkan, interop, asynchronous copy engine, timeline semaphores
