OpenCL: Hide data transfer behind GPU Kernels runtime

The source code used here can be found at: https://github.com/ravkum/openclDataAndKernelParallelExecution

This blog assumes that the reader has understanding of OpenCL and is aware of host and device programs. I am taking AMD devices as an example but the concept applies to all OpenCL devices.

There are two kinds of GPU devices:

1) APU, where the GPU is integrated with CPU (and so is called iGPU). An example of this is AMD Ryzen series APU. Here the CPU and GPU share the virtual memory address space so memory transfer is not required. ZeroCopy buffers should be used in this case. I will explain how to use ZeroCopy buffers in another post.

2) dGPU, Discrete GPU. Here the input data has to be transferred from CPU DRAM to dGPU VRAM. Similarly, the output data has to be read back from dGPU VRAM to the CPU DRAM. This blog post is about these data transfers between CPU and GPU devices and explains how the DMA engine can be utilized to hide the data transfer behind the kernels being run on the GPU.

Let us use an regular example problem to explain this. Let us assume we have to apply a few filters on a set of input data.

The problem application pipeline looks like this:

Image for post
Image for post

To hide data transfer time behind kernel runtime, In brief, we need to do the following:

  1. Create the CPU buffers in the pinned CPU DRAM and create corresponding device memory buffers in the GPU VRAM.
  2. Have at least 2 set of input/Output buffers and kernels. I normally create ’n’ batches of input/output buffers and kernels and set the kernel arguments so we don’t have to do it repeatedly in the main program pipeline loop. Kernels are run in round-robin fashion.
  3. Have 3 command queues. 1 for data write, 1 for kernel enqueue and 1 for data read. PCIe lanes give us duplex memory transfer capabilities and read/write can also happen at the same time so 3 command queues are required.
  4. The main pipeline loop should have all async calls. A secondary host code thread should have the pinned memory data ready. This can be done using a callback function based on events. The callback function can have the memcpy from source to pinned host memory, or any other method of having the latest input ready in the pinned host memory.
  5. Use cl_event to synchronize work between queues.
  6. Profile and see if things are working as expected or not. If not, something is broken in the above steps, fix it.

The data flow would look something like this:

Image for post
Image for post

Few things to consider:

  1. The kernel works on the previous input data
  2. The next set of data should be sent to device once the filter1 of the previous batch finishes
  3. The output can be read back once the previous Filter2 is run is over
  4. Data send receive and kernel runs should all be in pipeline

Now each step in detail with code:

The below code snippet is for Steps 1 and 2:

  1. Create the CPU buffers in the pinned CPU DRAM and create corresponding device memory buffers in the GPU VRAM
  2. Have at least 2 set of input/Output buffers and kernels. I normally create ’n’ batches of input/output buffers and kernels and set the kernel arguments so we don’t have to do it repeatedly in the main program pipeline loop. Kernels are run in round-robin fashion.

3. Have 3 command queues. 1 for data write, 1 for kernel enqueue and 1 for data read. PCIe lanes give us duplex memory transfer capabilities and read/write can also happen at the same time so 3 command queues are required.

Below code snippet is the core pipeline and tackles both step 4 and 5:

4. The main pipeline loop should have all async calls. A secondary host code thread should have the pinned memory data ready. This can be done using a callback function based on events. The callback function can have the memcpy from source to pinned host memory, or any other method of having the latest input ready in the pinned host memory. The idea here is that the

5. Use cl_event to synchronize work between queues.

That is it. This should do the job of hiding the data transfer behind the kernel run.

Final step, confirm using profiler:

Image for post
Image for post

Here we can see that the data write, data read and kernel runs are happening parallelly and are not serialized.

The above log is taken using CodeXL (Application Timeline Trace on the dGPU that I have. Please note that CodeXL is no longer supported and all newer GPUs and newer drivers support RGP (Radeon Graphics Profiler).

On RGP, profiling OpenCL application is even simpler and you can easily see this timeline trace without any issues.

Do comment and let me know if any questions or concerns on this. Feel free to use the code in anyway required.

GPGPU programmer with focus on AI.. An avid reader..

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store