OpenCL: Hide data transfer behind GPU Kernels runtime

Ravi Kumar
8 min readFeb 15, 2021

The source code used here can be found at: https://github.com/ravkum/openclDataAndKernelParallelExecution

This blog assumes that the reader has understanding of OpenCL and is aware of host and device programs. I am taking AMD devices as an example but the concept applies to all OpenCL devices.

There are two kinds of GPU devices:

1) APU, where the GPU is integrated with CPU (and so is called iGPU). An example of this is AMD Ryzen series APU. Here the CPU and GPU share the virtual memory address space so memory transfer is not required. ZeroCopy buffers should be used in this case. I will explain how to use ZeroCopy buffers in another post.

2) dGPU, Discrete GPU. Here the input data has to be transferred from CPU DRAM to dGPU VRAM. Similarly, the output data has to be read back from dGPU VRAM to the CPU DRAM. This blog post is about these data transfers between CPU and GPU devices and explains how the DMA engine can be utilized to hide the data transfer behind the kernels being run on the GPU.

Let us use an regular example problem to explain this. Let us assume we have to apply a few filters on a set of input data.

The problem application pipeline looks like this:

To hide data transfer time behind kernel runtime, In brief, we need to do the following:

  1. Create the CPU buffers in the pinned CPU DRAM and create corresponding device memory buffers in the GPU VRAM.
  2. Have at least 2 set of input/Output buffers and kernels. I normally create ’n’ batches of input/output buffers and kernels and set the kernel arguments so we don’t have to do it repeatedly in the main program pipeline loop. Kernels are run in round-robin fashion.
  3. Have 3 command queues. 1 for data write, 1 for kernel enqueue and 1 for data read. PCIe lanes give us duplex memory transfer capabilities and read/write can also happen at the same time so 3 command queues are required.
  4. The main pipeline loop should have all async calls. A secondary host code thread should have the pinned memory data ready. This can be done using a callback function based on events. The callback function can have the memcpy from source to pinned host memory, or any other method of having the latest input ready in the pinned host memory.
  5. Use cl_event to synchronize work between queues.
  6. Profile and see if things are working as expected or not. If not, something is broken in the above steps, fix it.

The data flow would look something like this:

Few things to consider:

  1. The kernel works on the previous input data
  2. The next set of data should be sent to device once the filter1 of the previous batch finishes
  3. The output can be read back once the previous Filter2 is run is over
  4. Data send receive and kernel runs should all be in pipeline

Now each step in detail with code:

The below code snippet is for Steps 1 and 2:

  1. Create the CPU buffers in the pinned CPU DRAM and create corresponding device memory buffers in the GPU VRAM
  2. Have at least 2 set of input/Output buffers and kernels. I normally create ’n’ batches of input/output buffers and kernels and set the kernel arguments so we don’t have to do it repeatedly in the main program pipeline loop. Kernels are run in round-robin fashion.
#define BATCH_KERNELS 2….….//Kernelscl_kernel kernel_1[BATCH_KERNELS];cl_kernel kernel_2[BATCH_KERNELS];//Declare device cl_mem bufferscl_mem device_input_cl[BATCH_KERNELS];cl_mem device_tmp_input_cl[BATCH_KERNELS];cl_mem device_output_cl[BATCH_KERNELS];//Declare host cl_mem bufferscl_mem host_input_cl[BATCH_KERNELS];cl_mem host_output_cl[BATCH_KERNELS];….….//Allocate memory,//host memory is allocated in pinned CPU memory using CL_MEM_ALLOC_HOST_PTRfor (int i = 0; i < BATCH_KERNELS; i++) {device_input_cl[i] = clCreateBuffer(infoDeviceOcl.mCtx, CL_MEM_READ_ONLY,2 * paddedNumBytes,NULL, &err);CHECK_RESULT(err != CL_SUCCESS, “clCreateBuffer failed with %d\n”, err);host_input_cl[i] = clCreateBuffer(infoDeviceOcl.mCtx, CL_MEM_ALLOC_HOST_PTR,2 * paddedNumBytes,NULL, &err);CHECK_RESULT(err != CL_SUCCESS, “clCreateBuffer failed with %d\n”, err);device_tmp_input_cl[i] = clCreateBuffer(infoDeviceOcl.mCtx, CL_MEM_READ_WRITE,2 * tmpOutputpaddedNumBytes,NULL, &err);CHECK_RESULT(err != CL_SUCCESS, “clCreateBuffer failed with %d\n”, err);device_output_cl[i] = clCreateBuffer(infoDeviceOcl.mCtx, CL_MEM_WRITE_ONLY,2 * numBytes * UPSCAL_FACTOR,NULL, &err);CHECK_RESULT(err != CL_SUCCESS, “clCreateBuffer failed with %d\n”, err);host_output_cl[i] = clCreateBuffer(infoDeviceOcl.mCtx, CL_MEM_ALLOC_HOST_PTR,2 * numBytes * UPSCAL_FACTOR,NULL, &err);CHECK_RESULT(err != CL_SUCCESS, “clCreateBuffer failed with %d\n”, err)}….….// Create kernelsfor (int i = 0; i < BATCH_KERNELS; i++) {kernel_1[i] = clCreateKernel(programFilter, FILTER_1_KERNEL, &err);CHECK_RESULT(err != CL_SUCCESS, “clCreateKernel failed with Error code = %d”, err);kernel_2[i] = clCreateKernel(programFilter, FILTER_2_KERNEL, &err);CHECK_RESULT(err != CL_SUCCESS, “clCreateKernel failed with Error code = %d”, err);}//Set kernel argumentsint cnt = 0;for (int i = 0; i < BATCH_KERNELS; i++) {cnt = 0;err = clSetKernelArg(kernel_1[i], cnt++, sizeof(cl_mem), &(device_input_cl[i]));err |= clSetKernelArg(kernel_1[i], cnt++, sizeof(cl_mem), &(device_tmp_input_cl[i]));err |= clSetKernelArg(kernel_1[i], cnt++, sizeof(cl_mem), &(filter1_coeff));CHECK_RESULT(err != CL_SUCCESS, “clSetKernelArg failed with Error code = %d”, err);cnt = 0;err = clSetKernelArg(kernel_2[i], cnt++, sizeof(cl_mem), &(device_tmp_input_cl[i]));err |= clSetKernelArg(kernel_2[i], cnt++, sizeof(cl_mem), &(device_output_cl[i]));err |= clSetKernelArg(kernel_2[i], cnt++, sizeof(cl_mem), &(filter2_coeff));CHECK_RESULT(err != CL_SUCCESS, “clSetKernelArg failed with Error code = %d”, err);}

3. Have 3 command queues. 1 for data write, 1 for kernel enqueue and 1 for data read. PCIe lanes give us duplex memory transfer capabilities and read/write can also happen at the same time so 3 command queues are required.

infoDeviceOcl->mQueue = clCreateCommandQueue(infoDeviceOcl->mCtx, infoDeviceOcl->mDevice, 0, &err);CHECK_RESULT(err != CL_SUCCESS, “clCreateCommandQueue mQueue failed. Err code = %d”, err);infoDeviceOcl->mReadQueue = clCreateCommandQueue(infoDeviceOcl->mCtx, infoDeviceOcl->mDevice, 0, &err);CHECK_RESULT(err != CL_SUCCESS, “clCreateCommandQueue mQueue failed. Err code = %d”, err);infoDeviceOcl->kQueue = clCreateCommandQueue(infoDeviceOcl->mCtx, infoDeviceOcl->mDevice, 0, &err);CHECK_RESULT(err != CL_SUCCESS, “clCreateCommandQueue kQueue failed. Err code = %d”, err);

Below code snippet is the core pipeline and tackles both step 4 and 5:

4. The main pipeline loop should have all async calls. A secondary host code thread should have the pinned memory data ready. This can be done using a callback function based on events. The callback function can have the memcpy from source to pinned host memory, or any other method of having the latest input ready in the pinned host memory. The idea here is that the

5. Use cl_event to synchronize work between queues.

//Pointers to store mapped pointerscl_float *pinned_input[BATCH_KERNELS];cl_float *pinned_output[BATCH_KERNELS];//Events to synchronizecl_event input_event[BATCH_KERNELS];cl_event output_event[BATCH_KERNELS];cl_event kernel2_event[BATCH_KERNELS];cl_event kernel1_event[BATCH_KERNELS];Getting the mapped host pointer for pinned buffer://Getting host pointers for all host pinned buffersfor (int i = 0; i < BATCH_KERNELS; i++) {pinned_input[i] = (float *)clEnqueueMapBuffer(infoDeviceOcl.mQueue, host_input_cl[i], CL_TRUE, CL_MAP_WRITE, 0, 2 * paddedNumBytes, 0, NULL, NULL, &status);CHECK_RESULT(status != CL_SUCCESS, “Error in clEnqueueMapBuffer. Status: %d\n”, status);//My input data doesn’t change. In the real application, this memcpy should happen by the secondary thread based on the event callback //function or other such means so the main thread is not blockedmemcpy(pinned_input[i], data, 2 * paddedNumBytes);pinned_output[i] = (float *)clEnqueueMapBuffer(infoDeviceOcl.mQueue, host_output_cl[i], CL_FALSE, CL_MAP_READ, 0, 2 * numBytes * UPSCAL_FACTOR, 0, NULL, NULL, &status);CHECK_RESULT(status != CL_SUCCESS, “Error in clEnqueueMapBuffer. Status: %d\n”, status);}//Send first set of data instatus = clEnqueueWriteBuffer(infoDeviceOcl.mQueue, device_input_cl[0], CL_TRUE, 0, 2 * paddedNumBytes, pinned_input[0], 0, NULL, &input_event[0]);CHECK_RESULT(status != CL_SUCCESS, “Error in clEnqueueWriteBuffer. Status: %d\n”, status);//main piepeline loopfor (iter = 0; iter < iteration; iter++) {if (iter < (BATCH_KERNELS)) { //This is required as the first BATCH_KERNELS number of kernels would trigger all the required events for second set run and onwardsset = iter % BATCH_KERNELS;input_cl = &device_input_cl[set];output_cl = &device_output_cl[set];//Work on the previously sent dataerr = clEnqueueNDRangeKernel(infoDeviceOcl.kQueue, kernel_1[set], 1, NULL, globalWorkSize, localWorkSize, 1, &input_event[set], &kernel1_event[set]);CHECK_RESULT(err != CL_SUCCESS, “clEnqueueNDRangeKernel failed with Error code = %d”, err);err = clEnqueueNDRangeKernel(infoDeviceOcl.kQueue, kernel_2[set], 1, NULL, globalWorkSize, localWorkSize, 0, NULL, &kernel2_event[set]);CHECK_RESULT(err != CL_SUCCESS, “clEnqueueNDRangeKernel failed with Error code = %d”, err);clFlush(infoDeviceOcl.kQueue);//In my case, the host buffer is pinned, so MapBuffer step is required only once at the beginning, already done that//reading half the output as the final stage in the pipeline is supposed to be a compression kernel with 2:1 compression factorstatus = clEnqueueReadBuffer(infoDeviceOcl.mReadQueue, *output_cl, CL_FALSE, 0, 2 * numBytes, pinned_output[set], 1, &kernel2_event[set], &output_event[set]);CHECK_RESULT(status != CL_SUCCESS, “Error in clEnqueueWriteBuffer. Status: %d\n”, status);//Send next set of data//In my case, the host buffer is pinned, so MapBuffer step is required only once at the beginning, already done thatstatus = clEnqueueWriteBuffer(infoDeviceOcl.mQueue, *input_cl, CL_FALSE, 0, 2 * paddedNumBytes, pinned_input[(set + 1) % BATCH_KERNELS], 0, NULL, &input_event[(set + 1) % BATCH_KERNELS]);CHECK_RESULT(status != CL_SUCCESS, “Error in clEnqueueWriteBuffer. Status: %d\n”, status);clFlush(infoDeviceOcl.mReadQueue);}else {set = iter % BATCH_KERNELS;input_cl = &device_input_cl[(iter + 1) % BATCH_KERNELS];output_cl = &device_output_cl[set];//Work on the previously sent data//kernel1 can start execution only after the input is availableerr = clEnqueueNDRangeKernel(infoDeviceOcl.kQueue, kernel_1[set], 1, NULL, globalWorkSize, localWorkSize, 1, &input_event[set], &kernel1_event[set]);CHECK_RESULT(err != CL_SUCCESS, “clEnqueueNDRangeKernel failed with Error code = %d”, err);//kernel2 can start execution only after the previous set output has been read so it has to wait for output_event[set] eventerr = clEnqueueNDRangeKernel(infoDeviceOcl.kQueue, kernel_2[set], 1, NULL, globalWorkSize, localWorkSize, 1, &output_event[set], &kernel2_event[set]);CHECK_RESULT(err != CL_SUCCESS, “clEnqueueNDRangeKernel failed with Error code = %d”, err);clFlush(infoDeviceOcl.kQueue);//In my case, the host buffer is pinned, so MapBuffer step is required only once at the beginning, already done that//reading half the output as the final stage in the pipeline is supposed to be a compression kernel with 2:1 compression factorstatus = clEnqueueReadBuffer(infoDeviceOcl.mReadQueue, *output_cl, CL_FALSE, 0, 2 * numBytes, pinned_output[set], 1, &kernel2_event[set], &output_event[set]);CHECK_RESULT(status != CL_SUCCESS, “Error in clEnqueueWriteBuffer. Status: %d\n”, status);//Send next set of data//In my case, the host buffer is pinned, so MapBuffer step is required only once at the beginning, already done thatstatus = clEnqueueWriteBuffer(infoDeviceOcl.mQueue, *input_cl, CL_FALSE, 0, 2 * paddedNumBytes, pinned_input[(iter + 1) % BATCH_KERNELS], 1, &kernel1_event[(iter + 1) % BATCH_KERNELS], &input_event[(iter + 1) % BATCH_KERNELS]);CHECK_RESULT(status != CL_SUCCESS, “Error in clEnqueueWriteBuffer. Status: %d\n”, status);clFlush(infoDeviceOcl.mReadQueue);}}

That is it. This should do the job of hiding the data transfer behind the kernel run.

Final step, confirm using profiler:

Here we can see that the data write, data read and kernel runs are happening parallelly and are not serialized.

The above log is taken using CodeXL (Application Timeline Trace on the dGPU that I have. Please note that CodeXL is no longer supported and all newer GPUs and newer drivers support RGP (Radeon Graphics Profiler).

On RGP, profiling OpenCL application is even simpler and you can easily see this timeline trace without any issues.

Do comment and let me know if any questions or concerns on this. Feel free to use the code in anyway required.

--

--

Ravi Kumar

GPGPU programmer with focus on AI.. An avid reader..