If you are used to the programming GPU applications using CUDA runtime API, and have a clear concept of CUDA architecture, porting the application to OpenCL might be a little bit confusing but not difficult.

To exactly demonstrate the difference between CUDA runtime and OpenCL, a downloadable example of vector addition is attached at the end of the tutorial.

The concept of threads, blocks, and kernels is the same, one of the major differences, however, is how the kernel is launched and the number of API calls required to do so. OpenCL is more or less same as the CUDA driver API, but in this article, we will show how a CUDA runtime equivalent OpenCL program can be written. Following are the terminologies and API calls used in a CUDA runtime application, along with their OpenCL counterparts.

Terminologies CUDA OpenCL
General
Thread Work Item
Block Work Group
Global Memory Global Memory
Shared Memory Local Memory
Constant Memory Constant Memory
Local Memory Private Memory
Stream Command Queue
Texture Image
Surface Writable Image
Qualifiers
Kernel __global__ __kernel
Device Function __device__ None
Constant Memory __constant__ __constant
Device Variable __device__ __global
Shared Memory __shared__ __local
Basic API Calls  
cudaMalloc clCreateBuffer
cudaMemcpy (Host to Device) clEnqueueWriteBuffer
cudaMemcpy (Device to Host) clEnqueueReadBuffer
cudaMemcpy (Device to Device) clEnqueueCopyBuffer
cudaMemcpy2D (Host to Device) clEnqueueWriteBufferRect
cudaMemcpy2D (Device to Host) clEnqueueReadBufferRect
cudaMemcpy2D (Device to Device) clEnqueueCopyBufferRect
cudaFree clReleaseMemObject
cudaStreamCreate clCreateCommandQueue
cudaStreamDestroy clReleaseCommandQueue
Kernel Call
Grid / Block variables dim3 blockSize(x,x,x);

 

dim3 gridSize(x,x,x);

size_t work_group_size[] = {x,x,x};

 

 

size_t total_work_items[] =

{x,x,x}

 

Kernel Launch

 

myKernel<<<>>(arg1,arg2,…argn);

 

clSetKernelArg(myKernel,arg1);

clSetKernelArg(myKernel,arg2);

.

.

.

clSetKernelArg(myKernel,argn);

clEnqueueNDRangeKernel(myKernel);

Synchronization cudaDeviceSynchronize()

cudaStreamSynchronize()

clFinish()
Built-in Indexing
Number of blocks gridDim get_num_groups()
Size of block blockDim get_local_size()
Index of block blockIdx get_group_id()
Index of thread in a block threadIdx get_local_id()
Global Thread

Index

blockIdx * blockDim + threadIdx get_global_id()
Total Threads gridDim * blockDim get_global_size()
In-Device Synchronization
__syncthreads() Barrier()
__threadfence() N/A
__threadfence_block() mem_fence()
N/A read_mem_fence()
N/A write_mem_fence()

Launching the Kernel

Apart from these and few other differences, an important concept to grasp is how the kernel is launched. CUDA kernel launch requires the user to specify the following:

  1. Block size, i.e. number of threads per block.
  2. Grid size, i.e. the number of blocks
  3. Amount of dynamically allocated shared memory (optional)
  4. The CUDA stream on which the kernel is launched. (optional)

Whereas OpenCL kernel launch requires the following:

  1. OpenCL command queue on which the kernel is launched.
  2. Dimensions of indexing (1,2 or 3)
  3. Total number of threads
  4. Work-group size, i.e. threads per work-group (optional)
  5. Event associated with the kernel (optional).

The CUDA Runtime API is a high level interface (much easier way) for the CUDA Driver API. In a CUDA runtime application, a default context and a default stream is created on the first CUDA API call. The application is statically compiled by the NVIDIA compiler nvcc (for device code) along with a host compiler (for C++ host code). The device code implements C++.

In contrast, OpenCL provides a low level API, hence the context and command queue have to be created explicitly. We have to code more but it provides more control over the execution of the program. OpenCL device code implements a subset of C99 standard so we cannot use features of C++ like templates, overloading etc. in kernels. Kernels are compiled dynamically at runtime, to provide cross platform, cross device behavior. So instead of standard code, OpenCL device code is written in either of the following ways:

  1. String literal, hardcoded inside the source file.
  2. In a .cl or .clc source file, loaded and compiled by the OpenCL runtime.

You can download this vector addition example to get a clear idea of the differences between CUDA and OpenCL.