If you are used to the programming GPU applications using CUDA runtime API, and have a clear concept of CUDA architecture, porting the application to OpenCL might be a little bit confusing but not difficult.
To exactly demonstrate the difference between CUDA runtime and OpenCL, a downloadable example of vector addition is attached at the end of the tutorial.
The concept of threads, blocks, and kernels is the same, one of the major differences, however, is how the kernel is launched and the number of API calls required to do so. OpenCL is more or less same as the CUDA driver API, but in this article, we will show how a CUDA runtime equivalent OpenCL program can be written. Following are the terminologies and API calls used in a CUDA runtime application, along with their OpenCL counterparts.
Terminologies | CUDA | OpenCL |
General | ||
Thread | Work Item | |
Block | Work Group | |
Global Memory | Global Memory | |
Shared Memory | Local Memory | |
Constant Memory | Constant Memory | |
Local Memory | Private Memory | |
Stream | Command Queue | |
Texture | Image | |
Surface | Writable Image | |
Qualifiers | ||
Kernel | __global__ | __kernel |
Device Function | __device__ | None |
Constant Memory | __constant__ | __constant |
Device Variable | __device__ | __global |
Shared Memory | __shared__ | __local |
Basic API Calls | ||
cudaMalloc | clCreateBuffer | |
cudaMemcpy (Host to Device) | clEnqueueWriteBuffer | |
cudaMemcpy (Device to Host) | clEnqueueReadBuffer | |
cudaMemcpy (Device to Device) | clEnqueueCopyBuffer | |
cudaMemcpy2D (Host to Device) | clEnqueueWriteBufferRect | |
cudaMemcpy2D (Device to Host) | clEnqueueReadBufferRect | |
cudaMemcpy2D (Device to Device) | clEnqueueCopyBufferRect | |
cudaFree | clReleaseMemObject | |
cudaStreamCreate | clCreateCommandQueue | |
cudaStreamDestroy | clReleaseCommandQueue | |
Kernel Call | ||
Grid / Block variables | dim3 blockSize(x,x,x);
dim3 gridSize(x,x,x); |
size_t work_group_size[] = {x,x,x};
size_t total_work_items[] = {x,x,x} |
Kernel Launch |
myKernel<<<>>(arg1,arg2,…argn); |
clSetKernelArg(myKernel,arg1); clSetKernelArg(myKernel,arg2); . . . clSetKernelArg(myKernel,argn); clEnqueueNDRangeKernel(myKernel); |
Synchronization | cudaDeviceSynchronize()
cudaStreamSynchronize() |
clFinish() |
Built-in Indexing | ||
Number of blocks | gridDim | get_num_groups() |
Size of block | blockDim | get_local_size() |
Index of block | blockIdx | get_group_id() |
Index of thread in a block | threadIdx | get_local_id() |
Global Thread
Index |
blockIdx * blockDim + threadIdx | get_global_id() |
Total Threads | gridDim * blockDim | get_global_size() |
In-Device Synchronization | ||
__syncthreads() | Barrier() | |
__threadfence() | N/A | |
__threadfence_block() | mem_fence() | |
N/A | read_mem_fence() | |
N/A | write_mem_fence() |
Launching the Kernel
Apart from these and few other differences, an important concept to grasp is how the kernel is launched. CUDA kernel launch requires the user to specify the following:
- Block size, i.e. number of threads per block.
- Grid size, i.e. the number of blocks
- Amount of dynamically allocated shared memory (optional)
- The CUDA stream on which the kernel is launched. (optional)
Whereas OpenCL kernel launch requires the following:
- OpenCL command queue on which the kernel is launched.
- Dimensions of indexing (1,2 or 3)
- Total number of threads
- Work-group size, i.e. threads per work-group (optional)
- Event associated with the kernel (optional).
The CUDA Runtime API is a high level interface (much easier way) for the CUDA Driver API. In a CUDA runtime application, a default context and a default stream is created on the first CUDA API call. The application is statically compiled by the NVIDIA compiler nvcc (for device code) along with a host compiler (for C++ host code). The device code implements C++.
In contrast, OpenCL provides a low level API, hence the context and command queue have to be created explicitly. We have to code more but it provides more control over the execution of the program. OpenCL device code implements a subset of C99 standard so we cannot use features of C++ like templates, overloading etc. in kernels. Kernels are compiled dynamically at runtime, to provide cross platform, cross device behavior. So instead of standard code, OpenCL device code is written in either of the following ways:
- String literal, hardcoded inside the source file.
- In a .cl or .clc source file, loaded and compiled by the OpenCL runtime.
You can download this vector addition example to get a clear idea of the differences between CUDA and OpenCL.