Switching from CUDA Runtime API to OpenCL

by Salman Ul Haq on September 12th, 2013

If you are used to the programming GPU applications using CUDA runtime API, and have a clear concept of CUDA architecture, porting the application to OpenCL might be a little bit confusing but not difficult.

To exactly demonstrate the difference between CUDA runtime and OpenCL, a downloadable example of vector addition is attached at the end of the tutorial.

The concept of threads, blocks, and kernels is the same, one of the major differences, however, is how the kernel is launched and the number of API calls required to do so. OpenCL is more or less same as the CUDA driver API, but in this article, we will show how a CUDA runtime equivalent OpenCL program can be written. Following are the terminologies and API calls used in a CUDA runtime application, along with their OpenCL counterparts.

Terminologies	CUDA	OpenCL
General
	Thread	Work Item
	Block	Work Group
	Global Memory	Global Memory
	Shared Memory	Local Memory
	Constant Memory	Constant Memory
	Local Memory	Private Memory
	Stream	Command Queue
	Texture	Image
	Surface	Writable Image
Qualifiers
Kernel	__global__	__kernel
Device Function	__device__	None
Constant Memory	__constant__	__constant
Device Variable	__device__	__global
Shared Memory	__shared__	__local
Basic API Calls
	cudaMalloc	clCreateBuffer
	cudaMemcpy (Host to Device)	clEnqueueWriteBuffer
	cudaMemcpy (Device to Host)	clEnqueueReadBuffer
	cudaMemcpy (Device to Device)	clEnqueueCopyBuffer
	cudaMemcpy2D (Host to Device)	clEnqueueWriteBufferRect
	cudaMemcpy2D (Device to Host)	clEnqueueReadBufferRect
	cudaMemcpy2D (Device to Device)	clEnqueueCopyBufferRect
	cudaFree	clReleaseMemObject
	cudaStreamCreate	clCreateCommandQueue
	cudaStreamDestroy	clReleaseCommandQueue
Kernel Call
Grid / Block variables	dim3 blockSize(x,x,x); dim3 gridSize(x,x,x);	size_t work_group_size[] = {x,x,x}; size_t total_work_items[] = {x,x,x}
Kernel Launch	myKernel<<<>>(arg1,arg2,…argn);	clSetKernelArg(myKernel,arg1); clSetKernelArg(myKernel,arg2); . . . clSetKernelArg(myKernel,argn); clEnqueueNDRangeKernel(myKernel);
Synchronization	cudaDeviceSynchronize() cudaStreamSynchronize()	clFinish()
Built-in Indexing
Number of blocks	gridDim	get_num_groups()
Size of block	blockDim	get_local_size()
Index of block	blockIdx	get_group_id()
Index of thread in a block	threadIdx	get_local_id()
Global Thread Index	blockIdx * blockDim + threadIdx	get_global_id()
Total Threads	gridDim * blockDim	get_global_size()
In-Device Synchronization
	__syncthreads()	Barrier()
	__threadfence()	N/A
	__threadfence_block()	mem_fence()
	N/A	read_mem_fence()
	N/A	write_mem_fence()

Launching the Kernel

Apart from these and few other differences, an important concept to grasp is how the kernel is launched. CUDA kernel launch requires the user to specify the following:

Block size, i.e. number of threads per block.
Grid size, i.e. the number of blocks
Amount of dynamically allocated shared memory (optional)
The CUDA stream on which the kernel is launched. (optional)

Whereas OpenCL kernel launch requires the following:

OpenCL command queue on which the kernel is launched.
Dimensions of indexing (1,2 or 3)
Total number of threads
Work-group size, i.e. threads per work-group (optional)
Event associated with the kernel (optional).

The CUDA Runtime API is a high level interface (much easier way) for the CUDA Driver API. In a CUDA runtime application, a default context and a default stream is created on the first CUDA API call. The application is statically compiled by the NVIDIA compiler nvcc (for device code) along with a host compiler (for C++ host code). The device code implements C++.

In contrast, OpenCL provides a low level API, hence the context and command queue have to be created explicitly. We have to code more but it provides more control over the execution of the program. OpenCL device code implements a subset of C99 standard so we cannot use features of C++ like templates, overloading etc. in kernels. Kernels are compiled dynamically at runtime, to provide cross platform, cross device behavior. So instead of standard code, OpenCL device code is written in either of the following ways:

String literal, hardcoded inside the source file.
In a .cl or .clc source file, loaded and compiled by the OpenCL runtime.

You can download this vector addition example to get a clear idea of the differences between CUDA and OpenCL.