Many CUDA beginners learn how to write, test and profile CUDA kernels, but most of the times they use randomly generated input data. When it comes to actual real world problem, they are confused how to acquire the input data and process it on the GPU.

In this tutorial, I will show you how to acquire input images on host using OpenCV, then pass that input to CUDA kernel for processing. For this specific tutorial, I will write a basic CUDA function to convert the input color image to gray image. I assume that user has CUDA Toolkit and OpenCV installed in his system. Here’s a good tutorial on setting up OpenCV on your machine with Visual Studio.

We start by writing a CUDA kernel for converting an input BGR image to a gray scale image. Your CUDA kernel will look something like this:

__global__ void bgr_to_gray_kernel(unsigned char* input, 
					 unsigned char* output, 
					 int width,
					 int height,
					 int colorWidthStep,
					 int grayWidthStep)
{
	//2D Index of current thread
	const int xIndex = blockIdx.x * blockDim.x + threadIdx.x;
	const int yIndex = blockIdx.y * blockDim.y + threadIdx.y;

	//Only valid threads perform memory I/O
	if((xIndex < width) && (yIndex <  height))
	{
		/* Kernel Code Here */
	}
}

The next step would be to create a wrapper function which calls the above CUDA kernel. Since we are using OpenCV as front end, it would be convenient if we use OpenCV’s image data structures as the wrapper function arguments, like this:

void convert_to_gray(const cv::Mat& input, cv::Mat& output);

Inside the wrapper function, allocate device memories for the input and output images and copy the data from OpenCV Mat to the input device memory.

unsigned char *d_source, *d_destination;
cudaMalloc(&d_source,colorBytes);
cudaMalloc(&d_destination,grayBytes)

//copying
cudaMemcpy(d_input,input.ptr(),colorBytes,cudaMemcpyHostToDevice);

Once the memory copy is complete, launch the CUDA kernel to perform the color conversion and then copy back the results to the output OpenCV Mat. This is all what it takes to create a CUDA wrapper for a host side library. We can pass the OpenCV Mat acquired from the disk or the camera, as an argument to this wrapper function and the processing will be done on the GPU. The philosophy behind the OpenCV GPU module is more or less the same except that it uses the fancy GPUMat wrappers for device memory and its functionality is highly optimized to run on GPUs.

You can download the complete, working example from the link below. Make sure to run the example code in a CUDA C++ project.

Download Source