A simple CUDA kernel + mex file (part 1)

In the previous post, I introduced the various problems that I have encountered when I had to install some external CUDA library on our Mac workstations. In this post and the next, I will take a simple example (moving data from array A to array B through the GPU) to show how I finally achieved this installation process. While the application is not really useful, it demonstrates all the various tips and tricks that we applied.

The big picture

So, let’s say we want to copy array A to array B, through the GPU and from Matlab. What exactly do we need, how do we organize the code ?

File layout

As mentioned previously, our goal is to isolate the CUDA-specific part (that will be compiled with CUDA’s nvcc) and the Matlab-specific part (that will be compiled using the mex script provided by Mathworks). Hence, we embed all the CUDA function calls inside C-wrappers, and we build a small library out of that. Then, we pass this library to the linkage option of mex, and voilĂ  !

(At least, it’s how it should work…)

So, we’re going to have :

  • for the C+CUDA part : a file dataloop.h that contains only C-compliant function declarations, and a file dataloop.cu that will implement these functions and the CUDA kernels, eventually compiled into a static library (.a, see below)
  • for the Matlab part : a mex_dataloop.cpp file that implements the gateway function required by Matlab and links against the said library
  • for sanity checks : a main.cpp file that executes the code of dataloop.cu outside Matlab, just to be sure that everything works.

Dynamic vs. Static linking ?

A first Mac caveat here.

MacOS does not have exactly the same X-server model as Linux systems have ; there is no kind of “default” behind-the-scene terminal like the one you access with the Ctrl-Alt-F7 keys. Graphical applications can be (and are usually) launched without reading the environment values that may be set in your .profile. Furthermore, the latest Matlab releases 1 seem to not launch an Xterm and an X11 instance anymore on Mac platforms, hence we’re almost sure that environment variables set in the .profile are not read.

Since I wasn’t sure that Matlab would be able to find the necessary dynamic libraries (even when using standard locations such as /usr/local/lib or modifying the DYLD_LIBRARY_PATH 2 variable) I preferred static linkage over dynamic linkage for our task, and we added the appropriate linker command (with ar) in the Makefile.

Finally : the code explained

The kernel

The CUDA kernel is really simple and just copies the data from one memory position to another :

__global__ void dataloop(float *src, float *dest)
    int tid = blockIdx.x;
    dest[tid] = src[tid];

As stated in the previous post, all the CUDA-specific calls are embedded in a regular C wrapper, so that we do not expose any CUDA additional syntax in the public interface :

void process_data_with_cuda(float *host_src, float *host_dest, int N)
    float *d_src = NULL;
    float *d_dest = NULL;

    memset(host_dest, 0, N*sizeof(float));

    // Allocate on device
    cudaMalloc((void**)&d_src, N*sizeof(float));
    cudaMalloc((void**)&d_dest, N*sizeof(float));

    // Transfer src to device
    cudaMemcpy(d_src, host_src, N*sizeof(float), cudaMemcpyHostToDevice);

    // Launch kernel
    dataloop<<<N, 1>>>(d_src, d_dest);

    // Fetch data back
    cudaMemcpy(host_dest, d_dest, N*sizeof(float), cudaMemcpyDeviceToHost);

    // Release memory

Again, this code is fairly straightforward to understand (and not really useful) :

  1. we allocate memory on the GPU with cudaMalloc,
  2. we transfer the data from the CPU to the GPU with cudaMemcopy + cudaMemcpyHostToDevice,
  3. we use the CUDA kernel to copy this data from the source GPU slot to the target GPU memory slot,
  4. we fetch the data back to the CPU memory (cudamemcpy again),
  5. finally, we gently release the memory we allocated on the device with cudaFree.

Anatomy of the mex file

#include <iostream>
#include <cassert>

#include "dataloop.h"
#include "mex.h"
#include "matrix.h"

int const kNDims = 2;

// Gateway //
void mexFunction(int nlhs, mxArray *plhs[], int nrhs, mxArray const *prhs[])
    std::cerr << __FUNCTION__ << " with args:\n";
    std::cerr << "Left: " << nlhs << " | " << plhs << std::endl;
    std::cerr << "Right: " << nrhs << " | " << prhs << std::endl;

    std::cerr << "Reading input...\n";
    mxArray const *src = prhs[0];
    float *p_src = (float*)mxGetData( src );
    mwSize const *srcSize = mxGetDimensions( src );
    int N = (srcSize[0] == 1 ? srcSize[1] : srcSize[0]);

    std::cerr << "Src: " << srcSize[0] << "x" << srcSize[1] << std::endl;
    std::cerr << "N = " << N << std::endl;

    std::cerr << "Configuring output...\n";

    plhs[0] = mxCreateNumericArray(kNDims, srcSize, mxSINGLE_CLASS, mxREAL);
    float *p_dest = (float*)mxGetData( plhs[0] );

    std::cerr << "CUDA calling ! (There's a tribute here...)\n";
    process_data_with_cuda(p_src, p_dest, N);

    std::cerr << "Back from GPU. Check your Post-Traumatic-Data-Disorder syndrom.\n";
    std::cerr << std::endl;

    for (int i = 0; i < N; ++i) {}
        std::cerr << "(" << i << ")\t@in = " << *(p_src+i)
                  << "\t@out = " << *(p_dest+i)
                  << std::endl;

The file dataloop.h contains our public interface, i.e. only the declaration of the process_data_with_cuda() function. The files mex.hand matrix.h declare the functions and structures that Matlab uses, and the prototype of mexFunction() is also imposed by Matlab.

Then, we log some information to the standard error stream (it should happen in read in your Matlab command-line window). This information includes the size of the input matrix. Note that the code handle both rows and column vectors by setting N to the greatest value between the number of rows and columns of the input.

Then, we allocate memory for the output vector, using the size read from the input data. Note that we create an array of floating-point values by passing mxSINGLE_CLASS to the mxCreateNumericArray function : while Matlab natively uses double values (8 bytes per real number), most GPUs still only handle float types (4 bytes). On the other side, in the Matlab call to our mex file, you also need to convert the input data do float with the command single().

When everything is setup, we call our main function process_data_with_cuda() that performs the actual computation on the GPU. This function takes as inputs the pointers to the source and destination data, and knows nothing about Matlab, while nothing about CUDA is exposed to Matlab and the mex script in this file.

And the Makefile ?

Since this post is already long enough, I postpone the explanation of the Makefile to the next post, coming very soon.

If you already now about Makefile syntax, CUDA compiling and mex script usage, then you can probably guess almost everything that’s inside this file. There are however a couple of caveats that are Mac-platform specific, that I will detail in this future post.

If you don’t want to miss the next post, you can register to the blog’s feed or follow me on Twitter!

  1. We installed the R2012a. ^
  2. Mac’s LD_LIBRARY_PATH ^