# CUDA : lessons from the trenches

I’ve done a lot of casual (unfortunately, non-thesis-writing related) stuff this summer, including a lot of code. Among these, we had the chance to host a software consultant during a few days, to test and deploy some CUDA codes that they had been optimizing during the last year.

As the official Nerd PhD Student0 on optical flow), I was picked up by a post-doc to assist the consultant integrating their CUDA stuff on our Mac line. And this is how I went through 3 days of complete madness, anger, and cries.

## Spoiler alert: starting with the end

If you’re in a hurry, here is the solution to all our problems : we bought a brand new desktop PC, not a Mac1, with a decent and recent NVIDIA GPU.

On my personal side, I was able (after some steps described below) to compile and run all the external CUDA code and to execute it from Matlab. However, the results were always garbage : we suspect it come from some hardware incompatibility issues (the compute capability of the card is much too low : 1.1).

We strongly suspect the hardware capability because I managed to have a simple CUDA kernel executed correctly from a mex-file. In a follow-up post, I’ll comment this code and how to compile it with a Makefile.

## Cross platform development

Let’s start with something easy : the company’s code was developed and run only on windows. It consists in a bunch of files, ultimately compiled into a DLL, and the said DLL is then used by Matlab through the loadlib and calllib functions, and a lot of complicated boilerplate code to put the data in form and correctly hook into the DLL.

A DLL is not something mysterious : it’s just the windows version of a dynamic library ! Yes, just like a libfoo.so or a libfoo.dylib on linux or mac platforms. However, as a DLL, the lib object should export some symbols using DLL-specific syntax, which is not recognized by our compiler (Apple’s version of llvm.

The fix was easy : protect the windows only part of the include files (*.h) using the pre-processor. I chose to use :

#if !_defined(_WIN32) && !_defined(_WIN64)


to isolate the windows-only from the mac/linux-only parts.

## It’s 2012. Let’s write a Makefile !

Of course, VC++ project files (used by the consultants) don’t work outside their original world2. So, we needed a way to build the project. And here comes the make tool and Makefile editing.

If you did it before, then

• either you’ve pursued doing it every day, and you feel quite comfortable with it
• or (like me and our consultant) you learned it years ago at school, then copied/pasted/modified very carefully a skeleton Makefile (usually a compacted one created by someone else) that you always had with you.

So, we had some 30 or 60 minutes of fun editing an old Makefile to adapt it to our project. This part was a bit tedious, but with the help of emacs3 to fill the indentations with the correct number of tabs and spaces.

Calling CUDA from a mex-function

In order to easily call (and compile !) the CUDA part from Matlab, I wrote a mex file as an interface between these 2 worlds. I will provide an example later in another post, but here is a rough description :

• the mex file is a vanilla C/C++ file that respects the gateway syntax expected by matlab, for example super_mex.cpp
• it includes a vanilla C .h file, for example foo.h, that again does not exhibit any CUDA peculiarity
• all the CUDA code is placed inside C wrappers in foo.cu : CUDA kernel calls happen only inside C/C++ functions
• foo.h is included in both foo.cu and super_mex.cpp.

Then, to compile, proceed as follow :

1. use nvcc to compile the .cu file to an object file (.o), make a library from the results
2. use the mex compiler to compile the .cpp part and link it with the output of the previous step.

### That’s all… for now !

In a next post, I will detail the content of the Makefile. While most of its content is standard, it contains a few lines that are really Mac-specific and that can hardly be guessed if you’ve never seen them before. So, stay tuned and don’t miss the next post !

### Reference

Fast TV-L1 Optical Flow for Interactivity, D’Angelo, Emmanuel; Paratte, Johan; Puy, Gilles; Vandergheynst, Pierre, IEEE International Conference on Image Processing (ICIP) 2011, Brussels, Belgium, September 11-14, 2011

If you don’t want to miss the next post, you can register to the blog’s feed or follow me on Twitter!

1. We’re not a Mac-only lab anymore :-( ^
2. Or at least I don’t know how to do it. ^
3. It’s included by default by MacOS X, you can launch it from Terminal.app. ^