Python, OpenGL and CUDA/CL

@for Developers
@author Kai Ruhl
@since 2011-09

Python is a nice language for prototyping things. It can wrap C++ libraries (required for performance sensitive parts) quite well, as evident e.g. by the fast numpy (array manipulation) library. Additionally, writing programs in CUDA or OpenCL also has awesome performance. In this article I describe how to setup a nice Python environment.

1 Python basics

To start with Python [tutorial], I start with ipython, a nice command line interface that works a bit like a super-calculator. The numpy library [docs] (partially replaces scipy) for array manipulation is a must, as are the boost library [docs] for C++ interoperability, and pyopengl [docs] with SDL (pygame [docs]) for visualization.

apt-get install build-essential python-dev python-ctypeslib ipython python-setuptools python-pip
apt-get install python-scipy-dbg python-numpy-dbg python-h5py python-opencv
apt-get install libboost-all-dev libboost-python-dev libboost-thread-dev

To work with "float images" (not the usual byte images), which are required for optical flow fields, I install OpenEXR [docs].

apt-get install openexr openexr-viewers libopenexr-dev libilmbase-dev
easy_install -U openexr

For nicer development, install Eclipse for C++ Developers [download], and from the update manager, pydev [url:] [info]. This allows code completion, syntax checking etc. within an IDE, which is maybe preferable to command line. If you are debugging in ipython, any file can be included via %run -i command.

2 PyOpenGL

Since we want OpenGL [intro1] [intro2] [api] [mistakes] interoperability with CUDA and OpenCL, we have to install the pyopengl (and pygame) first. We can either use the Ubuntu version (more stable) or the one from pypi [site] (faster, but somewhat likely to give unwanted exceptions in Python):

apt-get install python-opengl
easy_install -U pyopengl
easy_install -U pyopengl-accelerate

apt-get install python-pygame

To test the OpenGL, we execute a minimal example [src]:

from OpenGL.GL import *; from OpenGL.GLUT import *; from OpenGL.GLU import *; import random

def display():
    glClear(GL_COLOR_BUFFER_BIT); glBegin(GL_POINTS)
    x = [0.0,640.0,320.0]; y = [0.0,0.0,480.0];
    curx = 0; cury = 320; glVertex2f(curx,cury)
    for i in range(0,500000):
        idx=random.randint(0,2); curx=(curx+x[idx])/2.0; cury=(cury+y[idx])/2.0
    glEnd(); glFlush()

if __name__ == '__main__':
    glutInitWindowSize(640, 480); glutCreateWindow("Sierpinski"); glutInitDisplayMode(GLUT_SINGLE | GLUT_RGB)
    glClearColor(1.0,1.0,1.0,0.0); glColor3f(0.0,0.0, 0.0); glPointSize(1.0)
    glMatrixMode(GL_PROJECTION); glLoadIdentity(); gluOrtho2D(0.0,640.0,0.0,480.0)

Note that GLUT is the most horrible of OpenGL frameworks; QT/PySide [docs] would be the most full-blown. A middle ground and more compact framework is SDL/pygame, tested as follows (with ball texture):

import pygame
from pygame.locals import KEYDOWN, K_ESCAPE

if __name__ == "__main__":
    pygame.init(); keep_running = True
    size = width, height = 320, 240; speed = [2, 2]; black = 0, 0, 0
    screen = pygame.display.set_mode(size, pygame.HWSURFACE)
    ball = pygame.image.load("ball.gif"); ballrect = ball.get_rect()
    while keep_running:
        for event in pygame.event.get():
            if event.type == pygame.QUIT or (event.type == KEYDOWN and event.key == K_ESCAPE):
                pygame.display.set_caption("Shutting down...")
                keep_running = False
        # Move the ball, bounce back on borders.
        ballrect = ballrect.move(speed)
        speed[0] = -speed[0] if ballrect.left < 0 or ballrect.right > width else speed[0]
        speed[1] = -speed[1] if < 0 or ballrect.bottom > height else speed[1]
        # Fill the background screen buffer, then flip.
        screen.fill(black); screen.blit(ball, ballrect)
        pygame.display.flip(); pygame.time.wait(10)

Having that setup, we can continue to compile PyCUDA and PyOpenCL with OpenGL interop.

3 CUDA (the C part)

Using PyCUDA [docs] requires an appropriate Nvidia graphics card [check] and installation of the Nvidia CUDA software: A recent driver with CUDA toolkit [download] (install as root) and the GPU computing SDK (install as user).

In my standard installation, I install the CUDA toolkit either to /opt/cuda or use the x-swat [home] Ubuntu Nvidia drivers; then I install the GPU computing SDK [ubuntu guide] (a set of CUDA examples) to somewhere in my home. Note: CUDA uses the gcc of one generation behind, this requires some configuration. Nouveau Linux graphics card driver may conflict with Nvidia.

nano -w /etc/modprobe.d/nouveau.conf
    blacklist nouveau

# these are necessary to build.
apt-get install cutils freeglut3-dev libxi-dev libxmu-dev

./ && export CUDA_ROOT=/opt/cuda

apt-get install gcc-4.4 cpp-4.4
mkdir /opt/cuda/gcc-4.4 && cd /opt/cuda/gcc-4.4
ln -s /usr/bin/cpp-4.4 cpp
ln -s /usr/bin/gcc-4.4 gcc
ln -s /usr/bin/g++-4.4 g++

nano -w /opt/cuda/bin/nvcc.profile
    compiler-bindir = /opt/cuda/gcc-4.4

nano -w /etc/

add-apt-repository ppa:ubuntu-x-swat/x-updates
apt-get update && apt-get upgrade

apt-get install nvidia-cuda-toolkit && export CUDA_ROOT=/usr

nano -w /etc/nvcc.profile
    INCLUDES += -I/usr/lib/nvidia-cuda-toolkit/include

nano -w /etc/

su <username> && ./
    # path: /local/ruhl/devel/cuda-sdk-4.0

After this somewhat painful setup, it is time to test the installation. For this, compile the GPU computing examples and start e.g. the nbody app.

cd /local/ruhl/devel/cuda-sdk-4.0/C

nano -w common/
    CUDA_INSTALL_PATH ?= /opt/cuda
nano -w common/

cd bin/linux/release/

And voila: Your first step towards parallel computing with CUDA [docs].

4 PyCUDA (the Python part)

After having CUDA up and running, download PyCUDA [install] [ubuntu install] and install (as user):

# only if using Ubuntu nvidia drivers:
ln -s /usr/lib/nvidia-current/ /usr/lib/

git clone tar -xvzpf pycuda-2012.1.tar.gz

./ --cuda-root=/opt/cuda --cuda-enable-gl --boost-compiler=gcc44

python build
sudo python install

For some reason, OpenGL interoperability is not enabled by default. Also, must reside in /usr/lib, regardless of what you tell After installation, you may have to edit, in Eclipse, Window -> Preferences -> Pydev -> Interpreter - Python to include the appropriate path (usually /usr/local/lib/python2.7/dist-packages). then test the installation with a minimal example of PyCUDA [script], not using OpenGL yet [tutorial]:

from numpy import *; import pycuda.autoinit;
import pycuda.driver as cuda; from pycuda.compiler import SourceModule

cuda_lib = SourceModule("""
__global__ void doublify(float *a_ram)
  int idx = threadIdx.x + threadIdx.y*4;
  a_ram[idx] *= 2;

a_ram = random.randn(4, 4).astype(float32)          # create 4x4 image.
a_gpu = cuda.to_device(a_ram);                      # push 4x4 image to GPU.
func = cuda_lib.get_function("doublify")            # get CUDA function.
func(a_gpu, block=(4, 4, 1))                        # call 4x4x1 threads on CUDA function.
a_res = cuda.from_device(a_gpu, (4, 4), float32)    # poll result from GPU.
passed = (a_ram * 2 == a_res).all()                 # compare to CPU calculation.
print "Test passed: %s" % (passed)

All the initialisation pain is taken away by importing pycuda.autoinit, and all cleanup is done by the Python garbage collector. Functions can be defined inline, as e.g. in shader programming. A block (either 1D, 2D or 3D) defines the number of concurrent threads, a grid the number of concurrent/sequential blocks. Arrays in RAM have to be copied to GPU ("device") memory, and after computation back again.

The interoperability of PyCUDA and OpenGL works over buffer objects. In particular, you create an OpenGL buffer object with glGenBuffers(), then bind the same thing to PyCUDA with the

Details can be found in the example, which accesses a RGBA char texture buffer from both OpenGL and CUDA. The example is a bit low on performance (render into framebuffer, copy into texture buffer, apply CUDA operation, copy back to screen), but gives a good general idea.

5 PyOpenCL

CUDA (or the ATI equivalent) is required for PyOpenCL, so install CUDA first, and the pertinent OpenCL implementation.

apt-get install opencl-headers nvidia-opencl-dev ocl-icd-opencl-dev

After that, download or checkout PyOpenCL [install] [ubuntu install] and install:

# Depending on whether you trust releases:

git clone tar -xvzpf pyopencl-2012.1.tar.gz

./ --cl-enable-gl --boost-compiler=gcc44

python build
sudo python install

Again, GL interop must be enabled manually with the --cl-enable-gl flag. If on Ubuntu 13.04 (raring), specify OpenCL 1.1 as target by adding CL_PRETEND_VERSION="1.1" to [forum]. To test pure OpenCL [api], take a test tour with a minimal example [script]:

import pyopencl as cl; import numpy as np; import numpy.linalg as la

ctx = cl.create_some_context();
                                       # create OpenCL context
prg = cl.Program(ctx, """
    __kernel void sum(__global const float *a,
    __global const float *b, __global float *c)
      int gid = get_global_id(0);
      c[gid] = a[gid] + b[gid];

queue = cl.CommandQueue(ctx); mf = cl.mem_flags                        # create OpenCL queue
a_ram = np.random.rand(50000).astype(np.float32)                       # create buffers
b_ram = np.random.rand(50000).astype(np.float32)
c_ram = np.empty_like(a_ram)
a_gpu = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=a_ram) # upload/allocate buffers on GPU
b_gpu = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=b_ram)
c_gpu = cl.Buffer(ctx, mf.WRITE_ONLY, b_ram.nbytes)

prg.sum(queue, a_ram.shape, None, a_gpu, b_gpu, c_gpu)                 # call GPU function
cl.enqueue_copy(queue, c_ram, c_gpu)                                   # download result to RAM

passed = 0.0 == la.norm(c_ram - (a_ram + b_ram))                       # compare to CPU calculation.
print "Test passed: %s" % (passed,)

You need to explicitely create a context to compile a program; then a queue to run that program; and up/download the RAM data to GPU similar to CUDA. And that would be it.

6 C++ within Python

Of course, some tasks require sequential memory access and are not suitable to GPU computing. For these cases, scipy.weave is your friend: It allows C++ access to numpy arrays.

import numpy as np

cc_src = """
float v;
for (int i = 0; i < wid; i++) {
    for (int j = 0; j < hei; j++) {
        v = src(j, i);
        dst(j, i) = v;

# Prepare Python variables to use.
src = np.random.random((3, 4)).astype(np.float32)
dst = np.zeros_like(src)
hei, wid = src.shape

# Use variables in weave Blitz C++.
from scipy import weave
weave.inline(cc_src, ['src', 'dst', 'wid', 'hei'], type_converters=weave.converters.blitz, compiler='gcc')

print "Test passed: %s" % (np.all(src == dst),)

Two notable things are that you access numpy arrays with () not [], and that you cannot define functions. So this is really more for ad-hoc acceleration of slow Python source code. I use it often.

7 Debugging

When using ipython, I use ipdb [pypi] for debugging. Depending on your tastes, the CLI pudb [pypi] might also be good choice, and works exactly the same way.

apt-get install python-ipdb

In the python script, set a breakpoint with the following line:

import ipdb; ipdb.set_trace();

If QT is used, the QT event queue must be massaged beforehand.

import PyQt4.QtCore; PyQt4.QtCore.pyqtRemoveInputHook(); import ipdb; ipdb.set_trace();

And then, within ipython, run the script in interactive mode.

%run -i

When the breakpoint is encountered,  an ipdb> prompt will allow you to investigate the current state. h is help, s is single step, c is continue program.

EOF (Oct:2015)