Recursion in OpenCL kernels

I overheard a colleague the other day waxing lyrical about how OpenCL didn’t support recursion and how this prevented the implementation of recursive algorithms on the GPU. However, this isn’t quite the case.

While it is true that OpenCL doesn’t currently support recursive function calls, it is not true that you can’t implement recursion…. you’ve just got to do it yourself 🙁

Calling Though The Stack

You may or may not know how a function is calling in CPU-land so I’ll just give a quick recap here. When you call a function:

  1. The function arguments are pushed onto the local stack.
  2. The current program counter is pushed onto the stack.
  3. The program counter is set to the beginning of the function being called.

There may be slight variations depending on calling convention (e.g. the order that the function arguments are pushed onto the stack may change) and architecture (e.g. most platforms have a single instruction that will push the current program counter and jump to a new program counter location) but this is enough for now.

The function will the proceed doing whatever it needs to do. When it needs to return to where it was called:

  1. The return value will be placed in a register.
  2. The return program counter will be popped off the stack.
  3. The program counter will be set to this popped value.
  4. The calling piece of code will pop the function arguments off the stack.

Again, there will be variations in the details depending on calling conventions (e.g. step 4 may be done by the called function rather than the caller).

An OpenCL Stack

OpenCL doesn’t have an STL equivalent (AMD seem to be working on something similar called Bolt) so we’ll need our own stack structure to use. Here’s a little stack that I whipped up that lives in local memory.

typedef struct
{
    __local int* stackData;
    size_t stackSize;
    size_t maxStackSize;
} Stack;

void stackInit( Stack* pStack, __local int* pStackData, size_t maxStackSize )
{
    pStack->stackData = pStackData;
    pStack->stackSize = 0;
    pStack->maxStackSize = maxStackSize;
}

void stackPush( Stack* pStack, int value )
{
    if(pStack->stackSize < pStack->maxStackSize)
    {
        pStack->stackData[pStack->stackSize] = value;
        pStack->stackSize++;
    }
}

void stackPop( Stack* pStack, int numValuesToPop )
{
    if( pStack->stackSize > (numValuesToPop - 1) )
    {
        pStack->stackSize -= numValuesToPop;
    }
}

int stackTop( Stack* pStack, int offset )
{
    if( pStack->stackSize > (0 + offset) )
    {
        return pStack->stackData[pStack->stackSize - 1 - offset];
    }
    return 0;
}

It’s not incredibly beautiful and makes me yearn for C++ in OpenCL (templates, default argument values…. I hope AMD’s Static C++ Kernel Language Extension gets some cross vendor traction) but it’ll do for now.

One note is the choice of using local memory for the stack. If we put the stack in private memory it’ll be fast but will take up more registers as we allow our stack to become bigger which will reduce the number of work items per work group. Conversely, global memory will be really slow. Local memory allows fast access and for the stack to be a reasonable size.

One thought on “Recursion in OpenCL kernels

Leave a Reply

Your email address will not be published. Required fields are marked *