Using OpenCL’s Global Work Offset

In my last post I explained why you may want to slow your OpenCL program down a little to allow the rest of the system to remain responsive. However, I didn’t describe an easy way to do that or investigate how much your program performance would be hit. I will now rectify this 🙂

The Simple Way To Split

Okay, let’s say you’ve got a simple OpenCL kernel dispatch set up something like

status = clEnqueueNDRangeKernel( queue,
                                 kernel,
                                 workDim,
                                 NULL,
                                 globalSize,
                                 localSize,
                                 0,
                                 NULL,
                                 &myEvent );

and it’s taking too long to execute on the GPU so we want to split it up into multiple calls to clEnqueueNDRangeKernel(). The easiest way to do this is through the global work offset parameter to clEnqueueNDRangeKernel().

What this parameter does is to alter the values that are returned by get_global_id() in your kernel. Assuming this is how your kernel locates itself in the overall dispatch, we can make it appear to the kernel that it is part of a larger dispatch rather than part of a smaller dispatch. Our dispatch now looks something like

for( size_t z = 0; z < numDispatch[2]; z++ )
{
    globalOffset[2] = partialSize[2] * z;
    for( size_t y = 0; y < numDispatch[1]; y++ )
    {
        globalOffset[1] = partialSize[1] * y;
        for( size_t x = 0; x < numDispatch[0]; x++ )
        {
            globalOffset[0] = partialSize[0] * x;

            status = clEnqueueNDRangeKernel( queue,
                                             kernel,
                                             workDim,
                                             globalOffset,
                                             partialSize,
                                             localSize,
                                             0,
                                             NULL,
                                             NULL );
        }
    }
}

What’s happened here is that the global dispatch has been split up into numDispatch blocks each with a size of partialSize. We then put the clEnqueueNDRangeKernel() in a loop updating the global offset as we go.

Two things have been assumed here to make things clearer in this example:

  1. The original globalSize can be exactly split into multiples of partialSize. Generally this won’t be the case and you’ll have to handle the edges with smaller dispatches.
  2. partialSize is a multiple of localSize as required by the OpenCL specification.

In reality you’ll probably want to choose partialSize as some multiple of localSize and then calculate numDispatch from their rather than the other way around. More detail on how to calculate this splitting will have to be left for another time 🙂

The Subtleties

There’s a few things to watch out for with this approach and some things we can ignore:

  1. As each work group is effectively independent of any other work group in a dispatch, we are free to split work groups into different dispatches. This is what allows this multi dispatch technique to work.
  2. You can no longer use get_global_size() in your kernel to find out the overall size of what you’re doing. For example, if you’re processing an image and you use get_global_size() to find out the image dimensions. Instead, these must be passed in as kernel arguments.
  3. Same goes for get_num_groups(). You will have to either calculate this within your kernel or pass it in explicitly.

The Speed

I previously asserted that multiple dispatches aren’t much slower than a single dispatch. Backing that up with some numbers, a single clEnqueueNDRangeKernel() seems to take about 10 to 100 microseconds to execute which means there is very little overhead on a kernel dispatch that takes 30k microseconds to execute. That 30k microseconds is chosen to be about a 30 frames-per-second framerate.

The caveat here is that your individual dispatches must still have enough work items in them to fully occupy the GPU and allow it to hide latency. As we started out on this venture with a mammoth sized dispatch that was stalling the GPU for seconds at a time, there should be plenty of work items to go around. Once the GPU becomes starved of work items, i.e. you’ve split the dispatch up too much, there will be a large performance hit.

The Conclusion

Hopefully I’ve given enough technical detail to allow you use the global work offset in your own code. If some bits aren’t just drop a comment.

Leave a Reply

Your email address will not be published. Required fields are marked *