Be a good OpenCL citizen

You’ve been told to make your algorithm run fast. You saw the GPU as the ideal way to achieve this and waded in to code up some hardcore OpenCL. Your code now utilizes the GPU hardware to the max but your users are still complaining that your application is slow.

Getting new, more appreciative, users is one option. The other option is to trade in some of your hard gained performance for more apparent performance.

Humour your non pre-emptive GPU

One of the main problems with running code on the GPU is that there is no pre-emptive multitasking. We’ve got very used to this feature in CPUs and the thought of old Win16 code that would let one application hog the entire CPU without giving anyone else a look-in now leaves us shocked. However, once a piece of code has been sent to the GPU it will sit there until it has finished blocking all other pending tasks. This has two immediate effects:

  1. Stuttering GUI. Chances are that the system is running graphics (i.e. your GUI) on the same GPU as your OpenCL code. This means there will be no screen updates while your compute task is running as the graphics shaders cannot execute.
  2. Windows may reset your GPU. Windows has a built in GPU timeout of 10 seconds so if your task takes longer than this it will reset the GPU as it appears to have crashed.

Number 1 makes your application appear slow while number 2 makes your application appear unstable – even worse! As a developer, it is also tricky to immediately catch number 2 on your system.  You probably have a super-flash-wizz-bang GPU installed that crunches your compute task in a few seconds while your long suffering users have weeny GPUs that takes ages to complete the same task.

Luckily, both these problems can be solved fairly easily: just split your compute workload into smaller chunks. To do this, call clEnqueueNDRangeKernel() multiple times altering the kernel arguments each time rather than doing one enqueue. The downside to this is your compute task will take longer to complete but while it is completing your users can get on with other things.

Avoid host-side selfishness

Above we tried to become less selfish with regard to what has to execute on the GPU but in the quest for speed we may have also become selfish on the CPU side too. The most obvious trick is bunking your process priority up. Just don’t, okay 🙂 The only person that thinks your application is the most important thing running on the system is you. Others are simply not amused. The other thing you may be doing, either deliberately or accidentally, is polling for an event to finish like this:

cl_int status;
do
{
    clGetEventInfo( myEvent,
                    CL_EVENT_COMMAND_EXECUTION_STATUS,
                    sizeof(status),
                    &status,
                    NULL );
} while( status != CL_COMPLETE );

Suddenly you’ve got one CPU core at 100% workload doing nothing useful. True, this is probably the fastest way to wait for an event as it avoids any thread switching but it’s just obnoxious and will lead to the overall system starting to feel sluggish. Much better to use clWaitForEvents() and wait a little longer.

Another way to look at this is that you really shouldn’t be sending tiny workloads to the GPU where the time it takes to thread switch for the wait is the critical path. There are many ways to chain together OpenCL commands which allow you to set up a whole bunch of work for the GPU to chew away at while your CPU can sit idle or do something more useful.

Don’t listen to me – try it

So there you have it. If you split up your kernel execution and wait for events politely your application will appear faster and more responsive. And who sits with a stopwatch taking exact timings? Except reviewers. And me. And your competitors. Errrr…… 😛

Leave a Reply

Your email address will not be published. Required fields are marked *