OpenCL™ is often criticized for not being “performance portable”. While it is true that absolute optimal performance will only be achieved when tailoring host code and kernels directly to the target platform, significant improvements can be made by just tweaking some of the knobs that the OpenCL framework already contains. The point is that OpenCL is a programming model that was
specifically conceived to target different types of hardware in a portable fashion. This discussion by Comportability, the OpenCL user group, nicely puts this into perspective.
Load Balancing as an Optimization Knob
One of OpenCL’s optimization knobs is the ability to target multiple devices and divide the workload of one kernel amongst multiple OpenCL capable cores. Considering the fact that modern SoCs are increasingly becoming heterogeneous multi-core environments, being able to target multiple devices at once empowers the OpenCL user to do the following things:
- Analyze how different devices’ capabilities are best combined for the highest performance. This is often the primary goal in HPC.
- Determine the power footprint for different load balancing schemes, and as a result, be able to determine the lowest power for a given expected performance number. This is often the goal in mobile or embedded platforms.
Because of the way OpenCL was defined, it is possible to use OpenCL drivers for different platforms – e.g. a CPU and a discrete GPU – within the same program. This means that, in principle, a computation may be offloaded to multiple devices which could each be on different platforms. Using multiple devices from different platforms is an equally interesting case, but will be explored in a separate article.
The type of load balancing that is enabled by using multiple devices in one platform has been possible for quite a while on AMD APUs. Recently Intel also announced their OpenCL1.2. toolkit, which now supports both the Intel CPU and HD graphics simultaneously. As a result, load balancing is now possible on third generation Intel Core platforms. In the mobile space, the Qualcomm Snapdragon S4 platform for example, also supports heterogeneous CPU+GPU compute within one platform, albeit with an embedded profile.
Generating a Load Balancing Sweep in OpenCL CodeBench
OpenCL CodeBench has a feature called “load balancing sweep” which will partition a kernel execution and distribute it over multiple devices, and perform a sweep of the possible load variations. The runtime is measured for each iteration of the sweep, providing useful insights on the optimal load distribution for a given performance and/or power goal. With this data, static decision tables may be constructed to auto-tune a kernel on a given platform. Alternatively, the primitives generated by OpenCL CodeBench may be used to dynamically make load balancing decisions at run-time, depending power or performance measurements, changes in system load or changes in data set sizes.
Median Filtering Example
Below is one simple but common example: image filtering. The filter type is a so-called median filter. A median filter replaces every pixel with the median RGB values of N surrounding pixels. It is very suitable as a pre-processing step – e.g. to remove noise – before applying computer vision algorithms.
The following images are a visual before/after example using an artificially polluted picture.
The interesting part about median filtering is that it has a data-parallel aspect to it (the same operation is done for every pixel), but every filter operation also needs to perform a sorting algorithm. So it may not be immediately clear which combination of CPU/GPU split – if any – would be most appropriate.
Given this, the question is now: does it make sense to utilize both CPU and GPU and if so, what is the best workload division given a given image size? Additionally, we also wonder how this changes under a system load.
The answer is in the following graph which was directly generated from the output of the load balancing sweep that OpenCL CodeBench implemented automatically, running on an AMD A8 APU:
For this particular example it is clear that a 50/50 CPU/GPU split yields the lowest execution time, resulting in a 1.5x speedup compared to just using the CPU. While this improvement cannot be called dramatic, it came entirely for free, without modifying anything in our kernel code. In order to turn the load balancing knob in our host code, all we had to do was check the “load balancing sweep” option in the OpenCL CodeBench code generation wizard. The resulting code that we used for the example may be downloaded here.
The attentive reader may point out that one could obtain a much higher speedup by rewriting the kernel and making the sorting algorithm more GPU friendly. The naive implementation used for the test above indeed uses a simplistic bubble sort algorithm to find the median. Had we chosen a bitonic sort algorithm, the results would have been completely biased towards putting much more load on the GPU. The point was however not to come up with the most efficient implementation, but rather:
- To show how you can make an existing implementation run more optimally on given hardware by utilizing all the devices available.
- To demonstrate how tools like OpenCL CodeBench can automate the analysis required to determine what the best load split is.
We have analyzed load balancing results for different problems from the computer vision space, such as SIFT feature matching, K-nearest neighbor search, or even the good old matrix multiplication. We used platforms ranging from desktop platforms with discrete GPU to mobile handheld platforms. The conclusions are invariably the same:
- The optimal – from a performance point of view – GPU/CPU split is often a function of the problem data set size.
- The split varies with the system load on the CPU. This brings up the need for dynamic load balancing for certain classes of algorithms.
- The best load distribution depends on the hardware platform as we could observe on a variety of desktop and embedded platforms.
- Often, even given a fixed hardware platform and system load, a one single load distribution will not guarantee the best performance. Sometimes the algorithm dynamics require different load splits during the different stages of the algorithm. This is often the case with iterative, tree-based or search oriented problems where the data set size changes as the algorithm progresses.
OpenCL CodeBench made it possible to implement load balancing sweeps for many different kernels very quickly, without having to dive into the details of the required host code. If you want to try it for yourself, request a 30-day free trial.
Before writing this article, we presented a poster session on the same subject at the first International Workshop on OpenCL at Georgia Tech in Atlanta earlier this year. For this poster session we worked together with Prof. Dan Connors of CU Denver, who suggested to have a more in-depth look at load balancing of computer vision algorithms and also assisted in running some of the experiments on various hardware. His group is doing some interesting research on adaptive OpenCL techniques.