OpenCL Compliance and Performance Portability

One of the benefits of OpenCL as an open standard is code portability across a number of vendors and compute platforms.  Cuda is proprietary and tightly linked to Nvidia GPU’s, Direct Compute was defined for Microsoft platforms, Renderscript is a Google creation… but OpenCL is a true industry standard developed in an open, collaborative manner by over 20 companies.  The OpenCL standard defines a common language format that supports multiple classes of devices, GPUs, DSPs, FPGAs, CPUs and custom accelerators.   Device compliance testing through Khronos, a not for profit industry consortium, ensures silicon platforms and their run times are adherent to the standard.  Code written for one compliant device will run on any another compliant device.

But what exactly does compliance mean?

In the case of OpenCL, compliance means functional compliance.  Code will run and barriers and synchronization will be respected.  Using common input code, the compute results returned by an OpenCL compliant FPGA are the same as those generated by an OpenCL compliant DSP or GPGPU.  Accelerators look to the system like black boxes and are interchangeable.  Communication between the host and the accelerator is standardized.  The host code will set up the data and enqueue the kernels in the manner specified by the developer.  The accelerator run time will interpret and execute the OpenCL kernels on the accelerator.  Since each run time is optimized to a specific platform, the run time will generate device specific executable code.

Although accelerators all work in the same manner, compliance however, excludes any claims of performance or power consumption.   Under OpenCL, data is transferred to the accelerator, the accelerator is enqueued with operations and after execution, the results can be read back by the host.  OpenCL code executed on one accelerator may exhibit better or worse performance when run on an accelerator with different architectural features.  Although portable, OpenCL does not provide power consumption guarantees when moving OpenCL code from one accelerator to another.  Power consumption is very device specific and highly dependent on both on both the efficiency of the silicon and the mapping of the algorithm to the architecture.

Different memory types, such as but not limited to caches, memory width and speed and the organization of compute units will all play a role in compute performance.  The translation of OpenCL code into native executable code through the run time is half the battle.  The other half is feeding the device with data to optimally use the device specific computational resources.  This is done by using the right type of memory (e.g. pinned memory), accessing data in a coalesced fashion, creating the right workgroup sizes and overlapping compute with data transfer where possible.  Memory access patterns, implemented through the host code and aligning to the kernel code (e.g. manual caching of data, workgroup size tuning, and the use of vector types …) will work best when organized for a specific architecture and will be less than optimal when utilized for a different accelerator.

Looking at the broader picture, it seems there could and should be a better solution.  If the kernel run time compiler is target specific and can generate code unique to each accelerator, the problem would seem to revolve around keeping all the compute units in the accelerator busy at all times while optimizing and presenting the data appropriately.  This is the function of the host code.

If an engineer were able to define the data operations in an abstracted manner, it seems there should be a host code tool that could understand the memory and data structures of the target device and then could more optimally map the data to the target.  If these host code data structures were optimized, more optimal performance and power consumption should result.

This is the model built into the OpenCL CodeBench from Amdahl Software.  The Host Code is defined through an abstraction layer – either an Eclipse wizard format or through an XML file.  OpenCL CodeBench builds from the user intentions and automatically generates host code with a click of a button.  Abstracting the input data enables an engineer to create host code in a fraction of the time it would take to write the code by hand and with knowledge of the OpenCL platforms, this host code can be easily retargeted.  Today, OpenCL Code Bench creates fully commented C and C++ host code for the variants of the OpenCL standard.  The code also includes error handling and kernel stub generation, avoiding any mismatch between the kernel interface and the host code API calls while setting up the input data and retrieving the results.  Other included options are test bench generation and the selection of either a standalone main or inclusion of the generated code within an existing program.  It is now practical for developers to rapidly test and trial OpenCL programs using different data types and structures and to create truly portable OpenCL code.  Releases of OpenCL CodeBench later this year will add capabilities for creating and retargeting OpenCL code to specific devices and architectures.

OpenCL CodeBench provides many more capabilities including a full OpenCL parser and editor with Eclipse integration.  We invite you to visit to learn more or experience the tool through a no obligation, 30 day trial.