Blog

GPGPU programming is entering the mainstream

GPGPU programming was once a venue only entered into by the programming experts. A programmer with intimate knowledge of a GPU device, graphics programming and the software application could decide to utilize the GPU to accelerate a sequential application. This was hacking at its finest. To enable GPGPU compute, the engineer would gather the data, making it look to the GPU like graphics information and then write the acceleration program utilizing graphics primitives. High level language support through languages such as HLSL, GLSL and Cg was somewhat primitive, lacking many of the features that exist in today’s standards. Debug and synchronization were manual tasks escalating the potential for error. Since each algorithm was hand crafted to the GPU, code re-use and portability were not terms typically used with this programming style. In most cases, the expertise required, the benefit provided and the time involved in getting the application to work properly made GPU compute acceleration cost prohibitive.

The OpenCL standard and GPGPU support from semiconductor manufacturers has opened the door to GPGPU acceleration for the mainstream engineer. With OpenCL, the engineer no longer needs to understand the graphics pipeline or ISA in detail, just the high level capabilities of the acceleration device, which the programmer can query through OpenCL. The engineer writes the control code in C or C++ and the kernel or acceleration code that will run on the GPU in OpenCL, a high level language variant of C. With OpenCL, an engineer can choose to precompile code or use a run time compiler provided by the system vendor or semiconductor manufacturer. OpenCL Code is functionally portable and if conformant to the Khronos OpenCL Standard will run on any OpenCL compatible device. As with most high level languages, although portable, OpenCL code written optimally for one device may not perform as well on devices with different memory structures or compute capabilities.

The promise of OpenCL sounds interesting but performance gains have to be substantial for the standard to enter the mainstream. They typically are! It is not uncommon to see accelerations of 10-50X and enjoy system power savings when using OpenCL and GPGPU acceleration. With limited training, engineers can quickly start implementing OpenCL and utilizing GPGPUs, DSPs or other heterogeneous compute devices to accelerate their applications. While straightforward, OpenCL does require a significant amount of programming.

The host code written in C or C++ is approximately 200 lines of code to accelerate one simple kernel with OpenCL, without any error handling. In more intricate scenarios, it can expand to 500 lines or more. This code is not complex but there is a lot of it, with many parameters in every call, parameters that have to be populated accurately and correctly match the kernel’s contract. The host code initializes the accelerator, defines the memory structures and data types. It also dictates how the compute resources are to be used within the GPU and manages the system data flow and GPU error messages. Although host code is management and control code, by dictating the data structures, task queues, and compute resources to be implemented, it plays a significant role in GPGPU performance.

The OpenCL kernel code is the inner loop or the key computational algorithm – hence the term kernel – to be accelerated on the GPU. This code is concise, optimized and closely paired to the resources defined in the host code. The OpenCL language is compact and concise so kernel functions are typically less than 100 lines of code and rarely over 200 lines. OpenCL kernels can work on individual data elements, process vector data using SIMD or SIMT or even process 1D, 2D and 3D image arrays with only a few commands. OpenCL is well written and through the numerous companies who have contributed to its development, is a robust standard.

OpenCL sounds useful and for software developers, a first thought might be to find an optimized library and call functions if better performance is required. Worst case, there should be examples or templates that can be used as a starting point and modified appropriately. For the more generic algorithms, optimized platform specific libraries are starting to emerge. Prominent open source examples are libraries such as OpenCV, the Bullet Physics library, and ViennaCL. With performance being platform and system specific, optimized libraries can provide benefit.
AMD, NVidia, Intel and other GPGPU providers have also made examples available and many designs are done using these as templates and adapting them to accelerate user defined algorithms. Even with libraries, good design examples and user expertise, software development tools are a necessity. Libraries will never have enough functionality and design examples or templates will need to be modified. An article in the July 2012 Issue of Design News, quoting the Ganssle Group states …

"Development tools are a key to success, largely because time is so costly. A good embedded engineer is going to cost $150,000 to $200,000 a year, that's why you have to be willing to spend…, on tools. By comparison to an experienced engineer, the cost of tools is insignificant"–The Ganssle Group

Amdahl Software provides these software development tools for OpenCL code creation. OpenCL CodeBench abstracts and simplifies OpenCL code creation. The goal of OpenCL CodeBench is to streamline OpenCL code generation, eliminating most of the verbose coding penalty that is often associated with it. It enables all engineers to leverage the capabilities of GPGPU enabled systems. OpenCL CodeBench enables true code portability through higher level tools and abstraction automation in the design process.

For Host Code generation in OpenCL CodeBench, the user is guided through an Eclipse wizard. After just a few screens, the OpenCL project, the host code, the unit test bench and the kernel stub are auto-generated. Moving through the wizard entries takes minutes with the resulting code immediately available upon completion. This code generation wizard includes validation, ensuring the engineer does not mix incompatible data types or create uncompilable or problematic OpenCL host code. For the command line user, XML is an alternative data entry path provided by OpenCL CodeBench. This enables seamless integration in command-line driven or makefile oriented environments or even automated testing frameworks.

Kernel code, written in the OpenCL C language and used to program the GPU, is many times, application specific. As mentioned earlier, examples and libraries help, and might cover some of the use cases, but the code is often unique to the application being accelerated. To assist with kernel code development, Amdahl Software has created a comprehensive OpenCL language editor with all the productivity features one might expect under Eclipse. Navigation capabilities, OpenCL semantic checking with quick fixes, code completion, code coloring, renaming refactoring, hover, code annotation and comprehensive error messages are only some of the features.
Code written or generated within the OpenCL CodeBench framework will be OpenCL compliant and functionally correct prior to being launched onto the runtime environment on the target platform.

Amdahl Software is working with semiconductor and system vendors to align the tools for platform specific guidance and optimizations. Other advanced research within Amdahl Software includes exploration of auto-tuning capabilities, the definition and generation of more complex OpenCL-based processing pipeline models and the auto-generation of the host code through abstraction from the kernel code functionality.

If you haven’t tried using OpenCL or looked at the program structure, it is not difficult to get started. Most PCs with modern graphics cards can be used as a development target. Visit the Amdahl Software web site and review the case study provided under OpenCL CodeBench, then take a 30 day, no obligation, free trial of OpenCL CodeBench and modify our code example or create your own.

0