Blog

Matrix Multiplication using OpenCL Images

We consider matrix multiplication the “Hello World” example of multi-core computing. However, as opposed to the traditional “Hello World”, matrix multiplication is actually useful as it is a basic building blocks for many algorithms.

In this article we will look at how a matrix multiplication can be accelerated on the GPU by using an OpenCL™ implementation. While matrix multiplication is shown as a use case, the purpose is not to show yet another clever implementation of this algorithm. Instead we want to give you a feel for how much easier using OpenCL could be made by using OpenCL CodeBench to perform the following tasks:

  • quickly generate and change host code using our XML Accelerator description, for different OpenCL™ based implementations of a matrix multiplication algorithm.
  • create a self-contained unit test suite for a kernel
  • extract OpenCL event profiling information

Note that in the discussion below we are only using the OpenCL CodeBench Command Line. The same tasks can also be carried out through the Eclipse wizard. So do not worry if XML files and command-line tools are not your cup of tea. If Eclipse is your preferred environment you will be guided by a wizard. As one would expect, the wizard will make sure that whatever you enter is consistent and will not contradict other settings. For example, in the context of image types, the wizard knows what are the allowed combinations of channel orders and channel types.

A first naive approach to a matrix multiplication might be to just take the inner loop of the corresponding C-program and pour it in to a kernel to map it onto a work item that generates one result value for the target matrix:

#define LENGTH (400)
kernel void krnl_matmul_buf(global const float* a,
                            global const float *b,
                            global float *c)
 {
   int i = get_global_id(0);
   int j = get_global_id(1);
   float tmp = 0;
   for (int k=0; k<LENGTH; ++k) {
     tmp += a[i*LENGTH+k] * b[k*LENGTH + j];
   }
   c[i*LENGTH+j] = tmp;
 }

This works fine and for decent matrix sizes – where the data transfer cost is small compared to the compute cost – this may already yield a speedup in the order of 15x, depending on your platform.

In order to generate the host code corresponding to this small OpenCL kernel, we use the following XML configuration file for the OpenCL CodeBench command line:

<?xml version="1.0" ?>
<accel name="matmul_buf">
  <target select="any">
    <device type="gpu"/>
  </target>
  <interface>
    <in type="float" name="x"><dim>400</dim><dim>400</dim></in>
    <in type="float" name="y"><dim>400</dim><dim>400</dim></in>
    <out type="float" name="z"><dim>400</dim><dim>400</dim></out>
  </interface>
  <testsuite>
    <testcase name="basics" time="1" iters="100">
      <in name="x" file="x.csv" embed="0"/>
      <in name="y" file="y.csv" embed="0"/>
      <out name="z" file="z.csv" embed="0"/>
    </testcase>
  </testsuite>
</accel>

Invoking OpenCL CodeBench Command Line can be as follows:

> amdl_gen_ocl -m -e matmul_buf.xml

The -m option instructs the tool to generate a standalone main that can be compiled and run. Since the XML specification contained a unit test suite, the main will simply invoke the unit test suite code. The test suite will be generated using the data found in the .csv files specified, x.csv and y.csv for the inputs, and z.csv for the reference output to check the result. The iters attribute causes the test to be executed 100 times while the time attribute will cause code to be generated to measure and print the elapsed time. The -e option causes the kernel to be embedded in a string in the generated source code. Since the code generator finds a kernel file named matmul_buf.cl in the current directory it will copy in the entire file. Otherwise it would have generated an empty kernel with the correct parameters. The two generated files are:

  • matmul_buf.h: a header file containing a simple procedural interface to our matrix multiplication
  • matmul_buf.c: an implementation file containing all the low-level OpenCL code to offload the matrix multiplication to the GPU.

Writing the kernel and the corresponding XML specification file took less than five minutes of our time, while generating the host code was instantaneous.

We can now compile the generated files. For example, on linux, the command line for this would be:

> gcc -I. -I/opt/AMDAPP/include -std=c99 matmul_buf.c -lOpenCL

Running the application on a mainstream laptop with an AMD A8-3500M APU with Radeon 6620G graphics shows that the elapsed time for 100 invocations of the accelerator function takes about 9 seconds.

There are various ways that the above implementation can be optimized to better take advantage of the capabilities of a GPU. One way – although not necessarily the best for every device – is to take advantage of the OpenCL image type. Devices that support OpenCL 1.1 or higher allow for opaque objects of type image2d_t and image3d_t to be passed to and from kernels. When using these types in a kernel, significant benefits can be gained from the fact that they are cached by using the texture caches. Additionally, one element of an image type can contain up to 4 values, e.g. to represent the Red, Green, Blue and Transparency components of an image, so a two-dimensional array could be represented by an equivalent image as follows:

Equivalent image

Once an 8×8 matrix, as in the picture, is represented as a 2D image with 8×2 float4 elements, this opens the door for all sorts of optimizations that combine the benefits of the texture caching, different tiling schemes and the use of OpenCL vector types.

The AMD APP SDK contains an example that shows how to implement matrix multiplication using the image type. The corresponding kernel file is called 1MatrixMulImage_Kernels.cl and may be found in the samples subdirectory of the SDK distribution. We will look at the first kernel in the file which uses a 4×4 tiling of the input matrix, to calculate 16 values (4 x float4) per work group.

Although the AMD example already comes with its own C++ host code based on the SDKSample base class, we will only use the provided kernel. While the AMD examples are a great starting point, and it would be certainly possible to integrate the host code directly in our own project, there are various reasons why you might want to use a different approach:

  • Your compiler may not support C++
  • You may not be interested in modifying the SDK example class to make it fit your needs.
  • You may not want to pull in any external libraries like the SDKUtil library

In our case, we obviously want to showcase OpenCL CodeBench, and the benefits of just pulling in the example kernel from the AMDAPP SDK are:

  • We can generate the same functional API for our kernel as with the previous kernel.
  • The same test suite as before is quickly re-generated from the same test data, although the target data structures are different.
  • Overall we save time because the only thing we need to modify is a few lines in the XML description.

Going back to our own matrix multiplication example, we can integrate this sample kernel from the AMD APP SDK very easily by modifying our accelerator XML description as follows:

<?xml version="1.0" ?>
<accel name="matmul_img_opt">
  <target select="any">
    <device type="gpu"/>
  </target>
  <worksize>
    <dim>100</dim>
    <dim>100</dim>
  </worksize>
  <interface>
    <in type="image2d_t" chan_type="CL_FLOAT" chan_order="CL_RGBA" name="x">
      <dim>100</dim>
      <dim>400</dim>
    </in>
    <in type="image2d_t" chan_type="CL_FLOAT" chan_order="CL_RGBA" name="y">
      <dim>100</dim>
      <dim>400</dim>
    </in>
    <out type="image2d_t" chan_type="CL_FLOAT" chan_order="CL_RGBA" name="z">
      <dim>100</dim>
      <dim>400</dim>
    </out>
    <in type="uint" name="widthA"/>
    <in type="uint" name="widthB"/>
  </interface>
  <testsuite>
    <testcase name="basics" time="1" iters="100">
      <in name="x" file="x.csv" embed="0"/>
      <in name="y" file="y.csv" embed="0"/>
      <out name="z" file="z.csv" embed="0"/>
      <in name="widthA"><val>400</val></in>
      <in name="widthB"><val>400</val></in>
    </testcase>
  </testsuite>
</accel>

Note that we essentially just changed the type definitions of the x, y and z parameters, indicating the desire to use image2d_t types, with floating point channel values, and the RGBA channel order, allowing us to use 4 values per image element. We also modified the global work size – indeed, due to the tiling, every kernel invocation now produces a 4 by 4 block of 16 values. Additionally, the example kernel requires 2 width parameters for the input matrices.

To regenerate the code, we invoke the following command:

> amdl_gen_ocl -m imgo matmul_img_opt.xml

This time we didn’t pass the -e option, causing the generated code to read the matmul_img_opt.cl kernel file at run-time, instead of having it embedded in a string. We will mark the kernel we want to use from this file by renaming it to krnl_matmul_img_opt. Again, the following files are generated:

  • matmul_img_opt.h: a header file containing a simple procedural interface to our matrix multiplication
  • matmul_img_opt.c: an implementation file containing all the low-level OpenCL code to offload the matrix multiplication to the GPU, and the test suite.

Adapting our first XML accelerator specification to match the expectations of the AMD kernel took less than a minute while re-generating the host code and our same test as before was instantaneous. The test is successful. We can also observe that the run time for 100 iterations has now gone down to about 1s, or a performance improvement of 9x!

In the previous discussion we have seen that OpenCL can be used to take advantage of the GPU to obtain significant speedups of algorithms. Furthermore, by using vector data types and texture caches for the data, an even greater speedup can be achieved. During this exercise, we have also shown that OpenCL CodeBench can make you much more productive by:

  • quickly generating all the host code for the function you want to accelerate
  • easily adding a unit test using pre-generated test-data
  • providing support for OpenCL event profiling

OpenCL CodeBench can be used in command line mode (as shown in this article) or from within the Eclipse IDE. Within Eclipse, OpenCL CodeBench also provides a powerful editor that understands the OpenCL C99 syntax, and adds syntax coloring, syntax and semantic checking with quick-fixes and navigation in the same fashion as Eclipse users are used to when editing other languages.

If you would like to test-drive OpenCL CodeBench for yourself, Request a 30-day free trial, without any obligation.

0