MENU

GPU Compute and OpenCL: an introduction.

GPU Compute and OpenCL: an introduction.

Feature articles |
By eeNews Europe



OpenCL is introduced in this context (in this first part of the article), and a sample application is presented (in the second instalment of the article), together with indications on setting up the development environment, building and running the application on a Freescale i.MX6-based platform.

Editor’s note; If the reader has come first to this, part 1, of this article, a pdf of parts 1 & 2, complete, is now available by clicking here.

Basics of today’s GPU hardware architecture

….and the forces that created it (e.g. graphic pipeline).

We cannot discuss OpenCL/GPU Compute, even in context of a highly integrated System on Chip (SoC), without at least a basic level of understanding of the GPU architecture and factors that shaped it. So let’s first take an overview of this subject.

While pervasive today, and a big selling point for many everyday objects, the complex and elaborate GUIs (graphical user interaces) we use today rely on GPUs (graphics processing units) that are the culmination of more than 30 years of evolution, driven first by a nascent graphics industry, then mostly by PC and console gaming, and being strongly influenced by the needs of low-power/high performance SoCs today.

Starting with the first monochrome video display controllers, at incredibly low resolutions (“up to” 62 x 128 for CDP1861 in mid ’70s) moving to EGA adapters with 16 colours and 640 x 350 in mid ’80s (VGA/SVGA, with up to 800 x 600 would come soon after), we had to wait until the early ’90s for the first products that start to shape the GPU architectures in use today.

By 1996, the major players that would dominate the 3D graphics world for almost two decades are there (most of them defunct by now) and will make furious progress primarily driven by the gaming world, but with a direct impact on all devices that surround us. Also, the APIs that will enable the industry to move forward are established (OpenGL and Direct 3D), replacing proprietary ones that we see so often in nascent industries (one exception, Glide from 3dfx, still endured, and was open-sourced too late, contributing to the death of this much loved 3D pioneer). All this rapid development contributed to establishing a graphics processing pipeline that, at a high enough level, remains largely the same today, and is depicted in Figure 1.

Figure 1 The basic elements of what has become the “standard” graphics processing pipleine

next; early GPUs…


The first GPUs were used to accelerate the processing that happened after the Transformation and Lighting (T&L) stage (handled by the CPU) – stages that were better suited for a static, non-programmable implementation (yet configurable through the inputs representing vertex sets, light position and characteristics, transformation matrices). As the evolution of CMOS process nodes allowed cramming more and more transistors on the GPU die, a few noteworthy steps were taken:

– T&L processing accelerated by the GPU offloaded the CPU and provided significant performance gains. This was achieved first at the end of the ’90s, and was very quickly was adopted by all 3D GPU providers.

– Programmability of the vertex processing (vertex shading) and pixel processing (pixel shading). This opened up new possibilities, and control for the GFX designers, in terms of scene design and effects. Moving from a fixed-function implementation towards a programmable pipeline was done through programmable engines called “shaders”:

– The programming model is defined by a “Shader model” that specifies instructions, registers and operations.

– Shader models and shader definitions were initially different for vertex processing and pixel processing, but were unified starting with Shader Model 4.0 (in Direct3D)/Unified Shader Model (OpenGL). The first generation of GPUs implementing a unified shader model was available in the market in mid-2000.

– Data format has evolved from integer only, to Floating Point with half, full and even double precision.

While the shader engines have evolved from a very limited instruction set, these are still to a significant extent purpose built, and the balance between flexibility and performance/efficiency is heavily tilted towards the latter. In terms of data format support, one relevant aspect is IEEE 754 compliance – this sometimes is traded off for some additional performance.

The shader engines in many implementations can be assimilated with Digital Signal Processors (DSP), highly optimised for the mathematical operations that the GPU pipeline needs to support. Similar techniques, and architecture approaches used in the DSP domain are in use today (or have been in the past) in premier GPU cores: VLIW, SIMD, vector processing.

next; unified shader models…


A unified shader model used by the GPU processing means that the processing pipeline for today’s designs can be represented as seen in Figure 2. Note that this time we show the memory, and the fact that the shaders have RW access to the memory (dedicated video memory typically for video cards, or system memory for GPUs integrated in SoCs).

Figure 2 The main elements of a typical processing pipeline for a current graphics design

The advent of unified shader cores resulted in a new measure of the relative performance of the GPU: the number, and frequency, of these cores. These two parameters combined are typically used to give a measure of the maximum compute capacity of the GPU, expressed in Floating point Operations Per Second (FLOPS).

As an example for the GFX support in a mainstream SoC, Figure 3 shows the three GPUs present on Freescale’s i.MX6Q/D. In its most powerful instantiation on the i.MX6 product family, the 3D GPU has 4 shader cores, with a compute performance rated at 24GFLOPS. Note the presence of two additional GPU cores, performing specialised functions, for increased efficiency of the typical embedded application: accelerating layer composition, and vector graphics.

Figure 3 Freescale’s i.MX6Q/D hosts three distinct GPUs

next; shaders and their ideal tasks…


Shaders, and ideal GPU tasks

A bit more on the shader units of the GPU, and types of workload best suited for the GPU pipeline; let’s start by listing the defining characteristics for the typical processing tasks run by GPU shader we presented before:

– The same set of operations (translated in instruction sequences) have to be applied to a significant number of data inputs:

•Objects with tens of thousands of polygons are undergoing the same T&L operations. Same goes with pixels/texels.

– The input data samples are processed independently.

– The processing flow is mostly linear (little to no branching).

– Input data for the processing flow manifests very good data spatial locality.

• The code accesses the memory in a linear fashion, typically traversing 1D/2D/3D arrays.

– The processing task is not latency sensitive

• A good-enough frame rate 25-30 fps, or 40-33 msec/ frame. As long as all frame-related processing is done in this interval, the order and latency to achieve a particular step in the processing chain is not relevant.

Consequently the GPU architects, when choosing how to invest the available die size/number of transistors:

– Will maximise compute capacity versus caching capacity

– Will maximise opportunities for thread and data level parallelism

Figure 4 Make-up of the Unified Shaders Unit

Thus typically, the unified shaders unit of a GPU is comprised of a number of identical cores, local cache, a thread management unit (more on this a bit later), and a texture access unit (for conversion and sampling). In many implementations, each shader core has Single Instruction Multiple Data (SIMD) capabilities. The number of cores varies with the target market and performance, going from one to a few shader cores for low-power/highly efficient SoCs, to hundred, sometimes thousands of cores for high-end PC/professional graphic cards. For high end GPUs, a base unified shaders unit is replicated multiple times – we show only one, with four shader cores, but a high end GPU would have a few unified shader blocks with tens/hundreds of shader cores.

next; a note on caches..


A note on the cache: local cache is still needed and present, but the ratio “cache size” versus “compute capacity” is significantly below what a traditional CPU would show. Look for CPU and GPU die photos and try to visually compare the two ratios – it will be very obvious.

With the architecture presented above, maximum performance is achieved when each shader core executes the same instruction on different data points in a related set (pixels in a certain screen area that needs the same processing applied, vertices of a certain object that need the same translation applied). This minimises instruction fetching & decoding operations per processed input data point, and caching of the data operated-on is most efficient. It is the job of the thread management block to efficiently schedule the thread groups that are ready to run.

Thread grouping and scheduling is very important from the perspective of the programming paradigm used for exploiting GPUs for general compute operations, and is thoroughly reflected in the OpenCL environment. Also, for each GPU we typically have an optimal thread group size (e.g. number of shader cores that will execute the same instructions in parallel), and this should be taken into account during the design of the application for maximum performance.

One important note on branching: the hardware will execute all taken branches encountered during execution on any shader core for all shader cores. Thus branching has the potential to significantly reduce efficiently of the entire thread execution, and should be avoided.

With all the background presented above, and not surprisingly, we can conclude that the types of workloads that can exploit the GPU need to manifest a similar set of characteristics:

– are easily parallelisable

– operate on large data sets with little or no processing interdependence, and,

– are latency insensitive.

Plenty of compute domains satisfy these requirements: image processing, signal processing, physical modeling, crypto (key and certificate) processing, and much more.

next; enter OpenCL…


 

The need for a HW-agnostic programming language and framework – enter OpenCL.

When trying to exploit the compute capacity of the GPU, the pioneers of the field started by using OpenGL. While restricted to OpenGL’s required representation for the input data, and OpenGL primitives, different compute operations can be performed. Image processing is one domain where such techniques are frequently used – assimilating images to textures enables one to easily perform a variety of filtering operations. However, to promote GPU Compute to the status of mainstream technology, something much better was needed – a standardised, supported by many (if not all) GPU providers, ideally familiar and relatively easy to use (for experienced programmers) framework. This is OpenCL, and while it clearly hits the mark in terms of being a well-adopted standard by the field and having widespread support, we will let the reader decide on how familiar it seems and how easy it is to use.

OpenCL is an open specification API from the Khronos Group (the same standards body that controls OpenGL, OpenVG, OpenMAX) and enables asynchronous multi-core programming for cross-platform, heterogeneous computing environments. It is supported today on GPUs, CPUs (typically exploiting SIMD capabilities), DSPs and FPGAs. The first spec revision (1.0) was available in 2009, while the latest at the time of writing this article is revision 2.0, published in November 2013. Note the existence of a Full Profile (FP) and an Embedded Profile (EP): some features that are mandatory in FP are optional in EP, and certain limits and constraints are relaxed on FP.

What OpenCL tries to achieve is very challenging: offer a portable framework that allows one to exploit efficiently various types of underlying HW architectures, data and/or instruction and/or thread parallel. For example, a GPU is typically instruction- and data-parallel, while a multi-core CPU with SIMD capabilities (basically any CPU these days, be it for a PC or a recent handheld device) is data- and thread-parallel.

We will try in this section of the article to give a high-level overview of the OpenCL framework, and in the subsequent chapters, through the OpenCL sample application, the reader will have a (hopefully) clear picture on how does OpenCL fulfills its purpose.

next; ‘host’ and ‘compute’ devices…


A first distinction that OpenCL makes is that of the “Host” and “Compute device”: the “Compute device” is the number crunching machine abstraction (and there can be more than one in the system), while the “Host” is the part of the system best suited to run complex/control code. The Host will manage the application execution, feeding and ensuring proper synchronisation of tasks executed by the Compute devices in the system.

A view of the Host and Compute Device representation for a typical system is presented in Figure 5. Note that while we only represent one, there can be a multitude of Compute Devices in the system, and these can be of different types (e.g. multiple GPUs and/or multiple DSPs and/or multiple FPGAs managed by the same CPU).

The Host is typically programmed in standard C/C++ but the Host API allows bindings to virtually any other programming languages, with Java, Python, Ruby, OCaml being also used. In addition, the Host will expose the platform layer API (providing a HW abstraction layer over the compute devices in the system), and the runtime API that allows managing the compute devices (discovery, enumeration and configuration, submits routines for execution, synchronisation, resource allocation).

Figure 5 View of the Host and Compute Device representation for a typical system

The Compute device is programmed using OpenCL C – essentially a subset of ISO C99 with language extensions. The restrictions (fixed length arrays, no recursion, no pointers to functions, no bit fields, and others) as well as the extensions (new keywords and pragmas, array notations, and others) are introduced to ease task and vector parallelism at runtime and in the compilation process respectively. Functions intended to run on the Compute device are called “Kernels”.

 

next; elements of the compute device


For the next steps to make sense, we need to further dissect the Compute device, and that gives us the:

Processing element: the smallest execution unit. Typically this would operate on a vector operand, and in our example above this is one shader core. Later in the article we will see that other architectures exists, where the processing element is mapped differently.

Compute unit: the smallest block usable at the thread level. This would typically be the group of processing elements that sit behind a single thread management unit, implying that this group would execute in lockstep the same flow of instructions. In our example above the Compute unit is represented by one block of unified shaders.

Figure 6 Structure of work to be processed

It is important to understand the processing flow for OpenCL. Let’s try to use as an example a matrix (OpenCL can process 1D/2D/3D arrays) that is to be processed by a kernel.

The input structure to be processed, in its entirety, is called the “Index space”, with the “NDRange” sizes in each dimension of the input data block – in our case, we have a Gx = 4, and a Gy = 4 as well (note however that the x, y and z sizes do not need to be equal). The Index space will be processed in “Work group” segments, which is the group of work items that will be processed on at least one Compute unit (multiple Compute units can be used in parallel). Simplifying, we can assume that the number of work items in a work group that will be processed on the Compute unit is equal with the number of Processing elements in the said Compute unit – thus in our example, we are showing a Work group that contains 4 elements (reality is quite a bit more complex, as readers that will come to be intimate with OpenCL will soon realise).

Each element of the input array (part of a particular Work group), called a “Work item”, will be processed by a kernel instantiation, running on a Processing element, and is identified by a unique “Global ID”. The Global ID is given by the position of the Work item in the Index space, and Work group – as each Work group is unique and has a well-defined position in the Index space, the global ID for a particular Work item can be determined starting from its position in the Work group. We should also introduce the concept of a “Local ID” which is the index of the work item relative to its work group. This concept is the base for splitting the computation over groups and better use local memory (see definition below).

next; memory types…


An important component of the processing flow is the utilisation of memory. OpenCL defines the following memory types:

Global Memory: accessible to both the Host and all Compute devices in the system. Physically, it can be the system memory, or on-die memory (as long as it can be accessed by the compute units).

Constant Memory: same characteristics as the global memory, but it is read-only.

Local Memory: memory which resides close to the processing elements. Local memory is specific to a work-group, and is accessible only by work-items belonging to that work group.

Private Memory: accessible to a single kernel instance/work-item, not visible to other work-items.

Passing data between the Host and the Compute devices is with the help of:

– APIs allow the Host to create memory objects in global memory, accessible to both the Host and the Compute devices.

– Mapping memory regions such that these areas are read/write accessible for both the Host and Compute Devices, allowing data exchanges as necessary. Note that performance of “Mapping” versus “Copy” may vary from case to case, due to cache maintenance and Host-Device interface speed.

– APIs allowing control of memory operations ordering and synchronization using events.

We now have all the elements of the process flow: the kernel instance, the input data (Index space) structured down to the work item that gets worked on by a Processing element, the ability to exchange data between the Host and the Compute devices, and the ability to synchronise between these two.

Figure 7

In closing this section, let’s look at the details of the GC2000 3D GPU of the i.MX6 SoC, from the OpenCL perspective, as this platform will be used in the following sections for our “Hello world” OpenCL application (coming in the second part of this article).

Compared with the generic architecture we used as an example before, each shader in GC2000 is a SIMD processor, 4 SPFP wide. Thus, each shader has 4 processing elements, and each shader has an independent thread management unit, allowing the execution of independent kernels on each, and representing the Compute unit. The Compute device (that is the entire GPU core) allows the processing of 16 data elements in parallel, thus the size of the work group that provides the maximum efficiency is of 16 elements. Any Work group size below 4 will imply some level of inefficiency in the processing, as the group will be processed on a shader, and thus at least one processing element will not be used.

If you enjoyed this article, you will like the following ones: don't miss them by subscribing to :    eeNews on Google News

Share:

Linked Articles
10s