Prev Chapter 15: Functional image processing in Quasar Up Main page Chapter 17: The Quasar compiler/optimizer Next

16 The Quasar runtime system

The Quasar environment contains both a compiler (see Chapter 17↓) and a runtime system. The runtime system manages the computation devices and also makes sure that Quasar code is efficiently executed on the selected/available devices. Note that the compiler and runtime system are not entirely separated: the runtime system exploits information (metadata) generated at compile-time, e.g., to improve scheduling options.

Most data structures (e.g., primitive types, user-defined types, functions, arrays, matrices, \SpecialChar ldots) support automatic transfer to the GPU. For the run-time system, this implies some “bookkeeping” (overhead) at runtime. For example, to keep track of the most “recent” version of a data structure, both a CPU pointer and a GPU pointer may be required, plus a set of dirty bits.

Two versions of the runtime system exist and are both in use

Standard engine (v1 runtime): offers single GPU CUDA and OpenMP CPU support.
Hyperion engine (v2 runtime): offers OpenCL, multi-GPU CUDA/OpenCL and two-layer threaded OpenMP CPU support. For the Hyperion engine, most built-in core routines (such as sum, prod functions and implementation of matrix operators) are implemented in Quasar itself. The Hyperion engine is therefore easily extensible to future accelerator devices. The Hyperion engine supports multi-GPU and allows combining OpenCL GPUs with CUDA GPUs (for example on NVidia Optimus laptops, Intel HD Graphics with the NVidia Geforce GPU).

Compared to the Standard engine, the Hyperion engine has a more efficient object graph management system, allowing clustering objects into subgraphs which allow simultaneous transfers of objects to the GPU, with better management and support of GPU memory models (pinned memory, unified memory etc).

Apart from implementation differences, there are no functional differences between the two runtime engines, i.e., a program developed on runtime v1 works on runtime v2 (and vice versa).

The runtime system can be selected through the device selector, for example in the Redshift IDE. Hyperion devices are listed explicitly as v2, e.g. CUDA v2, OpenCL v2, \SpecialChar ldots

In general, the runtime system performs several tasks:

\tightlist

Program interpretation and execution
Providing an abstraction layer for computation device (e.g., CUDA, OpenCL, OpenMP)
Performing object management (reference counting, grouping memory transfers etc.)
Performing memory management (allocation, transfer from CPU -> device, device -> CPU, memory pinning)
Load balancing/runtime scheduling of parallel tasks

The above components will now be discussed in more detail in the following subsections.

16.1 Program interpretation and execution

A Quasar program consists of a sequence of high-level “instructions”, for which an instruction often involves an expression evaluation. These instructions are either compiled to .Net byte-code or interpreted by the Quasar interpreter.

The actual time-consuming calculations are performed by invoking kernel or device functions. In particular, a kernel function is a function that acts on one element of the data, and that is repeated (in parallel) to all other elements. Consider the point-wise multiplication of matrices A .* B. Then a kernel function computes the product of two elements at a given position ‘pos’: Y[pos] = A[pos] * B[pos]. This kernel function is internally defined in Quasar as follows:

pw_multiply = __kernel__ (A:mat,B:mat,Y:mat,pos:ivec2) -> Y[pos]=A[pos]*B[pos]

Now, when the interpreter asks the computation engine to calculate the product of two matrices, the kernel function ‘pw_multiply’ is scheduled. The same occurs for the assignment Y[pos]=...: this operation is also performed in a kernel function. Eventually, the runtime system obtains a “stream” of kernel function invocations:

kernel1 -> kernel2 -> kernel3

The runtime system is then responsible for launching these kernels on the data.

16.2 Abstraction layer for computation devices

The run-time system has several back-ends: for CUDA, OpenCL and OpenMP. For every kernel function, the Quasar compiler typically generates code for different targets. However, the high-level instructions of a Quasar program are target-independent (even when compiled to .Net bytecode), which allows a Quasar program to be easily retargetted during runtime to other devices.

There are a set of run-time parameters that allow controlling the different run-time back-ends:

Enable concurrent kernel execution (CKE). This mode has specific meaning depending on the back-end:\begin_inset Separator latexpar\end_inset
- For CUDA back-ends, the CKE mode uses CUDA streams and the CUDA asynchronous programming model. The streaming is performed completely automatically by the run-time system (by tracking dependencies). When necessary, inter-stream event synchronization is also performed.
- For OpenCL back-ends, CKE relies on the OpenCL event system to specify that kernel functions that need to executed in parallel.
- For CPU back-ends (Runtime v1), the CKE mode has no effect.
- For CPU back-ends (Runtime v2 - hyperion engine), the CKE mode causes kernel functions to be executed in different CPU threads, in a two-level concurrency model: kernel functions are executed in parallel on different threads, however, each invididual kernel function may use a fixed number of OpenMP threads. For example, for an 8-core processor with hyperthreading enabled, 2 kernel functions can be executed in parallel, running the calculations each on 4 OpenMP threads, enabling efficient use of the CPU. These configuration parameters can be set in the hyperion device configuration file.
Enable CUDA pinned host memory: if enabled, pinned host memory is used automatically for critical memory transfers.
Enable CUDA 16-bit floating point textures: the use of 16-bit floating point textures reduces storage space. Note that this setting requires the use of hwtex_* modifier (See Section 9.4↑).
Enable OpenMP multi-threading with N CPU threads: enables OpenMP multi-threading on multicore processors, with the specified number of OpenMP threads
GPU scheduling mode: controls the runtime behavior in circumstances of GPU sharing and/or high memory usage. The following modes are available:\begin_inset Separator latexpar\end_inset
- Maximize Stability: in this mode, certain operations may be slowed down when the GPU is running out of memory (e.g. by automatically copying memory back to the CPU in order)
- Maximize Performance: in this mode, the slow down is prevented and instead a run-time error is generated
GPU memory model: controls how aggressively the runtime allocates memory. The following options are available.\begin_inset Separator latexpar\end_inset
- Small footprint: the run-time memory manager is extremely conservative in memory allocations
- Medium footprint: default mode for the memory manager
- Large footprint: allocates device memory aggressively, leaving limited GPU memory to other processes. Use this mode when you intend to allocate many very large memory blocks.
Default image codec provider: selects between GDI+ and GTK+ for coding and decoding images. GDI+ and GTK+ support different sets of image codecs. Some image formats are better supported by GDI+, other formats are better supported by GTK+ (using libjpeg, libtiff, libpng etc).
Use OpenGL for visualization: when enabled, all rendering is accelerated using OpenGL. Note that OpenGL is currently not enabled in most terminal sessions (e.g. Windows remote desktop, Linux SSH -X)
OpenGL anti-aliasing mode: specifies the anti-aliasing mode to be used for the OpenGL context (mainly useful when rendering lines and points).

16.3 Object management

Quasar has a custom object system to represent the internal data. There are different built-in object types:

Scalar types (scalar, int)
Matrix types (real-valued, complex-valued)
Function types (functions are first-class variables and can also store data through function closures).
Cell matrix types: a structured way for storing variables of all other types (similar to Matlab/Octave cell matrices)
Untyped objects: user-defined data types, useful for rapid prototyping
Typed objects: user-defined data types, same but more powerful
String types: limited support currently.

Objects can contain references to other objects. Circular references are also allowed (although the handling of them is not fully implemented at this time). An important feature is that all object types are transparently accessible from all computation devices. For example, in contrast to many other GPU libraries, Quasar makes no distinction between CPU data types and GPU data types. For efficiency, restrictions apply, for example untyped objects cannot be accessed from the device because runtime type checking would cause a significant amount of overhead. Instead, the runtime is able to decide whether a given object should reside in CPU or GPU memory or both. This means that data needs to be transferred from/to the device without the user even knowing it (of course this information is still available through profiling - and the run-time system can be controlled programatically, see Section 16.7↓). The object and memory management systems are responsible for these transfers. The object system then has to track the references between the objects. Typically, when one object is accessed in device memory, all objects that are referenced also need to be copied to the same device memory. To make sure that this is performed efficiently, the objects are “clustered” so that the complete object graph can be transferred in one memory copy operation.

16.4 Memory management

When executing kernels on a given data set, the operands of the kernel function may or not reside in the correct memory for the device (e.g., GPU memory). This requires memory management.

The memory manager is responsible for allocating memory in device/CPU memory and for transferring data from the CPU to the device. When the device is running out of memory, it may also be necessary to transfer data back from the device to the CPU.

The runtime scheduler then has to ensure that 1) the memory manager is doing the correct task (so that all required data is in the correct memory before a given kernel is launched) and 2) that the kernels are properly executed according to the data dependencies.

Both tasks of the runtime scheduler actually interfere - for example, it is possible that the CPU requests to release memory that is still used in one of the asynchronously running kernels. The same applies to data that needs to be transferred from or to the GPU. The runtime scheduler then has to ensure that both the CPU and the different kernels have a consistent view on the data.

Therefore, Quasar has a distributed memory system: an object may reside in the memory of different computation devices at the same time. As long as read accesses are considered, this does not pose any problems on its own. The main difficulties are rather: 1) dealing with memory allocation requests when there is not sufficient device memory available, 2) avoiding memory fragmentation and 3) making sure that no memory blocks are released that are still in use by an asynchronous kernel.

Different memory allocation algorithms are automatically used by the runtime system. The CPU engine uses a garbage collector with 3 generations (short term, mid term and long term). Because of the (relatively) scarce GPU device memory, the CUDA computation engine has a new memory allocation algorithm that relying on reference counting instead of garbage collection. This allocation algorithm is also optimized for middle (>1024) to large (>100MB) data allocations.

In some cases, the device may run out of memory. If this happens, some (idle) memory blocks are selected to be transferred back to the CPU memory. Therefore, the least-recently-used (LRU) strategy is adopted (taking into account which memory blocks are currently in use by kernels) and an optimization takes place to minimize the amount of memory that needs to be copied to the CPU.

However, in some cases, the memory fragmentation and size of the working set for a kernel are too large so that moving memory becomes impractical. In this case, an error report is currently displayed:

Amount of pinned memory: 398,131,200 bytes
Freelist size: 6 memory blocks
Largest free block: 67,665,920 bytes, insufficient to hold request size of 132,710,400 bytes
Process total: 736,493,568, Inuse: 580,608,000 bytes, Free: 155,885,568 bytes;
Device total: 1,010,368,512, device free: 167,772,160
Chunk 0 size 67,108,864 bytes: Fragmentation: 47.8%, pinned: 0 bytes (0.0%), free: 17,342,464 bytes (25.0%) Chunk 1 size 134,217,728 bytes: Fragmentation: 0.0%, pinned: 132,710,400 bytes (98.0%), free: 1,507,328 bytes (1.0%)
Chunk 2 size 268,435,456 bytes: Fragmentation: 0.0%, pinned: 132,710,400 bytes (49.0%), free: 36,192,256 bytes (13.0%)
Chunk 3 size 266,731,520 bytes: Fragmentation: 32.9%, pinned: 132,710,400 bytes (49.0%), free: 100,843,520 bytes (37.0%)

The solution is then to modify the Quasar program such that smaller (or less) memory blocks are used (for example, by splitting images in smaller tiles). An alternative is to change the memory allocation model to “huge memory footprint”. This way, Quasar will allocate large memory chunks and the possibility that a block does not fit in the chunks is decreased.

16.5 Load balancing and runtime scheduling

The runtime scheduler defines the order of the kernel launches and whether kernels can be executed concurrently. Therefore, the scheduler also tracks the different data dependencies between subsequent kernels. At the core of the runtime scheduler is the command queue. The command queue is a first-in first-out queue that either contains an individual kernel launch or a memory transfer operation. When kernel and device functions are scheduled, load balancing is performed an an appropriate computation device is selected based on several parameters.

The runtime scheduler aims at distributing the kernel functions over the available computation devices, taking full advantage of their possibilities.

16.6 Optimizing memory transfers with const and nocopy

To suppress the need to unnecessary memory copies, kernel and device function parameters can be annotated with the modifiers ’const and ’nocopy. The meaning of these modifiers is as follows:

’const: Indicates that the parameter is constant and will not be changed inside the kernel function.
’nocopy: Indicates that the input values of the parameter (for example of a vector or matrix type) are not used. This is useful when implementing functions that completely overwrite the content of a vector/matrix without reading the original values of this vector/matrx.

The ’const avoids that, after execution of the kernel/device function, function parameters are being copied back to the host (or equivalently, to another GPU). On the other hand, ’nocopy avoids that the memory is being copied from the host to the device (or equivalently between GPUs).

The compiler automatically detects parameters that can be annotated with ’const and ’nocopy, so therefore there is no direct need to use these modifiers in user code. However, the modifiers may still be of use for indicating the calling interface of the function: the calling function then knows that e.g. the given function does not change the value of one of the parameters, or that the input value of one of the parameters is not used directly.

Finally, the reader may have noted that it could also be useful to have a nocopy at input and output modifier; this could serve as a local “scratch memory” to be used by the kernel function (e.g., as used in the implementation of various linear algebra routines). In Quasar, there is no special modifier for this purpose, it suffices to not declare any access modifier - as long as the scratchpad memory will not be used by another function, the memory transfer will not be performed.

16.7 Controlling the runtime system programmatically

Sometimes it is useful to control the runtime system from Quasar code. This can be achieved using the following code attributes:

{!sched mode=cpu|gpu|fpga}: sets the scheduling mode manually, instructing the scheduler to run the following kernels on the specified device.
{!sched gpu_index=1}: sets the GPU index (within the range 0..num_gpus-1) for the next kernel function. This feature is useful for multi-GPU programming (see Chapter 11↑).
{!sched}: switches back to automatic scheduling
{!alloc mode=auto|cpu|gpu}: sets the memory allocation mode for the next allocation request.
{!alloc type=auto|pinnedhostmem|unifiedmem|texturemem}: sets the allocation type for the next allocation request. This function has only effect when the allocation mode is GPU. Pinned host memory is memory that avoids paging and that can more efficiently been transferred to/from the GPU. Unified memory is automatically page-mapped between CPU and GPU (requires CUDA 6.0 or higher). Texture memory (to be used with access modifiers ’hwtex_* (see Section 9.3↑), often allows more efficient data access patterns.
{!transfer vars=a; target=cpu|gpu}: instructs the runtime system to transfer the specified variable to the CPU (or GPU). vars may refer to an expression (for example, a[index], in which case the specified element of the cell array is transferred).
{!transfer vars=a; target=cpu|gpu; gpu_index=1}: instructs the runtime system to transfer the specified variable to the GPU with the specified index.

Note that the above functions are provided for fine-tuning implementations. However, the code attributes should be used with care, because inappropriate usage may degrade the performance and cause unexpected errors in some cases. Several aspects, such as the memory location of a variable, can be visualized and analyzed in the variables window in the Redshift IDE.

Prev Chapter 15: Functional image processing in Quasar Up Main page Chapter 17: The Quasar compiler/optimizer Next