Chapter 15: Functional image processing in Quasar Up Main page Chapter 17: The Quasar compiler/optimizer 

16 The Quasar runtime system

The Quasar environment contains both a compiler (see Chapter 17↓) and a runtime system. The runtime system manages the computation devices and also makes sure that Quasar code is efficiently executed on the selected/available devices. Note that the compiler and runtime system are not entirely separated: the runtime system exploits information (metadata) generated at compile-time, e.g., to improve scheduling options.
Most data structures (e.g., primitive types, user-defined types, functions, arrays, matrices, \SpecialChar ldots) support automatic transfer to the GPU. For the run-time system, this implies some “bookkeeping” (overhead) at runtime. For example, to keep track of the most “recent” version of a data structure, both a CPU pointer and a GPU pointer may be required, plus a set of dirty bits.
Two versions of the runtime system exist and are both in use
Compared to the Standard engine, the Hyperion engine has a more efficient object graph management system, allowing clustering objects into subgraphs which allow simultaneous transfers of objects to the GPU, with better management and support of GPU memory models (pinned memory, unified memory etc).
Apart from implementation differences, there are no functional differences between the two runtime engines, i.e., a program developed on runtime v1 works on runtime v2 (and vice versa).
The runtime system can be selected through the device selector, for example in the Redshift IDE. Hyperion devices are listed explicitly as v2, e.g. CUDA v2, OpenCL v2, \SpecialChar ldots
In general, the runtime system performs several tasks:
The above components will now be discussed in more detail in the following subsections.

16.1 Program interpretation and execution

A Quasar program consists of a sequence of high-level “instructions”, for which an instruction often involves an expression evaluation. These instructions are either compiled to .Net byte-code or interpreted by the Quasar interpreter.
The actual time-consuming calculations are performed by invoking kernel or device functions. In particular, a kernel function is a function that acts on one element of the data, and that is repeated (in parallel) to all other elements. Consider the point-wise multiplication of matrices A .* B. Then a kernel function computes the product of two elements at a given position ‘pos’: Y[pos] = A[pos] * B[pos]. This kernel function is internally defined in Quasar as follows:
pw_multiply = __kernel__ (A:mat,B:mat,Y:mat,pos:ivec2) -> Y[pos]=A[pos]*B[pos]
Now, when the interpreter asks the computation engine to calculate the product of two matrices, the kernel function ‘pw_multiply’ is scheduled. The same occurs for the assignment Y[pos]=...: this operation is also performed in a kernel function. Eventually, the runtime system obtains a “stream” of kernel function invocations:
kernel1 -> kernel2 -> kernel3
The runtime system is then responsible for launching these kernels on the data.

16.2 Abstraction layer for computation devices

The run-time system has several back-ends: for CUDA, OpenCL and OpenMP. For every kernel function, the Quasar compiler typically generates code for different targets. However, the high-level instructions of a Quasar program are target-independent (even when compiled to .Net bytecode), which allows a Quasar program to be easily retargetted during runtime to other devices.
There are a set of run-time parameters that allow controlling the different run-time back-ends:

16.3 Object management

Quasar has a custom object system to represent the internal data. There are different built-in object types:
Objects can contain references to other objects. Circular references are also allowed (although the handling of them is not fully implemented at this time). An important feature is that all object types are transparently accessible from all computation devices. For example, in contrast to many other GPU libraries, Quasar makes no distinction between CPU data types and GPU data types. For efficiency, restrictions apply, for example untyped objects cannot be accessed from the device because runtime type checking would cause a significant amount of overhead. Instead, the runtime is able to decide whether a given object should reside in CPU or GPU memory or both. This means that data needs to be transferred from/to the device without the user even knowing it (of course this information is still available through profiling - and the run-time system can be controlled programatically, see Section 16.7↓). The object and memory management systems are responsible for these transfers. The object system then has to track the references between the objects. Typically, when one object is accessed in device memory, all objects that are referenced also need to be copied to the same device memory. To make sure that this is performed efficiently, the objects are “clustered” so that the complete object graph can be transferred in one memory copy operation.

16.4 Memory management

When executing kernels on a given data set, the operands of the kernel function may or not reside in the correct memory for the device (e.g., GPU memory). This requires memory management.
The memory manager is responsible for allocating memory in device/CPU memory and for transferring data from the CPU to the device. When the device is running out of memory, it may also be necessary to transfer data back from the device to the CPU.
The runtime scheduler then has to ensure that 1) the memory manager is doing the correct task (so that all required data is in the correct memory before a given kernel is launched) and 2) that the kernels are properly executed according to the data dependencies.
Both tasks of the runtime scheduler actually interfere - for example, it is possible that the CPU requests to release memory that is still used in one of the asynchronously running kernels. The same applies to data that needs to be transferred from or to the GPU. The runtime scheduler then has to ensure that both the CPU and the different kernels have a consistent view on the data.
Therefore, Quasar has a distributed memory system: an object may reside in the memory of different computation devices at the same time. As long as read accesses are considered, this does not pose any problems on its own. The main difficulties are rather: 1) dealing with memory allocation requests when there is not sufficient device memory available, 2) avoiding memory fragmentation and 3) making sure that no memory blocks are released that are still in use by an asynchronous kernel.
Different memory allocation algorithms are automatically used by the runtime system. The CPU engine uses a garbage collector with 3 generations (short term, mid term and long term). Because of the (relatively) scarce GPU device memory, the CUDA computation engine has a new memory allocation algorithm that relying on reference counting instead of garbage collection. This allocation algorithm is also optimized for middle (>1024) to large (>100MB) data allocations.
In some cases, the device may run out of memory. If this happens, some (idle) memory blocks are selected to be transferred back to the CPU memory. Therefore, the least-recently-used (LRU) strategy is adopted (taking into account which memory blocks are currently in use by kernels) and an optimization takes place to minimize the amount of memory that needs to be copied to the CPU.
However, in some cases, the memory fragmentation and size of the working set for a kernel are too large so that moving memory becomes impractical. In this case, an error report is currently displayed:
Amount of pinned memory: 398,131,200 bytes
Freelist size: 6 memory blocks
Largest free block: 67,665,920 bytes, insufficient to hold request size of 132,710,400 bytes
Process total: 736,493,568, Inuse: 580,608,000 bytes, Free: 155,885,568 bytes;
Device total: 1,010,368,512, device free: 167,772,160
Chunk 0 size 67,108,864 bytes: Fragmentation: 47.8%, pinned: 0 bytes (0.0%), free: 17,342,464 bytes (25.0%) Chunk 1 size 134,217,728 bytes: Fragmentation: 0.0%, pinned: 132,710,400 bytes (98.0%), free: 1,507,328 bytes (1.0%)
Chunk 2 size 268,435,456 bytes: Fragmentation: 0.0%, pinned: 132,710,400 bytes (49.0%), free: 36,192,256 bytes (13.0%)
Chunk 3 size 266,731,520 bytes: Fragmentation: 32.9%, pinned: 132,710,400 bytes (49.0%), free: 100,843,520 bytes (37.0%)
The solution is then to modify the Quasar program such that smaller (or less) memory blocks are used (for example, by splitting images in smaller tiles). An alternative is to change the memory allocation model to “huge memory footprint”. This way, Quasar will allocate large memory chunks and the possibility that a block does not fit in the chunks is decreased.

16.5 Load balancing and runtime scheduling

The runtime scheduler defines the order of the kernel launches and whether kernels can be executed concurrently. Therefore, the scheduler also tracks the different data dependencies between subsequent kernels. At the core of the runtime scheduler is the command queue. The command queue is a first-in first-out queue that either contains an individual kernel launch or a memory transfer operation. When kernel and device functions are scheduled, load balancing is performed an an appropriate computation device is selected based on several parameters.
The runtime scheduler aims at distributing the kernel functions over the available computation devices, taking full advantage of their possibilities.

16.6 Optimizing memory transfers with const and nocopy

To suppress the need to unnecessary memory copies, kernel and device function parameters can be annotated with the modifiers ’const and ’nocopy. The meaning of these modifiers is as follows:
The ’const avoids that, after execution of the kernel/device function, function parameters are being copied back to the host (or equivalently, to another GPU). On the other hand, ’nocopy avoids that the memory is being copied from the host to the device (or equivalently between GPUs).
The compiler automatically detects parameters that can be annotated with ’const and ’nocopy, so therefore there is no direct need to use these modifiers in user code. However, the modifiers may still be of use for indicating the calling interface of the function: the calling function then knows that e.g. the given function does not change the value of one of the parameters, or that the input value of one of the parameters is not used directly.
Finally, the reader may have noted that it could also be useful to have a nocopy at input and output modifier; this could serve as a local “scratch memory” to be used by the kernel function (e.g., as used in the implementation of various linear algebra routines). In Quasar, there is no special modifier for this purpose, it suffices to not declare any access modifier - as long as the scratchpad memory will not be used by another function, the memory transfer will not be performed.

16.7 Controlling the runtime system programmatically

Sometimes it is useful to control the runtime system from Quasar code. This can be achieved using the following code attributes:
Note that the above functions are provided for fine-tuning implementations. However, the code attributes should be used with care, because inappropriate usage may degrade the performance and cause unexpected errors in some cases. Several aspects, such as the memory location of a variable, can be visualized and analyzed in the variables window in the Redshift IDE.
 Chapter 15: Functional image processing in Quasar Up Main page Chapter 17: The Quasar compiler/optimizer