Chapter 7: Object-oriented programming Up Main page Chapter 9: GPU hardware features 

8 Special programming patterns

8.1 Matrix/vector expressions

Operations on large matrices are grouped and automatically converted into a kernel function (sometimes called broadcasting). For example:
x = randn(512,512,64)
y = 0.1 + (0.8 * 255 * sin(x/255)) + 10 * w
will automatically be translated to:
function [_out:cube]=opt__auto_optimize1(x:cube,w:cube)
function [] = __kernel__ opt__auto_optimize1_kernel _
(_out:cube’unchecked,x:cube’unchecked,w:cube’unchecked,pos:ivec3)
_out[pos]=((0.1+(204*sin((x[pos]/255))))+(10*w[pos])))
endfunction
_out = uninit(size(x)))
parallel_do(size(x),_out,x,w,opt__auto_optimize1_kernel)
endfunction
reduction (x:cube, w:cube) -> (((0.1+(204*sin((x/255))))+(10*w))= _
opt__auto_optimize1(x,w)
x = randn(512,512,64)
y = opt__auto_optimize1(x, w)
which is faster, because intermediate results are directly computed in local memory, without accessing the global memory (see Subsection 2.4.3↑). Remark that this procedure depends on the success of the type inference. In some cases, it may be necessary to give a hint to the compiler about the types of certain variables, through assert(type(var,"typename")) (see Section 2.2↑). Also, the expression optimizer generates a reduction to deal with expressions of the form (((0.1+(204*sin((x/255))))+(10*w)). When the same expression appears several times in the code, even in slightly modified version (e.g. sin(sinc(x)/255) instead of sin(x/255)), the generated __kernel__ function will be re-used.
The expression optimization can be configured using the following pragma:#pragma expression_optimizer (on|off)

8.2 Loop parallelization/serialization

In Quasar, loop parallelization consists of 1) the detection of parallelizable (or serializable) loops and 2) the automatic generation of kernel functions for these loops. The automatic loop parallelizer (ALP) attempts to parallelize for-loops, starting with the outside loops first. The ALP automatically recognizes one, two and three dimenional for-loops, and also maximizes the dimensionality of the loops subject to parallelization. The ALP extracts the loops and converts them to kernel functions (see Subsection 2.4.1↑), which are subsequently optimized using target-specific code transformations (see Chapter 17↓). Howver, not every for-loop can be parallelized and executed on a GPU: a number of restrictions exist:
  1. All variables that are being used inside the loop must have a static type (i.e. explicitly typed as vec, mat, cube, see Section 2.2↑) or a static type can be inferred from the context (type inference or through explicit/implicit specialization, see Chapter 6↑). Practically speaking: only types that can be used inside __kernel__ or __device__ functions are allowed.
  2. For the best performance, for slicing operations A[a..b,2], the dimensions a and b must be constant and known at compile time (either specified explicitly, or obtained through constant propagation). When these constants are not known, a nested kernel function will be generated. Depending on the context, this may or may not involve dynamic kernel memory (see Section 8.3↓)
  3. The for-loop nest must be perfect (no code in between subsequent for statements, no dependencies between the loop boundaries, no break, no continue) for example:
    for m=0..size(x,0)-1
    for n=5..size(y,0)-1
    endfor
    endfor
    This is an example of a imperfect loop:
    for m=0..size(x,0)-1
    for n=m..size(y,0)-m
    endfor
    endfor
    (Note: certain types of imperfect loops can also be handled using the imperfect loop function transform, this requires placing {!function_transform enable="imperfectloop"} inside the for-loop)
  4. Only types that can be used inside kernel or device functions are allowed (e.g., types based on class and mutable class, but not dynamic class).
  5. When host (i.e. non kernel/device) functions are called from a loop, the loop is not eligible for parallelization/serialization. In particular, only a limited number of built-in functions are supported. Functions that interact with/take variables with unspecified types (such as print, load, save, ...) are not supported.
  6. Data dependencies/conflicts between different iterations are detected and not allowed. In case a dependency is detected, the loop can be serialized. In this case, the for-loop will be natively compiled (in C++) and executed on the CPU in single-threaded mode.
  7. Advanced kernel function features such as shared memory and thread synchronization (see Subsection 2.4.4↑) are not supported, because these functions often require low-level access to the block position (blkpos) and dimensions (blkdim).
In case one of these conditions are violated, an error message is generated, and the code is not parallelized. In default form, this leads coded to be interpreted by the Quasar interpreter (often resulting in a significantly slower execution). It is therefore recommended to resolve issues raised by the automatic for-loop parallelizer.
The ALP can be configured using the pragmas (see Section 17.3↓): #pragma loop_parallelizer (on|off) and the global configuration setting COMPILER_AUTO_FORLOOP_PARALLELIZATION (see Table 17.3↓).
Individual for-loop parallelization can also be controlled by placing a code attribute (see Section 4.11↑) in front of the for-loop definition. The following code attributes are available:
An example of ALP code attributes is given below:
im = imread("image.tif","rgb")
{!parallel for}
for m=0..size(im,0)-1
for n=0..size(im,1)-1
im[m,n,0..2] = 255-im[m,n,0..2]
endfor
endfor
Most of the time, {!parallel for} is not required because the compiler is able to detect the parallelism automatically. However, {!parallel for} ensures that the subsequent loop will be parallelized, generating a compiler error when this fails.

8.2.1 While-loop serialization

The ALP is also able to serialize while loops, as in the following example:
{!serial for}
while !finished
x = a[0,index]
y = a[1,index]
index += 1
if x.^2 + y.^2 < 1
finished = true
endif
endwhile

8.2.2 Example: gamma correction

The following example illustrates a gamma correction operation:
im = imread("image.tif")
im_out = zeros(size(im))
gamma = 1.1
tic()
{!parallel for}
for i=0..size(im,0)-1
for j=0..size(im,1)-1
for k=0..size(im,2)-1
im_out[i,j,k] = im[i,j,k]^gamma
endfor
endfor
endfor
toc()
fig1 = imshow(im)
fig2 = imshow(im_out)
fig1.connect(fig2)
When no code attribute ({!parallel for}, {!serial for}) is used, the compiler will analyze the code, inspect the variable dependencies and decide whether the loop can be parallelized or serialized. In fact, it is not necessary to specify these code attributes, however, when it is done, the compiler will generate an error in case the parallelization/serialization fails (e.g., due to some data dependency). Resolving these errors may allow the code to be parallelized.
Warning: when {!parallel for} is placed in front of a loop, the loop will be parallelized irrespective of the dependencies. Consequently, when using {!parallel for} uncarefully, the program behavior is affected and the wrong results may be obtained!

8.3 Dynamic kernel memory

Very often, it is desirable to construct (non-fixed length) vector or matrix expressions within a for-loop (or a kernel function). Before Jan. 2014, this resulted in a compilation error “function cannot be used within the context of a kernel function” or “loop parallelization not possible because of function XX”. The transparent handling of vector or matrix expressions with in kernel functions requires some special (and sophisticated) handling by the Quasar compiler and runtime. In particular: what is needed is dynamic kernel memory. This is memory that is allocated on the GPU (or CPU) during the operation of the kernel. The dynamic memory is disposed (freed) either when the kernel function terminates or at a later point.
There are a few use cases for dynamic kernel memory:
In all these cases, dynamic memory can be used, simply by calling the zeros, ones, eye or uninit functions. One may also use slicing operators (A[0..255, 2]) in order to extract a sub-matrix. The slicing operations then take the current boundary access mode (e.g. mirroring, circular) into account. Dynamic kernel memory has the following advantages:
However, there also some drawbacks:
Because of these drawbacks, an optimization warning is generated by the compiler. It is recommended to use dynamic kernel memory sparely and only when there is no direct alternative. Different approaches exist to avoid that dynamic kernel memory is used:

8.3.1 Examples

The following program transposes 16x16 blocks of an image, creating a cool tiling effect. Firstly, a kernel function version is given and secondly a loop version. Both versions are equivalent: in fact, the second version is internally converted to the first version.

Kernel version

function [] = __kernel__ kernel (x : mat, y : mat, B : int, pos : ivec2)
r1 = pos[0]*B..pos[0]*B+B-1 % creates a dynamically allocated vector
r2 = pos[1]*B..pos[1]*B+B-1 % creates a dynamically allocated vector

y[r1, r2] = transpose(x[r1, r2]) % matrix transpose
% creates a dynamically allocated vector
endfunction
x = imread("lena_big.tif")[:,:,1]
y = zeros(size(x))
B = 16 % block size
parallel_do(size(x,0..1) / B,x,y,B,kernel)

Loop version

x = imread("lena_big.tif")[:,:,1]
y = zeros(size(x))
B = 16 % block size
{!parallel for}
for m = 0..B..size(x,0)-1
for n = 0..B..size(x,1)-1
A = x[m..m+B-1,n..n+B-1] % creates a dynamically allocated vector
y[m..m+B-1,n..n+B-1] = transpose(A) % matrix transpose
endfor
endfor

8.3.2 Memory models

To acommodate the widest range of algorithms, two memory models are currently provided (some more may be added in the future).
  1. Concurrent memory model (default)
    In the concurrent memory model, the computation device (e.g. GPU) autonomously manages a separate memory heap that is reserved for dynamic objects. The size of the heap can be configured in Quasar and is typically 32MB.\begin_inset Separator latexpar\end_inset
    The concurrent memory model is extremely efficient when all threads (e.g.  ≥  512) request dynamic memory at the same time. The memory allocation is done by a specialized parallel allocation algorithm that significantly differs from traditional sequential allocators.
    For efficiency, there are some internal limitations on the size of the allocated blocks:
    • The total amount of kernel dynamic memory is by default 128 MB, but can be configured (see Redshift/program settings/Runtime/Memory management)
    • The maximum size is by default 64 KiB (65536 bytes), but can be configured (see Redshift/program settings/Runtime/Memory management).
    • The minimum size is by default 64 bytes and scales linearly with the setting of maximum size. When the allocation request size is smaller than this minimum size, the request size is rounded up to the minimum size.
    • All Quasar modules need to use the same kernel dynamic memory settings: this is necessary so that dynamic memory blocks can be accessed correctly (allocated, deallocated) in between different modules
    Also note that the maximum size also incorporates the management overhead (reference counting, block sizes etc.), so in case you intend to allocate a 128 × 128 matrix in 32-bit precision float mode (64K) it is required to select 128 KiB.
    The concurrent memory model is primarily designed to enable a large number of concurrent allocations of relatively small memory blocks (for example to perform a local matrix calculation). It is generally not memory-efficient to allocate large memory blocks from a kernel function in the concurrent memory model. This is because often thousands of threads may be launched on the GPU, and correspondingly, thousands of dynamically allocated memory blocks may be active for a certain local calculation. However, the temporary memory blocks may easily consume several megabytes of GPU memory, and this may restrict the amount of GPU memory available for other purposes. For example, consider 2800 concurrent threads using each 128KB of data. This corresponds to 377MB of GPU memory!
    Therefore, for large dynamic memory allocations, consider using the cooperative memory model.
  2. Cooperative memory model\begin_inset Separator latexpar\end_inset
    In the cooperative memory model, the kernel function requests memory directly to the Quasar allocator. This way, there are no limitations on the size of the allocated memory. Also, the allocated memory is automatically garbage collected. To enable the cooperative memory model from a kernel function, use:{!kernel memorymodel="cooperative"}inside the kernel function.
    Because the GPU cannot launch callbacks to the CPU, this memory model requires the kernel function to be executed on the CPU.
    Advantages:
    • The maximum block size and the total amount of allocated memory only depend on the available system resources.
    Limitations:
    • Can only be used from CPU kernel functions.
    • The Quasar memory allocator uses locking (to limited extend), so simultaneous memory allocations on all processor cores may be expensive.

8.3.3 Features

8.3.4 Performance considerations

Dynamic kernel memory can greatly improve the expressibility of Quasar programs, however there are also a number of downsides that need to be taken into account.

8.4 Map and Reduce pattern

A popular parallel programming model is the Map and reduce model. In this model, a Map() function is applied to the data and subsequently, the data is aggregated using a Reduce() function. In its most simple for, a sequential implementation would be as follows:
total = 0.0
for m=0..511
for n=0..511
total = total + map(im[m,n])
endfor
endfor
If this loop would be parallelized directly, this would lead to data races, because the summation (total = total + ...) is not atomic. Luckily the Quasar compiler atomizes this summation operation behind the screens, leading to the following parallel loop:
total = 0.0
{!parallel for}
for m=0..511
for n=0..511
total += map(im[m,n])
endfor
endfor
For an overview how the compiler atomizes operations, see Table 8.1↓. Note that this programming pattern may occur with multiple result values: the result can be a scalar value, a vector or even a matrix:
A = zeros(2,2)
{!parallel for}
for i=0..255
A[0,0] += x[i,0]*y[i,0]
A[0,1] += x[i,0]*y[i,1]
A[1,0] += x[i,1]*y[i,0]
A[1,1] += x[i,1]*y[i,1]
endfor
Direct implementation of this loop in OpenCL and CUDA would give a poor performance on GPU devices, due to all adds being serialized in the hardware (all threads need to write to the same location in memory, so there is a spin-lock that basically serializes all the memory write accesses). The performance is often much worse than performing all operations sequentially on a CPU!
The obvious solution is the use of shared memory, thread synchronization in combination with parallel reduction (see Section 10.6↓). In general it is quite hard to write these kind of algorithms, taking all side-effects in consideration, such as register pressure, shared memory pressure. Therefore, the Quasar compiler now detects the above pattern and converts it into an efficient parallel reduction algorithm. There are some considerations:
Table 8.1 Atomization table of operators
Non-atomized Atomized Description
a = a + b a += b Addition
a = a - b a -= b Subtraction
a = a * b a *= b Multiplication
a = max(a, b) a ^^= b Maximum
a = min(a, b) a __= b Minimum
a = and(a, b) a &= b Bitwise AND
a = or(a, b) a |= b Bitwise OR
a = xor(a, b) a ~= b Bitwise XOR

8.5 Cumulative maps (prefix sum)

A related frequently occurring programming patterns are cumulative maps, for example:
total = 0.0
{!parallel for}
for m=0..511
for n=0..511
im[m,n] = im[m-1,n] + map(im[m,n])
endfor
endfor
Here, we assume that the ’safe access modifier is used (see Subsection 2.4.1↑), so that im[m-1,n] for m=0 is equal to 0. Some applications of the cumulative map patterns are cumulative sums (e.g., integral images), infinite impulse response (IIR) filters, etc. The cumulative map pattern is closely related to the map and reduce pattern, with the difference that the intermediate calculated values also need to be stored in memory.
The parallel algorithm for efficiently calculating a cumulative map is the prefix sum. By static analysis, the Quasar compiler is able to recognize the above pattern and convert it into a parallel prefix sum algorithm. The conditions are mostly the same as in case of the parallel reduction pattern:

8.6 Meta functions

Note: this section gives more advanced info about how internal routines of the compiler can be accessed from user code. Normally these functions do not need to be used directly, however this information can still be useful for certain operations.
Quasar has a special set of built-in functions, that are aimed at manipulating expressions at compile-time (although in the future the implementation may also allow them to be used at run-time). The functions are special, because actually, they do not follow the regular evaluation order (i.e. they can be evaluated from the outside to the inside of the expression, depending on the context). To make the difference clear with the host functions, these functions start with prefix $.
For example, x is an expression, as well as x+2 or (x+y)*(3*x)^99. A string can be converted (at runtime) to an expression using the function eval. This is useful for runtime processing of expressions for example entered by the user. However, the opposite is also possible:
print $str((x+y)*(3*x)^99) % Prints "(x+y)*(3*x)^99"
This is similar to the string-izer macro symbol in C:
#define str(x) #x
However, there are a lot of other things that can be done using meta functions. For example, an expression can be evaluated at compile-time using the function $eval (which differs from eval)
print $eval(log(pi/2)) % Prints 0.45158273311699, but the result is computed at compile-time.
The $eval function also works when there are constant variables being referred (i.e. variables whose values are known at compile-time). Although this seems quite trivial, this technique opens new doors for compile-time manipulation of expressions that are completely different from C/C++ but somewhat similar to Maple or LISP macros).
Below is a small overview of the meta functions in Quasar:
Notes:
means that the reduction for log(x) may only be applied inside __device__ functions, when the condition abs(x - 1) < 1e-1 is met. Here, this is simply a linear approximation of the logarithm around x==1.

Example: copying the type and assumptions from one variable to another

It is possible to write statements such as "assume the same about variable ’a’ as what is assumed on ’b’". This includes the type of the variable (as in Quasar, the type specification is nothing more than a predicate).
a : int
assert(0 <= a && a < 1)
b : ??
assert($subs(a=b,$assump(a)))
print $str($assump(b)) % Prints "type(b,"int")) && 0 <= b && b < 1"


 Chapter 7: Object-oriented programming Up Main page Chapter 9: GPU hardware features