Prev Chapter 12: SIMD processing on CPU and GPU Up Main page Chapter 14: Built-in function quick reference Next

13 Best practices

13.1 Use “main” functions

Quasar programs are executed from the top to the bottom. This means that, if there are statement in between function definitions, these statements will also be executed. This can be handy to define symbols at the global level, such as constants, lambda expressions etc. However, it is advisable to put the main program logic in one function, the function “main”. The function “main” will be called automatically by the runtime when the .q file is loaded. An example of a main function is as follows:

function [] = main()
img = imread("lena_big.tif")
imshow(img)
endfunction

The main function may contain fixed and optional parameters:function [] = main(required_param1, opt_param1=4.0)The required parameters must then be specified via the command-line (or via the set command line arguments dialog box in Redshift). For example:Quasar.exe myprog.q 1 2

When not enough parameters are specified (or too many), a run-time error will be generated. Practically, there are only two types that are allowed: scalar and string. In other to pass values of other types (e.g. matrices), it is currently best to wrap them in a string, and to convert the string to the right data type using the function eval. The following example illustrates this:

function [] = main(matrix_string : string)
matrix : mat = eval(matrix_string)
print matrix
endfunction
% Command line
Quasar.exe "[[1,0],[1,-1]]"
% Runtime system calls the main function as:
main("[[1,0],[1,-1]]")

Additionally, the main function can be made to accept a variable number of parameters, by defining it as a variadic function (see Section 4.8↑):

function [] = main(arg1, arg2, ...other_args)
print arg1
print arg2
for i=0..numel(other_args)-1
print other_args[i]
endfor
endfunction

This permits great flexibility when passing various parameters to Quasar programs.

Important remark: function “main” has a special behavior when the .q file is imported (using the import keyword, see earlier): in particular, the function definition is completely skipped, as if no “main” function was present in the file. Hence, for .q modules that are only intended to be imported, the “main” function can contain some testing code.

13.2 Shared memory usage

Shared memory (see Subsection 2.4.4↑) is on-chip and fast, however, for the CUDA computation engine, recent GPU devices use a global memory cache that has about the same efficiency as the shared memory. Consequently, the best practice is to only use shared memory when it is needed, for example when there is communication needed between the different kernel functions that are running in parallel on the same block. The reason is: copying from global memory to shared memory also has a performance cost, and because shared memory is limited, kernel functions often need to be restructured so that everything can fit into the shared memory. This “restructuring cost” often outweighs the benefits of using shared memory. So only use shared memory when it is really necessary.

13.3 Loop parallelization

Often, you may want to parallelize nested loops, such as:

for m=0..M-1
for n=0..N-1
for k=0..K-1
for l=0..L-1
...
endfor
endfor
endfor
endfor

The question is: which loops to parallelize? The answer is actually problem-specific (depends on the dimensions of the variables and their dependencies), but in general, it is recommended to parallelize the outer loops as much as possible, because this minimizes communication and synchronization with the computing device (e.g. GPU). For example, the above loops would be best parallelized as follows:

function [] = __kernel__ my_kernel(...)
for l=0..L-1
...
endfor
endfunction
parallel_do([M,N,K],...,my_kernel)

However, in many cases it is not necessary to perform this parallelization yourself: the Quasar compiler has an efficient built-in auto parallelization routine, which checks variables and their dependencies, and chooses a parallelization strategy that has the most benefit for the particular problem. For more info, see Subsection 17.1.1↓.

Using the {!parallel for; dim=N} code attribute, it is also possible how many of the outer for-loops need to be parallelized. Special care needs to be taken when N is selected. When N is too small, the kernel may end up being executed on the CPU because the data dimensions of the parallel loops are too small to match well with the GPU requirements. See also Section 8.2↑.

13.4 Output arguments

Functions can have multiple arguments, as shown in the following example:

function [band1 : mat, band2 : mat]=subband_decomposition(input : mat)
band1 = input .* G
band2 = output .* H
endfunction

Alternatively, the matrices are passed by reference (see Consequently, the best practice is to only use shared memory when it is needed, for example when there is communication needed between the different kernel functions that are running in parallel on the same block. The reason is: copying from global memory to shared memory also has a performance cost, and because shared memory is limited, kernel functions often need to be restructured so that everything can fit into the shared memory. This “restructuring cost” often outweighs the benefits of using shared memory. So only use shared memory when it is really necessary.Section 1.3↑), and this can also be exploited for returning processing results:

band1 = input % copy reference
band2 = zeros(size(input))
function [] = subband_decomposition(band1 : mat, band2 : mat)
band2[:,:] = band1 .* G
band1 = band1 .* H
endfunction

It is preferable to use the first approach (for readability of the code), however the second approach is also useful in some cases: the difference is in the memory usage: in the first approach: memory needs to be allocated for input, band1 and band2, while in the second approach, only memory is needed to store band1 and band2 (hence one memory allocation is eliminated). For applications relying on huge matrix sizes (for example applications working with digital camera images, or for real-time video applications), it is recommended to use the second approach.

Remark that simply using “band2 = band1 .* G” in the second approach would not give the correct result, because, even though band2 contains a pointer to the matrix memory, the value of band2 itself is still passed by value. Instead, adding [:,:] ensures that no new memory is allocated for band2.

In case after a call, one output argument is not necessary in the subsequent code, the output argument can be captured using a placeholder:[band1,_] = subband_decomposition(input)

This way, in future versions, the compiler may optionally specialize the function subband_decomposition, by generating a version in which the second output parameter is not being calculated.

As mentioned in Subsection 4.8.3↑, the output arguments of functions can be chained using the spread operator ... For example,

function [band1_out, band2_out] = process(band1, band2)
...
endfunction
process(...subband_decomposition(input))

This way, it becomes unnecessary to store the output arguments in intermediate variables.

13.5 Writing numerically stable programs

Here, we consider numerical stability and software program stability. To ensure numerical stability, programs may need to make use of the following functions:

isfinite(x): checks whether variable x is finite (i.e. not infinite and not NaN “not a number”).
isinf(x): returns true only if the variable x is infinite. Infinities are used to represent overflow and divide-by-zero.
isnan(x): returns true only if the variable x is “not a number”. The NaN encoded floating point numbers have no numerical value. They are produced by operations that have no meaningful result, like infinity minus infinity.

Remark that, by default, GPU computation engines, flush denormal floating point values to 0. Practically, this means that if the result of a single-precision floating-point operation, before rounding, is in the range − 2^− 126 to + 2^− 126 (or − 1, 175 × 10^− 38 -1, 175 × 10^− 38), it is replaced by 0 (also see Subsection 2.2.1↑). To avoid potential underflows, it maybe necessary to pre-scale the input data to a good “working” range, before numerical operations are performed. CPU computation engines may allow for denormal numbers (depending on the setting of the compiler, and whether SIMD instructions are used etc.), yielding more accurate numerical results, but at a decreased performance: working with denormal floating point numbers can be up to 100 times slower than in case of normalized numbers. Hence, in case numerical problems are an issue, it may be good to compare the results of the CPU and GPU computation engines.

Software stability: there are four causes for a Quasar program to be interrupted:

Errors (generated using the error statement or by Quasar runtime functions). Currently, error handling (e.g. try-catch blocks) are not supported yet. Hence when an error is generated, the program is automatically terminated.
Out-of-memory: when the system (or GPU) has not enough memory, the program will be halted. By default, Quasar attempts to move memory from the GPU to the system memory when it detects that a memory allocation may result in an out-of-memory error. In some cases, this may not be possible (e.g., a __kernel__ function that uses more memory than available on the GPU).
Stack overflow: usually when a recursive function calls itself in an endless loop. For example, the function:f = x -> f(x) will result in a stack overflow error.
Abusing ’unchecked modifiers: the ’unchecked modifier (see Subsection 2.4.1↑) is introduced for memory accesses where it is completely certain that a kernel function will not go out of bounds of the vectors/matrices/etc. This gives a performance benefit of up to 30% or more for certain functions. When the kernel function breaches the boundaries, the program may either result an error (e.g. cudaUnknownError), or crash. To prevent this kind of problems, one can 1) either remove the ’unchecked modifiers from the kernel function arguments, or 2) run the program using the CPU computation engine, with the flag COMPILER_PERFORM_BOUNDSCHECKS=true (see Table 17.3↓). In the second case, Quasar will report an error and some information on the variables that violate the boundary conditions, so that abuses of the ’unchecked modifier can be fixed.
An alternative solution is to temporarily replace ’unchecked by ’checked, this will instruct Quasar to perform bounds checking at any time for the specified variable, irrespective of the COMPILER_PERFORM_BOUNDSCHECKS variable.

To catch errors, it may be useful to place assertions inside kernel or device functions:

function [] = __kernel__ kernel (pos : ivec3)
b = 2
assert(b==3)
endfunction

In this example, the assertion obviously fails. Quasar breaks with the following error message:

(parallel_do) testkernel - assertion failed: line 23

13.6 Writing deterministic kernels

When writing kernels involving atomic operations, the outcome may depend on the order of the instructions being executed. This is due to the floating point arithmetic often not being associative: in general (A + B) + C! = A + (B + C) . Due to the parallelism, the order of the instructions cannot be controlled. The nondeterministic nature may not be a desired property of the algorithm. The following parallel reduction implementation exhibits this problem.

function y : scalar = __kernel__ parallel_reduction(x : vec’unchecked,
acc : accumulator, val0 : scalar, pos : int, blkpos : int, blkdim : int)
val = val0
bins = shared(blkdim)
for m=pos..32*blkdim..numel(x)-1 % step 1 - parallel sum
val = acc(val, x[m])
endfor
bins[blkpos] = val
syncthreads % step 2 - reduction
bit = 1
while bit < blkdim
index = 2*bit*blkpos
if index + bit < blkdim
bins[index] = acc(bins[index], bins[index + bit])
endif
syncthreads
bit *= 2
endwhile
% write output
if blkpos == 0
y += bins[0]
endif
endfunction

The problem is here that y += bins[0] is an atomic update operation. It is performed only once per block, however multiple blocks may interfere causing the order of the addition operations to be altered. The solution is here to avoid the atomic update altogether, and store the result per block in an output vector (with as length the number of blocks in the input vector). Then, in a separate kernel launch of parallel_reduction, the block sums are summed separately. To reduce the size of the blocksums vector, grid-strided loops can be used (see Subsubsection 9.2.3.1↑).

Thanks to the specific way that updates are performed in shared memory and due to the thread synchronization, the updates of bins are guaranteed to be deterministic! This shows that it is possible to remedy a kernel function and ensure that its output is deterministic.

When accumulation in shared memory with thread synchronization (or similarly, warp shuffling) is not one of the options for a specific kernel, an alternative is to use integer or fixedpoint representations. This way, we have successfully solved some depth map determinism problems during depth map calculation in computer vision applications.

Prev Chapter 12: SIMD processing on CPU and GPU Up Main page Chapter 14: Built-in function quick reference Next