6 Generic programming
Often, functions need to be duplicated for different container types (e.g. ‘vec[int8]’, ‘vec[scalar]’, ‘vec[cscalar]’). To avoid this duplication there is support for generic programming in Quasar.
Consider the following program that extracts the diagonal elements of a matrix and that is supposed to deal with arguments of either type ‘mat’ or type ‘cmat’:
function y : vec = diag(x : mat)
assert(size(x,0)==size(x,1))
N = size(x,0)
y = zeros(N)
parallel_do(size(y), __kernel__ (x:mat, y:vec, pos:int) > y[pos] = x[pos,pos])
endfunction
function y : cvec = diag(x : cmat)
assert(size(x,0)==size(x,1))
N = size(x,0)
y = czeros(N)
parallel_do(size(y), __kernel__ (x:cmat, y:cvec, pos : int) > y[pos] = x[pos,pos])
endfunction
Although function overloading here greatly solves part of the problem (at least from the user’s perspective), there is still duplication of the function ‘diag’. In general, we would like to specify functions that can “work”irrespective of their underlying type.In Quasar, this is fairly easy to do:
function y = diag[T](x : mat[T])
assert(size(x,0)==size(x,1))
N = size(x,0)
y = vec[T](N)
parallel_do(size(y), __kernel__ (pos) > y[pos] = x[pos,pos])
endfunction
As you can see, the types of the function signature have simply be omitted. The same holds for the ‘__kernel__’ function.
In this example, the type parameter ‘
T’ is required because it is needed for the construction of vector ‘
y‘ (through the ‘
vec[T]’ constructor). If ‘
T==scalar’, ‘
vec[T]’ reduces to ‘
zeros’, while if ‘
T==cscalar’, ‘
vec[T]’ reduces to ‘
czeros’ (complexvalued zero matrix). In case the type parameter is not required, it can be dropped, as in the following example:
function [] = copy_mat(x, y)
assert(size(x)==size(y))
parallel_do(size(y), __kernel__ (pos) > y[pos] = x[pos])
endfunction
Remarkably, this is still a generic function in Quasar; no special syntax is needed here.
Note that in previous versions of Quasar, all kernel function parameters needed to be explicitly typed. This is now no longer the case: the compiler will deduce the parameter types by calls to ‘diag’ and by applying the internal type inference mechanism. The same holds for the ‘__device__’ functions.
When calling ‘diag’ with two different types of parameters (for example once with ‘x:mat’ and a second time with ‘x:cmat’), the compiler will make two generic instantiations of ‘diag’. Internally, the compiler may either:

Keep the generic definition (type erasion) \begin_inset Separator latexpar\end_inset
function y = diag(x)

Make two instances of ‘diag’ (reification):\begin_inset Separator latexpar\end_inset
function y : vec = diag(x : mat)
function y : cvec = diag(x : cmat)
The compiler will combine these two techniques in a transparent way, such that: 1) for kernelfunctions explicit code is generated for the specific data types and 2) for less performancecritical host code type erasion is used (to avoid code duplication).
The selection of the code to run is made at compiletime, so correspondingly the Quasar Spectroscope debugger has special support for this. Of course, when calling the ‘diag’ function with a variable of type that cannot be determined at compiletime, a compiler error is generated:
The type of the arguments (’op’) needs to be fully defined for this function call!
6.1 Parametrized functions
As already mentioned, generic functions can be defined by just omitting the type declarations for the function parameters. For example, consider adding an item to a list (represented by a vector) at a given position.
function new_list = add_item(list, item, pos)
...
endfunction
However, very often it is desirable that the type relation between
list and
item is specified. For example, the type of list is ’
vec[T]’ where
T is some type. This can be achieved using parametrized functions:
function new_list = add_item[T](list : vec[T], item : T, pos)
...
endfunction
then, when the function
add_item is called, the compiler (or the runtime system) will check whether the types of
list and
item match. The type variable is available within the context of
add_item. This easily allows variables to be defined with the same type as
item or
list:
function new_list = add_item[T](list : vec[T], pos, item : T)
if pos > numel(list)
% extend the list with new items
new_list = vec[T](pos+1)
new_list[0..numel(list)1] = list
else
new_list = list
endif
new_list[pos] = item
endfunction
list1 = vec[int](10)
list2 = vec[string](8)
add_item(list1, 5, 4) %OK
add_item(list1, 4, "text") %Type mismatch
add_item(list2, 2, "let’s try again") % OK
In some cases (for example when
T only determines the output parameters), we wish to select the “version” of
add_item that will be called. This can be done by filling in
T explicitly (through a technique called generic function instantiation):
add_item[int](list1, 5, 4)
add_item[string](list2, 2, "let’s try again")
This is particularly useful when defining functions that return generic objects:
function list = create_list[T](initial_length : int)
list = vec[T](initial_length)
endfunction
my_list = create_list[int](10)
Parametric function themselves are variables and they have a certain type. In the above examples, the types of
add_item and
create_list are:
add_item : [(vec[??], ??, ??) > vec[??]]
add_item[int] : [(vec[int], ??, int) > vec[int]]
add_item[string] : [(vec[string], ??, string) > vec[string]]
create_list[int] : [int > vec[int]]
Remarks:

Kernel and device functions can also be parametric. For the devicespecific code, only the reification technique is used. The compiler will therefore rely on its type inference techniques to determine the types of all function parameters.

Functions can have multiple type parameters. When one of the type parameters is not used, a compiler warning is given.

In essence, generic programming in Quasar allows the programmer to write programs in which none of the data types needs to be specified. Consider the following example:
x = imread("lena_big.tif")
function [] = __kernel__ gamma_correction(x, pos)
x[pos] = 255*(x[pos]/255)^0.5
endfunction
parallel_do(size(x), x, gamma_correction)
Here, the compiler is able to determine the type of x (’cube’) and from this information the compiler finds that the type of the parameter ’pos’ in ’gamma_correction’ is ’ivec3’. When the function ’gamma_correction’ is later used in combination with a matrix of a different type (e.g. ’mat[uint8]’), the compiler will create a second version of ’gamma_correction’ where ’pos’ will be of type ’ivec2’.
6.2 Parametrized reductions
Similar to functions, reductions can also be parametrized. This relieves the programmer from the extra work in replicating reductions for different data types. Parametrized reductions frequently occur in combination with parametrized functions.
Suppose that we have a highly efficient multiplication function that works on a matrix (with an arbitrary data type) and vectors (with the same element data type as the matrix). Then we want to define an operator * in order to map expressions A*B onto this generic function. This can be achieved as follows:
function y = matrix_multiplication(A : mat[T], B : vec[T])
...
endfunction
reduction (T, A : mat[T], B : vec[T]) > A*B = matrix_multiplication(A,B)
I.e., the type parameter acts as nothing more than an extra input parameter for the reduction. Optionally, the reduction may include a where clause to impose additional constraints on T.
6.3 Parametrized types
In a type erasure approach, generic types can be obtained by not specifying the types of the members of a class:
type stack : mutable class
tab
pointer
endtype
However, this limits the type inference, because the compiler cannot make any assumptions w.r.t. the type of ‘tab’ or ‘pointer’. When objects of the type ‘stack’ are used within a forloop, the automatic loop parallelizer will complain that insufficient information is available on the types of ‘tab’ and ‘po(like type)inter’. This problem can be solved by using parametrized types:
type stack[T] : mutable class
tab : vec[T]
pointer : int
endtype
An object of the type ‘stack’ can then be constructed as follows:
obj = stack[int] % or even:
obj = stack[stack[cscalar]]
Parametric classes are similar to template classes in C++.
It is also possible to define methods for parametric classes:
function [] = __device__ push[T](self : stack[T], item : T)
cnt = (self.pointer += 1) % atomic add for thread safety
self.tab[cnt  1] = item
endfunction
Methods for parametric classes can be ‘__device__’ functions as well, so that they can be used on both the CPU and the GPU. This allow us to create threadsafe and lockfree implementations of common data types, such as sets, lists, stacks, dictionaries etc. within Quasar.
6.4 Generic memory allocation functions and casting
When programming generic functions, it is often useful to allocate vectors or matrices of generic types. This can be accomplished using generic extensions of the memory allocation functions: uninit[T](), zeros[T](), shared[T](), shared_zeros[T](), as in the following example:
function y:’unchecked = generic_multiply[T](a : mat[T]’unchecked, b : mat[T]’unchecked) concealed
assert(size(a,1)==size(b,0),"Matrix multiplication: dimensions do not match:",size(a,0..1),"x",size(b,0..1),"!")
y = mat[T](size(a,0),size(b,1))
for m = 0..size(y,0)1
for n = 0..size(y,1)1
result = cast(0.0, T)
for k = 0..size(a,1)1
result += a[m,k] * b[k,n]
endfor
y[m,n] = result
endfor
endfor
endfor
Here, the type T is only known when either the function is specialized (i.e., at compiletime) or at runtime.
Additionally, often it is required to initialize scalar variables based on the generic type. This is achieved using a type cast:
result = cast(0.0, T)
Note that type casts should be avoided as much as possible and should only be used when dealing with generic functions.
6.5 Explicit specialization through metafunctions
Normally, generic functions are automatically specialized (which is called implicit specialization). This is a compilerdecision that relies on a number of heuristics. However, there is also the possibility of explicitly indicating that a given function need to be specialized and also in which way. This can be achieved using the metafunction $specialize:$specialize(function_name, constraint1 && ... && constraintN)
In Quasar, there are three levels of genericity (for which specialization can be done):

Type constraints: a type constraint binds the type of an input argument of the function.

Value constraints: gives an explicit value to the value of an input argument

Logic predicates: additional assumptions on the input arguments (see Chapter 5↑) that are not type or value constraints
As an example, consider the following generic function:
function y = __device__ soft_thresholding(x, T)
if abs(x)>=T
y = (abs(x)  T) * (x / abs(x))
else
y = 0
endif
endfunction
reduction x : scalar > abs(x) = x where x >= 0
Now, we can make a specialization of this function to a specific type:
soft_thresholding_real = $specialize(soft_thresholding, type(x,"scalar") && type(T, "scalar")) But also for a fixed threshold:
soft_thresholding_T = $specialize(soft_thresholding,T==10)We can even go one step further and specify that ‘x>0‘:
soft_thresholding_P = $specialize(soft_thresholding,x>0)Everything combined, we get:
soft_thresholding_E = $specialize(soft_thresholding, type(x,"scalar") && type(T,"scalar") && T==10 && x>0)Based on this knowledge (and the above reduction), the compiler will then generate the following function:
function y = __device__ soft_thresholding_E(x : scalar, T : scalar)
if x >= 10
y = x  10
else
y = 0
endif
endfunction
It can be noted that the function has now significantly been simplified.
Explicit specialization can also be used to change the types of function parameters at compiletime. This is legal as long as the parameter types are always
narrowed by this operation. This is useful for example to address the GPU hardware texturing units (see
Section 9.3↓) in a more general way. Below, the implemenuninittation of a 1D spatial filter with variable directionality is given.
function [] = __kernel__ filter_kernel(x : mat, y : mat’unchecked, a : vec, dir : ivec2, pos : ivec2)
offset = int(numel(a)/2)
total = 0.
for k=0..numel(a)1
total += x[pos + (k  offset).* dir] * a[k]
endfor
y[pos] = total
endfunction
% Default implementation  simply pass all parameters to the kernel function
parallel_do(size(im),im,im_out,a,[0,1],filter_kernel)
% Implementation II  use the GPU hardware hardware texturing units (HTUs)
parallel_do(size(im),im,im_out,a,[0,1],$specialize(filter_kernel,type(x,mat’hwtex_nearest)))
% Implementation III  perform constant substitution + use the HTUs
parallel_do(size(im),im,im_out,$specialize(filter_kernel,type(x,mat’hwtex_nearest) && a==[1,2,3,2,1]/9 && dir==[0,1]))
On an NVidia Geforce 435M GPU, the third implementation is about two times faster than the first implementation and 10% faster than the second implementation.
6.6 Implicit specialization
When generic device functions are called from kernel/device functions, the generic device function will be implicitly specialized (i.e., specialized automatically by the compiler).
function y = __device__ soft_thresholding(x, T)
if abs(x)>=T
y = (abs(x)  T) * (x / abs(x))
else
y = 0
endif
endfunction
function [] = __kernel__ denoising(x : mat, y : mat)
assert(x[pos]>0)
y[pos] = soft_thresholding(x[pos], 10)
endfunction
For execution on a computation device, all types must be statically determined at compiletime. Therefore, the implicit specialization allows defining generic functions without bothering about the type to which the function will be applied. Note that during the implicit specialization phase, the compiler may generate an error when a type mismatch is encountered.
6.7 Generic sizeparametrized arrays
In some cases, it is useful to integrate knowledge on the size of arrays into their types. This can be achieved using generic sizeparametrized arrays. Consider the following example of a color transform on an image with N channels (where N is a generic parameter):
function img_out = colortransform[N](C : mat(N,N), img : cube(:,:,N))
img_out = zeros(size(img,0),size(img,1),N)
{!parallel for}
for m=0..size(img,0)1
for n=0..size(img,1)1
img_out[m,n,:] = (C * squeeze(img[m,n,:]))
endfor
endfor
endfunction
The parameter
N constraints the arguments
C and
img to match in dimensions in the following way: the third dimension of img should match the first and second dimensions of
C (if not, a compiler error or (in some rare cases) a runtime error will be generated). Due to the implicit specialization (
Section 6.6↑), calling the function with a
mat3x3 color transform matrix will ensure that the parameter
N=3 is substituted during compiletime, leading to efficient calculation of the vectormatrix product
C * squeeze(img[m,n,:]). The
squeeze function is required here, because
img[m,n,:] returns a vector of dimensions
1x1xN which is multiplied with matrix
C. The
squeeze function removes the singleton dimensions and converts the vector to
Nx1x1 (or simply
N, since singleton dimensions at the end do not matter in Quasar). For the assignment, this conversion is done automatically. In practice the
squeeze function will have zerooverhead; it is only required here to keep the compiler happy.
Because with this technique, the dimensions of
img_out[m,n,:] are known at compiletime, the implementation of the above function will not use dynamic kernel memory (
Section 8.3↓), which is a big performance benefit.
The type parametrization allows for linear arithmetic. Therefore, it is possible to declare types like vec(2*N) and mat(2+N,2*N). For example, defining a matrix of type mat(2,N):
A = mat(2,N)
B = cube(N,2*N,3)
Y = A[0,:]
C = B[:,:,1]
The compiler can now infer that Y has type vec(N) and that C has type mat(N,2*N). After specialization with N=2, C will have type mat(2,4) and the resulting variables enjoy the benefits of fixed sized data types.
6.8 Generic dimensionparametrized arrays
Often, it is useful to define functions that can handle arrays with arbitrary dimensions. For this purpose, the type
cube{:} can be used (see
2.2.5↑). This type specifies a cube with an arbitrary (or to speak, infinite) dimension. Because currently Quasar does not support multidimensional forloops directly, the easiest way is to linearize the index and use two conversion functions:

ind2pos(sz, index): converts a linear (scalar) index to an ndimensional position vector

pos2ind(sz, pos): converts an ndimensional position vector to a linear (scalar) index
The following example illustrates a subsampling operation on an ndimensional array:
function y:’unchecked = subsample(x : cube{:}, D : int)
y = uninit(int(size(x)/D))
{!parallel for}
for v = 0..numel(y)1
p = ind2pos(size(y), v)
y[p] = x[p*D]
endfor
endfunction
Then, it might be desirable to know the dimension of the array at the moment that the function is executed. The dimension of an array can be obtained using the ndims() function, however, in case the function has two arrays as input, it may be desired to express that both arrays have the same dimensionality. For this purpose, dimensionparametrized array types is available.
An example of such a function is the Kronecker product, for which the implementation is different for each value of the dimensionality. Nevertheless, the function can be implemented in a generic fashion, as follows:
function z = kron[N](x : cube{N}, y : cube{N})
function [] = __kernel__ kernel(x : vec, y : vec, z : vec,
x_dim : ivec(N), y_dim : ivec(N), pos : int, blkdim : int, blkcnt : int)
stepsize = blkdim * blkcnt
total = prod(x_dim)
rep = prod(y_dim)
z_dim = int(x_dim .* y_dim)
for i=pos..stepsize..total1
x_pos = ind2pos(x_dim, i)
val = x[i]
for j=0..rep1
y_pos = ind2pos(int(y_dim), j)
z_pos = x_pos + x_dim .* y_pos
z[pos2ind(z_dim, z_pos)] = val * y[pos2ind(y_dim, y_pos)]
endfor
endfor
endfunction
z = uninit(size(x,0..N1).*size(y,0..N1))
nextpow2 = n > int(2^ceil(log2(n)))
parallel_do(min(nextpow2(numel(x)), 16384), x, y, z, size(x,0..N1), size(y,0..N1), kernel)
endfunction
When called, the compiler will specialize the generic function kron for each value of N that occurs in the calling code. When the dimensionalities of x and y do not match, the maximum of their dimensionalities will be used. Note that the above implementation combines sizeparametrized and dimensionparametrized arrays in a natural way: the length of the dimension vectors x_dim and y_dim matches the dimensionality of the input data.
6.9 Example of generic programming: linear filtering
A linear filter computes a weighted average of a local neighborhood of pixel intensities, and the weights are determined by the socalled filter mask.
In essence, 2D linear filtering formula can be implemented in Quasar using a 6 line __kernel__ function:
function [] = __kernel__ filter(x : cube, y : cube, mask : mat, ctr : ivec3, pos : ivec3)
sum = 0.0
for m=0..size(mask,0)1
for n=0..size(mask,1)1
sum += x[pos+[m,n,0]ctr] * mask[m,n]
endfor
endfor
y[pos] = sum
endfunction
However, this may not be the fastest implementation, for two reasons:

The above kernel function performs several read accesses to x (e.g. for 3x3 masks it requires 9 read accesses per pixel!). As outlined in the Quick optimization guide, the implementation should use shared memory as much as possible.

In case the filter kernel is separable (i.e. mask = transpose(mask_y) * mask_x), a faster implementation can be obtained by performing the filtering in two passes: a horizontal pass and a vertical pass. However, a naive implementation of this approach may have a bad data locality and depending on the size of the filter mask, it may even do more worse than good.
The best approach is therefore to combine the above techniques (i.e. shared memory + separable filtering). For illustrational purposes, we will consider only the mean filter (with mask=ones(3,3)/9) in the following.
function [] = __kernel__ filter3x3kernelseparable(
x:cube,y:cube,pos:ivec3, blkpos:ivec3,blkdim:ivec3)
vals = shared(blkdim+[2,0,0])
sum = 0.
for i=pos[1]1..pos[1]+1
sum += x[pos[0],i,blkpos[2]]
endfor
vals[blkpos] = sum
if blkpos[0]<2
sum = 0.
for i=pos[1]1..pos[1]+1
sum += x[pos[0]+blkdim[0],i,blkpos[2]]
endfor
vals[blkpos+[blkdim[0],0,0]] = sum
endif
syncthreads
sum = 0.
for i=blkpos[0]..blkpos[0]+2
sum += vals[i,blkpos[1],blkpos[2]]
endfor
y[pos] = sum*(1.0/9)
endfunction
x = imread("image.png")
y = zeros(size(x))
parallel_do(size(y),x,y,filter3x3kernelseparable)
imshow(y)
Remark that the above implementation is rather complicated, especially the block boundary handling code is excessive. Through generic programming, it is possible to extend this code fragment, in order to be used in a wider context. Quasar has two programming techniques:

Function variables and closure variables\begin_inset Separator latexpar\end_inset
Suppose that we express a filtering operation in a general way:
type f : [__device__ (cube, ivec2) > vec3]
This is a type declaration of a function that takes a cube and a 2D position as input, and computes a 3D color value.
Then, a linear filter can be constructed simply as follows:
mask = ones(3,3)/9
ctr = [1,1]
function y : vec3 = __device__ linearfilter(x : cube, pos : ivec2)
y = [0.0,0.0,0.0]
for m=0..size(mask,0)1
for n=0..size(mask,1)1
y += x[pos+[m,n,0]ctr] * mask[m,n]
endfor
endfor
endfunction
Note that the body of this function is essentially the body of the kernel function at the top of this page.
Next, we can define a kernel function that performs filtering for any filtering operation of type f:
function [] = __kernel__ genericfilterkernelnonseparable(
x:cube,y:cube, masksz:op:f,ivec2,pos:ivec3, blkpos:ivec3,blkdim:ivec3)
vals = shared(blkdim+[masksz[0]1,masksz[1]1,0])
vals[blkpos] = x[pos[1,1,0]]
if blkpos[0]<masksz[0]1
vals[blkpos+[blkdim[0]1,1,0]] = x[pos+[blkdim[0]1,1,0]]
endif
if blkpos[1]<masksz[0]1
vals[blkpos+[blkdim[1]1,1,0]] = x[pos+[blkdim[1]1,1,0]]
endif
syncthreads
y[pos] = op(vals, blkpos)
endfunction
x = imread("image.png")
y = zeros(size(x))
parallel_do(size(y),x,y,size(mask,0..1),linearfilter,genericfilterkernelnonseparable)
imshow(y)
Here, masksz = size(mask,0..1) (the size of the filter mask). Now we have written a generic kernel function, that can take any filtering operation and compute the result in an efficient way. For example, the filtering operation can also be used for mathematical morphology or for computing local maxima:
function y : vec3 = __device__ maxfilter(x : cube, pos : ivec2)
y = [0.0,0.0,0.0]
for m=0..size(mask,0)1
for n=0..size(mask,1)1
y = max(y, x[pos+[m,n,0]ctr])
endfor
endfor
endfunction
The magic here, is to implicit use of closure variables: the function linear_filter and max_filter hold references to nonlocal variables (i.e. variables that are declared outside this function). Here these variables are mask and ctr. This way, the function signature is still [__device__ (cube, ivec2) > vec3].

Explicit/implicit specialization\begin_inset Separator latexpar\end_inset
Previous point (1) is demonstrates a simple generic programming approach through function pointers. Some people believe that generic programming leads to a loss in efficiency. One of their arguments is that by the dynamic function call y[pos] = op(vals, blkpos), where op is actually a function pointer, efficiency is lost: the compiler is for example not able to inline op and has to emit very general code to deal with this case.
In Quasar, this is not necessarily true  being a true domainspecific language, the compiler has a lot of information. In fact, the optimization of the generic function generic_filter_kernel_nonseparable can be made explicit, using the $specialize meta function:
linearfilterkernel = $specialize(genericfilterkernelnonseparable, op==maxfilter)
x = imread("image.png")
y = zeros(size(x))
paralleldo(size(y),x,y,size(mask,0..1),linearfilterkernel)
imshow(y)
The function $specialize is evaluated at compiletime and will substitute op with respectively linear_filter and max_filter. Correspondingly these two functions can be inlined and the resulting code is equivalent to the linear_filter_kernel function being completely written by hand. Now, in Quasar, function pointers are avoided by default (through the compilation setting “enable function pointers in generated code”). This is achieved exactly using this technique.

Datatypeindependent implementation\begin_inset Separator latexpar\end_inset
We can also go one step further and generalize the data types of the above kernel function:
mask = ones(3,3)/9
ctr = [1,1]
function y : vec3 = __device__ linearfilter[T](x : cube[T], pos : ivec2)
y = [0.0,0.0,0.0]
for m=0..size(mask,0)1
for n=0..size(mask,1)1
y += x[pos+[m,n,0]ctr] * mask[m,n]
endfor
endfor
endfor
function [] = __kernel__ genericfilterkernelnonseparable[T](
x:cube[T],y:cube[T], masksz,op:[__device__ (cube[T], ivec2) > vec3],ivec2,pos:ivec3, blkpos:ivec3,blkdim:ivec3)
vals = shared[T](blkdim+[masksz[0]1,masksz[1]1,0])
vals[blkpos] = x[pos[1,1,0]]
if blkpos[0]<masksz[0]1
vals[blkpos+[blkdim[0]1,1,0]] = x[pos+[blkdim[0]1,1,0]]
endif
if blkpos[1]<masksz[0]1
vals[blkpos+[blkdim[1]1,1,0]] = x[pos+[blkdim[1]1,1,0]]
endif
syncthreads
y[pos] = op(vals, blkpos)
endfunction
x = imread("image.png")
y = zeros(size(x))
parallel_do(size(y),x,y,size(mask,0..1),linearfilter,genericfilterkernelnonseparable)
imshow(y)
Here, the compiler will specialize the function
genericfilterkernelnonseparable, as follows:
$specialize(genericfilterkernelnonseparable,op==linearfilter,T==scalar)
Functions with closure variables are building blocks for larger algorithms. Functions can have arguments that are functions themselves. Function specialization is a compiler operation that can be used to generate explicit code for fixed argument values. In the future, function specialization may be done automatically in some circumstances.