Chapter 1: Introduction Up Main page Chapter 3: Type system 

2 Getting started

In this section, we will give you a little tutorial, to give you an idea of the Quasar programming language. First, a number of high-level concepts are listed. These concepts are included mainly to ease programming in Quasar. The most interesting parts are discussed in Section “writing parallel code” (Section 2.4↓).

2.1 Quasar high-level programming concepts

  1. Variables: Quasar variables are (by default) weakly-typed, although some mechanisms exist to enforce strong-typing. Variable names and function names are case sensitive.
  2. Data types: some of the built-in data types are listed here:\begin_inset Separator latexpar\end_inset
    • scalar: specifies a floating-point number. The used precision depends on the settings of the computation engine.
    • cscalar: specifies a complex-valued scalar number
    • vec: a dense vector (1D array) of arbitrary length (the size is limited by the system resources)
    • mat: a dense matrix (2D array) of arbitrary size (the size is limited by the system resources)
    • cube: a dense cube (3D array) of arbitrary size (the size is limited by the system resources)
    • cvec: a complex valued dense vector
    • cmat: a complex valued dense matrix
    • ccube: a complex valued dense cube
    • string: a string expression
    • cell: a cell matrix object
    • kernel_function: represents a reference to a kernel or device function (see further).
    • function: a reference to a Quasar (user) function or lambda expression
    • object: a user-defined object (see function object())
    Note that specific functions (see further) need to be used to create variables of a given type. There are also some special built-in datatypes: vecx and cvecx, with x=1,...,32 specify a vector of length x.
  3. Scalar numbers: scalar numbers can be entered in decimal notation (5.678) as well as in scientific notation (-1.9e-4). Imaginary numbers are defined by adding the suffix j (or i), hence 1+1j or 1-1j represent complex numbers. Non-decimal numbers are also supported: for example, binary numbers 1011011b (suffix b or B), octal numbers 123456o (suffix o or O) and hexadecimal numbers 1Fh or 0ECD3Fh (suffix h or H).
  4. Integer numbers: Quasar supports integer numbers (type int). The bit length of the int type depends on the computation engine, but is typically 32-bit. Also integer types with specified bit length exist: int8, uint8, int16, uint16, uint32.
  5. % Comments are cool
    However, note that multi-line comments are currently not (yet) supported.
  6. Assignment expressions:
    cool = 1
    quasar = cool
    The separation of lines using “;” is optional, and only mandatory when multiple statements are placed on the same line. One can assign to multiple variables at once, similar to C/C++:a = b = 1also, the result of an assignment is a value (in this case, 1). Multiple variable assignment is also possible (i.e. assigning multiple values to multiple variables at once). For example: [a, b] = [1, 2]will assign 1 to a and 2 to b. It is equivalent to:a=1; b=2The multiple variable assignment is mostly useful for 1) assigning multiple return values from functions, for interchanging values:[a, b]=[b, a]will swap the values of a and b. In some cases, it may be useful to neglect a certain return value. This can be done using the placeholder _:
    [a, _] = [1,2]
    [_, b] = [1,2]
  7. Arrays (vectors, matrices, cubes etc.) The data type used may depend on the settings of the computation engine (currently, only 32-bit floating point is allowed, for efficiency). The following program illustrates how to create vectors and perform operations:
    a = [0, 1, 2, 3] + 4
    b = [3, 3, 3, 3]
    print "a = ", a, "a .* b = ", a .* b, "sum(a.*b) = ", sum(a .* b)
    print "a^2 = ", a.^2
    String expressions are defined by double quotes (“”), the function “print” allows to print several comma separated values to the console. The “sum” function computes the sum of all components of the vector. The first line evaluates toa = [4, 5, 6, 7]i.e., 4 is added to every component of the vector. [4, 5, 6, 7] then represents a row vector. A matrix can be defined as follows:a = [[1,2],[2,1]]Statements and expressions can be split across multiple code lines. The following is also valid:
    a = [[1,2],
    [2,1]]
    However, for readability, it is adviced to put an underscore at the line break:
    a = [[1,2], _
    [2,1]]
    Similarly, a 3D matrix can be defined by:a = [[[1,2],[3,4]],[[5,6],[7,8]]]An alternative way of defining matrices is using the function zeros(.) or ones(.), which will initialize the values of the matrix to 0 and 1, respectively:
    a = zeros(5)
    b = zeros(6, 4)
    c = zeros(8, 6, 4)
    d = ones(5)
    e = ones(6, 4)
    f = ones(8, 6, 4)
    Alternatively, a vector with the dimensions can be used:
    dims = [4, 5, 6]
    a = ones(dims)
    The function size(.) returns the size of a vector/matrix/cube:
    dims = size(a)
    dim_y = size(a,0)
    dim_x = size(a,1)
    dim_z = size(a,2)
    [dim_y,dim_x] = size(a,0..1)
    [dim_y,dim_x,dim_z] = size(a)
    Note that the dimensions are zero-based. By convention, y is the first dimension (corresponding to index 0), x is the second dimension and z is the third dimension. Internally, matrices are stored in row-major order. [C]  [C] This is in contrast to MATLAB, which uses column-major order (i.e. FORTRAN order). An n × n identity matrix can be created by using the function eye(.):Q = eye(n)Another example:
    a = [[1,2],[2,1]]
    b = [[3],[4]]
    a[0,0] = 2
    print a * b, ",", eye(3)
    print a[0,0]
    print "size(a)=", size(a), "size(a,1)=", size(a,1)
    The function numel(.) gives the number of elements of a vector/matrix/cube. Practically: numel(X) = prod(size(X)).
  8. Operators: see Table 2.1↓.\begin_inset Separator latexpar\end_inset
    Notes:
    • There are no bit-wise integer operators. Instead, use the functions and (bitwise conjunction), or (bitwise disjunction), xor (exclusive or), not (bitwise negation), shl (bit-wise left shift), shr (bitwise right shift).
    • Within kernel or device functions, the operators +=, -= have a special meaning: they specify atomic operations (i.e. these operations are free of data races). There are currently 13 atomic operators (see table below).
      Table 2.1 Some operators
      = assignment ! inversion (of Boolean values)
      + add && Boolean AND
      - subtract / negation || Boolean OR
      * matrix multiplication (or multiplication of scalar values) ? : Conditional expression (similar to C/C++)
      / division of scalar values += a += b is a shorthand for a = a + b
      .* point-wise multiplication (vec, mat, cube data types) -= a -= b is a shorthand for a = a - b
      ./ point-wise division (vec, mat, cube data types) *= a *= b is a shorthand for a = a * b
      ^ exponentiation (scalar values currently) /= a /= b is a shorthand for a = a / b
      .^ point-wise exponentiation ^= a ^= b is a shorthand for a = a ^ b
      < smaller than .*= a .*= b is a shorthand for a = a .* b
      <= smaller than or equal ./= a ./= b is a shorthand for a = a ./ b
      > greater than .^= a .^= b is a shorthand for a = a .^ b
      >= greater than or equal ^^= Atomic maximum
      == equality __= Atomic minimum
      != inequality ~= Atomic bitwise exclusive or (XOR)
      .. Defines a sequence (see further) |= Atomic bitwise OR
      &= Atomic bitwise AND
  9. Sequences: a sequence defines a row vector:
    a=0..9
    b=0..2..6
    The middle argument defines the step size. Generally, the sequence includes the specified lower and upper bounds. [D]  [D] Except when the step size is too large, such as 0..2..3 = [0, 2] or 0..100..10 = [0]. Hence, the above statements are equivalent to:
    a=[0,1,2,3,4,5,6,7,8,9]
    b=[0,2,4,6]
    The sequences can subsequently be used for matrix indexing:
    A=randn(64,64)
    A_sub = A[a,b]
    Example:
    a = 1..2..10
    b = sum(a)
    c = linspace(1, 2, 5)
    print "a = ", a, "c = ", c
    print sum = ", [b, sum(c)]
    The linspace function creates an uniformly spaced row vector of 5 values between 1 and 2, hence c = [1,1.25,1.5,1.75,2]. Implicit sequences (:) can be used to quickly index matrices:print A[:,0], A[0,:]This statement prints the first column of A, followed by the first row of A. Note that for the Matlab keyword “end”, there is no Quasar equivalent. However, it is still possible to use A[0..size(A,0)-1,0].
  10. Control structures: Quasar supports several control structures:
    for a=0..2..4
    break
    continue
    endfor
    if a==2
    endif
    if a==2
    ...
    elseif a==3
    ...
    else
    ...
    endif
    while expr
    break
    continue
    endwhile
    repeat
    break
    continue
    until expr
    Note that “if” is ended with “endif”. Also “if”, “endif” statements must be spread along several lines of code. This is to improve readability of the code. The following is NOT allowed:if a==2; do_something(); endif % Not allowed! An example of a for-loop:
    for i=1..2..100
    j=i+1
    print i, " ", j
    if i==1
    print "i is one"
    elseif i==3
    print "i is three"
    endif
    endfor
    Non-uniform ranges can be specified as follows:
    for powerOfTwo=[1,2,4,8,16,32,64,128]
    print powerOfTwo
    endfor
    or more conveniently as:
    for powerOfTwo=2.^(1..7)
    print powerOfTwo
    endfor
  11. Switches are also possible, the syntax is a little different, for example:
    match a with
    | 1 ->
    print "a=1"
    | 2 ->
    print "a=2"
    | (3, 4) ->
    print "a=3 or a=4"
    | "String" ->
    print "a=String"
    | _ -> print
    "a is something else"
    endmatch
    Note that different data types (i.e. strings and scalar numbers) can be mixed. Multiple case values can be specified (grouped by parentheses).
  12. Ternary operators: an inline if is also available, just like in C/C++:
    y = condition ? true_value : false_value
    y = (x > T) ? x - T : 0
    When the condition is true, only the value for true is evaluated. Conversely, when the condition is false, only the false-part is executed.
  13. Lambda expressions: simply speaking, lambda expressions define inline functions, for example:
    v = (x,y) -> 2*x+y
    u = x -> 2*x
    w = x -> y -> x + y
    z = w(10)
    print v(1,2), " ", z(5)
    a = [[1,2],[2,1]]
    print v(a,a)
    print w(4)(5)
    Note that here, w is a lambda expression that returns another lambda expression (y -> x + y) when evaluated. As such, partial evaluation is possible, e.g., z=w(10) (see further in Section 4.10↓). Lambda expressions can contain several sub-expressions and can be spread over several lines, as follows:
    print_sum = (a, b) -> (sum=a+b;
    print(sum); sum)
    The different expressions are separated using semicolons (’;’). The return value of the lambda expression is always the last expression (in the above example, sum). Using ternary operators, it is fairly simple to define recursive lambda expressions:factorial = x -> x > 0 ? x * factorial(x - 1) : 1
  14. Functions: the syntax for functions is different from the syntax for lambda expressions:function [outarg1, ..., outargM] = name (inarg1, ..., inargN)Here there are M output arguments (outarg1, ..., outargM) and N input arguments (inarg1, ..., inargN). “name” is the name of the function. Note that all output arguments must be assigned, otherwise the function call fails. An example:
    function y = do_something(x)
    y = x * 2
    endfunction
    a = [[1,2],[2,1]]
    b = do_something(a)
    print b
    Calling a function with multiple output arguments requires multiple variable assignment:
    function [x, y] = compute(a, b)
    x = a + b
    y = a * b
    endfunction
    [u, v] = compute(2,3)
    print u, " ", v
    Functions can contain inner functions (up to arbitrary nest depths). The inner functions (direct childs, not siblings) can then only be accessed from the outer function. For example:
    function y = colortransform (x : vec3, cname)
    function a = hsv2rgb (c)
    h = floor(c / 60)
    f = frac(c / 60)
    v = 255 * c
    p = v * (1 - c)
    q = v * (1 - f * c)
    t = v * (1 - (1 - f) * c)
    match h with
    | 0 -> a = [v, t, p]
    | 1 -> a = [q, v, p]
    | 2 -> a = [p, v, t]
    | 3 -> a = [p, q, v]
    | 4 -> a = [t, p, v]
    | _ -> a = [v, p, q]
    endmatch
    endfunction
    if cname=="hsv2rgb"
    y = hsv2rgb(x)
    else
    error "the specified color transform ",cname, " is not supported!"
    endif
    endfunction
    Argument types can optionally be specified, as shown above in x : vec3. Quasar will check at compile-time (and run-time) if the arguments are of the correct type, otherwise an error will be raised. The presence or absence of argument types has no further influence on the execution and end result of the program (except when types do not match and an error is generated). However, specifying argument types can help the Quasar optimizer to generate more efficient code.\begin_inset Separator latexpar\end_inset
    Function handles can also be used, for example:
    my_func = colortransform
    print my_func([0.2, 0.2, 0.3])
  15. Optional function arguments: functions can have optional arguments. In case an argument is missing, the default value is used. For example:
    function [y, k] = my_func(b : mat, a : scalar = 4)
    print a + b
    y = k = 0
    endfunction
    my_func(2)
    Since my_func is called with one argument, the default value for the second argument will be used (4 in this case).
    Hence, functions can have multiple outputs and optional arguments, whereas lambda expressions can not. Note that the optional function arguments can - on their turn - be expressions and even function calls:
    function [y, k] = my_func(b : mat, a : mat = eye(4))
    function [y, k] = my_func(b : mat, a : mat = A .* B)
    function [y, k] = my_func(b : mat, a : mat = 2 * b)
    Note that by default, variable references (if the name does not correspond to another input argument) refer to the outer context in which the function is defined. They capture the value at the time the function is defined. The variables are defined in the order that they are put as argument. The following would lead to an error:function [y, k] = my_func(a : mat = 2 * b, b : mat) Here, b is not defined at the time a = 2 * b is evaluated.
  16. Cell matrices: vectors, matrices and cubes can be grouped in a cell-structure. Cell matrices are either created using the function cell or using the special designated quotes ‘’. copy(.) performs a deep copy of a cell matrix (i.e. the function recursively applies copy(.) to all its elements). Some examples are given below:
    A = cell(2,2)
    G = cell(3)
    A[0,0] = eye(4)
    A[1,1] = 3
    A[0,1] = cell(1,4)
    A[0,1][1] = ones(3,3)
    A[0,1][1][0,0] = 2
    print A[0,1][1]*3
    print size(A)*2
    B = copy(A)
    C = B-A
    print C[0,0]
    D = {A,B,C}
    D_names = {"A","B","C"}
    print D_names[1]
    print size(’’)
    One important special feature is that operations on cell matrices are supported when the different operands have the same structure. It is possible to compute the sum of two cell matrices using:C = B+AAlthernatively, we can multiply all elements of a cell matrix by a constant:C = B*4Or, we can use cell matrices in function calls (note that this is only allowed with built-in functions).C = max(B,4)
    Cell matrices are convenient structures especially for rapid prototyping. However, because cell matrices can store any data, type inference that is required for efficient parallelization may fail on cell matrices. For this purpose, it is useful to use fully typed cell matrices (see Section 3.5↓).
  17. Dynamic evaluation: string expressions can be parsed and evaluated at runtime using the eval(.) function:val = eval("(x) -> 3*eye(x)")(8)Here, the eval function parses the string expression ’’(x) -> 3*eye(x)´´ and returns a corresponding lambda expression. This lambda expression can then be evaluated at the same speed as “regular” lambda expressions. This can be useful for simulations (e.g. passing functions through the command line).
  18. Reading an input image:img_in = imread("lena_big.tif")Grayscale images return a two-dimensional matrix, color images return a three-dimensional cube, in which the length of the third dimension is either 3 (RGB) or 4 (RGBA - RGB with an alpha channel).
  19. The spread operator: the spread operator “...´´ allows to unpack vectors to arbitrary indices or function parameters. Using the spread operator, the following lines of code can be simplified:
    pos = [0,1,2]
    y = im[pos[0],pos[1],pos[2],0] % Before
    y = im[...pos, 0] % After
    luminance = (R,G,B) -> 0.2126 * R + 0.7152 * G + 0.0722 * B
    c = [128, 42, 96]
    lum = luminance(c[0],c[1],c[2]) % Before
    lum = luminance(...c) % After
    The spread operator is in particularly useful in combination with variadic functions (see Section 4.8↓).
Importing .q files: .q files can contain multiple variable and function definitions which can be accessed from other .q programs. To do so, the import keyword can be used. The import keyword should be used only at the global scope (hence not within functions or control structures) and at the beginning of the Quasar module. For example:
import "system.q"
import "imfilter.q"
% all definitions from system.q and imfilter.q are now available.im = imfilter(imread("img.tif"),ones(7,7))
There is one exception: “main” functions are completely skipped and hence not imported (see Section 13.1↓). Also, .q files are only to be imported once (multiple imports will have no effect and will be ignored by the compiler), and the import definitions must be placed on the beginning of the program.
Syntax notice: the control structure keywords are as follows: if  →  endif, for  →  endfor, type  →  endtype, while  →  endwhile, match  →  endmatch, function  →  endfunction, try  →  endtry. In older versions of Quasar, the keywords were: if  →  endif, for  →  end, type  →  end, while  →  end, match  →  end, function  →  end,try  →  end. We have found experimentally that matching the control structure endings with the beginnings enhances not only the readability of the code but also prevents certain types of bugs (especially with nested control structures). For legacy code, the Quasar compiler still accepts the old endings. This is done on a per-file basis. When both control structure endings are mixed, the compiler generates an error. So the user is encouraged to use the new control structure endings!


2.2 A brief introduction of the type system

Note: a full depth explanation on the Quasar user-defined types will be given in Chapter 3↓. Here we only give a brief introduction.
Quasar has an array-based type system, that facilitates working with multi-dimensional data, which includes for example conversions between vectors and matrices. In general, variable types are implicit (hence do in general not need to be specified by the user). In contrast to the MATLAB/Octave, the Quasar compiler obtains the types of the variables through type inference. The type inference is not strict: if the compiler is not able to figure out the type of a variable, this variable will considered to be of an unknown type (denoted by the type ’??’). The main primitive types of Quasar are summarized in Table 2.2↓. The types vec, mat, cube, cvec, cmat, ccube, ... are actually shorthands for their corresponding generic versions with explicit type parameters (see further in Chapter 6↓). Also the shorthands are listed in the table. Additional primitive types are given in Table 2.3↓.
Table 2.2 Quasar main primitive types. Note: to use the types with asterisk(*), it is required to import the module “inttypes.q
type 0-dim 1-dim 2-dim 3-dim n-dim
integer number int ivec(*) imat(*) icube(*) icube{n}(*)
shorthand for vec[int] mat[int] cube[int] cube{n}[int]
scalar number scalar vec mat cube cube{n}
shorthand for vec[scalar] mat[scalar] cube[scalar] cube{n}[scalar]
complex scalar number cscalar cvec cmat ccube ccube{n}
shorthand for vec[cscalar] mat[cscalar] cube[cscalar] cube{n}[cscalar]
The dimensionality can be specified via the brace syntax: e.g., cube{4} denotes a four-dimensional array. Quasar defines “infinite” dimensional data types: cube{:} and ccube{:}, although in practice, the current implementation only supports up to 16-dimensional data structures.
The relation between the different “dimensional” types is defined as follows:
vec ⊂ mat ⊂ cube ⊂ cube{:}
ivec ⊂ imat ⊂ icube ⊂ icube{:}
cvec ⊂ cmat ⊂ ccube ⊂ ccube{:}
Hence, every vector can be passed to a function requiring a matrix, and every matrix can be passed to a function requiring a cube. Whether a value \strikeout off\uuline off\uwave offA\uuline default\uwave default is vector, matrix, or cube, depends on the number of dimensions of \strikeout off\uuline off\uwave offA\uuline default\uwave default:
A has type vec if ndims(A)==1       mat if ndims(A)==2       cube if ndims(A)==3       cube{n} if ndims(A)==n
where ndims returns the total number of dimensions. Note that scalar numbers are not part of the relationship (hence scalarvec). This is mainly for implementation efficiency.
The consequence is that, functions defined for arguments of type cube can also accept arguments of type vec and mat. For example, for digital images, cube can both represent color images (with dimensions M × N × 3) and grayscale images (with dimensions M × N × 1). In some cases, it is useful to indicate size information in the type. This can be done by adding size parameters to the type: e.g., cube(:,:,3) denotes a 3D array for which the size in the last dimension is always 3. Size parameters give invaluable information to the compiler allowing specialized code to be generated.
Explicitly annotating the types of variables can bring performance benefits, although for code simplicity it is advised to only specify the type when necessary: in many cases the type is obtained and propagated using type inference. For example, when using the function im=imread("image.png", "rgb"), the compiler will infer that the type of im is cube(:,:,3).
It is possible to check at run-time whether a variable (or intermediate result) has a certain type, using the function type(A:typename). Moreover, the check can be performed using type checks and/or assertions, for example:
print(variable:cube[int])
print(1:cube) % Error: Type check failed: ‘int’const‘ is not ‘cube‘
assert(type(1,"scalar"))
assert(type(1i,"cscalar"))
assert(type(zeros(2,2),"mat"))
assert(type("Quasar","string"))
In case one of the above the type checks fail, a compiler error will be generated. Additionally, the file system.q defines a number of lambda expressions for checking types:
isreal      = x -> type(x, "scalar")  || type(x,"vec")  || type(x,"mat")  
                || type(x, "cube")
iscomplex   = x -> type(x, "cscalar") || type(x,"cvec") || type(x,"cmat") 
                || type(x, "ccube")
isscalar    = x -> type(x, "scalar")  || type(x, "cscalar")
isvector    = x -> type(x, "vec")     || type(x, "cvec")
ismatrix    = x -> type(x, "mat")     || type(x, "cmat")  || isvector(x)
iscube      = x -> type(x, "cube")    || type(x, "ccube") || ismatrix(x)
Under some circumstances, the Quasar compiler is not able to figure out the types of the variables through inference. One example is the load function, which reads data from a file (through a process called deserialization) and stores them into variables.
[A, B] = load("myfile.dat")
The file load operation is only performed at runtime, the result depends on the content of the file being loaded and correspondingly the compiler can not predict the types of the variables. Then it makes sense to give the compiler some type information, such that it can perform some smart optimizations when needed:
assert(type(A,"ccube"))
assert(type(B,"vec"))
The assert function then has a two-fold purpose: 1) it gives the compiler information about the types of A and B and 2) it performs a runtime check to validate the data read from“myfile.dat”.
An alternative (and perhaps cleaner) way to check the type of the variable is by using type annotations. The above example then becomes:
[A : ccube, B : vec] = load("myfile.dat")In case the types do not match, the runtime system will generate an error message. Type annotations need to be declared only the first time the variable is used.
Finally, type conversion is generally not needed in Quasar (avoided for computational performance reasons), although a conversion table is given in Table 2.4↓. Only for generic programming purposes (see Chapter 6↓), an overridable type conversion function cast(x, new_type) is available.
Table 2.3 Additional primitive types “first-class citizens
Type Purpose
string Sequences of characters
lambda_expr Lambda expressions
function Function handles
kernel_function Kernel functions
object Objects
?? Unspecified type (i.e. determined at run-time)
Table 2.4 Type conversion table
From/To int/ivec/imat/icube scalar/vec/mat/cube cscalar/cvec/cmat/ccube
int/ivec/imat/icube - float(.) complex(.) / complex(re,im)
scalar/vec/mat/cube int(.) - complex(.) / complex(re,im)
cscalar/cvec/cmat/ccube int(real(.)) / int(imag(.)) real(.) / imag(.) -
Lambda expressions can also be explicitly typed. Quasar follows typing conventions similar to the Haskell and ML programming languages. For example:
Lambda expressions (especially those that use closures, see Section 4.2↓) are very powerful in Quasar. To reduce the overhead associated with calling lambda expressions on computation devices, the Quasar compiler attempts to inline lambda expressions and functions whenever possible or beneficial.

2.2.1 Floating point representation

In Quasar, the internal representation of scalar numbers “scalar” (or complex scalar numbers “cscalar”), is usually not specified at the code level. This allows the floating point representation to be changed on a global level. By default, Quasar will use single precision floating point numbers (see Table 2.5↓). However, it is possible to compile and run the programs using double precision as well, by passing the -double command line option to Quasar, e.g.:
./Quasar.exe -debug -double script.q
Table 2.5 Overview of floating point representations
IEEE 32-bit (single precision) IEEE 16-bit (half precision) IEEE 64-bit (double precision)
Quasar type scalar’single scalar’half scalar’double
Significand 23 bits 10 bits 52 bits
Exponent 8 bits 5 bits 11 bits
Minimum pos. value 1.17549435 × 10 − 38 5.9605 × 10 − 8 2.225073858507201 × 10 − 308
Maximum pos. value 3.40282347 × 1038 65504.0 1.797693134862316 × 10308
Exact integer repr. -224 + 1 to 224 − 1 (16, 777, 215) -211 + 1 to 211 − 1 (2, 048)  − 253 − 1 to 253 − 1
When numerical precision is not of uttermost importance, it is recommended to use single precision. Note that some older GPUs have limited double precision FP support. CUDA devices before compute capability 1.3 even do not have double precision FP support. Moreover, using double precision FP numbers doubles the memory bandwidth. Consequently, programs using double precision may run up to 2x slower than programs with single precision FP. Additionally, several consumer GPUs (e.g., Geforce series) have a double precision throughput that is much lower than the floating point precision throughput. However, there are some reasons to enable double precision in Quasar programs:
Often it is useful to check whether the program is not suffering from floating point inaccuracies. This can simply be done by running the program once in double precision mode.
Note: NVidia GPU’s GTX 260, 275, 280, 285, 295 chips (with compute capability 1.3) have a low performance in double precision computations (about 1/8 of single precision performance). Devices of the NVidia Fermi architecture (compute capability 2.0+) have 1/2 the performance of single precision operations. Performance is greatly improved with either NVidia Tesla cards or the NVidia Titan (which is based on the Kepler architecture). For full details, see https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#arithmetic-instructions.
Finally, recall that FP math is not associative, i.e. the sum (A + B) + C is not guaranteed to be equal to A + (B + C). When parallelizing computations, the order of the operations is often changed (and may be even unspecified), leading to results which may differ each time the technique runs even with the same input data. This limitation is not inherent to Quasar, but applies to the specific approach used to perform parallel computations using floating point numbers. The example “Accurate sum” gives more information in this issue (see Section 10.7↓); also in Section 13.6↓ we explain how parallel code can be written that does not depend on this “non-deterministic” behavior.
The global constant “eps” is available for determining the machine precision (similar to FLT_EPSILON/DBL_EPSILON in C or eps in Octave/Matlab). The functions maxvalue(scalar) and minvalue(scalar) can be used to determine the maximum and minimum values representable in floating point format.

2.2.2 Mixed precision floating point computations

It is recommended to use the default scalar data type (with globally defined precision) as much as possible. However, in some cases, it is desirable to perform certain parts of the computation in a higher precision (e.g., when numerical accuracy is important) or even in a lower precision (e.g., when memory and/or computation time counts). Therefore Quasar allows specifying the required precision as part of the scalar type:
The half precision floating point representation (FP16) cannot be set globally (due to its inherent precision and range restrictions), however, it is possible to write functions that mix scalar types of different precision. Mixing FP16 and FP32 appears to be common in machine learning (e.g., convolutional neural networks). Pascal, Volta and Turing GPUs also allow support FP16 computations natively. Volta’s tensor cores provide mixed matrix multiply-accumulate operations, in which the matrix is stored in FP16, the result is represented in FP32.
Remark: Quasar currently does not support bfloat16, the half-precision floating point format supported by Google’s Tensor Processing Units (TPUs).

2.2.3 Integer types

Next to floating point numbers, Quasar has also (limited) support for integer types. The default integer type is “int” (signed integer). Its bit length depends on the computation engine, but is guaranteed to be at least 32-bit.
There are also integer types with a pre-defined bit length, these are mainly provided 1) to enable more efficient input/output handling (e.g. reading/writing of images in integer format), or 2) to write certain algorithm in which memory usage/memory bandwidth should be as low as possible. Generally, the use of the integers with pre-defined bit length should be avoided. For completeness, these types are listed below:
A matrix containing 8-bit integers can be obtained as follows:A = mat[int8](rows,cols)Note that, by default, arithmetic operations for integer matrices are disabled (e.g. summing, subtracting, conversion to floating point etc.). These operations can be included by importing the inttypes library (import "inttypes.q").
Integer types can have special modifiers (the modifier can be added by writing a apostrophe ’ directly after the type name). These modifiers indicate how the conversion from a floating point number / integer number with larger bit depth to the considered integer type takes place.
The following function, which sums two 8-bit unsigned integer matrices, illustrates the usage of integer modifiers:
function y : mat[uint8’sat] = add(a : mat[uint8], b : mat[uint8])
for m=0..size(a,0)-1
for n=0..size(a,1)-1
y[m,n] = a[m,n] + b[m,n]
endfor
endfor
endfunction
Here, integer saturation is used in case the sum of a[m,n] and b[m,n] does not fall in the range 0..255.

Important note

Variables are implicitly integer when defined by a constant with no decimal sign. This may have some unexpected consequences as in the following example:
a = 0 % a is integer
for n=0..N-1
a += x[n] % x[n] is implicitly converted to integer
endfor
Here, x[n] is automatically converted to integer. To warn the programmer, a type conversion warning message will be shown. If it is desired to declare a as a scalar value, the first line needs to be replaced by a = 0.0.

2.2.4 Fixed sized datatypes

As already mentioned in Section 2.2↑, vectors and matrices can optionally have their size specified in any dimension. For example, a : cube(:,4,:) always has size(a,1) == 4. The size acts as a constraint on the specialized type; the constraint is checked by both the compiler and the runtime. ’:’ indicates that the length is unspecified (not known at compile-time).
Using this technique, it is easy to define single instruction multiple data (SIMD) operations. For x86/64 CPU targets in Quasar, vectors of length 4 (vec(4)) may be [E]  [E] depending on the configuration settings and depending on whether the specific operation is accelerated by the processor. mapped onto SSE datatypes while vectors of length 8 (vec(8)) may be mapped onto AVX datatypes. See Chapter 12↓ for more information.
In generic functions, it is possible to use type parameters for this purpose (for example cube(:,:,P)). For the following function:
function [] = color_transform[P : int](im_in : cube(:,:,P), im_out : cube(:,:,P))
...
endfunction
array size errors can be caught early in the development process: when the function color_transform is called and the compiler cannot guarantee that im_in and im_out have the same size in the third dimension, a compiler error will result.

2.2.5 Higher dimensional matrices

Higher-dimensional matrices (with dimension > 3) need to be specified using an explicit dimension parameter. For example cube{4} denotes a 4-dimensional array. The array with unspecified (infinite) dimension in Quasar is cube{:}. This is useful for generic specialization purposes (see further in Chapter 6↓). To express higher-dimensional loops for which the dimensionality is not known in advance, the functions ind2pos and pos2ind convert between position coordinates and linear indices, as illustrated by the following example:
function y:’unchecked = multidim_subsampling(x : cube{:}, factor : int)
y = uninit(int(size(x)/factor))
{!parallel for}
for i = 0..numel(y)-1
p = ind2pos(size(y), i)
y[p] = x[p*factor]
endfor
endfunction
Here, a higher dimensional cube is subsampled by factor along all dimensions of the cube. This technique is particularly useful for implementing operations with unknown dimensionality of the input parameters (as e.g., in Kronecker products).

2.2.6 User-defined types, type definitions and pointers

Quasar supports user-defined types (UDTs) and pointers: the user-defined types are defined as classes, as illustrated below:
type point : class
x : scalar
y : scalar
endtype
The type keyword is always followed by a type definition. The class point can be instantiated using its default constructor:
p = point()or:
p = point(x:=4, y:=5)Remark that the arguments of the constructor are named. The order of the arguments can then also be changed:
p = point(y:=5, x:=4)
By default, classes in Quasar are immutable. This means that, once initialized, the value of the class cannot be changed (or a compiler error will be generated)! Classes can also be made mutable, as follows:
type point : mutable class
x : scalar
y : scalar
endtype
Immutable classes allow for some optimizations to be applied. For example, they can be stored in constant device memory, some memory transfers are eliminated, moreover, the Quasar runtime does not need to check if the value of this class has been changed in device memory. For these reasons, it is recommended to use immutable classes whenever possible.
Additionally, a UDT can contain other UDTs:
type rectangle : class
p1 : point
p2 : point
endtype
Remark that the fields of the rectangle (p1, p2) are stored in-place. This means that the internal storage size of the UDT is the sum of the storage sizes of its fields. In this case, using single precision FP, elements of the point class will take 8 bytes and consequently elements of the rectangle class will contain 16 bytes.
Like in other programming languages (e.g. C/C++, Pascal), it is also possible to define a rectangle that stores references to the point class. Therefore, Quasar supports Pascal-type pointers:
type pt_rectangle : class
p1 : ^point
p2 : ^point
endtype
Remark that in many programming languages, pointers can be a source of programming errors (e.g. dangling pointers, uninitialized pointers etc). For this reason, the pointers in Quasar have special properties, that allow them to be safe in usage:
Internal detail: the pointers in Quasar rely on customized form of reference counting to help track allocation of memory, including a technique to solve memory leaks caused by potential circular references. Moreover, the pointers make an abstraction from the particular device: the object can reside either in CPU memory, GPU memory, or both.
It is also possible to define (multi-dimensional) arrays of UDTs, using parametric types:
type point_vec : vec[point]
type point_mat : mat[point]
type point_cube : cube[point]
type rectangle_vec : vec[rectangle]
type pt_rectangle_vec : vec[^pt_rectangle]
Using UDT arrays is often more efficient than storing the individual elements of the UDT in separate matrices. This is because 1) the indexing often only needs to be performed once and 2) because better memory coalescing and caching. The UDT arrays can be initialized by zero (or using nullptr’s), in the following way:
a = point_vec(10)
b = point_mat(4, 5)
c = point_cube([1, 2, 3])
Note that a type definition (type x : y) is required for this construction. The following is (currently) not supported:a = vec[point](10)
Moreover, the multi-dimensional arrays and UDTs may contain variables with unspecified types:
type point : class
x : ??
y : ??
endtype
type cell_vec : vec[??]
type cell_mat : mat[??]
type cell_cube : cube[??]
One caveat is: variables with unspecified types do not support automatic parallelization (see further in Section 2.3↓) and can not be passed to kernel functions (see Subsection 2.4.1↓).
UDTs can also contain vectors/matrices:
type wavelet_bands : mutable class
LL : ^wavelet_bands
HL : mat
LH : mat
HH : mat
endtype
The premise is that this class does not have a default constructor (wavelet_bands()), because there are no default values for matrices. Also nullptr’s are not allowed. Hence, it is necessary to explicitly specify the value of wavelet_bands:
bands = wavelet_bands(LL:=nullptr,
HL:=ones(64,64),
LH:=ones(64,64),
HH:=ones(64,64))

2.3 Automatic parallelization

The Quasar compiler automatically attempts to parallelize for-loops, depending on the matrix indexing scheme, input/output variables, constants and data dependencies. For example, the sequential code fragment, demonstrating a spatial filtering using a box filter (mask):
im = imread("image_big.tif")
im_out = zeros(size(im))
N = 5
mask = ones(2*N+1,2*N+1)/(2*N+1)^2
for m=0..size(im,0)-1
for n=0..size(im,1)-1
a = [0.,0.,0.]
for k=-N..N
for l=-N..N
a += mask[N+k,N+l] * im[m+k,n+l,0..2]
endfor
endfor
im_out[m,n,0..2] = a
endfor
endfor
automatically expands to the following equivalent parallel program:
im = imread("image_big.tif")
im_out = zeros(size(im))
N = 5
mask = ones(2*N+1,2*N+1)/(2*N+1)^2
function []=__kernel__ parallel_func(im:cube,im_out:cube,mask:mat,N:int,pos:ivec2)
a = [0.,0.,0.]
for k=-N..N
for l=-N..N
a += mask[N+k,N+l] * im[pos[0]+k,pos[1]+l,0..2]
endfor
endfor
im_out[pos[0],pos[1],0..2] = a
endfunction
parallel_do(size(im,0..1),im,im_out,mask,N,parallel_func)
In this program, first a kernel function is defined. Next, the parallel_do function launches the kernel function in parallel for every pixel in the image im. The kernel function processes exactly one pixel intensity, and is called repetitively by the function parallel_do. When compiling the Quasar program, the kernel functions and automatically parallelized loops are compiled, depending on the computation engine being used, to CUDA or native C++ code (using OpenMP). This ensures optimal usage of the computational resources.
The Quasar optimizer may fail to extract a parallel program, for example because the type of certain variables is not known or because certain dependencies between variables have been detected (the latter causing the loop to be executed serially - “serialized”). For mapping algorithms onto hardware, variable types need to be well defined. As explained in Section 2.2↑, when the variable type is not specified, Quasar uses type inference to derive the exact type from the context. When this fails, warning messages are displayed on the console during compilation that can help to make the program parallel, e.g., by explicitly declaring the type (B : vec = ...). Quite often, it may be the intention of the programmer to have a parallel loop. In this case, it is possible to interrupt the program compilation when the loop parallelization fails (thereby generating a compiler error). This is possible by putting {!parallel for} directly before the for-loop to be parallelized:
{!parallel for}
for m=0..size(im,0)-1
for n=0..size(im,1)-1
endfor
endfor
In case the compiler then detects some dependencies between the variables, a warning message will be displayed reporting these dependencies and the loop will be parallelized despite the warnings (possibly causing data races).
There are some scenarios that currently cannot be handled by the auto-parallelizer (for example polymorphic variables which change type during the loop). Additionally, certain features cannot be used from inside loops, such as shared memory functions, thread synchronization... Therefore, and also for full flexibility, it is also possible to perform the parallelization completely manually. This is described in the following section.

2.4 Writing parallel code using kernel functions

In this section, we describe two ways of writing parallel code:
Most algorithms can be efficiently implemented using the “basic” approach. The advanced usage consists of synchronization, dealing with data races, sharing memory across multiprocessors and other GPU programming techniques, which may lead to increased performance taking more advantage of the available features.
For beginning users, it is advised to get acquainted first with the basic usage techniques, before considering the advanced usage. Additionally, kernels written using the “basic usage” approach are often further optimized by the Quasar compiler to use some more advanced features (see Chapter 17↓).

2.4.1 Basic usage: kernel functions

A kernel function is a Quasar function with a special attribute __kernel__, that can be parallelized. Kernel functions are launched in parallel on every element of a certain matrix, using the “parallel_do” built-in function. The __kernel__ attribute specifies that the function should be natively compiled for the targeted computation engine (e.g. CUDA, CPU). Consequently, __kernel__ functions are considerably faster in execution than host functions thanks to their parallelization and native code generation. As example, consider the following algorithm:
function [] = __kernel__ color_temperature(x : cube, y : cube, temp,
cold : vec3, hot : vec3, pos : ivec2)
input = x[pos[0],pos[1],0..2]
if temp<0
output = lerp(input,cold,(-0.25)*temp)
else
output = lerp(input,hot,0.25*temp)
endif
y[pos[0],pos[1],0..2] = output
endfunction
hot = [1.0,0.2,0.0]*255
cold = [0.3,0.4,1]*255
img_out = zeros(size(img_in))
parallel_do(size(img_out,0..1),img_in,img_out,temp,cold,hot,color_temperature)
The kernel function is launched on a grid of dimensions “size(img_out,0..1)” using the parallel_do construct. This means that every pixel in img_out will be addressed individually by parallel_do, and correspondingly the function color_temperature will be called for every pixel position.
For a kernel function, all parameters need to be explicitly typed. Recall that untyped parameters in Quasar are denoted by ??. Code using untyped parameters cannot be mapped onto an efficient implementation, therefore compiler will first try to deduce the type from the context. The kernel function is then treated as a generic function (see Chapter 6↓). If the type deduction fails, a compiler error will result.
To write efficient kernel functions it is recommended to use vector and matrix types with size constraints. For example:
The explicit size of the vector and matrix parameters allows calculations involving the variables to be mapped onto an efficient implementation. Additionally, it also helps the type inference: for example, the product of a vector of length 4 (vec(4)) and a 4 × 4 matrix (mat(4,4)) results in a vector of length 4.
However, some datatypes can not be passed as arguments to kernel functions: cell matrices containing unknown types (??) and strings. To pass cell matrices, use parametrized matrix types (vec[cube], mat[cube], mat[vec], etc., see Chapter 3↓). Strings need to be converted to vectors (using the functions fromascii, fromunicode). Device functions (__device__) possibly containing closure variables (see Section 4.2↓), can also be passed.
For vectors of length  ≤ 64, it is most efficient to add the length explicitly in the type as above. Vectors with length known at compile-time are treated in a special way: the components of the vector are grouped together requiring less memory read/write requests and may be implemented using SIMD instructions if the underlying back-end compiler supports them. At the very least, these vectors are allocated on the stack (or registers), which is significantly faster than in the kernel dynamic memory (see Section 8.3↓). Matrices with size constraints (such as matXxY, cubeXxYxZ) are also treated as fixed-length vectors internally.
Remark: the current Quasar implementation places a limit on the maximum length of the constraint, or the product of the dimensions (e.g., X × Y for matXxY, X × Y × Z for cubeXxYxZ. This limit is 64). When the vector length is longer, the specification of the value will not have an effect (apart from type inference purposes).
Inside __kernel__ and __device__ functions, it is recommended to use integers instead of scalars (when possible). This may yield a speed-up of about 30% for CUDA targets and even more for CPU targets. When a scalar constant contains a decimal point (e.g., 1.2), the compiler will consider this constant to be a floating point number, otherwise it will be treated as an integer.
The syntax of the parallel_do function is as follows:parallel_do(dimensions, inarg1, ..., inargN, kernel_function)
where dimensions is a vector. Note that normally kernel function cannot have output arguments (there is a special advanced feature that allows kernel functions to return values of certain types, see Section 4.7↓, but this feature is only for specific use-cases). Instead, in most use cases the return values should be written to the input arguments passed by reference, i.e. arguments of types vec, mat, cube, cvec, cmat, ccube.
There are some special arguments that can be defined in the kernel function declaration, but that do not need to be passed to parallel_do:
The parallel_do function basically executes the following sequential program in parallel:
blkdim = choose_optimal_block_size(kernel_function) % done automatically
for m=0..dimensions[0]-1
for n=0..dimensions[1]-1
for p=0..dimensions[2]-1
pos = [m,n,p]
blkpos = mod(pos, blkdim)
blkidx = floor(pos/blkdim)
kernel_function(inarg1, ..., inargN, [pos], [blkpos], [blkidx], [blkdim])
endfor
endfor
endfor
Here, first optimal block dimensions (blkdim) for the given kernel function are being selected. Then, kernel_function is run inside the three loops, prod(dimensions)=dimensions[0] × dimensions[1] × dimensions[2] times.
Special modifiers are available for kernel function arguments. The modifiers are specified using the apostrophe-symbol:
function [] = __kernel__ imfilter_kernel_nonsep_mirror_ext(y : cube’unchecked,
x : cube’unchecked, mask : mat’unchecked’const, center : ivec2, pos : ivec3)
These modifiers specify how vector/matrix/cube elements are accessed, and in particular enable efficient boundary handling in image processing:
The default access modes are currently ’safe (inside kernel/device functions) and ’checked outside of kernel/device functions (for performance reasons). In case the program behavior depends on the access mode, it is best to explicitly indicate the access mode.
Finally, there are some rules with respect to the calling conventions for kernel functions:
There are some special functions that can be used within kernel functions:

2.4.2 Device functions

In the example in the previous section, the linear interpolation function lerp is defined as:lerp = __device__ (a : scalar, b : scalar, d : scalar) -> a + (b - a) * dDevice functions are the only functions (next to kernel functions) that can be called from a kernel/device function. The __device__ function specifies that the function should be natively compiled for the targeted computation engine (e.g. CUDA, CPU), however, in contrast to kernel functions, they can not be used as argument to a call of the parallel_do function. Device functions are hence useful to aid the writing of kernel functions. For example, if one often needs a 2D vector that is orthogonal to a given 2D vector, one can define:orth = __device__ (x : vec2) -> [-x[1], x[0]]The function orth can then be used from other functions (also outside kernel/device functions).
Table 2.6↓ lists whether functions of different types can call each other. Note that there are a number of combinations that are not supported:
Table 2.6 Quasar: which function types can call ...?
From/To host __device__ __kernel__
host Yes Yes parallel_do/serial_do only
__device__ No Yes
__kernel__ No Yes
Remark: kernel and device functions have dedicated types, containing respectively __kernel__ and __device__ type modifiers. For the above definitions:
imfilter_kernel_nonsep_mirror_ext : _
[__kernel__(cube,cube,mat,ivec2,ivec3) -> ()]
lerp : [__device__(scalar,scalar,scalar) -> scalar]
orth : [__device__(vec2) -> vec2]
These types can be used for defining more general functions that use device/kernel functions as input argument. For example:
add = __device__ (x : scalar, y : scalar) -> x + y
sub = __device__ (x : scalar, y : scalar) -> x - y
mul = __device__ (x : scalar, y : scalar) -> x * y
orth = __device__ (x : vec2) -> [-x[1], x[0]]
ident = __device__ (x : scalar) -> sub(add(x, 2*x), 2*x)
function [] = __kernel__ my_kernel (X : mat, Y : mat, Z : mat, pos : ivec2)
Z[pos] = add(X[pos], Y[pos])
v = orth([X[pos], Y[pos]])
endfunction
X = ones(4,4)
Y = eye(4)
Z = zeros(size(X))
parallel_do(size(Z),X,Y,Z,my_kernel)
One special feature of device functions, is that they can be used as function pointers and passed to kernel functions. This can be used to reduce the number of kernel functions, or as an alternative to dynamic code generation:
% Definition of a __device__ function type
type binary_function : [__device__ (scalar, scalar) -> scalar]
add = __device__ (x : scalar, y : scalar) -> x + y
sub = __device__ (x : scalar, y : scalar) -> x - y
mul = __device__ (x : scalar, y : scalar) -> x * y
function [] = __kernel__ arithmetic_op(Y : cube, _
A : cube, B : cube, fn : binary_function, pos : ivec3)
Y[pos] = fn(A[pos], B[pos])
endfunction
A = ones(50,50,3)
B = rand(size(A))
Y = zeros(size(A))
parallel_do(size(Y),Y,A,B,add,arithmetic_op)
Unfortunately, there is a performance penalty associated to function pointer calls: the extra indirection avoids the compiler to inline the function. For this reason, when possible the Quasar compiler will attempt to avoid function pointer calls (by substituting the exact function).

2.4.3 Memory usage inside kernel or device functions

There are three types of memory that can be used inside kernel or device functions:
  1. local memory: this is memory that is local to the function, and each parallel run of the kernel function (called ’thread’) contains a private copy of this memory. Below are a few examples of the creation of local memory:
    A = [0, 1, 2, 3] % Generates a variable of type ’ivec4’
    B = [0., 1., 2., 3.] % Generates a variable of type ’vec4’
    C = [1 + 1j, 2 - 2j] % Generates a complex-valued variable of type ’cvec2’
    D = ones(6) % Generate a vector of length 6, filled with 1.
    E = zeros(8) % Generates a vector of length 8, filled with 0.
    F = complex(zeros(4)) % Generates a complex-valued vector of length 4
    For the GPU computation engine, there are however a few limitations: first, local memory is internally stored in device registers, is hence very fast, but also scarse. When the maximum number of device registers is exceeded, global memory is used instead (which has a much larger memory access latency).
  2. shared memory: this type of memory is shared across threads, and allocated using the functions shared and shared_zeros. Its usage is discussed in Subsection 2.4.4↓.
  3. global memory: this type of memory is used for storing vectors and matrices with either large dimensions or dimensions that cannot be determined at compile-time. Global memory is also used for dynamically allocated objects (see Section 8.3↓). For example:
    function [] = __kernel__ my_kernel (X : mat, pos : ivec2)
    % X[pos] is stored in global memory
    endfunction
    X = ones(4096,4096)
    parallel_do(size(X), my_kernel)
    Here, X is allocated outside a kernel function. The values of X, in total 4 × 4096 × 4096 bytes (in case of 32-bit floating point), are stored automatically in global memory in a linear way. The following formula is used for translating the 3D index to a linear index:
    index(dim1,  dim2,  dim3) = ( dim1 Ndims2 +  dim2) Ndims3 +  dim3
    The ind2pos(size, index) function performs exactly this calculation. Global memory can reside either in CPU memory, GPU memory or both. When calling a kernel function using parallel_do in the GPU computation engine, the global memory will automatically be transferred to the GPU. Because the maximum amount of local memory and shared memory that can be used is limited by the hardware (e.g., not more than 48K), global memory is the only way to pass large amounts of data to a kernel function. The only premise is: a kernel/device function cannot allocate global memory, the memory should best be allocated in advance and passed to the function.
    In some cases, a Quasar program may run out of global GPU memory. In that case, Quasar will automatically transfer a non-frequently used memory buffer back to the CPU. This memory buffer can be later transferred back to the GPU. By this technique, Quasar programs can use all the available memory in the system (both CPU and GPU).
  4. texture memory: texture memory is read-only global memory that is internally optimized for spatial access patterns (whereas the global memory is more optimal for linear accesses). In particular, the data layout is optimized for texture sampling using nearest neighbor interpolation or linear interpolation. CUDA uses space filling curves [G]  [G] http://en.wikipedia.org/wiki/Space-filling_curve for optimizing the data layout. See Section 9.3↓ for more information.
Finally, it is important to mention that local memory should be scarcely used (or at least: with care), because for the GPU, the local memory is mapped directly onto the device registers. In CUDA, the total size of the device registers is 32K for compute capability 2.0 and 64K for compute capability 64K. However, the device registers are shared across all computing threads: hence, when invoking 512 threads in parallel, the total amount of local memory available to a kernel/device function is respectively 64 bytes and 128 bytes! If more registers are used, the kernel will execute, but global memory is used instead (called register spilling). To avoid performance impact, the Quasar runtime decreases the number of threads being spawned. The maximum number of threads that a given kernel function uses, can be determined using the function prod(max_block_size(my_kernel)). Also see Subsection 2.4.4↓ for more information.

2.4.4 Advanced usage: shared memory and synchronization

For this section, a basic familiarity with the GPU architecture is required. Internally, chunks of data are processed in blocks, as follows:
pos = [m,n,p]
blkpos = mod(pos, blkdim)
blkidx = floor(pos/blkdim)
blkcnt = ceil(size(y)./blkdim)
Within one block, a kernel function can access data from kernel functions running in parallel on this block. This is very useful for implementing some special parallel algorithms, such as parallel sum, parallel sort, spatially recursive filters etc. However, read/write operations can interfere (data races), so special care is needed.
Advanced usage consists of 1) using thread synchronization, 2) using shared memory, 3) dealing with data races.
Thread Synchronization
The number of threads that run in parallel over one block can be calculated using prod(blkdim), i.e., the product of the block dimensions. Sometimes, it is necessary that each threads wait until a given operation is completed, by means of a thread barrier. All threads (within one block!) then wait until completion of the operation. In Quasar, this is done using the syncthreads keyword:
function [] = __kernel__ my_kernel (y : mat, z : mat, pos : ivec2, idx : ivec2)
y[pos] = 10*idx
syncthreads % all threads wait here until the above operation has been completed.
z[pos] = y[pos]*2
endfunction
It is important to mention that the thread synchronization is performed on a block level, rather than on the full grid. In the above example, when the syncthreads is first encountered, only values y[pos] with pos ∈ [0..blkdim[0]-1]  ×  [0..blkdim[1]-1] will have been computed, and not the complete matrix y!
Finally, the correct usage of syncthreads is that all threads effectively meet the barrier. It is for example not allowed to put a synchronization barrier inside a conditional if...else... clause, unless it is sure that each thread encounters the same number of barriers while running the kernel function. syncthreads may also have a parameter to indicate the synchronization granularity, for example syncthreads(block), syncthreads(warp), syncthreads(grid), syncthreads(host). For more details, see Subsection 9.8.1↓.
Remark: note that syncthreads does not ensure that global memory writes performed by the block are completed and can be seen by other threads. If this is required, consider using memfence (see Subsection 9.8.3↓).
Shared Memory
The GPU has several memory types (global memory, texture memory, shared memory, registers etc.). Therefore, to achieve the best performance it is best to use the right memory type for each task. Global memory is used by default and registers are used for local calculations within kernel/device functions. Shared memory is visible to all threads of a kernel function within one block and can be allocated using the function shared(.) or shared_zeros(.) from kernel functions. It’s usage is as follows:
var1 = shared(dim); % vector
var2 = shared(dim1,dim2); % matrix
var3 = shared(dim1,dim2,dim3); % cube
var4 = shared_zeros(dim); % vector initialized with 0’s
var5 = shared_zeros(dim1,dim2); % matrix initialized with 0’s
var6 = shared_zeros(dim1,dim2,dim3); % cube initialized with 0’s
var7 = shared[uint8](dims) % generic memory allocation
var8 = shared_zeros[uint64](dims)
% generic memory allocation initialized with 0’s
syncthreads % REQUIRED in case of shared_zeros!!!
Shared memory is visible and shared within one block. That means that, when going to another block (e.g. when blkidx changes), the content of the shared memory cannot be relied on. Use shared_zeros only when you want to initialize the memory with zeros. The shared memory allocated with shared is not initialized (like in C/C++). This is often faster.
Important: one or multiple shared_zeros calls always need to be followed by a syncthreads statement (as shown in the example below). This is because the memory initialization by shared_zeros is performed in parallel. Hence, when all threads randomly start using the allocated memory it is necessary to wait until the zero initialization operation has fully been completed. In fact, the (internal) implementation of shared_zeros is as follows:
function [] = __kernel__ shared_mem_example(blkpos : ivec3, blkdim : ivec3)
A = shared(100) % One vector of 100 elements
% Compute the index of the current thread
threadId = (blkpos[0] * blkdim[1] + blkpos[1]) * blkdim[2] + blkpos[0]
nThreads = prod(blkdim) % Number of threads within one block
for i=threadId..nThreads..numel(A)-1 % Parallel initialization
A[i] = 0.0
endfor
syncthreads % Make sure all threads have finished before continuing!

% Is equivalent to
B = shared_zeros(100)
syncthreads % Make sure all threads have finished before continuing!
endfunction
There are however two caveats when using shared memory:
  1. For the CUDA computation engine, the amount of shared memory per block is typically 32 KB. On CUDA architectures, shared memory is on-chip and much faster than other off-chip memory. Consequently the amount of shared memory is limited. Taking into account that a (single precision) floating point value takes 4 bytes, the maximum dimensions of a square block of shared memory that you can allocate are 64 × 64. The more shared memory a kernel uses, the less blocks that can be executed in parallel on the GPU.
  2. To obtain maximal performance benefits when using shared memory, it is important to make sure that the compiler can determine statically the amount of memory that will be used by the kernel function. If not, the compiler will assume that the kernel function will take all of the available shared memory on the GPU, which prevents the hardware from processing multiple blocks in parallel. For example, if you request: x = shared(20,3,6)the compiler will reserve 20 × 3 × 6 × 4 bytes = 1440 bytes for the kernel function. However, often the arguments of the function shared are non-constant. In this case you can use assertions (see further in Chapter 5↓):
    assert(M<8 && N<20 && K<4)
    x = shared(M,N,K)
    With this assertion the compiler is able to infer the amount of required shared memory. In this case: 8 × 20 × 4 × 4 bytes = 2560 bytes. The compiler then gives the following message:Information: Calculated an upper bound for the amount of shared memory: 2560 bytes
Due to these restrictions, shared memory should be used in a “smart” way and with care.
Dealing with data races
To solve data races, one can either use atomic operations (e.g., +=, -=,/=,*=,^=, ...). Atomic operations are serialized, so the end result of the computation will always be correct. Atomic operations are often used in combination with synchronization barriers (see above). For example:
function [] = __kernel__ my_kernel(x : mat, y : vec, blkpos : ivec2, blkdim : ivec2)
bins = zeros(blkdim) % allocates shared memory
nblocks = (size(x)+blkdim-1)./blkdim

% step 1 - do some computations
val = 0.0
for m=0..nblocks[0]-1
for n=0..nblocks[1]-1
val += x[blkpos + [m,n] .* blkdim]
endfor
endfor
bins[blkpos] = val
% step 2 - synchronize all threads using this barrier
syncthreads
% Now it is safe to read from the variable bins
endfunction
Hint: only use atomic operations when necessary. If there is no possibility for a data race, it is more efficient to use non-atomic counterparts (e.g. y[pos] = y[pos] + 1). For GPUs, atomic operations in shared memory are also more efficient than atomic operations in global memory. Often, atomic operations can be avoided using the parallel reduction algorithm (see Section 8.4↓).
Warp size
The warp size is the number of threads in a warp, a subdivision that is used in GPU hardware implementation for memory coalescing and instruction dispatch. The warp size is important for branching: branch divergence occurs when not all threads within a warp follow the same execution path; this should be avoided as much as possible. For recent GPUs, the warp size is typically 32. The warp size is also important to know when accessing constant memory (see Section 9.1↓): constant memory works the most efficient when all threads within one warp access the same memory location at the same time.
In Quasar, the warpsize can be requested using the special kernel function parameter warpsize.
Specifying the GPU block size
By default, the GPU block size is determined automatically by the runtime system. For situations in which explicit control of the block size is required, it is also possible to manually specify the block size (blkdim) using the function parallel_do. For example:
sz = max_block_size(my_kernel, my_block_size)
parallel_do([dims,sz],...,my_kernel)
where dims and sz are both vectors of equal length. my_block_size then typically depends on the amount of shared memory you want to use within the kernel. The built-in function max_block_size computes the maximally allowed block size for the given kernel function. Note that
The above behavior is handled transparently by the function max_block_size(.). Hence one should always call max_block_size, to determine the maximal block size for a given kernel function.
Warning: the specification of the block size (especially without using max_block_size) should be done with care, because when the block size is too small, the performance of the kernel function may be severly impacted. Additionally, the code may fail or work less optimally on future computation devices.
Block size not specified: what happens?
In case the block size is not specified, it can be accessed from the kernel function through blkdim (this parameter should then be added to the argument list). The Quasar runtime system computes the block size that is estimated to be the most optimal for the given kernel, according to some heuristics. Quite often, this will be 16 × 32. Note that the block size is always a divisor of the dimensions dims. When necessary, the block size for a given kernel function can be retrieved programmatically using opt_block_size:
sz = opt_block_size(my_kernel)Note that opt_block_size uses an internal optimization method for determining the best possible block size for the given data dimensions, taking into account the resources used by the kernel function (e.g., registers, shared memory, ...). Furthermore, it always returns a block size that the hardware can handle.
Large vector/matrix dimensions that are not a power/multiple of 2.
It is best to specify dimensions to parallel_do that are a multiple of the maximal block size (e.g. 16 × 32 or 32 × 16). GPU computation engines best work with input data dimensions that are a multiple of (a power of) two. In the following example, this is not the case:
function [] = __kernel__ my_kernel(y : vec’unchecked, pos : int)
y[pos] = 1.0
endfunction
y = zeros(65535)
parallel_do(size(y),y,my_kernel) % errorInvalidValue
To ensure proper functioning of the program, the runtime system internally pads the input dimensions to be a multiple of two, as follows:
function [] = __kernel__ my_kernel(y : vec’unchecked, pos : int)
if pos >= 0 && pos < numel(y)
y[pos] = 1.0
endif
endfunction
y = zeros(65535)
pad = x -> ceil(x / BLOCK_SIZE) * BLOCK_SIZE % Block size is determined automatically
parallel_do(pad(numel(y)),y,my_kernel) % Success!
Note that this is performed completely transparently to the user, but comes at a slight performance cost: 1) the position checking if pos >= 0 && pos < numel(y), which is performed by all threads and 2) some threads (the ones for which the if-test fails) may be “inactive” by this measure.


 Chapter 1: Introduction Up Main page Chapter 3: Type system