2 Getting started
In this section, we will give you a little tutorial, to give you an idea of the Quasar programming language. First, a number of high-level concepts are listed. These concepts are included mainly to ease programming in Quasar. The most interesting parts are discussed in Section “writing parallel code” (
Section 2.4↓).
2.1 Quasar high-level programming concepts
-
Variables: Quasar variables are (by default) weakly-typed, although some mechanisms exist to enforce strong-typing. Variable names and function names are case sensitive.
-
Data types: some of the built-in data types are listed here:\begin_inset Separator latexpar\end_inset
-
scalar: specifies a floating-point number. The used precision depends on the settings of the computation engine.
-
cscalar: specifies a complex-valued scalar number
-
vec: a dense vector (1D array) of arbitrary length (the size is limited by the system resources)
-
mat: a dense matrix (2D array) of arbitrary size (the size is limited by the system resources)
-
cube: a dense cube (3D array) of arbitrary size (the size is limited by the system resources)
-
cvec: a complex valued dense vector
-
cmat: a complex valued dense matrix
-
ccube: a complex valued dense cube
-
string: a string expression
-
cell: a cell matrix object
-
kernel_function: represents a reference to a kernel or device function (see further).
-
function: a reference to a Quasar (user) function or lambda expression
-
object: a user-defined object (see function object())
Note that specific functions (see further) need to be used to create variables of a given type. There are also some special built-in datatypes: vecx and cvecx, with x=1,...,32 specify a vector of length x.
-
Scalar numbers: scalar numbers can be entered in decimal notation (5.678) as well as in scientific notation (-1.9e-4). Imaginary numbers are defined by adding the suffix j (or i), hence 1+1j or 1-1j represent complex numbers. Non-decimal numbers are also supported: for example, binary numbers 1011011b (suffix b or B), octal numbers 123456o (suffix o or O) and hexadecimal numbers 1Fh or 0ECD3Fh (suffix h or H).
-
Integer numbers: Quasar supports integer numbers (type int). The bit length of the int type depends on the computation engine, but is typically 32-bit. Also integer types with specified bit length exist: int8, uint8, int16, uint16, uint32.
-
% Comments are cool
However, note that multi-line comments are currently not (yet) supported.
-
Assignment expressions:
cool = 1
quasar = cool
The separation of lines using “;” is optional, and only mandatory when multiple statements are placed on the same line. One can assign to multiple variables at once, similar to C/C++:a = b = 1also, the result of an assignment is a value (in this case, 1). Multiple variable assignment is also possible (i.e. assigning multiple values to multiple variables at once). For example: [a, b] = [1, 2]will assign 1 to a and 2 to b. It is equivalent to:a=1; b=2The multiple variable assignment is mostly useful for 1) assigning multiple return values from functions, for interchanging values:[a, b]=[b, a]will swap the values of a and b. In some cases, it may be useful to neglect a certain return value. This can be done using the placeholder _:
[a, _] = [1,2]
[_, b] = [1,2]
-
Arrays (vectors, matrices, cubes etc.) The data type used may depend on the settings of the computation engine (currently, only 32-bit floating point is allowed, for efficiency). The following program illustrates how to create vectors and perform operations:
a = [0, 1, 2, 3] + 4
b = [3, 3, 3, 3]
print "a = ", a, "a .* b = ", a .* b, "sum(a.*b) = ", sum(a .* b)
print "a^2 = ", a.^2
String expressions are defined by double quotes (“”), the function “print” allows to print several comma separated values to the console. The “sum” function computes the sum of all components of the vector. The first line evaluates toa = [4, 5, 6, 7]i.e., 4 is added to every component of the vector. [4, 5, 6, 7] then represents a row vector. A matrix can be defined as follows:a = [[1,2],[2,1]]Statements and expressions can be split across multiple code lines. The following is also valid:
a = [[1,2],
[2,1]]
However, for readability, it is adviced to put an underscore at the line break:
a = [[1,2], _
[2,1]]
Similarly, a 3D matrix can be defined by:a = [[[1,2],[3,4]],[[5,6],[7,8]]]An alternative way of defining matrices is using the function zeros(.) or ones(.), which will initialize the values of the matrix to 0 and 1, respectively:
a = zeros(5)
b = zeros(6, 4)
c = zeros(8, 6, 4)
d = ones(5)
e = ones(6, 4)
f = ones(8, 6, 4)
Alternatively, a vector with the dimensions can be used:
dims = [4, 5, 6]
a = ones(dims)
The function size(.) returns the size of a vector/matrix/cube:
dims = size(a)
dim_y = size(a,0)
dim_x = size(a,1)
dim_z = size(a,2)
[dim_y,dim_x] = size(a,0..1)
[dim_y,dim_x,dim_z] = size(a)
Note that the dimensions are zero-based. By convention, y is the first dimension (corresponding to index 0), x is the second dimension and z is the third dimension. Internally, matrices are stored in row-major order. An n × n identity matrix can be created by using the function eye(.):Q = eye(n)Another example:
a = [[1,2],[2,1]]
b = [[3],[4]]
a[0,0] = 2
print a * b, ",", eye(3)
print a[0,0]
print "size(a)=", size(a), "size(a,1)=", size(a,1)
The function numel(.) gives the number of elements of a vector/matrix/cube. Practically: numel(X) = prod(size(X)).
-
Operators: see Table 2.1↓.\begin_inset Separator latexpar\end_inset
Notes:
-
There are no bit-wise integer operators. Instead, use the functions and (bitwise conjunction), or (bitwise disjunction), xor (exclusive or), not (bitwise negation), shl (bit-wise left shift), shr (bitwise right shift).
-
Within kernel or device functions, the operators +=, -= have a special meaning: they specify atomic operations (i.e. these operations are free of data races). There are currently 13 atomic operators (see table below).
-
Sequences: a sequence defines a row vector:
a=0..9
b=0..2..6
The middle argument defines the step size. Generally, the sequence includes the specified lower and upper bounds. Hence, the above statements are equivalent to:
a=[0,1,2,3,4,5,6,7,8,9]
b=[0,2,4,6]
The sequences can subsequently be used for matrix indexing:
A=randn(64,64)
A_sub = A[a,b]
Example:
a = 1..2..10
b = sum(a)
c = linspace(1, 2, 5)
print "a = ", a, "c = ", c
print sum = ", [b, sum(c)]
The linspace function creates an uniformly spaced row vector of 5 values between 1 and 2, hence c = [1,1.25,1.5,1.75,2]. Implicit sequences (:) can be used to quickly index matrices:print A[:,0], A[0,:]This statement prints the first column of A, followed by the first row of A. Note that for the Matlab keyword “end”, there is no Quasar equivalent. However, it is still possible to use A[0..size(A,0)-1,0].
-
Control structures: Quasar supports several control structures:
for a=0..2..4
break
continue
endfor
if a==2
endif
if a==2
...
elseif a==3
...
else
...
endif
while expr
break
continue
endwhile
repeat
break
continue
until expr
Note that “if” is ended with “endif”. Also “if”, “endif” statements must be spread along several lines of code. This is to improve readability of the code. The following is NOT allowed:if a==2; do_something(); endif % Not allowed! An example of a for-loop:
for i=1..2..100
j=i+1
print i, " ", j
if i==1
print "i is one"
elseif i==3
print "i is three"
endif
endfor
Non-uniform ranges can be specified as follows:
for powerOfTwo=[1,2,4,8,16,32,64,128]
print powerOfTwo
endfor
or more conveniently as:
for powerOfTwo=2.^(1..7)
print powerOfTwo
endfor
-
Switches are also possible, the syntax is a little different, for example:
match a with
| 1 ->
print "a=1"
| 2 ->
print "a=2"
| (3, 4) ->
print "a=3 or a=4"
| "String" ->
print "a=String"
| _ -> print
"a is something else"
endmatch
Note that different data types (i.e. strings and scalar numbers) can be mixed. Multiple case values can be specified (grouped by parentheses).
-
Ternary operators: an inline if is also available, just like in C/C++:
y = condition ? true_value : false_value
y = (x > T) ? x - T : 0
When the condition is true, only the value for true is evaluated. Conversely, when the condition is false, only the false-part is executed.
-
Lambda expressions: simply speaking, lambda expressions define inline functions, for example:
v = (x,y) -> 2*x+y
u = x -> 2*x
w = x -> y -> x + y
z = w(10)
print v(1,2), " ", z(5)
a = [[1,2],[2,1]]
print v(a,a)
print w(4)(5)
Note that here, w is a lambda expression that returns another lambda expression (y -> x + y) when evaluated. As such, partial evaluation is possible, e.g., z=w(10) (see further in Section 4.10↓). Lambda expressions can contain several sub-expressions and can be spread over several lines, as follows:
print_sum = (a, b) -> (sum=a+b;
print(sum); sum)
The different expressions are separated using semicolons (’;’). The return value of the lambda expression is always the last expression (in the above example, sum). Using ternary operators, it is fairly simple to define recursive lambda expressions:factorial = x -> x > 0 ? x * factorial(x - 1) : 1
-
Functions: the syntax for functions is different from the syntax for lambda expressions:function [outarg1, ..., outargM] = name (inarg1, ..., inargN)Here there are M output arguments (outarg1, ..., outargM) and N input arguments (inarg1, ..., inargN). “name” is the name of the function. Note that all output arguments must be assigned, otherwise the function call fails. An example:
function y = do_something(x)
y = x * 2
endfunction
a = [[1,2],[2,1]]
b = do_something(a)
print b
Calling a function with multiple output arguments requires multiple variable assignment:
function [x, y] = compute(a, b)
x = a + b
y = a * b
endfunction
[u, v] = compute(2,3)
print u, " ", v
Functions can contain inner functions (up to arbitrary nest depths). The inner functions (direct childs, not siblings) can then only be accessed from the outer function. For example:
function y = colortransform (x : vec3, cname)
function a = hsv2rgb (c)
h = floor(c / 60)
f = frac(c / 60)
v = 255 * c
p = v * (1 - c)
q = v * (1 - f * c)
t = v * (1 - (1 - f) * c)
match h with
| 0 -> a = [v, t, p]
| 1 -> a = [q, v, p]
| 2 -> a = [p, v, t]
| 3 -> a = [p, q, v]
| 4 -> a = [t, p, v]
| _ -> a = [v, p, q]
endmatch
endfunction
if cname=="hsv2rgb"
y = hsv2rgb(x)
else
error "the specified color transform ",cname, " is not supported!"
endif
endfunction
Argument types can optionally be specified, as shown above in x : vec3. Quasar will check at compile-time (and run-time) if the arguments are of the correct type, otherwise an error will be raised. The presence or absence of argument types has no further influence on the execution and end result of the program (except when types do not match and an error is generated). However, specifying argument types can help the Quasar optimizer to generate more efficient code.\begin_inset Separator latexpar\end_inset
Function handles can also be used, for example:
my_func = colortransform
print my_func([0.2, 0.2, 0.3])
-
Optional function arguments: functions can have optional arguments. In case an argument is missing, the default value is used. For example:
function [y, k] = my_func(b : mat, a : scalar = 4)
print a + b
y = k = 0
endfunction
my_func(2)
Since my_func is called with one argument, the default value for the second argument will be used (4 in this case).
Hence, functions can have multiple outputs and optional arguments, whereas lambda expressions can not. Note that the optional function arguments can - on their turn - be expressions and even function calls:
function [y, k] = my_func(b : mat, a : mat = eye(4))
function [y, k] = my_func(b : mat, a : mat = A .* B)
function [y, k] = my_func(b : mat, a : mat = 2 * b)
Note that by default, variable references (if the name does not correspond to another input argument) refer to the outer context in which the function is defined. They capture the value at the time the function is defined. The variables are defined in the order that they are put as argument. The following would lead to an error:function [y, k] = my_func(a : mat = 2 * b, b : mat) Here, b is not defined at the time a = 2 * b is evaluated.
-
Cell matrices: vectors, matrices and cubes can be grouped in a cell-structure. Cell matrices are either created using the function cell or using the special designated quotes ‘’. copy(.) performs a deep copy of a cell matrix (i.e. the function recursively applies copy(.) to all its elements). Some examples are given below:
A = cell(2,2)
G = cell(3)
A[0,0] = eye(4)
A[1,1] = 3
A[0,1] = cell(1,4)
A[0,1][1] = ones(3,3)
A[0,1][1][0,0] = 2
print A[0,1][1]*3
print size(A)*2
B = copy(A)
C = B-A
print C[0,0]
D = {A,B,C}
D_names = {"A","B","C"}
print D_names[1]
print size(’’)
One important special feature is that operations on cell matrices are supported when the different operands have the same structure. It is possible to compute the sum of two cell matrices using:C = B+AAlthernatively, we can multiply all elements of a cell matrix by a constant:C = B*4Or, we can use cell matrices in function calls (note that this is only allowed with built-in functions).C = max(B,4)
Cell matrices are convenient structures especially for rapid prototyping. However, because cell matrices can store
any data, type inference that is required for efficient parallelization may fail on cell matrices. For this purpose, it is useful to use
fully typed cell matrices (see
Section 3.5↓).
-
Dynamic evaluation: string expressions can be parsed and evaluated at runtime using the eval(.) function:val = eval("(x) -> 3*eye(x)")(8)Here, the eval function parses the string expression ’’(x) -> 3*eye(x)´´ and returns a corresponding lambda expression. This lambda expression can then be evaluated at the same speed as “regular” lambda expressions. This can be useful for simulations (e.g. passing functions through the command line).
-
Reading an input image:img_in = imread("lena_big.tif")Grayscale images return a two-dimensional matrix, color images return a three-dimensional cube, in which the length of the third dimension is either 3 (RGB) or 4 (RGBA - RGB with an alpha channel).
-
The spread operator: the spread operator “...´´ allows to unpack vectors to arbitrary indices or function parameters. Using the spread operator, the following lines of code can be simplified:
pos = [0,1,2]
y = im[pos[0],pos[1],pos[2],0] % Before
y = im[...pos, 0] % After
luminance = (R,G,B) -> 0.2126 * R + 0.7152 * G + 0.0722 * B
c = [128, 42, 96]
lum = luminance(c[0],c[1],c[2]) % Before
lum = luminance(...c) % After
The spread operator is in particularly useful in combination with variadic functions (see Section 4.8↓).
Importing .q files: .q files can contain multiple variable and function definitions which can be accessed from other .q programs. To do so, the import keyword can be used. The import keyword should be used only at the global scope (hence not within functions or control structures) and at the beginning of the Quasar module. For example:
import "system.q"
import "imfilter.q"
% all definitions from system.q and imfilter.q are now available.im = imfilter(imread("img.tif"),ones(7,7))There is one exception: “main” functions are completely skipped and hence not imported (see
Section 13.1↓). Also, .q files are only to be imported once (multiple imports will have no effect and will be ignored by the compiler), and the import definitions must be placed on the beginning of the program.
Syntax notice: the control structure keywords are as follows: if → endif, for → endfor, type → endtype, while → endwhile, match → endmatch, function → endfunction, try → endtry. In older versions of Quasar, the keywords were: if → endif, for → end, type → end, while → end, match → end, function → end,try → end. We have found experimentally that matching the control structure endings with the beginnings enhances not only the readability of the code but also prevents certain types of bugs (especially with nested control structures). For legacy code, the Quasar compiler still accepts the old endings. This is done on a per-file basis. When both control structure endings are mixed, the compiler generates an error. So the user is encouraged to use the new control structure endings!
2.2 A brief introduction of the type system
Note: a full depth explanation on the Quasar user-defined types will be given in
Chapter 3↓. Here we only give a brief introduction.
Quasar has an array-based type system, that facilitates working with multi-dimensional data, which includes for example conversions between vectors and matrices. In general, variable types are
implicit (hence do in general not need to be specified by the user). In contrast to the MATLAB/Octave, the Quasar compiler obtains the types of the variables through
type inference. The type inference is not
strict: if the compiler is not able to figure out the type of a variable, this variable will considered to be of an
unknown type (denoted by the type
’??’). The main primitive types of Quasar are summarized in
Table 2.2↓. The types
vec,
mat,
cube,
cvec,
cmat,
ccube, ... are actually shorthands for their corresponding generic versions with explicit type parameters (see further in
Chapter 6↓). Also the shorthands are listed in the table. Additional primitive types are given in
Table 2.3↓.
Table 2.2 Quasar main primitive types. Note: to use the types with asterisk(*), it is required to import the module “inttypes.q”
type
|
0-dim
|
1-dim
|
2-dim
|
3-dim
|
n-dim
|
integer number
|
int
|
ivec(*)
|
imat(*)
|
icube(*)
|
icube{n}(*)
|
shorthand for
|
|
vec[int]
|
mat[int]
|
cube[int]
|
cube{n}[int]
|
scalar number
|
scalar
|
vec
|
mat
|
cube
|
cube{n}
|
shorthand for
|
|
vec[scalar]
|
mat[scalar]
|
cube[scalar]
|
cube{n}[scalar]
|
complex scalar number
|
cscalar
|
cvec
|
cmat
|
ccube
|
ccube{n}
|
shorthand for
|
|
vec[cscalar]
|
mat[cscalar]
|
cube[cscalar]
|
cube{n}[cscalar]
|
The dimensionality can be specified via the brace syntax: e.g.,
cube{4} denotes a four-dimensional array. Quasar defines “infinite” dimensional data types:
cube{:} and
ccube{:}, although in practice, the current implementation only supports up to 16-dimensional data structures.
The relation between the different “dimensional” types is defined as follows:
vec ⊂ mat ⊂ cube ⊂ cube{:}
ivec ⊂ imat ⊂ icube ⊂ icube{:}
cvec ⊂ cmat ⊂ ccube ⊂ ccube{:}
Hence, every vector can be passed to a function requiring a matrix, and every matrix can be passed to a function requiring a cube. Whether a value \strikeout off\uuline off\uwave off
A\uuline default\uwave default is vector, matrix, or cube, depends on the number of dimensions of \strikeout off\uuline off\uwave off
A\uuline default\uwave default:
A has type ⎧⎪⎪⎨⎪⎪⎩
vec
if ndims(A)==1
mat
if ndims(A)==2
cube
if ndims(A)==3
cube{n}
if ndims(A)==n
where ndims returns the total number of dimensions. Note that scalar numbers are not part of the relationship (hence scalar⊈vec). This is mainly for implementation efficiency.
The consequence is that, functions defined for arguments of type cube can also accept arguments of type vec and mat. For example, for digital images, cube can both represent color images (with dimensions M × N × 3) and grayscale images (with dimensions M × N × 1). In some cases, it is useful to indicate size information in the type. This can be done by adding size parameters to the type: e.g., cube(:,:,3) denotes a 3D array for which the size in the last dimension is always 3. Size parameters give invaluable information to the compiler allowing specialized code to be generated.
Explicitly annotating the types of variables can bring performance benefits, although for code simplicity it is advised to only specify the type when necessary: in many cases the type is obtained and propagated using type inference. For example, when using the function im=imread("image.png", "rgb"), the compiler will infer that the type of im is cube(:,:,3).
It is possible to check at run-time whether a variable (or intermediate result) has a certain type, using the function
type(A:typename). Moreover, the check can be performed using type checks and/or assertions, for example:
print(variable:cube[int])
print(1:cube) % Error: Type check failed: ‘int’const‘ is not ‘cube‘
assert(type(1,"scalar"))
assert(type(1i,"cscalar"))
assert(type(zeros(2,2),"mat"))
assert(type("Quasar","string"))
In case one of the above the type checks fail, a compiler error will be generated. Additionally, the file system.q defines a number of lambda expressions for checking types:
isreal = x -> type(x, "scalar") || type(x,"vec") || type(x,"mat")
|| type(x, "cube")
iscomplex = x -> type(x, "cscalar") || type(x,"cvec") || type(x,"cmat")
|| type(x, "ccube")
isscalar = x -> type(x, "scalar") || type(x, "cscalar")
isvector = x -> type(x, "vec") || type(x, "cvec")
ismatrix = x -> type(x, "mat") || type(x, "cmat") || isvector(x)
iscube = x -> type(x, "cube") || type(x, "ccube") || ismatrix(x)
Under some circumstances, the Quasar compiler is not able to figure out the types of the variables through inference. One example is the load function, which reads data from a file (through a process called deserialization) and stores them into variables.
[A, B] = load("myfile.dat")
The file load operation is only performed at runtime, the result depends on the content of the file being loaded and correspondingly the compiler can not predict the types of the variables. Then it makes sense to give the compiler some type information, such that it can perform some smart optimizations when needed:
assert(type(A,"ccube"))
assert(type(B,"vec"))
The assert function then has a two-fold purpose: 1) it gives the compiler information about the types of A and B and 2) it performs a runtime check to validate the data read from“myfile.dat”.
An alternative (and perhaps cleaner) way to check the type of the variable is by using type annotations. The above example then becomes:
[A : ccube, B : vec] = load("myfile.dat")In case the types do not match, the runtime system will generate an error message. Type annotations need to be declared only the first time the variable is used.
Finally, type conversion is generally not needed in Quasar (avoided for computational performance reasons), although a conversion table is given in
Table 2.4↓. Only for generic programming purposes (see
Chapter 6↓), an overridable type conversion function
cast(x, new_type) is available.
Table 2.3 Additional primitive types “first-class citizens”
Type
|
Purpose
|
string
|
Sequences of characters
|
lambda_expr
|
Lambda expressions
|
function
|
Function handles
|
kernel_function
|
Kernel functions
|
object
|
Objects
|
??
|
Unspecified type (i.e. determined at run-time)
|
Table 2.4 Type conversion table
From/To
|
///
|
///
|
///
|
///
|
-
|
|
/
|
///
|
|
-
|
/
|
///
|
/
|
/
|
-
|
Lambda expressions can also be explicitly typed. Quasar follows typing conventions similar to the Haskell and ML programming languages. For example:
-
[int -> int]: a function that takes "int" as input and gives “int” as output
-
[(mat, scalar) -> (mat, mat)]: a function that takes two input arguments (of type mat and scalar) and that has two output arguments (both of type mat)
-
[int -> int -> int]: a function that projects an integer input onto a lambda expression of type int -> int.
-
[...scalar -> scalar]: a variadic function that takes an arbitrary number of scalar values as input and that returns a scalar number.
Lambda expressions (especially those that use closures, see
Section 4.2↓) are very powerful in Quasar. To reduce the overhead associated with calling lambda expressions on computation devices, the Quasar compiler attempts to inline lambda expressions and functions whenever possible or beneficial.
2.2.1 Floating point representation
In Quasar, the internal representation of scalar numbers “
scalar” (or complex scalar numbers “
cscalar”), is
usually not specified at the code level. This allows the floating point representation to be changed on a global level. By default, Quasar will use single precision floating point numbers (see
Table 2.5↓). However, it is possible to compile and run the programs using double precision as well, by passing the
-double command line option to Quasar, e.g.:
./Quasar.exe -debug -double script.q
Table 2.5 Overview of floating point representations
|
IEEE 32-bit (single precision)
|
IEEE 16-bit (half precision)
|
IEEE 64-bit (double precision)
|
Quasar type
|
scalar’single
|
scalar’half
|
scalar’double
|
Significand
|
23 bits
|
10 bits
|
52 bits
|
Exponent
|
8 bits
|
5 bits
|
11 bits
|
Minimum pos. value
|
1.17549435 × 10 − 38
|
5.9605 × 10 − 8
|
2.225073858507201 × 10 − 308
|
Maximum pos. value
|
3.40282347 × 1038
|
65504.0
|
1.797693134862316 × 10308
|
Exact integer repr.
|
-224 + 1 to 224 − 1 (16, 777, 215)
|
-211 + 1 to 211 − 1 (2, 048)
|
− 253 − 1 to 253 − 1
|
When numerical precision is not of uttermost importance, it is recommended to use single precision. Note that some older GPUs have limited double precision FP support. CUDA devices before compute capability 1.3 even do not have double precision FP support. Moreover, using double precision FP numbers doubles the memory bandwidth. Consequently, programs using double precision may run up to 2x slower than programs with single precision FP. Additionally, several consumer GPUs (e.g., Geforce series) have a double precision throughput that is much lower than the floating point precision throughput. However, there are some reasons to enable double precision in Quasar programs:
-
When numerical accuracy is an issue: remark that results obtained using double-precision arithmetic may differ from the same operations obtained using single-precision arithmetic, due to the greater precision and due to rounding errors. Therefore, it is important to compare and express the results within a certain tolerance rather than expecting them to be exact. Moreover, GPU devices typically flush numbers smaller than the minimum representable value (in absolute sense) to zero. Correspondingly, by using double precision FP numbers it may be possible to reduce some of the error introduced by underflow, as the minimal representable value is of the order 10 − 308 for double, while 10 − 38 for single precision (see Table 2.5↑).
-
For comparing the results of the algorithms to MATLAB/C++ implementations using double precision FP numbers.
Often it is useful to check whether the program is not suffering from floating point inaccuracies. This can simply be done by running the program once in double precision mode.
Note: NVidia GPU’s GTX 260, 275, 280, 285, 295 chips (with compute capability 1.3) have a low performance in double precision computations (about 1/8 of single precision performance). Devices of the NVidia Fermi architecture (compute capability 2.0+) have 1/2 the performance of single precision operations. Performance is greatly improved with either NVidia Tesla cards or the NVidia Titan (which is based on the Kepler architecture). For full details, see
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#arithmetic-instructions.
Finally, recall that FP math is not associative, i.e. the sum
(A + B) + C is not guaranteed to be equal to
A + (B + C). When parallelizing computations, the order of the operations is often changed (and may be even unspecified), leading to results which may differ each time the technique runs even with the same input data. This limitation is not inherent to Quasar, but applies to the specific approach used to perform parallel computations using floating point numbers. The example “Accurate sum” gives more information in this issue (see
Section 10.7↓); also in
Section 13.6↓ we explain how parallel code can be written that does not depend on this “non-deterministic” behavior.
The global constant “eps” is available for determining the machine precision (similar to FLT_EPSILON/DBL_EPSILON in C or eps in Octave/Matlab). The functions maxvalue(scalar) and minvalue(scalar) can be used to determine the maximum and minimum values representable in floating point format.
2.2.2 Mixed precision floating point computations
It is recommended to use the default scalar data type (with globally defined precision) as much as possible. However, in some cases, it is desirable to perform certain parts of the computation in a higher precision (e.g., when numerical accuracy is important) or even in a lower precision (e.g., when memory and/or computation time counts). Therefore Quasar allows specifying the required precision as part of the scalar type:
-
scalar’single: represents a 32-bit IEEE single precision floating point type (FP32)
-
scalar’double: represents a 64-bit IEEE double precision floating point type (FP64)
-
scalar’half: represents a 16-bit IEEE half precision floating point type (FP16)
The half precision floating point representation (FP16) cannot be set globally (due to its inherent precision and range restrictions), however, it is possible to write functions that mix scalar types of different precision. Mixing FP16 and FP32 appears to be common in machine learning (e.g., convolutional neural networks). Pascal, Volta and Turing GPUs also allow support FP16 computations natively. Volta’s tensor cores provide mixed matrix multiply-accumulate operations, in which the matrix is stored in FP16, the result is represented in FP32.
Remark: Quasar currently does not support bfloat16, the half-precision floating point format supported by Google’s Tensor Processing Units (TPUs).
2.2.3 Integer types
Next to floating point numbers, Quasar has also (limited) support for integer types. The default integer type is “int” (signed integer). Its bit length depends on the computation engine, but is guaranteed to be at least 32-bit.
There are also integer types with a pre-defined bit length, these are mainly provided 1) to enable more efficient input/output handling (e.g. reading/writing of images in integer format), or 2) to write certain algorithm in which memory usage/memory bandwidth should be as low as possible. Generally, the use of the integers with pre-defined bit length should be avoided. For completeness, these types are listed below:
-
int8: a signed 8-bit integer (with range -128..127)
-
int16: a signed 16-bit integer (with range -32768..32767)
-
int32: a signed 32-bit integer (with range − 231..231 − 1)
-
int64: (not fully implemented yet)
-
uint8: an unsigned 8-bit integer (with range 0..255)
-
uint16: an unsigned 16-bit integer (with range 0..65535)
-
uint32: an unsigned 32-bit integer (with range 0..232 − 1)
-
uint64: an unsigned 64-bit integer (with range 0..264 − 1)
A matrix containing 8-bit integers can be obtained as follows:A = mat[int8](rows,cols)Note that, by default, arithmetic operations for integer matrices are disabled (e.g. summing, subtracting, conversion to floating point etc.). These operations can be included by importing the inttypes library (import "inttypes.q").
Integer types can have special modifiers (the modifier can be added by writing a apostrophe ’ directly after the type name). These modifiers indicate how the conversion from a floating point number / integer number with larger bit depth to the considered integer type takes place.
-
int’checked (default): generates an error when the integer can not be represented using the current type (note: not implemented yet)
-
int’sat: in case of overflow, the integer is saturated (clipped) to the highest (or lowest) possible value that can be represented.
-
int’unchecked: performs no integer overflow checking. This may often be the fastest.
The following function, which sums two 8-bit unsigned integer matrices, illustrates the usage of integer modifiers:
function y : mat[uint8’sat] = add(a : mat[uint8], b : mat[uint8])
for m=0..size(a,0)-1
for n=0..size(a,1)-1
y[m,n] = a[m,n] + b[m,n]
endfor
endfor
endfunction
Here, integer saturation is used in case the sum of a[m,n] and b[m,n] does not fall in the range 0..255.
Important note
Variables are implicitly integer when defined by a constant with no decimal sign. This may have some unexpected consequences as in the following example:
a = 0 % a is integer
for n=0..N-1
a += x[n] % x[n] is implicitly converted to integer
endfor
Here,
x[n] is automatically converted to integer. To warn the programmer, a type conversion warning message will be shown. If it is desired to declare
a as a scalar value, the first line needs to be replaced by
a = 0.0.
2.2.4 Fixed sized datatypes
As already mentioned in
Section 2.2↑, vectors and matrices can optionally have their size specified in any dimension. For example,
a : cube(:,4,:) always has
size(a,1) == 4. The size acts as a constraint on the specialized type; the constraint is checked by both the compiler and the runtime. ’
:’ indicates that the length is unspecified (not known at compile-time).
Using this technique, it is easy to define single instruction multiple data (SIMD) operations. For x86/64 CPU targets in Quasar, vectors of length 4 (
vec(4)) may be mapped onto SSE datatypes while vectors of length 8 (
vec(8)) may be mapped onto AVX datatypes. See
Chapter 12↓ for more information.
In generic functions, it is possible to use type parameters for this purpose (for example cube(:,:,P)). For the following function:
function [] = color_transform[P : int](im_in : cube(:,:,P), im_out : cube(:,:,P))
...
endfunction
array size errors can be caught early in the development process: when the function
color_transform is called and the compiler cannot guarantee that
im_in and
im_out have the same size in the third dimension, a compiler error will result.
2.2.5 Higher dimensional matrices
Higher-dimensional matrices (with dimension > 3) need to be specified using an explicit dimension parameter. For example
cube{4} denotes a 4-dimensional array. The array with unspecified (infinite) dimension in Quasar is
cube{:}. This is useful for generic specialization purposes (see further in
Chapter 6↓). To express higher-dimensional loops for which the dimensionality is not known in advance, the functions
ind2pos and
pos2ind convert between position coordinates and linear indices, as illustrated by the following example:
function y:’unchecked = multidim_subsampling(x : cube{:}, factor : int)
y = uninit(int(size(x)/factor))
{!parallel for}
for i = 0..numel(y)-1
p = ind2pos(size(y), i)
y[p] = x[p*factor]
endfor
endfunction
Here, a higher dimensional cube is subsampled by factor along all dimensions of the cube. This technique is particularly useful for implementing operations with unknown dimensionality of the input parameters (as e.g., in Kronecker products).
2.2.6 User-defined types, type definitions and pointers
Quasar supports user-defined types (UDTs) and pointers: the user-defined types are defined as classes, as illustrated below:
type point : class
x : scalar
y : scalar
endtype
The
type keyword is always followed by a type definition. The class
point can be instantiated using its default constructor:
p = point()or:
p = point(x:=4, y:=5)Remark that the arguments of the constructor are named. The order of the arguments can then also be changed:
p = point(y:=5, x:=4)
By default, classes in Quasar are
immutable. This means that, once initialized, the value of the class cannot be changed (or a compiler error will be generated)! Classes can also be made mutable, as follows:
type point : mutable class
x : scalar
y : scalar
endtype
Immutable classes allow for some optimizations to be applied. For example, they can be stored in constant device memory, some memory transfers are eliminated, moreover, the Quasar runtime does not need to check if the value of this class has been changed in device memory. For these reasons, it is recommended to use immutable classes whenever possible.
Additionally, a UDT can contain other UDTs:
type rectangle : class
p1 : point
p2 : point
endtype
Remark that the fields of the rectangle (p1, p2) are stored in-place. This means that the internal storage size of the UDT is the sum of the storage sizes of its fields. In this case, using single precision FP, elements of the point class will take 8 bytes and consequently elements of the rectangle class will contain 16 bytes.
Like in other programming languages (e.g. C/C++, Pascal), it is also possible to define a rectangle that stores references to the point class. Therefore, Quasar supports Pascal-type pointers:
type pt_rectangle : class
p1 : ^point
p2 : ^point
endtype
Remark that in many programming languages, pointers can be a source of programming errors (e.g. dangling pointers, uninitialized pointers etc). For this reason, the pointers in Quasar have special properties, that allow them to be safe in usage:
-
Multiple indirections (^^point) are not allowed.
-
All pointer values should be initialized, either used a constructor of the class, or using a null pointer (nullptr). For example, the above class can be initialized using:r = pt_rectangle(p1:=nullptr, p2:=point(1,2))
-
Pointers are only allowed to be used for UDTs, not for scalars (scalar, cscalar, ...) or matrices (vec, mat, cube, ...).
-
All pointer values are typed. It is for example not allowed to declare a pointer to an unknown type (^??).
-
Pointer arithmetic is also not allowed.
Internal detail: the pointers in Quasar rely on customized form of reference counting to help track allocation of memory, including a technique to solve memory leaks caused by potential circular references. Moreover, the pointers make an abstraction from the particular device: the object can reside either in CPU memory, GPU memory, or both.
It is also possible to define (multi-dimensional) arrays of UDTs, using parametric types:
type point_vec : vec[point]
type point_mat : mat[point]
type point_cube : cube[point]
type rectangle_vec : vec[rectangle]
type pt_rectangle_vec : vec[^pt_rectangle]
Using UDT arrays is often more efficient than storing the individual elements of the UDT in separate matrices. This is because 1) the indexing often only needs to be performed once and 2) because better memory coalescing and caching. The UDT arrays can be initialized by zero (or using
nullptr’s), in the following way:
a = point_vec(10)
b = point_mat(4, 5)
c = point_cube([1, 2, 3])
Note that a type definition (
type x : y) is required for this construction. The following is (currently) not supported:
a = vec[point](10)
Moreover, the multi-dimensional arrays and UDTs may contain variables with
unspecified types:
type point : class
x : ??
y : ??
endtype
type cell_vec : vec[??]
type cell_mat : mat[??]
type cell_cube : cube[??]
One caveat is: variables with unspecified types do not support automatic parallelization (see further in
Section 2.3↓) and can not be passed to kernel functions (see
Subsection 2.4.1↓).
UDTs can also contain vectors/matrices:
type wavelet_bands : mutable class
LL : ^wavelet_bands
HL : mat
LH : mat
HH : mat
endtype
The premise is that this class does not have a default constructor (
wavelet_bands()), because there are no default values for matrices. Also
nullptr’s are not allowed. Hence, it is necessary to explicitly specify the value of
wavelet_bands:
bands = wavelet_bands(LL:=nullptr,
HL:=ones(64,64),
LH:=ones(64,64),
HH:=ones(64,64))
2.3 Automatic parallelization
The Quasar compiler automatically attempts to parallelize for-loops, depending on the matrix indexing scheme, input/output variables, constants and data dependencies. For example, the sequential code fragment, demonstrating a spatial filtering using a box filter (mask):
im = imread("image_big.tif")
im_out = zeros(size(im))
N = 5
mask = ones(2*N+1,2*N+1)/(2*N+1)^2
for m=0..size(im,0)-1
for n=0..size(im,1)-1
a = [0.,0.,0.]
for k=-N..N
for l=-N..N
a += mask[N+k,N+l] * im[m+k,n+l,0..2]
endfor
endfor
im_out[m,n,0..2] = a
endfor
endfor
automatically expands to the following equivalent parallel program:
im = imread("image_big.tif")
im_out = zeros(size(im))
N = 5
mask = ones(2*N+1,2*N+1)/(2*N+1)^2
function []=__kernel__ parallel_func(im:cube,im_out:cube,mask:mat,N:int,pos:ivec2)
a = [0.,0.,0.]
for k=-N..N
for l=-N..N
a += mask[N+k,N+l] * im[pos[0]+k,pos[1]+l,0..2]
endfor
endfor
im_out[pos[0],pos[1],0..2] = a
endfunction
parallel_do(size(im,0..1),im,im_out,mask,N,parallel_func)
In this program, first a kernel function is defined. Next, the parallel_do function launches the kernel function in parallel for every pixel in the image im. The kernel function processes exactly one pixel intensity, and is called repetitively by the function parallel_do. When compiling the Quasar program, the kernel functions and automatically parallelized loops are compiled, depending on the computation engine being used, to CUDA or native C++ code (using OpenMP). This ensures optimal usage of the computational resources.
The Quasar optimizer may fail to extract a parallel program, for example because the type of certain variables is not known or because certain dependencies between variables have been detected (the latter causing the loop to be executed serially - “serialized”). For mapping algorithms onto hardware, variable types need to be well defined. As explained in
Section 2.2↑, when the variable type is not specified, Quasar uses type inference to derive the exact type from the context. When this fails, warning messages are displayed on the console during compilation that can help to make the program parallel, e.g., by explicitly declaring the type (
B : vec = ...). Quite often, it may be the intention of the programmer to have a parallel loop. In this case, it is possible to interrupt the program compilation when the loop parallelization fails (thereby generating a compiler error). This is possible by putting
{!parallel for} directly before the for-loop to be parallelized:
{!parallel for}
for m=0..size(im,0)-1
for n=0..size(im,1)-1
endfor
endfor
In case the compiler then detects some dependencies between the variables, a warning message will be displayed reporting these dependencies and the loop will be parallelized despite the warnings (possibly causing data races).
There are some scenarios that currently cannot be handled by the auto-parallelizer (for example polymorphic variables which change type during the loop). Additionally, certain features cannot be used from inside loops, such as shared memory functions, thread synchronization... Therefore, and also for full flexibility, it is also possible to perform the parallelization completely manually. This is described in the following section.
2.4 Writing parallel code using kernel functions
In this section, we describe two ways of writing parallel code:
Most algorithms can be efficiently implemented using the “basic” approach. The advanced usage consists of synchronization, dealing with data races, sharing memory across multiprocessors and other GPU programming techniques, which may lead to increased performance taking more advantage of the available features.
For beginning users, it is advised to get acquainted first with the basic usage techniques, before considering the advanced usage. Additionally, kernels written using the “basic usage” approach are often further optimized by the Quasar compiler to use some more advanced features (see
Chapter 17↓).
2.4.1 Basic usage: kernel functions
A kernel function is a Quasar function with a special attribute
__kernel__, that can be parallelized. Kernel functions are launched in parallel on every element of a certain matrix, using the “
parallel_do” built-in function. The
__kernel__ attribute specifies that the function should be
natively compiled for the targeted computation engine (e.g. CUDA, CPU). Consequently,
__kernel__ functions are
considerably faster in execution than host functions thanks to their parallelization and native code generation. As example, consider the following algorithm:
function [] = __kernel__ color_temperature(x : cube, y : cube, temp,
cold : vec3, hot : vec3, pos : ivec2)
input = x[pos[0],pos[1],0..2]
if temp<0
output = lerp(input,cold,(-0.25)*temp)
else
output = lerp(input,hot,0.25*temp)
endif
y[pos[0],pos[1],0..2] = output
endfunction
hot = [1.0,0.2,0.0]*255
cold = [0.3,0.4,1]*255
img_out = zeros(size(img_in))
parallel_do(size(img_out,0..1),img_in,img_out,temp,cold,hot,color_temperature)
The kernel function is launched on a grid of dimensions “
size(img_out,0..1)” using the
parallel_do construct. This means that every pixel in
img_out will be addressed individually by
parallel_do, and correspondingly the function
color_temperature will be called for every pixel position.
For a kernel function, all parameters need to be
explicitly typed. Recall that untyped parameters in Quasar are denoted by
??. Code using untyped parameters cannot be mapped onto an efficient implementation, therefore compiler will first try to deduce the type from the context. The kernel function is then treated as a generic function (see
Chapter 6↓). If the type deduction fails, a compiler error will result.
To write efficient kernel functions it is recommended to use vector and matrix types with size constraints. For example:
-
vec(X) (or shortcut vecX): corresponds to a vector of length X.
-
ivec(X) (or shortcut ivecX): corresponds to an integer vector of length X.
-
cvec(X) (or shortcut ivecX): corresponds to a complex-valued vector of length X.
-
mat(X,Y) (or shortcut matXxY): a matrix of size X × Y.
-
cube(X,Y,Z)(or shortcut cubeXxYxZ): a cube of size X × Y × Z.
-
int: integer data type
The explicit size of the vector and matrix parameters allows calculations involving the variables to be mapped onto an efficient implementation. Additionally, it also helps the type inference: for example, the product of a vector of length 4 (vec(4)) and a 4 × 4 matrix (mat(4,4)) results in a vector of length 4.
However, some datatypes can not be passed as arguments to kernel functions: cell matrices containing unknown types (
??) and strings. To pass cell matrices, use parametrized matrix types (
vec[cube],
mat[cube],
mat[vec], etc., see
Chapter 3↓). Strings need to be converted to vectors (using the functions
fromascii,
fromunicode). Device functions (
__device__) possibly containing closure variables (see
Section 4.2↓), can also be passed.
For vectors of length
≤ 64, it is most efficient to add the length explicitly in the type as above. Vectors with length known at compile-time are treated in a special way: the components of the vector are grouped together requiring less memory read/write requests and may be implemented using SIMD instructions if the underlying back-end compiler supports them. At the very least, these vectors are allocated on the stack (or registers), which is significantly faster than in the kernel dynamic memory (see
Section 8.3↓). Matrices with size constraints (such as
matXxY,
cubeXxYxZ) are also treated as fixed-length vectors internally.
Remark: the current Quasar implementation places a limit on the maximum length of the constraint, or the product of the dimensions (e.g., X × Y for matXxY, X × Y × Z for cubeXxYxZ. This limit is 64). When the vector length is longer, the specification of the value will not have an effect (apart from type inference purposes).
Inside __kernel__ and __device__ functions, it is recommended to use integers instead of scalars (when possible). This may yield a speed-up of about 30% for CUDA targets and even more for CPU targets. When a scalar constant contains a decimal point (e.g., 1.2), the compiler will consider this constant to be a floating point number, otherwise it will be treated as an integer.
The syntax of the parallel_do function is as follows:parallel_do(dimensions, inarg1, ..., inargN, kernel_function)
where
dimensions is a vector. Note that
normally kernel function cannot have output arguments (there is a special advanced feature that allows kernel functions to return values of certain types, see
Section 4.7↓, but this feature is only for specific use-cases). Instead, in most use cases the return values should be written to the input arguments passed by reference, i.e. arguments of types
vec,
mat,
cube,
cvec,
cmat,
ccube.
There are some special arguments that can be defined in the kernel function declaration, but that do not need to be passed to parallel_do:
-
pos (of type int, (i)vecX): the current position of the work item being processed. Note that a “work item” can be either an individual pixel, or a “group of pixels”, depending on how you specify the “dimensions” argument.
-
blkpos (of type int, (i)vecX): the current position within the block (for advanced users, see Subsection 2.4.4↓)
-
blkidx (of type int, (i)vecX): the block index (for advanced users, see Subsection 2.4.4↓)
-
blkdim (of type int, (i)vecX): the current dimensions a block (for advanced users, see Subsection 2.4.4↓). Internally, the data is processed on a block-by-block basis. The dimensions of a block depend on the computation engine in use. For example, for the CUDA computation engine (with CUDA compute capability 2.0), the block dimensions can be as large as 16 × 32 or 32 × 16. For the CPU computation engine, the block dimensions will rather be 1 × #num_processors.
-
blkcnt (of type int, (i)vecx): the number of blocks in each dimension (for advanced users, see Subsection 2.4.4↓)
-
warpsize (of type int): the warp size of the device (for advanced users, see Subsection 2.4.4↓)
The
parallel_do function basically executes the following sequential program in parallel:
blkdim = choose_optimal_block_size(kernel_function) % done automatically
for m=0..dimensions[0]-1
for n=0..dimensions[1]-1
for p=0..dimensions[2]-1
pos = [m,n,p]
blkpos = mod(pos, blkdim)
blkidx = floor(pos/blkdim)
kernel_function(inarg1, ..., inargN, [pos], [blkpos], [blkidx], [blkdim])
endfor
endfor
endfor
Here, first optimal block dimensions (blkdim) for the given kernel function are being selected. Then, kernel_function is run inside the three loops, prod(dimensions)=dimensions[0] × dimensions[1] × dimensions[2] times.
Special
modifiers are available for kernel function arguments. The modifiers are specified using the apostrophe-symbol:
function [] = __kernel__ imfilter_kernel_nonsep_mirror_ext(y : cube’unchecked,
x : cube’unchecked, mask : mat’unchecked’const, center : ivec2, pos : ivec3)
These modifiers specify how vector/matrix/cube elements are accessed, and in particular enable efficient boundary handling in image processing:
-
’safe: disregards writes outside the data boundaries, reads outside the data boundaries evaluate to zero.
-
’circular: performs circular boundary handling
-
’mirror: mirrors when accessing outside the data boundaries.
-
’clamped: clamps (saturates) to the data boundaries (y[0] = y[-1] = y[-2] = ... and y[N-1] = y[N] = y[N+1] = ...)
-
’unchecked (warning: dangerous usage - your program may crash if not used properly): specifies no bounds checking on the input/output data. In case of access outside the data boundaries, a runtime error may or may not be generated. Specify this modifier in case you are sure your kernel/device function is 100% correct, and when you want to enjoy a modest extra code speedup.
-
’checked: the opposite of ’unchecked: generates an error when indices are out of the data boundaries. Quasar will give information on which matrices are the prime suspect.
-
’const: indicates that the matrix variable is constant and will not be changed inside the kernel function.
The default access modes are currently ’safe (inside kernel/device functions) and ’checked outside of kernel/device functions (for performance reasons). In case the program behavior depends on the access mode, it is best to explicitly indicate the access mode.
Finally, there are some rules with respect to the calling conventions for kernel functions:
-
Kernel functions can not have optional arguments.
-
A kernel function can not call a “host” function.
-
A kernel function can call a “device” function (see Subsection 2.4.2↓)
-
A kernel function can call a lambda expression declared with the __device__ attribute (see Subsection 2.4.2↓).
-
A kernel function can call a lambda expression declared without the __device__ attribute, on the premise that the lambda expression can be inlined.
-
A kernel function can call other kernel functions, through parallel_do (see further in Section 4.4↓). A kernel function cannot directly call another kernel function using a standard function call.
There are some special functions that can be used within kernel functions:
-
periodize(x, N): periodizes the input coordinate, i.e. k + a⋅N, with 0 ≤ k < N becomes k. This function is used automatically when the modifier ’circular is specified.
-
mirror_ext(x, N): mirrors the input coordinate between [0, N − 1]. This function is used automatically when the modifier ’mirror is specified.
-
clamp(x, N): clamps the input coordinates to [0, N]. This function is used automatically when the modifier ’clamped is specified.
-
int(x): converts the input argument to integer (using type casting)
-
float(x): converts the input argument to floating-point (using type casting)
-
shared(dims), shared_zeros(dims), shared[T](dims): the function has a special meaning - allocation of shared memory (see Subsection 2.4.4↓).
2.4.2 Device functions
In the example in the previous section, the linear interpolation function lerp is defined as:lerp = __device__ (a : scalar, b : scalar, d : scalar) -> a + (b - a) * dDevice functions are the only functions (next to kernel functions) that can be called from a kernel/device function. The __device__ function specifies that the function should be natively compiled for the targeted computation engine (e.g. CUDA, CPU), however, in contrast to kernel functions, they can not be used as argument to a call of the parallel_do function. Device functions are hence useful to aid the writing of kernel functions. For example, if one often needs a 2D vector that is orthogonal to a given 2D vector, one can define:orth = __device__ (x : vec2) -> [-x[1], x[0]]The function orth can then be used from other functions (also outside kernel/device functions).
Table 2.6↓ lists whether functions of different types can call each other. Note that there are a number of combinations that are not supported:
-
A device function cannot call a host function. This is simply because “default” functions are, by default, not natively compiled. However, in many cases, it is possible to convert the host function to a device function, by adding the __device__ modifier.
-
A device function cannot call a kernel function directly, nor can a kernel function call another kernel function (unless parallel_do is used, see further in Section 4.4↓). This is because kernel functions have special facilities for parallelization (e.g. they can use OpenMP etc).
-
However, a host function can call a device function. This is useful for declaring functions that can be used both from host code as from kernel code. An example is the sinc function:sinc = __device__ (x:scalar) -> x == 0 ? 1.0 : sin(x)/x print sinc(0) % call the device function
Table 2.6 Quasar: which function types can call ...?
From/To
|
“host”
|
__device__
|
__kernel__
|
“host”
|
Yes
|
Yes
|
/ only
|
__device__
|
No
|
Yes
|
|
__kernel__
|
No
|
Yes
|
|
Remark: kernel and device functions have dedicated types, containing respectively
__kernel__ and
__device__ type modifiers. For the above definitions:
imfilter_kernel_nonsep_mirror_ext : _
[__kernel__(cube,cube,mat,ivec2,ivec3) -> ()]
lerp : [__device__(scalar,scalar,scalar) -> scalar]
orth : [__device__(vec2) -> vec2]
These types can be used for defining more general functions that use device/kernel functions as input argument. For example:
add = __device__ (x : scalar, y : scalar) -> x + y
sub = __device__ (x : scalar, y : scalar) -> x - y
mul = __device__ (x : scalar, y : scalar) -> x * y
orth = __device__ (x : vec2) -> [-x[1], x[0]]
ident = __device__ (x : scalar) -> sub(add(x, 2*x), 2*x)
function [] = __kernel__ my_kernel (X : mat, Y : mat, Z : mat, pos : ivec2)
Z[pos] = add(X[pos], Y[pos])
v = orth([X[pos], Y[pos]])
endfunction
X = ones(4,4)
Y = eye(4)
Z = zeros(size(X))
parallel_do(size(Z),X,Y,Z,my_kernel)
One special feature of device functions, is that they can be used as function pointers and passed to kernel functions. This can be used to reduce the number of kernel functions, or as an alternative to dynamic code generation:
% Definition of a __device__ function type
type binary_function : [__device__ (scalar, scalar) -> scalar]
add = __device__ (x : scalar, y : scalar) -> x + y
sub = __device__ (x : scalar, y : scalar) -> x - y
mul = __device__ (x : scalar, y : scalar) -> x * y
function [] = __kernel__ arithmetic_op(Y : cube, _
A : cube, B : cube, fn : binary_function, pos : ivec3)
Y[pos] = fn(A[pos], B[pos])
endfunction
A = ones(50,50,3)
B = rand(size(A))
Y = zeros(size(A))
parallel_do(size(Y),Y,A,B,add,arithmetic_op)
Unfortunately, there is a performance penalty associated to function pointer calls: the extra indirection avoids the compiler to inline the function. For this reason, when possible the Quasar compiler will attempt to avoid function pointer calls (by substituting the exact function).
2.4.3 Memory usage inside kernel or device functions
There are three types of memory that can be used inside kernel or device functions:
-
local memory: this is memory that is local to the function, and each parallel run of the kernel function (called ’thread’) contains a private copy of this memory. Below are a few examples of the creation of local memory:
A = [0, 1, 2, 3] % Generates a variable of type ’ivec4’
B = [0., 1., 2., 3.] % Generates a variable of type ’vec4’
C = [1 + 1j, 2 - 2j] % Generates a complex-valued variable of type ’cvec2’
D = ones(6) % Generate a vector of length 6, filled with 1.
E = zeros(8) % Generates a vector of length 8, filled with 0.
F = complex(zeros(4)) % Generates a complex-valued vector of length 4
For the GPU computation engine, there are however a few limitations: first, local memory is internally stored in device registers, is hence very fast, but also scarse. When the maximum number of device registers is exceeded, global memory is used instead (which has a much larger memory access latency).
-
shared memory: this type of memory is shared across threads, and allocated using the functions shared and shared_zeros. Its usage is discussed in Subsection 2.4.4↓.
-
global memory: this type of memory is used for storing vectors and matrices with either large dimensions or dimensions that cannot be determined at compile-time. Global memory is also used for dynamically allocated objects (see Section 8.3↓). For example:
function [] = __kernel__ my_kernel (X : mat, pos : ivec2)
% X[pos] is stored in global memory
endfunction
X = ones(4096,4096)
parallel_do(size(X), my_kernel)
Here, X is allocated outside a kernel function. The values of X, in total 4 × 4096 × 4096 bytes (in case of 32-bit floating point), are stored automatically in global memory in a linear way. The following formula is used for translating the 3D index to a linear index:
index(dim1, dim2, dim3) = ( dim1 Ndims2 + dim2) Ndims3 + dim3
The ind2pos(size, index) function performs exactly this calculation. Global memory can reside either in CPU memory, GPU memory or both. When calling a kernel function using parallel_do in the GPU computation engine, the global memory will automatically be transferred to the GPU. Because the maximum amount of local memory and shared memory that can be used is limited by the hardware (e.g., not more than 48K), global memory is the only way to pass large amounts of data to a kernel function. The only premise is: a kernel/device function cannot allocate global memory, the memory should best be allocated in advance and passed to the function.
In some cases, a Quasar program may run out of global GPU memory. In that case, Quasar will automatically transfer a non-frequently used memory buffer back to the CPU. This memory buffer can be later transferred back to the GPU. By this technique, Quasar programs can use all the available memory in the system (both CPU and GPU).
-
texture memory: texture memory is read-only global memory that is internally optimized for spatial access patterns (whereas the global memory is more optimal for linear accesses). In particular, the data layout is optimized for texture sampling using nearest neighbor interpolation or linear interpolation. CUDA uses space filling curves for optimizing the data layout. See Section 9.3↓ for more information.
Finally, it is important to mention that local memory should be
scarcely used (or at least: with care), because for the GPU, the local memory is mapped directly onto the device registers. In CUDA, the total size of the device registers is 32K for compute capability 2.0 and 64K for compute capability 64K. However, the device registers are shared across all computing threads: hence, when invoking
512 threads in parallel, the total amount of local memory available to a kernel/device function is respectively 64 bytes and 128 bytes! If more registers are used, the kernel will execute, but global memory is used instead (called
register spilling). To avoid performance impact, the Quasar runtime decreases the number of threads being spawned. The maximum number of threads that a given kernel function uses, can be determined using the function
prod(max_block_size(my_kernel)). Also see
Subsection 2.4.4↓ for more information.
2.4.4 Advanced usage: shared memory and synchronization
For this section, a basic familiarity with the GPU architecture is required. Internally, chunks of data are processed in blocks, as follows:
pos = [m,n,p]
blkpos = mod(pos, blkdim)
blkidx = floor(pos/blkdim)
blkcnt = ceil(size(y)./blkdim)
Within one block, a kernel function can access data from kernel functions running in parallel on this block. This is very useful for implementing some special parallel algorithms, such as parallel sum, parallel sort, spatially recursive filters etc. However, read/write operations can interfere (data races), so special care is needed.
Advanced usage consists of 1) using thread synchronization, 2) using shared memory, 3) dealing with data races.
The number of threads that run in parallel over one block can be calculated using prod(blkdim), i.e., the product of the block dimensions. Sometimes, it is necessary that each threads wait until a given operation is completed, by means of a thread barrier. All threads (within one block!) then wait until completion of the operation. In Quasar, this is done using the syncthreads keyword:
function [] = __kernel__ my_kernel (y : mat, z : mat, pos : ivec2, idx : ivec2)
y[pos] = 10*idx
syncthreads % all threads wait here until the above operation has been completed.
z[pos] = y[pos]*2
endfunction
It is important to mention that the thread synchronization is performed on a block level, rather than on the full grid. In the above example, when the syncthreads is first encountered, only values y[pos] with pos ∈ [0..blkdim[0]-1] × [0..blkdim[1]-1] will have been computed, and not the complete matrix y!
Finally, the correct usage of
syncthreads is that all threads effectively meet the barrier. It is for example not allowed to put a synchronization barrier inside a conditional
if...else... clause, unless it is sure that each thread encounters the same number of barriers while running the kernel function.
syncthreads may also have a parameter to indicate the synchronization granularity, for example
syncthreads(block),
syncthreads(warp),
syncthreads(grid),
syncthreads(host). For more details, see
Subsection 9.8.1↓.
Remark: note that
syncthreads does not ensure that global memory writes performed by the block are completed and can be seen by other threads. If this is required, consider using
memfence (see
Subsection 9.8.3↓).
The GPU has several memory types (global memory, texture memory, shared memory, registers etc.). Therefore, to achieve the best performance it is best to use the right memory type for each task. Global memory is used by default and registers are used for local calculations within kernel/device functions. Shared memory is visible to all threads of a kernel function within one block and can be allocated using the function shared(.) or shared_zeros(.) from kernel functions. It’s usage is as follows:
var1 = shared(dim); % vector
var2 = shared(dim1,dim2); % matrix
var3 = shared(dim1,dim2,dim3); % cube
var4 = shared_zeros(dim); % vector initialized with 0’s
var5 = shared_zeros(dim1,dim2); % matrix initialized with 0’s
var6 = shared_zeros(dim1,dim2,dim3); % cube initialized with 0’s
var7 = shared[uint8](dims) % generic memory allocation
var8 = shared_zeros[uint64](dims)
% generic memory allocation initialized with 0’s
syncthreads % REQUIRED in case of shared_zeros!!!
Shared memory is visible and shared within one block. That means that, when going to another block (e.g. when blkidx changes), the content of the shared memory cannot be relied on. Use shared_zeros only when you want to initialize the memory with zeros. The shared memory allocated with shared is not initialized (like in C/C++). This is often faster.
Important: one or multiple shared_zeros calls always need to be followed by a syncthreads statement (as shown in the example below). This is because the memory initialization by shared_zeros is performed in parallel. Hence, when all threads randomly start using the allocated memory it is necessary to wait until the zero initialization operation has fully been completed. In fact, the (internal) implementation of shared_zeros is as follows:
function [] = __kernel__ shared_mem_example(blkpos : ivec3, blkdim : ivec3)
A = shared(100) % One vector of 100 elements
% Compute the index of the current thread
threadId = (blkpos[0] * blkdim[1] + blkpos[1]) * blkdim[2] + blkpos[0]
nThreads = prod(blkdim) % Number of threads within one block
for i=threadId..nThreads..numel(A)-1 % Parallel initialization
A[i] = 0.0
endfor
syncthreads % Make sure all threads have finished before continuing!
% Is equivalent to
B = shared_zeros(100)
syncthreads % Make sure all threads have finished before continuing!
endfunction
There are however two caveats when using shared memory:
-
For the CUDA computation engine, the amount of shared memory per block is typically 32 KB. On CUDA architectures, shared memory is on-chip and much faster than other off-chip memory. Consequently the amount of shared memory is limited. Taking into account that a (single precision) floating point value takes 4 bytes, the maximum dimensions of a square block of shared memory that you can allocate are 64 × 64. The more shared memory a kernel uses, the less blocks that can be executed in parallel on the GPU.
-
To obtain maximal performance benefits when using shared memory, it is important to make sure that the compiler can determine statically the amount of memory that will be used by the kernel function. If not, the compiler will assume that the kernel function will take all of the available shared memory on the GPU, which prevents the hardware from processing multiple blocks in parallel. For example, if you request: x = shared(20,3,6)the compiler will reserve 20 × 3 × 6 × 4 bytes = 1440 bytes for the kernel function. However, often the arguments of the function shared are non-constant. In this case you can use assertions (see further in Chapter 5↓):
assert(M<8 && N<20 && K<4)
x = shared(M,N,K)
With this assertion the compiler is able to infer the amount of required shared memory. In this case: 8 × 20 × 4 × 4 bytes = 2560 bytes. The compiler then gives the following message:Information: Calculated an upper bound for the amount of shared memory: 2560 bytes
Due to these restrictions, shared memory should be used in a “smart” way and with care.
To solve data races, one can either use atomic operations (e.g.,
+=,
-=,
/=,
*=,
^=, ...). Atomic operations are serialized, so the end result of the computation will always be correct. Atomic operations are often used in combination with synchronization barriers (see above). For example:
function [] = __kernel__ my_kernel(x : mat, y : vec, blkpos : ivec2, blkdim : ivec2)
bins = zeros(blkdim) % allocates shared memory
nblocks = (size(x)+blkdim-1)./blkdim
% step 1 - do some computations
val = 0.0
for m=0..nblocks[0]-1
for n=0..nblocks[1]-1
val += x[blkpos + [m,n] .* blkdim]
endfor
endfor
bins[blkpos] = val
% step 2 - synchronize all threads using this barrier
syncthreads
% Now it is safe to read from the variable bins
endfunction
Hint: only use atomic operations when necessary. If there is no possibility for a data race, it is more efficient to use non-atomic counterparts (e.g.
y[pos] = y[pos] + 1). For GPUs, atomic operations in shared memory are also more efficient than atomic operations in global memory. Often, atomic operations can be avoided using the parallel reduction algorithm (see
Section 8.4↓).
The warp size is the number of threads in a warp, a subdivision that is used in GPU hardware implementation for memory coalescing and instruction dispatch. The warp size is important for branching: branch divergence occurs when not all threads within a warp follow the same execution path; this should be avoided as much as possible. For recent GPUs, the warp size is typically 32. The warp size is also important to know when accessing constant memory (see
Section 9.1↓): constant memory works the most efficient when all threads within one warp access the same memory location at the same time.
In Quasar, the warpsize can be requested using the special kernel function parameter warpsize.
Specifying the GPU block size
By default, the GPU block size is determined automatically by the runtime system. For situations in which explicit control of the block size is required, it is also possible to manually specify the block size (
blkdim) using the function
parallel_do. For example:
sz = max_block_size(my_kernel, my_block_size)
parallel_do([dims,sz],...,my_kernel)
where
dims and
sz are both vectors of equal length.
my_block_size then typically depends on the amount of shared memory you want to use within the kernel. The built-in function
max_block_size computes the
maximally allowed block size for the given kernel function. Note that
-
For the CUDA computation engine, the maximum block size is limited by the maximum number of threads per block. For CUDA compute capability 2.0, we should have that the maximum number of elements ≤ 1024. In practice, this number is often even lower, due to the resources used by the kernel (registers, shared memory, ...).
-
For optimal performance, the number of elements in each block should be a multiple of the warp size (typically 32). To optimize the memory access pattern it might be even desirable to set the block width as a multiple of 32.
-
For the CPU computation engine, the maximum block size is unlimited, unless synchronization (syncthreads) is used. In this case, the block size is limited depending on the number of multi-processors in the system.
-
Performance is not necessarily proportional to the number of threads per block. In some cases, the optimal launch configuration has a block size that is as small as 64 or 128.
The above behavior is handled transparently by the function max_block_size(.). Hence one should always call max_block_size, to determine the maximal block size for a given kernel function.
Warning: the specification of the block size (especially without using max_block_size) should be done with care, because when the block size is too small, the performance of the kernel function may be severly impacted. Additionally, the code may fail or work less optimally on future computation devices.
Block size not specified: what happens?
In case the block size is not specified, it can be accessed from the kernel function through blkdim (this parameter should then be added to the argument list). The Quasar runtime system computes the block size that is estimated to be the most optimal for the given kernel, according to some heuristics. Quite often, this will be 16 × 32. Note that the block size is always a divisor of the dimensions dims. When necessary, the block size for a given kernel function can be retrieved programmatically using opt_block_size:
sz = opt_block_size(my_kernel)Note that opt_block_size uses an internal optimization method for determining the best possible block size for the given data dimensions, taking into account the resources used by the kernel function (e.g., registers, shared memory, ...). Furthermore, it always returns a block size that the hardware can handle.
Large vector/matrix dimensions that are not a power/multiple of 2.
It is best to specify dimensions to parallel_do that are a multiple of the maximal block size (e.g. 16 × 32 or 32 × 16). GPU computation engines best work with input data dimensions that are a multiple of (a power of) two. In the following example, this is not the case:
function [] = __kernel__ my_kernel(y : vec’unchecked, pos : int)
y[pos] = 1.0
endfunction
y = zeros(65535)
parallel_do(size(y),y,my_kernel) % errorInvalidValue
To ensure proper functioning of the program, the runtime system internally pads the input dimensions to be a multiple of two, as follows:
function [] = __kernel__ my_kernel(y : vec’unchecked, pos : int)
if pos >= 0 && pos < numel(y)
y[pos] = 1.0
endif
endfunction
y = zeros(65535)
pad = x -> ceil(x / BLOCK_SIZE) * BLOCK_SIZE % Block size is determined automatically
parallel_do(pad(numel(y)),y,my_kernel) % Success!
Note that this is performed completely transparently to the user, but comes at a slight performance cost: 1) the position checking if pos >= 0 && pos < numel(y), which is performed by all threads and 2) some threads (the ones for which the if-test fails) may be “inactive” by this measure.