8 Special programming patterns
8.1 Matrix/vector expressions
Operations on large matrices are grouped and automatically converted into a kernel function (sometimes called
broadcasting). For example:
x = randn(512,512,64)
y = 0.1 + (0.8 * 255 * sin(x/255)) + 10 * w
will automatically be translated to:
function [_out:cube]=opt__auto_optimize1(x:cube,w:cube)
function [] = __kernel__ opt__auto_optimize1_kernel _
(_out:cube’unchecked,x:cube’unchecked,w:cube’unchecked,pos:ivec3)
_out[pos]=((0.1+(204*sin((x[pos]/255))))+(10*w[pos])))
endfunction
_out = uninit(size(x)))
parallel_do(size(x),_out,x,w,opt__auto_optimize1_kernel)
endfunction
reduction (x:cube, w:cube) -> (((0.1+(204*sin((x/255))))+(10*w))= _
opt__auto_optimize1(x,w)
x = randn(512,512,64)
y = opt__auto_optimize1(x, w)
which is faster, because intermediate results are directly computed in local memory, without accessing the global memory (see
Subsection 2.4.3↑). Remark that this procedure depends on the success of the type inference. In some cases, it may be necessary to give a hint to the compiler about the types of certain variables, through
assert(type(var,"typename")) (see
Section 2.2↑). Also, the expression optimizer generates a reduction to deal with expressions of the form
(((0.1+(204*sin((x/255))))+(10*w)). When the same expression appears several times in the code, even in slightly modified version (e.g.
sin(sinc(x)/255) instead of
sin(x/255)), the generated
__kernel__ function will be re-used.
The expression optimization can be configured using the following pragma:
#pragma expression_optimizer (on|off)
8.2 Loop parallelization/serialization
In Quasar, loop parallelization consists of 1) the detection of parallelizable (or serializable) loops and 2) the automatic generation of kernel functions for these loops. The automatic loop parallelizer (ALP) attempts to parallelize for-loops, starting with the outside loops first. The ALP automatically recognizes one, two and three dimenional for-loops, and also maximizes the dimensionality of the loops subject to parallelization. The ALP extracts the loops and converts them to kernel functions (see
Subsection 2.4.1↑), which are subsequently optimized using target-specific code transformations (see
Chapter 17↓). Howver, not every for-loop can be parallelized and executed on a GPU: a number of restrictions exist:
-
All variables that are being used inside the loop must have a static type (i.e. explicitly typed as vec, mat, cube, see Section 2.2↑) or a static type can be inferred from the context (type inference or through explicit/implicit specialization, see Chapter 6↑). Practically speaking: only types that can be used inside __kernel__ or __device__ functions are allowed.
-
For the best performance, for slicing operations A[a..b,2], the dimensions a and b must be constant and known at compile time (either specified explicitly, or obtained through constant propagation). When these constants are not known, a nested kernel function will be generated. Depending on the context, this may or may not involve dynamic kernel memory (see Section 8.3↓)
-
The for-loop nest must be perfect (no code in between subsequent for statements, no dependencies between the loop boundaries, no break, no continue) for example:
for m=0..size(x,0)-1
for n=5..size(y,0)-1
endfor
endfor
This is an example of a imperfect loop:
for m=0..size(x,0)-1
for n=m..size(y,0)-m
endfor
endfor
(Note: certain types of imperfect loops can also be handled using the imperfect loop function transform, this requires placing {!function_transform enable="imperfectloop"} inside the for-loop)
-
Only types that can be used inside kernel or device functions are allowed (e.g., types based on class and mutable class, but not dynamic class).
-
When host (i.e. non kernel/device) functions are called from a loop, the loop is not eligible for parallelization/serialization. In particular, only a limited number of built-in functions are supported. Functions that interact with/take variables with unspecified types (such as print, load, save, ...) are not supported.
-
Data dependencies/conflicts between different iterations are detected and not allowed. In case a dependency is detected, the loop can be serialized. In this case, the for-loop will be natively compiled (in C++) and executed on the CPU in single-threaded mode.
-
Advanced kernel function features such as shared memory and thread synchronization (see Subsection 2.4.4↑) are not supported, because these functions often require low-level access to the block position (blkpos) and dimensions (blkdim).
In case one of these conditions are violated, an error message is generated, and the code is not parallelized. In default form, this leads coded to be interpreted by the Quasar interpreter (often resulting in a significantly slower execution). It is therefore recommended to resolve issues raised by the automatic for-loop parallelizer.
The ALP can be configured using the pragmas (see
Section 17.3↓):
#pragma loop_parallelizer (on|off) and the global configuration setting
COMPILER_AUTO_FORLOOP_PARALLELIZATION (see
Table 17.3↓).
Individual for-loop parallelization can also be controlled by placing a code attribute (see
Section 4.11↑) in front of the for-loop definition. The following code attributes are available:
-
{!parallel for}: forces the loop to be parallelized, despite dependencies (i.e., potential data races) detected between the variables. In the latter case, a warning message is shown.
-
{!parallel for; dim=n}: forces the following n loops to be parallelized jointly. Note that this operation requires that the for-loop openings are placed directly after each other, without any intermediate statements (apart from comments). Explicitly specifying the dimension allows the outer n loops to be parallelized even in cases that there are some additional inner loops. When the dims parameter is omitted, the compiler attempt to select a maximal value for dims depending on the inner for-loops.
-
{!parallel for; multi_device=true}: causes the for-loop to be parallelized over multiple devices (at least, for runtime engines that support multiple devices, see Chapter 11↓).
-
{!serial for}: serializes the for-loop, for execution on devices that support serial execution (e.g., the CPU).
-
{!interpreted for}: forces the loop to be interpreted. This comes at a computational cost, but can still be useful for e.g., debugging purposes
An example of ALP code attributes is given below:
im = imread("image.tif","rgb")
{!parallel for}
for m=0..size(im,0)-1
for n=0..size(im,1)-1
im[m,n,0..2] = 255-im[m,n,0..2]
endfor
endfor
Most of the time, {!parallel for} is not required because the compiler is able to detect the parallelism automatically. However, {!parallel for} ensures that the subsequent loop will be parallelized, generating a compiler error when this fails.
8.2.1 While-loop serialization
The ALP is also able to serialize while loops, as in the following example:
{!serial for}
while !finished
x = a[0,index]
y = a[1,index]
index += 1
if x.^2 + y.^2 < 1
finished = true
endif
endwhile
8.2.2 Example: gamma correction
The following example illustrates a gamma correction operation:
im = imread("image.tif")
im_out = zeros(size(im))
gamma = 1.1
tic()
{!parallel for}
for i=0..size(im,0)-1
for j=0..size(im,1)-1
for k=0..size(im,2)-1
im_out[i,j,k] = im[i,j,k]^gamma
endfor
endfor
endfor
toc()
fig1 = imshow(im)
fig2 = imshow(im_out)
fig1.connect(fig2)
When no code attribute ({!parallel for}, {!serial for}) is used, the compiler will analyze the code, inspect the variable dependencies and decide whether the loop can be parallelized or serialized. In fact, it is not necessary to specify these code attributes, however, when it is done, the compiler will generate an error in case the parallelization/serialization fails (e.g., due to some data dependency). Resolving these errors may allow the code to be parallelized.
Warning: when {!parallel for} is placed in front of a loop, the loop will be parallelized irrespective of the dependencies. Consequently, when using {!parallel for} uncarefully, the program behavior is affected and the wrong results may be obtained!
8.3 Dynamic kernel memory
Very often, it is desirable to construct (non-fixed length) vector or matrix expressions within a for-loop (or a kernel function). Before Jan. 2014, this resulted in a compilation error “function cannot be used within the context of a kernel function” or “loop parallelization not possible because of function XX”. The transparent handling of vector or matrix expressions with in kernel functions requires some special (and sophisticated) handling by the Quasar compiler and runtime. In particular: what is needed is dynamic kernel memory. This is memory that is allocated on the GPU (or CPU) during the operation of the kernel. The dynamic memory is disposed (freed) either when the kernel function terminates or at a later point.
There are a few use cases for dynamic kernel memory:
-
When the algorithm requires to process several medium-sized (16x16) to large-sized (e.g. 128x128) matrices. For example, a kernel function that performs matrix operations for every pixel in the image. The size of the matrices may or may not be known in advance.
-
Efficient handling of multivariate functions that are applied to (non-overlapping or overlapping) image blocks.
-
When the algorithm works with dynamic data structures such as linked lists, trees, it is also often necessary to allocate “nodes” on the fly.
-
To use some sort of “/scratch” memory that does not fit into the GPU shared memory (note: the GPU shared memory is 32K, but this needs to be shared between all threads - for 1024 threads this is 32 bytes private memory per thread). Dynamic memory does not have such a stringent limitation. Moreover, dynamic memory is not shared and disposed either 1) immediately when the memory is not needed anymore or 2) when a GPU/CPU thread exists. Correspondingly, when 1024 threads would use 32K each, this will require less than 32MB, because the threads are logically in parallel, but not physically.
In all these cases, dynamic memory can be used, simply by calling the zeros, ones, eye or uninit functions. One may also use slicing operators (A[0..255, 2]) in order to extract a sub-matrix. The slicing operations then take the current boundary access mode (e.g. mirroring, circular) into account. Dynamic kernel memory has the following advantages:
-
Memory can be allocated and released on the fly (for example, scratch pads within a kernel that exceed the shared memory size)
-
Functions can be written in a generic way so that they can handle any vector/matrix dimensions
However, there also some drawbacks:
-
Despite the internal parallel memory allocator being quite efficient, code that uses dynamic kernel memory is generally 10x-25x slower than code that does not rely on dynamic kernel memory. This is due to 1) the execution cost of allocation and disposal operations (which are performed behind the scenes) and 2) the storage of the data in the global memory which is slower than, e.g., register memory.
-
The restrictions of the dynamic memory subsystem may cause runtime errors to be generated, for example when insufficient dynamic kernel memory is available. To alleviate this problem, it is possible to increase the amount of dynamic kernel memory by a configuration setting (in Redshift: Program Settings/runtime/kernel dynamic memory reserve), although this memory is then not available for other purposes.
Because of these drawbacks, an optimization warning is generated by the compiler. It is recommended to use dynamic kernel memory sparely and only when there is no direct alternative. Different approaches exist to avoid that dynamic kernel memory is used:
-
Use of array size constraints (see Section 3.3↑). The resulting variables have the advantages that the memory is stored in the registers or on the stack (which is highly efficient).
-
Use of size-parametrized generic types (see Section 6.7↑).
-
Memory preallocation: allocate one large matrix (using zeros, uninit functions) outside the kernel function and pass the matrix to the kernel function
The following program transposes 16x16 blocks of an image, creating a cool tiling effect. Firstly, a kernel function version is given and secondly a loop version. Both versions are equivalent: in fact, the second version is internally converted to the first version.
Kernel version
function [] = __kernel__ kernel (x : mat, y : mat, B : int, pos : ivec2)
r1 = pos[0]*B..pos[0]*B+B-1 % creates a dynamically allocated vector
r2 = pos[1]*B..pos[1]*B+B-1 % creates a dynamically allocated vector
y[r1, r2] = transpose(x[r1, r2]) % matrix transpose
% creates a dynamically allocated vector
endfunction
x = imread("lena_big.tif")[:,:,1]
y = zeros(size(x))
B = 16 % block size
parallel_do(size(x,0..1) / B,x,y,B,kernel)
Loop version
x = imread("lena_big.tif")[:,:,1]
y = zeros(size(x))
B = 16 % block size
{!parallel for}
for m = 0..B..size(x,0)-1
for n = 0..B..size(x,1)-1
A = x[m..m+B-1,n..n+B-1] % creates a dynamically allocated vector
y[m..m+B-1,n..n+B-1] = transpose(A) % matrix transpose
endfor
endfor
8.3.2 Memory models
To acommodate the widest range of algorithms, two memory models are currently provided (some more may be added in the future).
-
Concurrent memory model (default)
In the concurrent memory model, the computation device (e.g. GPU) autonomously manages a separate memory heap that is reserved for dynamic objects. The size of the heap can be configured in Quasar and is typically 32MB.\begin_inset Separator latexpar\end_inset
The concurrent memory model is extremely efficient when all threads (e.g. ≥ 512) request dynamic memory at the same time. The memory allocation is done by a specialized parallel allocation algorithm that significantly differs from traditional sequential allocators.
For efficiency, there are some internal limitations on the size of the allocated blocks:
-
The total amount of kernel dynamic memory is by default 128 MB, but can be configured (see Redshift/program settings/Runtime/Memory management)
-
The maximum size is by default 64 KiB (65536 bytes), but can be configured (see Redshift/program settings/Runtime/Memory management).
-
The minimum size is by default 64 bytes and scales linearly with the setting of maximum size. When the allocation request size is smaller than this minimum size, the request size is rounded up to the minimum size.
-
All Quasar modules need to use the same kernel dynamic memory settings: this is necessary so that dynamic memory blocks can be accessed correctly (allocated, deallocated) in between different modules
Also note that the maximum size also incorporates the management overhead (reference counting, block sizes etc.), so in case you intend to allocate a 128 × 128 matrix in 32-bit precision float mode (64K) it is required to select 128 KiB.
The concurrent memory model is primarily designed to enable a large number of concurrent allocations of relatively small memory blocks (for example to perform a local matrix calculation). It is generally not memory-efficient to allocate large memory blocks from a kernel function in the concurrent memory model. This is because often thousands of threads may be launched on the GPU, and correspondingly, thousands of dynamically allocated memory blocks may be active for a certain local calculation. However, the temporary memory blocks may easily consume several megabytes of GPU memory, and this may restrict the amount of GPU memory available for other purposes. For example, consider 2800 concurrent threads using each 128KB of data. This corresponds to 377MB of GPU memory!
Therefore, for large dynamic memory allocations, consider using the cooperative memory model.
-
Cooperative memory model\begin_inset Separator latexpar\end_inset
In the cooperative memory model, the kernel function requests memory directly to the Quasar allocator. This way, there are no limitations on the size of the allocated memory. Also, the allocated memory is automatically garbage collected. To enable the cooperative memory model from a kernel function, use:{!kernel memorymodel="cooperative"}inside the kernel function.
Because the GPU cannot launch callbacks to the CPU, this memory model requires the kernel function to be executed on the CPU.
Advantages:
-
The maximum block size and the total amount of allocated memory only depend on the available system resources.
Limitations:
-
Can only be used from CPU kernel functions.
-
The Quasar memory allocator uses locking (to limited extend), so simultaneous memory allocations on all processor cores may be expensive.
-
Device functions can also use dynamic memory. The functions may even return objects that are dynamically allocated.
-
The following built-in functions are supported and can be used from within kernel and device functions:\begin_inset Separator latexpar\end_inset
zeros, czeros, ones, uninit, eye,
copy, reshape, repmat, shuffledims,
seq, linspace, real, imag, complex,
mathematical functions matrix/matrix
multiplication matrix/vector multiplication
Note: when the above functions are applied to arrays with size constraints (see
Section 3.3↑), no dynamic kernel memory will be used.
8.3.4 Performance considerations
Dynamic kernel memory can greatly improve the expressibility of Quasar programs, however there are also a number of downsides that need to be taken into account.
-
Global memory access: code relying on dynamic memory may be slow (for linear filters on GPU: 4x-8x slower), not because of the allocation algorithms, but because of the global memory accesses. However, it all depends on what you want to do: for example, for non-overlapping block-based processing (e.g., blocks of a fixed size), the dynamic kernel memory is an excellent choice.
-
Static vs. dynamic allocation: when the size of the matrices is known in advanced, static allocation (e.g. outside the kernel function may be used as well). The dynamic allocation approach relieves the programmer from writing code to pre-allocate memory and calculating the size as a function of the size of the data dimensions. The cost of calling the functions uninit, zeros is negligible to the global memory access times (one memory allocation is comparable to 4-8 memory accesses on average - 16-32 bytes is still small compared to the typical sizes of allocated memory blocks). Because dynamic memory is disposed whenever possible when a particular threads exists, the maximum amount of dynamic memory that is in use at any time is much smaller than the amount of memory required for pre-allocation.
-
Use vecX, matXxY, cubeXxYxZ, ... types (with size constraint) whenever your algorithm allows it (see Section 3.3↑). This completely avoids using global memory, by using the registers instead. Once a vector of length 17 is created, the vector is allocated as dynamic kernel memory.
-
Avoid writing code that leads to thread divergence: in CUDA, instructions execute in warps of (typically) 32 threads. Within a warp, instructions of different threads are executed at the same time (called implicit SIMD). Control flow instructions (if, match, repeat, while) can negatively affect the performance by causing threads of the same warp to diverge; that is, to follow different execution paths. Then,the different execution paths must be serialized, because all of the threads of a warp share a program counter. Consequently, the total number of instructions executed for this warp is increased. When all the different execution paths have completed, the threads converge back to the same execution path.
-
To obtain best performance in cases where the control flow depends on the position (i.e., pos or blkpos), the controlling condition should be written so as to minimize the number of divergent warps.
8.4 Map and Reduce pattern
A popular parallel programming model is the Map and reduce model. In this model, a Map() function is applied to the data and subsequently, the data is aggregated using a Reduce() function. In its most simple for, a sequential implementation would be as follows:
total = 0.0
for m=0..511
for n=0..511
total = total + map(im[m,n])
endfor
endfor
If this loop would be parallelized directly, this would lead to data races, because the summation (
total = total + ...) is not atomic. Luckily the Quasar compiler
atomizes this summation operation behind the screens, leading to the following parallel loop:
total = 0.0
{!parallel for}
for m=0..511
for n=0..511
total += map(im[m,n])
endfor
endfor
For an overview how the compiler atomizes operations, see
Table 8.1↓. Note that this programming pattern may occur with multiple result values: the result can be a scalar value, a vector or even a matrix:
A = zeros(2,2)
{!parallel for}
for i=0..255
A[0,0] += x[i,0]*y[i,0]
A[0,1] += x[i,0]*y[i,1]
A[1,0] += x[i,1]*y[i,0]
A[1,1] += x[i,1]*y[i,1]
endfor
Direct implementation of this loop in OpenCL and CUDA would give a poor performance on GPU devices, due to all adds being serialized in the hardware (all threads need to write to the same location in memory, so there is a spin-lock that basically serializes all the memory write accesses). The performance is often much worse than performing all operations sequentially on a CPU!
The obvious solution is the use of shared memory, thread synchronization in combination with parallel reduction (see
Section 10.6↓). In general it is quite hard to write these kind of algorithms, taking all side-effects in consideration, such as register pressure, shared memory pressure. Therefore, the Quasar compiler now detects the above pattern and converts it into an efficient parallel reduction algorithm. There are some considerations:
-
The following operators are supported: addition (+=), subtraction (-=), multiplication (*=), minimum (__=), maximum (^^=), bitwise AND (&=), bitwise OR (|=) and bitwise XOR (~=).
-
The compiler allows maximal flexibility in defining the body of the loop: i.e., it is possible to use control structures (if, loop, while, ...). Also, other intermediate values can be calculated, hence it is not a prerequisite that the loop only consists of summation statements such as in the above example.
-
The reduction may also perform a dimension reduction (for example, summing a matrix along the rows). Currently, the Quasar compiler supports dimension reductions for which the number of dimensions is reduced with 1 at the time (dimension reduction in multiple dimensions at the time may be supported in tfuture versions).
-
There is an upper limit on the number of accumulators (due to the size limit of the shared memory). For 32-bit floating point, up to 32 accumulators and for 64-bit floating point, up to 32 accumulators are supported. When the upper limit is exceeded, the generated code will still work, but the block size will silently be reduced (see also Section 9.9↓). This, together with the impact on the occupancy (due to high number of registers being used) might lead to a performance degradation.
Table 8.1 Atomization table of operators
Non-atomized
|
Atomized
|
Description
|
a = a + b
|
a += b
|
Addition
|
a = a - b
|
a -= b
|
Subtraction
|
a = a * b
|
a *= b
|
Multiplication
|
a = max(a, b)
|
a ^^= b
|
Maximum
|
a = min(a, b)
|
a __= b
|
Minimum
|
a = and(a, b)
|
a &= b
|
Bitwise AND
|
a = or(a, b)
|
a |= b
|
Bitwise OR
|
a = xor(a, b)
|
a ~= b
|
Bitwise XOR
|
8.5 Cumulative maps (prefix sum)
A related frequently occurring programming patterns are cumulative maps, for example:
total = 0.0
{!parallel for}
for m=0..511
for n=0..511
im[m,n] = im[m-1,n] + map(im[m,n])
endfor
endfor
Here, we assume that the
’safe access modifier is used (see
Subsection 2.4.1↑), so that
im[m-1,n] for
m=0 is equal to 0. Some applications of the cumulative map patterns are cumulative sums (e.g., integral images), infinite impulse response (IIR) filters, etc. The cumulative map pattern is closely related to the map and reduce pattern, with the difference that the intermediate calculated values also need to be stored in memory.
The parallel algorithm for efficiently calculating a cumulative map is the prefix sum. By static analysis, the Quasar compiler is able to recognize the above pattern and convert it into a parallel prefix sum algorithm. The conditions are mostly the same as in case of the parallel reduction pattern:
-
The following operators are supported: addition (+=), subtraction (-=), multiplication (*=), minimum (__=), maximum (^^=), bitwise AND (&=), bitwise OR (|=) and bitwise XOR (~=).
-
Again, the compiler allows maximal flexibility in defining the body of the loop: i.e., it is possible to use control structures (if, loop, while, ...).
-
One restriction is that the dependency should be im[m,n] ← im[m-k,n] where k is always 1. Higher order cumulative patterns (e.g., as in higher-order IIR filters) are not supported yet. In this case, it is recommended to decompose the filter into a cascade of first order IIR filters. Each filter of the cascade can then by individually parallelized using the cumulative map pattern.
8.6 Meta functions
Note: this section gives more advanced info about how internal routines of the compiler can be accessed from user code. Normally these functions do not need to be used directly, however this information can still be useful for certain operations.
Quasar has a special set of built-in functions, that are aimed at manipulating expressions at compile-time (although in the future the implementation may also allow them to be used at run-time). The functions are special, because actually, they do not follow the regular evaluation order (i.e. they can be evaluated from the outside to the inside of the expression, depending on the context). To make the difference clear with the host functions, these functions start with prefix $.
For example, x is an expression, as well as x+2 or (x+y)*(3*x)^99. A string can be converted (at runtime) to an expression using the function eval. This is useful for runtime processing of expressions for example entered by the user. However, the opposite is also possible:
print $str((x+y)*(3*x)^99) % Prints "(x+y)*(3*x)^99"
This is similar to the string-izer macro symbol in C:
#define str(x) #x
However, there are a lot of other things that can be done using meta functions. For example, an expression can be evaluated at compile-time using the function $eval (which differs from eval)
print $eval(log(pi/2)) % Prints 0.45158273311699, but the result is computed at compile-time.
The $eval function also works when there are constant variables being referred (i.e. variables whose values are known at compile-time). Although this seems quite trivial, this technique opens new doors for compile-time manipulation of expressions that are completely different from C/C++ but somewhat similar to Maple or LISP macros).
Below is a small overview of the meta functions in Quasar:
-
$eval(.): compile-time evaluation of expressions
-
$str(.): conversion of an expression to string
-
$subs(a=b,.): substitution of a variable by another variable or expression
-
$check(.): checks the satisfiability of a given condition (the result is either valid, satisfiable or unsatisfiable), based on the information that the compiler has at this point.
-
$assump(.): returns an expression with the assertions of a given variable
-
$simplify(.): simplifies boolean expressions (based on the information of the compiler, for example constant values etc.)
-
$args[in](.): returns an expression with the input arguments of a given function.
-
$args[out](.): returns an expression with the input arguments of a given function.
-
$nops(.): returns the number of operands in the expression
-
$op(.,n): returns the n-th operand of the expression
-
$ubound(.): calculates an upper bound for the given expression
-
$specialize(func,.): performs function specialization (see Section 6.5↑)
-
$inline(lambda)(...): performs inlining of lambda expressions/functions
-
$ftype(x) with x="__host__"/"__device__"/"__kernel__": determines whether we are inside a host, device or kernel function.
-
$typerecon(x,y) : reconstructs the type of the specified function with a given set of (specialized) input parameters. For example, given a function f = x -> x, $typerecon(f,type(x,scalar)) will return [scalar->scalar]. This meta function is mainly used internally in the compiler.
-
$target(.): indicates whether we are compiling for the given target platform (see Subsection 17.2.6↓).
Notes:
means that the reduction for log(x) may only be applied inside __device__ functions, when the condition abs(x - 1) < 1e-1 is met. Here, this is simply a linear approximation of the logarithm around x==1.
Example: copying the type and assumptions from one variable to another
It is possible to write statements such as "assume the same about variable ’a’ as what is assumed on ’b’". This includes the type of the variable (as in Quasar, the type specification is nothing more than a predicate).
a : int
assert(0 <= a && a < 1)
b : ??
assert($subs(a=b,$assump(a)))
print $str($assump(b)) % Prints "type(b,"int")) && 0 <= b && b < 1"