Chapter 8: Special programming patterns Up Main page Chapter 10: Parallel programming examples 

9 GPU hardware features

The GPU was originally designed for computer graphics and there are a lot of other facilities available to speed up GPU applications. In this Section, we describe a number of advanced GPU techniques from which Quasar programs can also potentially benefit. In this section, we describe several GPU specific optimization techniques that can easily be used from Quasar programs.

9.1 Constant memory and texture memory

The GPU hardware provides several caches or memory types that are designed for dealing with (partially) constant data:
For practical purposes, the size of the constant memory is rather small, so it is mostly useful for storing filter/weight coefficients that do not change while the kernel is executed. On the other hand, the texture memory is quite large, has its own cache, and can be used for storing constant input signals/images.
In Quasar, constant/texture memory can be utilized by adding modifiers to the kernel function parameter types. The following modifiers are available (see Table 9.1↓):
Note that because of the different access mechanism, these modifiers cannot be combined.
Table 9.1 Overview of access modifiers controlling the GPU hardware texturing unit
Modifier Purpose
’hwconst The data needs to be stored in constant memory (max 64KB). Implies ’const
’hwtex_nearest Store the vector/matrix/cube in the GPU texture memory, use nearest neighbor lookup
’hwtex_linear Store the vector/matrix/cube in the GPU texture memory, use linear interpolation
’hwtex_const Does not store the data in the GPU texture memory, but instead uses
the non-coherent L2-cache of the GPU when accessing the data
For Fermi and later devices, global memory accesses (i.e., without ’hw* modifiers) are cached in the L2-cache of the GPU. For Kepler GPU devices, using ’hwtex_const the texture cache is utilized directly, bypassing the L2 cache. The texture cache is a separate cache with a separate memory pipeline and relaxed memory coalescing rules, which may bring advantages to bandwidth-limited kernels. [L]  [L] For more information, see Kepler tuning guide.
Starting with Maxwell GPU devices, the L1 cache and the texture caches are unified. The unified L1/texture cache coalesces the memory accesses, gathering up the data requested by the threads in a warp, before delivering the data to the warp. [M]  [M] For more information, see Maxwell tuning guide.
For using constant memory, we give the following guidelines:
However, the best is to investigate whether the modifier improves the performance (e.g. using the Quasar profiler).
Consider the following convolution program:
Default version with no constant memory being used:
function [] = __kernel__ kernel(x : vec, y : vec, f : vec, pos : int)
sum = 0.0
for i=0..numel(f)-1
sum += x[pos+i] * f[i]
y[pos] = sum
Version with constant memory:
function [] = __kernel__ kernel_hwconst(x : vec, y : vec, f : vec’hwconst, pos : int)
sum = 0.0
for i=0..numel(f)-1
sum += x[pos+i] * f[i]
y[pos] = sum
Version with constant texture memory for f:
function [] = __kernel__ kernel_hwtex_const(x : vec, y : vec, f : vec’hwtex_const, pos : int)
sum = 0.0
for i=0..numel(f)-1
sum += x[pos+i] * f[i]
y[pos] = sum
Version with constant texture memory for x and f:
function [] = __kernel__ kernel_hwtex_const2(x : vec’hwtex_const, y : vec, f : vec’hwtex_const, pos : int)
sum = 0.0
for i=0..numel(f)-1
sum += x[pos+i] * f[i]
y[pos] = sum
Version with HW textures (see Section 9.3↓):
function [] = __kernel__ kernel_tex(x : vec, y : vec, f : vec’hwtex_nearest, pos : int)
sum = 0.0
for i=0..numel(f)-1
sum += x[pos+i] * f[i]
y[pos] = sum
For 100 runs on vectors of size 20482, with 32 filter coefficients, we obtain the following results for the NVidia Geforce 980 (Maxwell architecture):
Default: 513.0294 ms
f: ’hwconst: 132.0075 ms
f: ’hwtex_const: 128.0074 ms
x,f: ’hwtex_const: 95.005 ms
f: ’hwtex_nearest: 169.0096 ms
It can be seen that using constant memory (’hwconst) alone yields a speed-up of almost a factor 5 in this case. The best performance is obtained with hwtex_const. Moreover, using shared memory (see Section 10.5↓), the performance can even further be improved to 85 ms.
Performance tip: use ’hwconst when all threads in a kernel access the same memory location simultaneously (for example, filter coefficients in a linear filter). When to use ’hwtex_nearest versus ’hwtex_const, depends on the way the data is reused (whether the matrix is readonly or needs to be copied often between the global memory and the texture memory): when the matrix is readonly and the access pattern is localized (e.g., a local window operation) or in combination with boundary access methods as ’safe, ’mirror, ’circular, ’clamped, it may be beneficial to use ’hwtex_nearest. When the matrix access pattern is random, a better choice might be ’hwtex_const. But as advice gives a worse performance in the above example, it is best to profile in order to be sure.

9.2 Shared memory designators

Shared memory designators provide a convenient approach to fetch blocks of data from global memory into shared memory of the GPU as well as a mechanism to write back the data.
As mentioned before in Subsection 2.4.4↑, shared memory is best used whenever possible to relieve stress on the global memory and the most common use is in a memory mapping technique (similar to DMA): a portion of the data is mapped onto the shared memory. Local updates are later to be written back to the global memory. Different from a cache is that the layout of the shared memory is fully controlled: each array element is mapped onto a specified array element in the global memory. To facilitate the use of shared memory (in particular, the error-prone copying of data from and to shared memory), shared memory designators have been added to Quasar.
Shared memory designators thus provide a novel complementary technique to the shared() function to allocate and update shared memory. A shared memory designator is specified using the
access modifier as follows:
function [] = __kernel__ kernel(A : mat, a : scalar, b : scalar, pos : ivec3)
B : ’shared = transpose(a*A[0..9, 0..9]+b) % fetch
% ... calculations using B (e.g., directly with the indices)
A[0..9, 0..9] = transpose(B) % store
The designator
tells the compiler that this variable is intended to be stored in shared memory of the GPU. However, rather than one thread calculating
, the compiler will generate code such that the calculations are distributed over the threads within the block. For this purpose, the compiler detects “thread-invariant” code related to the designated shared memory variables and modifies the code such that it is distributed over the threads, followed by the necessary thread synchronization.
For the above example, this will generate the following code:
function [] = __kernel__ kernel(A:mat,a:scalar,b:scalar,pos:ivec3,this_thread_block:thread_block)
for i=this_thread_block.thread_idx..this_thread_block.size..99
for $i=this_thread_block.thread_idx..this_thread_block.size..99
Note that the code using shared memory designators is significantly easier to understand, compared to the above code with low-level threading and synchronization primitives. It becomes then straightforward to speed up existing algorithms.
The shared memory designator technique relies on:
  1. vector/matrix size inference: the compiler can determine that
    size(transpose(a*A[0..9, 0..9]+b))==[10,10]
    , so that the appropriate amount of shared memory can be allocated
  2. cooperative groups (see 9.8↓):
    allows access to the low-level block functionality (size, position, thread index etc.)
functions exist as alternative, as a low-level interface to the shared memory. Summarizing, the differences are:
: ’shared
Type “Low” level “High” level
Thread distribution manual automatic
Use in kernel function Yes Yes
Use in device function No No
Use in for-loop No Yes
See the matrix example below for an example of how to use the shared memory designators from a for-loop.

9.2.1 How to use

When a variable is declared with the shared designator
: ’shared
, the compiler will scan for several patterns related to the variable
  1. Initialization
    : the standard way to initialize shared variables is
    S : ’shared = uninit(M,N,K)
    The full type information of
    ) is omitted here since the compiler can obtained it via type inference. The above is the equivalent of
    S = shared(M,N,K)
    , however
    S : ’shared  = uninit(M,N,K)
    allows to compiler to manage the shared memory accesses.
    As is the case with
    it is best that the parameters
    are either constant (declared using
    : int’const
    , or a type parameter of a generic function), or that upper bounds on
    are given via an assertion (e.g.,)
    ). This way, the compiler can calculate the amount of shared memory that is required for the kernel function execution.
  2. Fetch and broadbast:
    Instead of initializing with
    it is possible to initialize directly with an expression, for example:
    S1 : ’shared = transpose(a*A[0..9, 0..5]+b)
    S2 : ’shared = img[p[0]+(-c..16+c),p[1]+(-c..16+c),:]
    S3 : ’shared = sum(reshape(img,[M,numel(img)/M]),1)
    Instead of every thread calculating duplicate results (as would have been the case without using
    ), the calculations are distributed over the threads within the thread block. In other words, the compiler will do the heavy work and generate code using the block parameters
    (see before). After the operation, a thread synchronization (
    ) will implicitly be performed.
    This initialization-by-expression can also be seen as a fetch and broadcast: first the memory is copied from global memory to shared memory (with possibly some intermediate calculations), next once in shared memory, the data is available to all threads (after
    Through type inference, the compiler can determine the dimensions of
    . For example, in the first case, the compiler will determine
    S1 : mat’shared(6,10)
  3. Gather and store: process and copy back to global memory
    Using the same technique as with fetch and broadcast, the data stored in shared memory can be written back to the global memory:
    A[0..9, 0..9] = transpose(S1)
    B[[0]-1, ([1]-1)] = S2
Again, the calculations are distributed over the individual threads.
  1. Shared calculation
    This pattern incurs a loop over the shared variable, as in the following example:
    Sb : ’shared = uninit(M)
    for L=0..M-1
    D = diagonal(cholesky(Sa[L,0..3,0..3]))
    Sb[L] = log_PP + 2*sum(log(D))
    Again, instead of every thread performing the entire loop, the loop will be distributed over the thread block. This allows for some calculation of temporary variables for which the results are shared over the entire thread block. The approach is similar to fetch and broadcast with the difference that the loop to initialize the shared variable is explicitly written out.
  2. Parallel reduction
    This pattern is currently experimental, but the idea is to expand aggregation operations into a parallel reduction algorithm, as in the following example:
    {!parallel for; blkdim=M}
    for n=0..N-1
    B : ’shared = uninit(M)
    B[n] = ...
    total = sum(B)
    Notice the calculation of total, via the
    function. Rather than every thread computing
    independently, the calculation can again be distributed over the thread block. This technique provides a simple parallel reduction primitive.
    Currently, only the functions
    with one parameter are supported.

9.2.2 Virtual blocks and overriding the dependency analysis

To distinguish calculations that are position-dependent from calculations that can be shared within the thread block, the compiler performs a dependency analysis: it starts from the
kernel function parameter and then determines the variables that are dependent of
. Once the calculation of a variable depends on
, the calculation can not be distributed any more over the thread block - the result of the calculation needs to be different for each thread.
In some cases, it is desirable to override the dependency analysis. For example, the virtual block technique defines tiles with size that is independent of the GPU architecture. The mapping onto GPU blocks is then implicitly handled by the compiler.
N : int’const = 16 % virtual block size
{!parallel for; blkdim=[N,N]}
for m=0..size(A,0)-1
for n=0..size(A,1)-1
mp = int(m/N)
np = int(n/N)
Ap = A[mp,Np]
Here, the values
are constant within each thread block, however by the specific way that the variables
are calculated via modulo operation on the position indices
, the compiler cannot (yet) determine the constantness within the thread block. The following code attribute (to be placed inside the inner for loop or kernel function) indicates that, irrespective of the indexing, the variable
are constant over the thread block:
{!shared_mem_promotion assume_blk_constant={Ap}}
In the above example, the virtual block size and the GPU block size are fixed via
{!parallel for; blkdim=[N,N]}
, but this does not necessarily have to be this way. It is up to the programmer to indicate that variables are constant within the thread block.
In the future, “virtual blocks” which size differs from the GPU blocks may be implemented using collaborative thread groups.

9.2.3 Examples

In this section, we discuss three example use cases of shared memory designators: image histograms, image separable filtering and parallel reduction. Histogram

The calculation of a histogram is a good use case of the shared memory designators:
function [] = __kernel__ hist_kernel[Bins](im : vec, hist : vec(Bins), pos : int)
hist_sh : ’shared = zeros(Bins)
for m=pos..16384..numel(im)-1
hist_sh[int(im[m])] += 1
hist[:] += hist_sh % add the result
First, the histogram “scratchpad” is allocated in shared memory and initialized with zeros. Here the size of the histogram is a generic parameter, which guarantees that the compiler always has the exact size of the histogram. In the second step, the image is traversed using a so called “grid-strided loop”. This is to ensure that the histogram can be updated several times by the same thread before the results are written to the histogram in global memory, thereby reducing the number of shared to global memory copies. As a final step, the local histogram is added to the global histogram.
A more simple way to implement
would have been to not use shared memory at all,
as in the following kernel:
function [] = __kernel__ hist_kernel_global(im : vec, hist : vec, pos : int)
hist[int(im[pos])] += 1
A previous implementation technique in Quasar, via the
kernel transform, is now deprecated in favor of designated shared memory, which yields not only more easily readable code but is also more flexible in its use.
function y : vec’unchecked = hist_sharedmemcaching(im)
y = zeros(256)
{!parallel for}
for m=0..size(im,0)-1
for n=0..size(im,1)-1
for k=0..size(im,2)-1
{!kernel_transform enable="sharedmemcaching"}
{!kernel_arg name=y; type="inout"; access="shared"; op="+="; cache_slices=y[:]; numel=256}
v = im[m,n,k]
y[v] += 1
On a Geforce GTX 980 GPU, the results are as follows:
Kernel function Execution time for 100 runs on a 512x512 image
hist_kernel 12.765 ms
hist_kernel_global 73.7107 ms
hist_sharedmemcaching 28.5985 ms
The shared memory technique gives a speed-up of a factor 5.6x! It is even outperforms the shared mem caching transform by a factor 2x!. Separable linear filtering

In this example, an implementation of separable linear filtering is given. Each block of the input image, including some border that depends on the filter length, is transferred to the shared memory.
function img_out = separable_linear_filtering[c : int](img_in : cube(:,:,3), fc : vec’hwconst(2*c+1))
function [] = __kernel__ kernel(img_out : cube(:,:,3), pos : ivec2)
p = pos - mod(pos, 16)
s1 : ’shared = img_in[p[0]+(-c..16+c),p[1]+(-c..16+c),:]
s2 : ’shared = uninit(16+2*c+1,16,3)
% work shared along the threads
for m=0..16+2*c
for n=0..15
total = zeros(3)
for k=0..numel(fc)-1
total += fc[k] * s1[m,n+k,:]
s2[m,n,:] = total
total1 = zeros(3)
[m1,n1] = pos-p
for k=0..numel(fc)-1
total1 += fc[k] * s2[m1+k,n1,:]
img_out[pos[0],pos[1],:] = total1
img_out = uninit(size(img_in))
parallel_do([size(img_in,0..1),[16,16]],img_out, kernel)
For the first, horizontal filtering stage, there are more output pixels to be computed than there are threads in each block (in particular, for filter length
and block size
, there are
output pixels). Using the shared memory designators, these calculations are also distributed over the thread block. For example, for
, this will yield
, corresponding to 12.5 warps. The result of the first filtering stage is stored in shared memory.
The second, vertical filtering stage takes inputs from the previous stage stored in shared memory. The result is written back to the global memory.
Note that the function is parametrized on the half-filter length
. This way, the number of loop iterations for
, as well as
can be computed at compile-time. The compiler also manages to calculate the amount of shared memory required to execute this kernel (6348 bytes for
and 4416 bytes for
in single precision floating point mode). Parallel reduction (sum of NxN matrices)

Below is an example of a parallel reduction algorithm rewritten to take advantage of shared memory designators. Although the advantages of this particular implementation are limited compared to the low-level
function, the implementation is given for completeness.
function [y] = red(im : cube)
M : int’const = 512 % block size
N : int’const = 8*M % number of blocks times block size
y = 0.0
{!parallel for; blkdim=M} % Explicitly set the block size
for n=0..N-1
B : ’shared = uninit(M) % Shared entry for each thread
% Calculate partial sum and store in B
total = 0.0
for k=n..N..numel(im)-1
total += im[ind2pos(size(im),k)]
B[mod(k,M)] = total
% Sum over the shared memory - parallel reduction
total = sum(B)
if mod(n,M)==0
y += total % Sum intermediate results

9.3 Speeding up spatial data access using Hardware Texturing Units

The hardware texturing units are a part of the graphics-accelerating heritage of the GPU. Originally, texture mapping was designed to enable realistically looking objects by letting the applications “paint” onto the geometry. From the rendered triangles, texture coordinates were interpolated along the X, Y and Z coordinates, such that for every output pixel, a texture value could be fetched (e.g. using nearest-neighbor or linear/trilinear interpolation). Later, programmable graphics and non-color like texture data (e.g. bump maps, shadow maps) were introduced and also the graphics hardware became more sophisticated. The hardware performance was improved by using dedicated hardware for transforming texture coordinates into hardware addresses, by adding texture caches and by using memory layouts optimized for spatial locality.
There is also hardware support for some of the type modifiers explained in Section 2.4↑, in particular “safe”, “circular”, “mirror” and “clamped”.
More generally, in Quasar, there are two main use cases for textures:
The hardware texture units can only be used in combination with texture memory. Texture memory is a read-only part of the global memory (see Subsection 2.4.3↑), that is cached on-chip (e.g. 6-8 KB per multi-processor) and ordered using a space-filling curve optimized for spatial locality.
In Table 9.2↓ there are a number of limitations listed for texture memory.
Limitation CUDA 2.x value
Maximum length for 1D texture 134217728
Maximum size for 2D texture 65536 × 65536
Maximum size for 3D texture 2048 × 2048 × 2048
Allowed element types scalar, int, int8, int16, int32
uint8, uint16, uint32
Access type locally read-only, changes visible in next
kernel function call
Access modifiers safe, circular, mirror and clamped
(no checked/unchecked)
Maximum number of textures/Quasar module 128 (or 256)
Table 9.2 Texture memory limitations
Using the hardware texture units in Quasar is quite simple: it suffices to add the following special modifiers to the types of arguments of kernel functions:
Note that these modifiers are only permitted to vec, mat or cube types. Complex-valued data or higher dimensional matrices are currently not yet supported.
The following image scaling example illustrates the use of hardware textures:
% Kernel function, not using hardware textures
function [] = __kernel__ interpolate_nonhwtex (y:mat, x:mat, scale:scalar, pos:ivec2)
scaled_pos = scale * pos
f = frac(scaled_pos)
i = int(floor(scaled_pos))

y[pos] = (1 - f[0]) * (1 - f[1]) * x[i[0], i[1]] +
f[0] * (1 - f[1]) * x[i[0]+1, i[1]] +
(1 - f[0]) * f[1] * x[i[0], i[1]+1] +
f[0] * f[1] * x[i[0]+1, i[1]+1]
% Kernel function, using hardware textures
function [] = __kernel__ interpolate_hwtex (y:mat, x:mat’hwtex_linear,
scale:scalar, pos:ivec2)
y[pos] = x[scale * pos]
Note that the use of the hardware textures (and in particular the linear interpolation) is quite simple. However, it is important to stress that the hwtex modifiers can only be used for kernel function arguments. It is for example not possible to declare variables using these modifiers (if you try so, the modifiers will not have any effect).
The hardware textures enable some performance benefit. For example, on a Geforce 435M, for the above program the following results were obtained:
2D nearest neighbor interpolation without hardware texturing: 109.2002 msec
2D nearest neighbor interpolation with hardware texturing: 93.6002 msec
3D nearest neighbor interpolation without hardware texturing: 421.2007 msec
3D nearest neighbor interpolation with hardware texturing: 312.0006 msec
2D Linear interpolation without hardware texturing: 156.0003 msec
2D Linear interpolation with hardware texturing: 109.2002 msec
3D Linear interpolation without hardware texturing: 873.6015 msec
3D Linear interpolation with hardware texturing: 312.0006 msec
Especially, in 3D with linear interpolation, the performance is almost 3x higher than the regular approach. Textures have also a number of limitations:
Summarizing, hardware textures have the following advantages:
  1. Texture memory is cached, this is helpful when global memory is the main bottleneck.
  2. Texture memory is efficient also for less regular access patterns
  3. Supports linear/bilinear and trilinear interpolation in hardware
  4. Supports boundary accessing modes (mirror, circular, clamped and safe) in hardware.

9.4 16-bit (half-precision) floating point textures

To reduce the bandwidth in computation heavy applications (e.g. real-time video processing), it is possible to specify that the GPU texturing unit should use 16-bit floating point formats. This can be configured on a global level in Redshift / Program Settings / Runtime / Use CUDA 16-bit floating point textures. Obviously, this will reduce the memory bandwidth by a factor of 2 in 32-bit float precision mode, and by a factor of 4 in 64-bit float precision mode. The option is also particularly useful when visualizing multiple large images.
Note that 16-bit floating point numbers have some limitations. The minimal positive non-zero value is 5.96046448e-08. The maximal value is 65504. Integers between -2047 and 2047 can be exactly represented. The machine precision (eps) value is 0.00097656. For these reasons, 16-bit floating point textures should not be used for accuracy sensitive parts of the algorithm. They are useful for rendering and visualization purposes (e.g., real-time video processing).
Currently, 16-bit precision is only available for textures (and not for non-texture arrays), however, the support for 16-bit floating point arrays will be added in a future version of Quasar.

9.5 Multi-component Hardware Textures

Very often, kernel functions access RGB color data using slicing operations, such as:
x[m,n,0..2]When the accesses m and/or n are irregular compared to the kernel function position variable pos, it may be useful to consider the use of multi-component hardware textures. These textures allow fetches of 2, 3 or 4 color components in one single operation, which is very efficient. A multi-component hardware texture can be declared by adding ’hwtex_nearest(4) to the access modifier of the cube type. The modifier is only permitted to mat, cube or cube{4} types. Complex-valued data or higher dimensional matrices are currently not yet supported. An example of a Gaussian filter employing multi-component textures is given below:
function y = gaussian_filter_hor(x, fc, n)
function [] = __kernel__ kernel(x : cube’hwtex_nearest(4), y : cube’unchecked, fc : vec’unchecked, n : int, pos : vec2)
sum = [0.,0.,0.]
for i=0..numel(fc)-1
sum = sum + x[pos[0],pos[1]+i-n,0..2] * fc[i]
y[pos[0],pos[1],0..2] = sum

y = uninit(size(x))
parallel_do (size(y,0..1), x, y, fc, n, kernel)
In parentheses, the number of components is indicated. Note that the hardware only supports 1, 2 or 4 components. In this mode, the Quasar compiler will support the texture fetching operation x[pos[0],pos[1]+i-n,0..2] and will translate the slice indexer into a 4-component texture fetch.
In combination with 16-bit floating point formats, the texture fetch even only requires a transfer of 64 bits (8 bytes) from the texture memory. On average, this will reduce the memory bandwidth by a factor 2 and at the same time reduces the stress on the global memory.
Finally, it is best to not use the same matrix value in ’hwtex_nearest(4) mode and later in ’hwtex_nearest mode (or vice versa) in another kernel function, because a mode change requires the texture memory to be reallocated and recopied (which affects the performance).

9.6 Texture/surface writes

For CUDA devices with compute capability 2.0 or higher, it is possible to write to the texture memory from a kernel function. In CUDA terminology, this is called a surface write. In Quasar, it suffices to declare the kernel function parameter using the modifier ’hwtex_nearest (or ’hwtex_nearest(n)) and to write to the corresponding matrix.
One caveat is that the texture write is only visible starting from the next kernel function call. Consider the following example:
function [] = __kernel__ kernel (y: mat’hwtex_nearest, pos : ivec2)
y[pos] = y[pos] + 1
y[pos] = y[pos] + 1 % unseen change
y = zeros(64,64)
print mean(y) % Result is 2 (instead of 4) because the surface writes
% are not visible until the next call
This may be counterintuitive, but this allows the texture cache to work properly.
An example with 4 component surface writes is given below (one stage of a wavelet transform in the vertical direction):
function [] = __kernel__ dwt_dim0_hwtex4(x : cube’hwtex_nearest(4), y : cube’hwtex_nearest(4), wc : mat’hwconst, ctd : int, n : int, pos : ivec2)
K = 16*n + ctd
a = [0.0,0.0,0.0,0.0]
b = [0.0,0.0,0.0,0.0]
tilepos = int((2*pos[0])/n)
j0 = tilepos*n
for k=0..15
j = j0+mod(2*pos[0]+k+K,n)
u = x[j,pos[1],0..3]
a = a + wc[0,k] * u
b = b + wc[1,k] * u
y[j0+mod(pos[0],int(n/2)), pos[1],0..3]=a
im = imread("lena_big.tif")
im_out = uninit(size(im))
parallel_do([size(im_out,0)/2,size(im_out,1)],im2,im_out,sym8,4,size(im_out,0), dwt_dim0_hwtex4)
On a Geforce GTX 780M, the computation times for 1000 runs are as follows:
without ’hwtex_nearest(4): 513 ms
with ’hwtex_nearest(4): 176 ms
Here this optimization resulted in a speedup of approx. a factor 3 (!)

9.7 Maximizing occupancy through shared memory assertions

Kernel functions that explicitly use shared memory can be optimized by specifying the amount of memory that a kernel function will actually use.
The maximum amount of shared memory that Quasar kernel functions can currently use is 32K (32768 bytes). Actually, the maximum amount of shared memory of the device is 48K (16K is reserved for internal purposes). The GPU may process several blocks at the same time, however there is one important restriction:
“The total number of blocks that can be processed at the same time also depends on the amount of shared memory that is used by each block.”
For example, if one block uses 32K, then it is not possible to launch a second block at the same time, because 2 x 32K>48K. In practice, your kernel function may only use e.g. 4K instead of 32K. This would then allow 48K/4K = 12 blocks to be processed at the same time.
When the amount of shared memory used by a kernel can not be deduced from the code, the Quasar runtime will assume that the kernel uses 32K shared memory per block. However, because N*32K < 48K requires N=1, only one block can be launched simultaneously. This significantly degrades the performance. Therefore it is recommended that you give some hints to the compiler about the amount of shared memory that the kernel function will use. This can be done as follows:
  1. Explicitly giving the dimensions of the shared memory arrays\begin_inset Separator latexpar\end_inset
    When you request:
    x = shared(20,3,6)
    the compiler will reserve 20 x 3 x 6 x 4 bytes = 1440 bytes for the kernel function.
  2. Using assertions\begin_inset Separator latexpar\end_inset
    Often the arguments of the function shared are non-constant. In this case you can use assertions. The Quasar compiler can then infer the total amount of shared memory that is being used through the logic system (see Chapter 5↑).
    assert(M<8 && N<20 && K<4)
    x = shared(M,N,K)
    Due to the above assertion, the compiler is able to infer the amount of required shared memory. In this case: 8 x 20 x 4 x 4 bytes = 2560 bytes. The compiler then gives the following message:
    Information: sharedmemtest.q - line 17: Calculated an upper bound for the amount of shared memory: 2560 bytes
Assertions have also an additional benefit: they allow the runtime system to check whether not too much shared memory will be allocated. In case N would exceed 20, the runtime system will give an error message.

9.8 Cooperative groups and warp shuffling functions

The following special kernel function parameters give fine grain control over GPU threads.
Parameter Type Description
coalesced_threads thread_block a thread block of coalesced threads
this_thread_block thread_block describes the current thread block
this_grid thread_block describes the current grid
this_multi_grid thread_block describes the current multi-GPU grid
The thread_block class has the following properties:
Property Description
thread_idx Gives the index of the current thread within a thread block
size Indicates the size (number of threads) of the thread block
active_mask Gives the mask of the threads that are currently active
The thread_block class has the following methods.
Method Description
sync() Synchronizes all threads within the thread block
partition(size : int) Allows partitioning a thread block into smaller blocks
shfl(var, src_thread_idx : int)
Direct copy from another thread
shfl_up(var, delta : int)
Direct copy from another thread, with index specified relatively
shfl_down(var, delta : int)
Direct copy from another thread, with index specified relatively
shfl_xor(var, mask : int)
Direct copy from another thread, with index specified by a XOR relative to the current thread index
Returns true if the predicate for all threads within the thread block evaluates to non-zero
Returns true if the predicate for any thread within the thread block evaluates to non-zero
Evaluates the predicate for all threads within the thread block and returns a mask where every bit corresponds to one predicate from one thread
Returns a mask of all threads that have the same value
Returns a mask only if all threads that share the same value, otherwise returns 0.
In principle, the above functions allow threads to communicate with each other, without relying on, e.g., shared memory. The warp shuffle operations allow taking values from other active threads (active means not disabled due to thread divergence). all, any, ballot, match_any and match_all allow to determine whether the threads have reached a given state.
The warp shuffle operations require a Kepler GPU (or higher) and allow the use of shared memory to be avoided (register access is faster than shared memory). This may bring again performance benefits for computationally intensive kernels such as convolutions and parallel reductions (sum, min, max, prod etc.).
Using this functionality will require the CUDA target to be specified explicitly (i.e., the functionality cannot be easily simulated by the CPU). This may be obtained by placing the following code attribute (see Section 4.11↑) inside the kernel: {!kernel target="nvidia_cuda"}. For CPU execution a separate kernel needs to be written.
Below an example is given for a parallel reduction algorithm based on warp shuffling functions. This is an alternative to the parallel reduction algorithm using shared memory (Section 10.6↓) and has the advantage that no shared memory is used.
function y : scalar = __kernel__ reduce_sum(coalesced_threads : thread_block, x : vec, pos : int)
{!kernel target="nvidia_cuda"}
val = x[pos]
lane = coalesced_threads.thread_idx
i = int(coalesced_threads.size / 2)
val = x[pos]

% accumulate within the warp
while i > 0
val += coalesced_threads.shfl_down(val, i)
i /= 2

if lane==0
y += val
In case the warp size is known to be 32, this can be even written as follows:
function y : scalar = __kernel__ reduce_sum(coalesced_threads : thread_block, x : vec, pos : int)
{!kernel target="nvidia_cuda"}
val = x[pos]
lane = coalesced_threads.thread_idx

% accumulate within the warp (of fixed size 32)
val += coalesced_threads.shfl_down(val, 16)
val += coalesced_threads.shfl_down(val, 8)
val += coalesced_threads.shfl_down(val, 4)
val += coalesced_threads.shfl_down(val, 2)
val += coalesced_threads.shfl_down(val, 1)

if lane==0
y += val

9.8.1 Fine synchronization granularity

As an extension of the cooperative groups, the keyword syncthreads accepts a parameter that indicates which threads are being synchronized. This allows more fine grain control on the synchronization.
Keyword Description
syncthreads(warp) performs synchronization across the current (possibly diverged) warp (32 threads)
syncthreads(block) performs synchronization across the current block
syncthreads(grid) performs synchronization across the entire grid
syncthreads(multi_grid) performs synchronization across the entire multi-grid (multi-GPU)
syncthreads(host) synchronizes all host (CPU and GPU threads)
The first statement syncthreads(warp) allows divergent threads to synchronize at any time (it is also useful in the context of Volta’s independent scheduling). syncthreads(block) is equivalent to syncthreads. The grid synchronization primitive syncthreads(grid) is allows to place barriers inside kernel function that synchronize all blocks. The following function:
function y = gaussian_filter_separable(x, fc, n)
function [] = __kernel__ gaussian_filter_hor(x : cube, y : cube, fc : vec, n : int, pos : vec3)
sum = 0.
for i=0..numel(fc)-1
sum = sum + x[pos + [0,i-n,0]] * fc[i]
y[pos] = sum
function [] = __kernel__ gaussian_filter_ver(x : cube, y : cube, fc : vec, n : int, pos : vec3)
sum = 0.
for i=0..numel(fc)-1
sum = sum + x[pos + [i-n,0,0]] * fc[i]
y[pos] = sum
z = uninit(size(x))
y = uninit(size(x))
parallel_do (size(y), x, z, fc, n, gaussian_filter_hor)
parallel_do (size(y), z, y, fc, n, gaussian_filter_ver)
Can now be simplified to:
function y = gaussian_filter_separable(x, fc, n)
function [] = __kernel__ gaussian_filter_separable(x : cube, y : cube, z : cube, fc : vec, n : int, pos : vec3)
sum = 0.
for i=0..numel(fc)-1
sum = sum + x[pos + [0,i-n,0]] * fc[i]
z[pos] = sum
sum = 0.
for i=0..numel(fc)-1
sum = sum + z[pos + [i-n,0,0]] * fc[i]
y[pos] = sum
z = uninit(size(x))
y = uninit(size(x))
parallel_do (size(y), x, y, z, fc, n, gaussian_filter_separable)
The advantage is not only in the improved readability of the code, but the number of kernel function calls can be reduced which further increases the performance. Note: this feature requires at least an NVIDIA Pascal GPU.
There are a few caveats with grid synchronization:
The Quasar compiler enforces these conditions automatically. However, in case manual control of the block size and/or block count is desirable, the function opt_block_cnt can be used (see following Section).
Important note: using cooperative group features such as grid and multi-grid synchronization requires cooperative kernel launches, which is only available for Pascal, Volta (or newer) GPUs and requires either Windows or Linux with the GPU device operating in TCC driver mode. Also see As an alternative, when cooperative kernel launching is not available, Quasar will emulate grid synchronization. This is achieved by disabling the configuration setting CUDA_BACKEND_COOPERATIVEGROUPS_TCCDRIVER in Quasar.config.xml.

9.8.2 Optimizing block count for grid synchronization

Whereas opt_block_size calculates the best suited block size for a kernel function, opt_block_cnt can be used to calculate the block count, in order to ensure that all blocks can be active at the same time. This condition is required when performing grid synchronization, but may also occur in a few kernel tiling schemes.
block_size = opt_block_size(kernel,data_dims)
grid_size = opt_block_cnt(kernel,block_size,data_dims)
parallel_do([grid_size.*block_size, block_size], kernel)

9.8.3 Memory fences

CUDA follows a weak consistence memory model, which means that shared/global memory writes are not necessarily performed in order. This may result in unexpected results in case the programming code assumes a fixed order of memory operations.
The keyword memfence can be used to place memory barriers in the code; this is useful when threads need to wait for a global memory operation to be completed. Currently memfence only has effect inside kernel/device functions. Note that these instructions in addition prevent the compiler from performing optimizations across the fence (e.g., caching a value in the registers).
Keyword Description
memfence(block) Suspends the current thread until its global/shared memory
writes are visible by all threads in the current block
memfence(grid) Suspends the current thread until its global memory
writes are visible by all threads in the grid
memfence(system) Suspends the current thread until its global memory writes
are visible by all threads in the system (CPU, GPU)

9.9 Kernel launch bounds

By default, the runtime system decides on how many threads use when launching a given kernel. The back-end compiler (e.g., NVCC) however, must ensure that the kernel is compiled for a flexible number of threads. For this purpose, an upper limit (called launch bounds) for the possible number of threads is selected. The launch bounds are in turn used to calculate the number of registers used. The smaller the number of registers that are being used, the more thread blocks can be scheduled in parallel and the higher the efficiency. On the other hand, for large kernels with a large number of registers, the back-end compiler places less-used variables in the global memory to avoid to many registers to be used (called register spilling). When variables are stored in global memory, the cost can be tremendous.
However, often an approximate idea of the number of threads per block is known at compile-time. For example, the technique from Section 9.7↑ may assume that the number of threads per block is fixed to a given number. For practical purposes, let us assume that this number is 128, where the maximum number of threads the kernel permits is 1024. For 128 threads, the amount of shared memory used by the kernel (i.e., 512 bytes in single precision mode and 1024 bytes in double precision mode) is much lower than in case the kernel would be executed with 1024 threads per block. This allows the GPU to schedule multiple threads in parallel, which increases the computational performance of the kernel. Now, since we know that only 128 threads are used, the compiler has actually more registers available than in the case of 1024 threads, potentially avoiding the register spilling. Hence the compiler can tune the number of registers with this extra piece of information.
For this purpose, CUDA provides a __launch_bounds__ directive, and similarly this directive is also available in Quasar, For examples:
{!cuda_launch_bounds max_threads_per_block=128}where the maximal number of threads per block obviously needs to be smaller or equal than the maximal supported number of threads by the GPU (typically, 1024). It suffices to place this code attribute directly inside a kernel function (for example at the top). It is also possible to specify the minimal number of blocks per streaming processor (SM) that needs to be executed:
{!cuda_launch_bounds max_threads_per_block=128; min_blocks_per_sm = 4}
The number of registers used by a kernel function can be inspected in the Quasar Redshift IDE by hovering with the mouse over the left margin in the source code editor, at the position of a kernel function. We have seen performance improvements of 20-30% or more by tuning the number of registers and number of threads per blocks.
The kernel launch attribute is also generated by several automatic code transformations present in the Quasar compiler.

9.10 Memory management

There are some problems operating on large images that do not fit into the GPU memory. The solution is to provide a FAULT-TOLERANT mode, in which the operations are completely performed on the CPU (we assume that the CPU has more memory than the GPU). Of course, running on the CPU comes at a performance hit. Therefore I will add some new configurable settings in this enhancement.
Please note that GPU memory problems can only occur when the total amount of memory used by one single kernel function > (max GPU memory - reserved mem) * (1 - fragmented mem%). For a GPU with 1 GB, this might be around 600 MB. Quasar automatically transfers memory buffers back to the CPU memory when it is running out of GPU space. Nevertheless, this may not be sufficient, as some very large images can take all the space of the GPU memory (for example 3D datasets).
Therefore, three configurable settings are added to the runtime system (see Quasar.Redshift.config.xml):
  1. RUNTIME_GPU_MEMORYMODEL with possible values:\begin_inset Separator latexpar\end_inset
    • SmallFootPrint - A small memory footprint - opts for conservative memory allocation leaving a lot of GPU memory available for other programs in the system
    • MediumFootprint (default) - A medium memory footprint - the default mode
    • LargeFootprint - chooses aggressive memory allocation, consuming a lot of available GPU memory quickly. This option is recommended for GPU memory intensive applications.
  2. RUNTIME_GPU_SCHEDULINGMODE with possible values:\begin_inset Separator latexpar\end_inset
    • MaximizePerformance - Attempts to perform as many operations as possible on the GPU (potentially leading to memory failure if there is not sufficient memory available. Recommended for systems with a lot of GPU memory).
    • MaximizeStability (default) - Performs operations on the CPU if there is not GPU memory available. For example, processing 512 MB images when the GPU only has 1 GB memory available. The resulting program may be slower. (FAULT-TOLERANT mode)
  3. RUNTIME_GPU_RESERVEDMEM\begin_inset Separator latexpar\end_inset
    • The amount of GPU memory reserved for the system (in MB). The Quasar runtime system will not use the reserved memory (so that other desktop programs can still run correctly). Default value = 160 MB. This value can be decreased at the user’s risk to obtain more GPU memory for processing (desktop applications such as Firefox may complain\SpecialChar ldots)
Please note that the “imshow” function also makes use of the reserved system GPU memory (the CUDA data is copied to an OpenGL texture).
 Chapter 8: Special programming patterns Up Main page Chapter 10: Parallel programming examples