Chapter 9: GPU hardware features Up Main page Chapter 11: Multi-GPU programming

# 10 Parallel programming examples

This section contains a number of useful parallel programming examples together with an explanation.

## 10.1 Gamma correction

As a first example, we demonstrate how a gamma correction can be programmed in Quasar.
y = copy(x)
gamma = 0.22
parallel_do(size(y),y,gamma,__kernel__ (y:cube’unchecked, gamma:scalar, pos:ivec3) -> _
y[pos] = 255*(y[pos]*(1.0/255))^gamma)
imshow(y)
The above approach makes use of __kernel__ lambda expressions, which allows to define __kernel__ functions in just one line of code. Note that it is possible to put multiple statements inside a lambda expression, this is done as follows:
kernel_lambda = __kernel__ (y:cube’unchecked) -> (statement1; statement2; ...)
Sometimes, it is useful to share functionality between different kernel functions. This can be achieved using a __device__ function:
gamma_correction = __device__ (x:scalar,gamma:scalar) -> _
255*(y*(1.0/255))^gamma
gamma_correction_kernel = __kernel__ (y:cube’unchecked, gamma:scalar, pos:ivec3) -> _
y[pos] = gamma_correction(x[pos], gamma)
Device functions are defined in the same way as kernel functions, but they can not be directly executed using the parallel_do function.

## 10.2 Fractals

As a second example, we consider the calculation of the Mandelbrot fractal. In Quasar, this can be obtained using quite simple code, by using complex arithmetic.
% Mandelbrot fractal with Normalized Iteration Count algorithm
function [] = __kernel__ mandelbrot_fractal(im : mat’unchecked, s : scalar, _
t : cscalar, num_it : int, pos : ivec2)
p = (float(pos) ./ size(im,0..1))-0.5
c = t+s*complex(p,p)
z = 0i
N = 2.0
for n=1..num_it
if abs(z)>N
break
endif
z = z*z + c
endfor
im[pos] = n-log2(log(abs(z))/log(N))
endfunction
x = zeros(768,768)
parallel_do(size(x),x,10,complex(-1.42),512,mandelbrot_fractal)
imshow(x,[])

## 10.3 Image rotation, translation and scaling [basic]

The example below uses a __device__ function to perform linear interpolation. The main kernel function then performs an affine transform on its position argument, pos. Boundary checking in the function linear_interpolate is only performed once, using the test min(i) >= 0 && max(i-size(img_in,0..1)) < -1. Alternatively, the modifier ’unchecked in img_in:cube’unchecked can be omitted, which would give the same result, but this would result in 4 boundary checks (one for each img_in[...] access) instead of 1.
function [] = rotatescaletranslate(img_in, img_out, theta, s, tx, ty)
% Device function for performing linear interpolation
function [q:vec3] = __device__ linear_interpolate(img_in:cube’unchecked, p:vec2)
i = floor(p)
f = frac(p)
if min(i) >= 0 && max(i-size(img_in,0..1)) < -1
q = img_in[i,i,0..2] * (1 - f) * (1 - f) + _
img_in[i,i+1,0..2] * (1 - f) * f + _
img_in[i+1,i,0..2] * f * (1 - f) + _
img_in[i+1,i+1,0..2] * f * f
else
q = [0.,0.,0.]
endif
endfunction
function [] = __kernel__ tf_kernel(img_out : cube’unchecked, _
img_in:cube’unchecked, A:mat’unchecked’const, t:vec2, pos:ivec2)
center = size(img_in,0..1)/2
p = pos - center
p = [A[0,0]*p + A[0,1]*p, A[1,0]*p + A[1,1]*p] + center + t
img_out[pos,pos,0..2] = linear_interpolate(img_in, p)
endfunction
A = [[cos(theta), -sin(theta)],
[sin(theta), cos(theta)]] * 2^s

parallel_do(size(img_out,0..1),img_out,img_in,A,-[ty,tx],tf_kernel)
endfunction

## 10.4 2D Haar inplace wavelet transform using lifting

The following code demonstrates an inplace Haar wavelet transform, implemented using the lifting scheme (but without normalization). The forward and backward transform respectively use the 2 × 2 transform matrices:
A⃗ = ( 1 ⁄ 2 1 ⁄ 2 1  − 1 )  and A⃗ − 1 = ( 1 1 ⁄ 2 1  − 1 ⁄ 2 ).
The main advantages of the Haar wavelet transform in the context of Quasar programs, is that the transform is very fast (takes less than 2 ms to compute for a 512 × 512 × 3 input image on a NVidia Geforce 435M using the CUDA computation engine). Moreover, for integer input data within the range [0, 255], this unnormalized transform does not suffer from floating point rounding errors, hence the reconstruction (backward transform applied after the forward transform) is exact.
Forward transform:
function [] = haar_fw(x, num_scales)
function [] = __kernel__ hor_haar_fw_kernel(x : cube’unchecked, _
y : cube’unchecked, j : int, pos : ivec3)
n = size(x,1)/2^(j+1)
if mod(pos,2)==0
[a, b] = [x[pos], x[pos+[0,1,0]]]
y[pos,pos/2,pos]=0.5*(a+b)
y[pos,pos/2+n,pos]=a-b
endif
endfunction
function [] = __kernel__ ver_haar_fw_kernel(x : cube’unchecked, _
y : cube’unchecked, j : int, pos : ivec3)
m = size(x,0)/2^(j+1)
if mod(pos,2)==0
[a, b] = [x[pos], x[pos+[1,0,0]]]
y[pos/2,pos,pos]=0.5*(a+b)
y[pos/2+m,pos,pos]=a-b
endif
endfunction
tmp = zeros(size(x))
for j=0..num_scales-1
sz = [size(x,0)/2^j,size(x,1)/2^j,size(x,2)]
parallel_do(sz,x,tmp,j,hor_haar_fw_kernel)
parallel_do(sz,tmp,x,j,ver_haar_fw_kernel)
endfor
endfunction
Backward transform:
function [] = haar_bw(x, num_scales)
function [] = __kernel__ hor_haar_bw_kernel(x : cube’unchecked, _
y : cube’unchecked, j : int, pos : ivec3)
n = size(x,1)/2^(j+1)
if mod(pos,2)==0
a = x[pos,pos/2,pos]
b = x[pos,pos/2+n,pos]
y[pos]=a+0.5*b
y[pos+[0,1,0]]=a-0.5*b
endif
endfunction
function [] = __kernel__ ver_haar_bw_kernel(x : cube’unchecked, _
y : cube’unchecked, j : int, pos : ivec3)
m = size(x,0)/2^(j+1)
if mod(pos,2)==0
a = x[pos/2,pos,pos]
b = x[pos/2+m,pos,pos]
y[pos]=a+0.5*b
y[pos+[1,0,0]]=a-0.5*b
endif
endfunction
tmp = zeros(size(x))
for j=num_scales-1..-1..0
sz = [size(x,0)/2^j,size(x,1)/2^j,size(x,2)]
parallel_do(sz,x,tmp,j,hor_haar_bw_kernel)
parallel_do(sz,tmp,x,j,ver_haar_bw_kernel)
endfor
endfunction

## 10.5Convolution

As a fifth example, we will illustrate how a 3 × 3 local means filter can be implemented. There are different possibilities: 1) using a non-separable filtering, 2) using separable filtering (but requiring extra memory to store the intermediate values), or 3) using shared memory (see Subsection 2.4.4↑).
1. Non-separable implementation
y = zeros(size(x))
parallel_do(size(y),x,y,__kernel__ (x:cube,y:cube,pos:ivec3) -> _
y[pos] = (x[pos+[-1,-1,0]]+x[pos+[-1,0,0]]+x[pos+[-1,1,0]] + _
x[pos+[ 0,-1,0]]+x[pos ]+x[pos+[0,1,0]] + _
x[pos+[ 1,-1,0]]+x[pos+[ 1,0,0]]+x[pos+[1,1,0]])*(1.0/9))
imshow(y)
2. Separable implementation:
y = zeros(size(x))
tmp = zeros(size(x))
parallel_do(size(y),x,tmp,__kernel__ (x:cube,y:cube,pos:ivec3) -> _
y[pos] = x[pos+[-1,0,0]]+x[pos]+x[pos+[1,0,0]])
parallel_do(size(x),tmp,y,__kernel__ (x:cube,y:cube,pos:ivec3) -> _
y[pos] = x[pos+[0,-1,0]]+x[pos]+x[pos+[0,1,0]]*(1.0/9))
imshow(x)
3. Separable implementation, using shared memory:
function [] = __kernel__ filter3x3_kernel_separable(x:cube,y:cube,pos:ivec3,
blkpos:ivec3,blkdim:ivec3)
[M,N,P] = blkdim+[2,0,0]
assert(M<=10 && N<=16 && P<=3) % specify upper bounds for the amount of shared memory
vals = shared(M, N, P) % shared memory

sum = 0.
for i=pos-1..pos+1 % step 1 - horizontal filter
sum += x[pos,i,blkpos]
endfor
vals[blkpos] = sum % store the result
if blkpos<2 % filter two extra rows (needed for vertical filtering)
sum = 0.
for i=pos-1..pos+1
sum += x[pos+blkdim,i,blkpos]
endfor
vals[blkpos+[blkdim,0,0]] = sum
endif
sum = 0.
for i=blkpos..blkpos+2 % step 2 - vertical filter
sum += vals[i,blkpos,blkpos]
endfor
y[pos] = sum*(1.0/9)
endfunction
y = zeros(size(x))
parallel_do(size(y),x,y,filter3x3_kernel_separable)
imshow(y)
Comparison of the computation times:\begin_inset Separator latexpar\end_inset
 Implementation Time/run (NVidia Geforce 435M) Non-separable 3.70 msec Separable 4.24 msec Separable, w. shared memory 3.51 msec
It can be noted that a separable implementation for a 3 × 3 filter kernel, only brings a benefit when shared memory is used.
Remarks:
• The out of bounds checking compilation (see Section 17.3↓) option needs to be turned off in order to have this benefit.
• Also important is that the upper bounds for using shared memory are specified. This can be done using the assertion system. The compiler is then able to compute the maximal amount of shared memory that will be needed by the kernel function (see Section 9.7↑).

## 10.6Parallel reduction sum

Note that the Quasar compiler will generate automatically code that performs a parallel sum (see Section 8.4↑). This section is mainly for educational purposes, for understanding the shared memory and thread synchronization.
A parallel sum can be implemented in Quasar using a logarithmic algoritm of complexity log2N. This consists of first computing “partial” sums of groups of elements, stored in shared memory, followed by recursively adding of the shared memory partial sums. A lot of information on this kind of algorithm can be found in literature. In Quasar, the implementation for vectors is as follows:
function [y : scalar] = __kernel__ my_sum(x : vec’unchecked,
blkpos : int, blkdim : int)
bins = shared(blkdim) % Note - we assume that blkdim is a power of two!
nblocks = (numel(x)+blkdim-1)/blkdim
% step 1 - parallel sum
val = 0.0
for m=0..nblocks-1
if blkpos + m*blkdim < numel(x)
val += x[blkpos + m*blkdim]
endif
endfor
bins[blkpos] = val
% step 2 - reduction
bit = 1
while bit < blkdim
index = 2*bit*blkpos
if blkpos+index<blkdim
bins[index] = bins[index] + bins[index + bit]
endif
bit *= 2
endwhile
% write output
if blkpos == 0
y += bins
endif
endfunction
In step 1, the input is split in a number of blocks, where each block has size “blockdim”. Then all blocks are summed in parallel, the results are stored in “bins” (has one entry per block element). In step 2, all elements of bins are added together, using an FFT-like butterfly. When blkdim = 16, the algorithm is as follows:
% iteration 1 (subsequent steps are performed in parallel)
bins += bins
bins += bins
bins += bins
bins += bins
bins += bins
bins += bins
bins += bins
bins += bins
% iteration 2 (subsequent steps are performed in parallel)
bins += bins
bins += bins
bins += bins
bins += bins
% iteration 3
bins += bins
bins += bins
% iteration 4
bins += bins
Finally, the end result (bins) is accumulated in the kernel output argument y (see Section 4.7↑).
Remark: due to the atomic operation +=, the result is not deterministic: floating point rounding errors depend on the order of the operations. For an atomic add, the order of operations is not specified. This can be solved by storing the intermediate results in a vector, and summing this vector independently.
The above example can be used to write a more generic parallel reduction, that can be used for multiplication, maximization, minimization:
type accumulator : [__device__ (scalar, scalar) -> scalar]
function y : scalar = __kernel__ parallel_reduction(x : vec’unchecked,
acc : accumulator, val : scalar, blkpos : int, blkdim : int)
bins = shared(blkdim) % Note - we assume that blkdim is a power of two!
nblocks = (numel(x)+blkdim-1)/blkdim)
for m=0..nblocks-1 % step 1 - parallel sum
if blkpos + m*blkdim < numel(x)
val = acc(val, x[blkpos + m*blkdim])
endif
endfor
bins[blkpos] = val
syncthreads % step 2 - reduction
bit = 1
while bit < blkdim
index = 2*bit*blkpos
if index + bit < blkdim
bins[index] = acc(bins[index], bins[index + bit])
endif
bit *= 2
endwhile
y += bins % write output
endfunction
device_sum = __device__ (x : scalar, y : scalar) -> x + y
device_prod = __device__ (x : scalar, y : scalar) -> x * y
reduction (x : cube) -> sum(x) = parallel_do(512, x, device_sum,0,parallel_reduction)
reduction (x : cube) -> prod(x) = parallel_do(512, x, device_prod,0,parallel_reduction)
Here, we define the accumulation functions (device_sum and device_prod), and we pass the functions dynamically to the parallel_reduction function.
Note that the Quasar compiler is also able to recognize for-loops that could benefit from the parallel reduction algorithm. In this case, the for-loop is automatically transformed to the above algorithm (see Section 8.4↑).

## 10.7A more accurate parallel sum

As mentioned in Subsection 2.2.1↑, floating point math is not associative, and the order of the summations may depend on the GPU architecture (the used block dimensions, etc.). The code below illustrates a more accurate parallel summation algorithm than in the previous section, combining Kahan’s algorithm, with the parallel sum reduction reduction. The main idea of Kahan’s algorithm, is to accumulate small errors in a separate variable. Because the operations do not require any extra global or shared memory, all operations are performed in local memory (see Subsection 2.4.3↑), yielding minimal overhead compared to the direct algorithm.
% Sum of all elements in the specified cube.
function y : scalar = r_sum(x : cube) concealed
function [y : scalar] = __kernel__ r_sum_kernel(x : vec,
nblocks : int, blkdim : int, blkpos : int)
s = shared(blkdim)
% step 1 - parallel sum
sum = 0.0
c = 0.0
for n=0..nblocks-1
if blkpos + n * blkdim < numel(x)
% Kahan’s sum reduction
u = x[blkpos + n * blkdim] - c
t = sum + u
c = (t - sum) - u
sum = t
endif
endfor
s[blkpos] = sum
% step 2 - reduction
% now sort all bins from large to small magnitudes
bit = 1
% use regular summing
while bit<blkdim
index=2*bit*blkpos
if index+bit<blkdim
s[index] = s[index] + s[index+bit]
endif
bit *= 2
endwhile
if blkpos==0
y += s
endif
endfunction
y = r_aggregator(x, r_sum_kernel)
endfunction
% Aggregator helper function (deals with the computation
% of the block sizes)
function z = r_aggregator(x, kernel) concealed
N = numel(x)
BLOCK_SIZE = prod(max_block_size(kernel, N))
nblocks = int((N + BLOCK_SIZE-1) / BLOCK_SIZE)
z = parallel_do([[1,BLOCK_SIZE],[1,BLOCK_SIZE]],x,nblocks,kernel)
endfunction
% Define a reduction to replace the summing function by our
% "improved" implementation.
reduction (x : cube) -> sum(x) = r_sum(x)

## 10.8 Parallel sort

To implement a parallel sorting algorithm, several algorithms exist. For the bitonic sort algorithm, the Quasar implementation is as follows:
function [] = sort(x)
function [] = __kernel__ bitsort(x : mat, n : int, blkdim : ivec2,
blkpos : ivec2, pos : ivec2)
k = 2
% copy the row to the shared memory...
s = shared(blkdim, n)
for l = 0..blkdim..n-1
tid = blkpos + l
if tid < size(x,1)
s[blkpos, tid] = x[pos, tid]
else
s[blkpos, tid] = 1e37 % maximum floating point value
endif
endfor
% parallel bitonic sort
while k <= n
% bitonic merge
j = int(k / 2)
while j > 0
for l = 0..blkdim..n-1
tid = blkpos + l % thread id
ixj = xor(tid, j)
if tid < ixj
if and(tid, k) == 0
v = [blkpos, tid]
w = [blkpos, ixj]
else
v = [blkpos, ixj]
w = [blkpos, tid]
endif
if s[v] > s[w]
[s[v], s[w]] = [s[w], s[v]]
endif
endif
endfor
j /= 2
endwhile
k *= 2
endwhile
% Copy back the results
for l = 0..blkdim..n-1
tid = blkpos + l
if tid < size(x,1)
x[pos, tid] = s[blkpos, tid]
endif
endfor
endfunction
nextpow2 = x -> 2^ceil(log2(x))
n = nextpow2(size(x,1))
sz = max_block_size(bitsort, [size(x,0),min(n,256),1])
parallel_do([[size(x,0),sz,1],sz],x,n,bitsort)
endfunction
A complete explanation of the bitonic sort algorithm can be found on http://en.wikipedia.org/wiki/Bitonic_sorter. Here, bitonic sorting is applied along the rows of the matrix.
The function handles input sizes that are not a multiple of two.

## 10.9 Matrix multiplication

Matrix multiplication in CUDA is so much fun that some people write books on this topic (see http://www.shodor.org/media/content//petascale/materials/UPModules/matrixMultiplication/moduleDocument.pdf). The following is the block-based solution proposed by NVidia. The solution exploits shared memory to reduce the number of accesses to global memory.
__global__ void MatMulKernel(Matrix A, Matrix B, Matrix C)
{
// Block row and column
int blockRow = blockIdx.y, blockCol = blockIdx.x;
// Each thread block computes one sub-matrix Csub of C
Matrix Csub = GetSubMatrix(C, blockRow, blockCol);
// Each thread computes 1 element of Csub accumulating results into Cvalue
float Cvalue = 0.0;
// Thread row and column within Csub
// Loop over all the sub-matrices of A and B required to compute Csub
for (int m = 0; m < (A.width / BLOCK_SIZE); ++m)
{
// Get sub-matrices Asub of A and Bsub of B
Matrix Asub = GetSubMatrix(A, blockRow, m);
Matrix Bsub = GetSubMatrix(B, m, blockCol);
// Shared memory used to store Asub and Bsub respectively
__shared__ float As[BLOCK_SIZE][BLOCK_SIZE];
__shared__ float Bs[BLOCK_SIZE][BLOCK_SIZE];
// Load Asub and Bsub from device memory to shared memory
As[row][col] = GetElement(Asub, row, col);
Bs[row][col] = GetElement(Bsub, row, col);
// Multiply Asub and Bsub together
for (int e = 0; e < BLOCK_SIZE; ++e)
Cvalue += As[row][e] * Bs[e][col];
}
// Each thread writes one element of Csub to memory
SetElement(Csub, row, col, Cvalue);
}
(Note: some functions are omitted for clarity)
However, this implementation is only efficient when the number of rows of matrix A is about the same as the number of cols of A. In other cases, performance is not optimal. Second, there is the issue that this version expects that the matrix dimensions are a multiple of BLOCK_SIZE. Why use a 3x3 matrix if we can have a 16x16?
In fact, there are 3 cases that need to be considered (let n < N):
1. (n × N)  ×  (N × n): The resulting matrix is small: in this case, it is best to use the parallel sum algorithm.
2. (N × N)  ×  (N × N): The number of rows/cols of A are more or less equal: use the above block-based algorithm.
3. (N × n)  ×  (n × N): The resulting matrix is large: it is not beneficial to use shared memory.
The following example illustrates this approach in Quasar:
% Dense matrix multiplication
function C = dense_multiply(A : mat, B : mat)
% Algorithm 1 - is well suited for calculating products of
% large matrices that have a small matrix as end result.
function [] = __kernel__ kernel1(a : mat’unchecked, b : mat’unchecked,
c : mat’unchecked, blkdim : ivec3, blkpos : ivec3)
n = size(a,1)
bins = shared(blkdim)
nblocks = int(ceil(n/blkdim))
% step 1 - parallel sum
val = 0.0
for m=0..nblocks-1
if blkpos + m*blkdim < n % Note - omitting  gives error
d = blkpos + m*blkdim
val += a[blkpos,d] * b[d,blkpos]
endif
endfor
bins[blkpos] = val
% step 2 - reduction
bit = 1
while bit < blkdim
if mod(blkpos,bit*2) == 0
bins[blkpos] += bins[blkpos + [bit,0,0]]
endif
bit *= 2
endwhile
% write output
if blkpos == 0
c[blkpos,blkpos] = bins[0,blkpos,blkpos]
endif
endfunction

% Algorithm 2 - the block-based algorithm, as described in the CUDA manual
function [] = __kernel__ kernel2(A : mat’unchecked, B : mat’unchecked,
C : mat’unchecked, BLOCK_SIZE : int, pos : ivec2, blkpos : ivec2, blkdim : ivec2)
% A[pos,m] * B[m,pos]
sA = shared(blkdim,BLOCK_SIZE)
sB = shared(BLOCK_SIZE,blkdim)
sum = 0.0
for m = 0..BLOCK_SIZE..size(A,1)-1
% Copy submatrix
for n = blkpos..blkdim..BLOCK_SIZE-1
sA[blkpos,n] = pos < size(A,0) && m+n < size(A,1) ? A[pos,m+n] : 0.0
endfor
for n = blkpos..blkdim..BLOCK_SIZE-1
sB[n,blkpos] = m+n < size(B,0) && pos < size(B,1) ? B[m+n,pos] : 0.0
endfor
% Compute the product of the two submatrices
for n = 0..BLOCK_SIZE-1
sum += sA[blkpos,n] * sB[n,blkpos]
endfor
endfor
if pos < size(C,0) && pos < size(C,1)
C[pos] = sum % Write the result
endif
endfunction

% Algorithm 3 - the most straightforward algorithm
function [] = __kernel__ kernel3(A : mat’unchecked, B : mat’unchecked,
C : mat’unchecked, pos : ivec2)
sum = 0.0
for m=0..size(A,1)-1
sum += A[pos,m]*B[m,pos]
endfor
C[pos] = sum
endfunction

[M,N] = [size(A,0),size(B,1)]
C = zeros(M,N)
if M <= 4
P = prevpow2(max_block_size(kernel1,[size(A,1),M*N]))
parallel_do([P,M,N],A,B,C,kernel1)
elseif size(A,1)>=8 && M >= 8
P = min(32, prevpow2(size(A,1)))
blk_size = max_block_size(kernel2,[32,32])
sz = ceil(size(C,0..1) ./ blk_size) .* blk_size
parallel_do([sz,blk_size],A,B,C,P,kernel2)
else
parallel_do(size(C),A,B,C,kernel3)
endif
endfunction

Chapter 9: GPU hardware features Up Main page Chapter 11: Multi-GPU programming