Title: Tutorial 6: Advanced GPU programming
In this tutorial, we show how the histogram of an image can be implemented efficiently using Quasar, using some more advanced GPU programming features. In particular, we consider three possible implementations of the histogram calculation:
The straightforward implementation is as follows:
function y = hist_naive(im : cube)
y = zeros(256)
for m=0..size(im,0)-1
for n=0..size(im,1)-1
for k=0..size(im,2)-1
v = im[m,n,k]
y[v] += 1
endfor
endfor
endfor
endfunction
Here, we calculate the number of occurences of a given pixel intensity between 0 and 255. Because the loops over the image (with iterators m
, n
, k
) will be executed in parallel, an atomic operator is necessary to avoid data races. In Quasar, the inplace operators (+=
, -=
, …) are automatically atomic.
Although the above implementation is sufficient to implement a histogram in Quasar, because there are only 256 possible bins and because in most images, the intensities are not uniformly distributed, the atomic operation +=
is likely to encounter many concurrent actions, causing the addition to be executed in serial, eventually.
A trick to speed up the histogram implementation on a GPU can be found in the CUDA programming manuals. The trick is to perform the histogram update in shared memory, and expand the 1D histogram to a matrix with multiple columns that allow concurrent updates. In Quasar, this can be achieved by manually writing the kernel functions. From within the kernel function, special arguments blkpos
and blkdim
(indicating respectively the position within a block and the block dimensions) become available.
function y = hist_shared(im : cube)
function [] = __kernel__ kernel (im : cube, y : vec, pos : ivec3, blkpos : ivec3, blkdim : ivec3)
nthreads = 4
y_sh : mat = shared(256,4)
% copy the value of y to the shared memory
cnt = prod(blkdim)
ind = pos2ind(blkdim, blkpos)
for i = ind..cnt..numel(y_sh)-1
y_sh[ind2pos(size(y_sh), i)] = 0.0
endfor
syncthreads
v = im[pos]
threadid = mod(ind,nthreads)
y_sh[v, threadid] += 1
syncthreads
for i = ind..cnt..256-1
y[i] += sum(y_sh[i,0..3])
endfor
endfunction
y = zeros(256)
parallel_do(size(im,0..2),im,y,kernel)
endfunction
The kernel function works as follows: first, a local histogram y_sh
is initialized with zeros. Because the size of the histogram is not necessarily equal to the number of the pixels in one block, we use the pos2ind
and ind2pos
functions to convert between position vectors and linear indices. With the loop,
for i = ind..cnt..numel(y_sh)-1
y_sh[ind2pos(size(y_sh), i)] = 0.0
endfor
syncthreads
All threads within one block can collaboratively initialize the local histogram to zero. A thread synchronization step (syncthreads
) is then required to ensure that all threads have access to the initialized local histogram.
In the next part, the local histogram is updated using atomic operations. To avoid the actions to be serialized, the size of the local histogram is expanded by a factor nthreads
.
v = im[pos]
threadid = mod(ind,nthreads)
y_sh[v, threadid] += 1
Each column of y_sh
then contains one single histogram that is accessed concurrently by only a part of the threads (the threads with index mod(ind,nthreads)
). At the end of the kernel function, all contributions to the local histograms are being aggregated:
syncthreads
for i = ind..cnt..256-1
y[i] += sum(y_sh[i,0..3])
endfor
The above boilerplate code can be avoided by the use of the shared memory caching transformation. To enable this transform, it is necessary to set some parameters:
!kernel_transform enable="sharedmemcaching"
!kernel_arg name=y; type="inout"; access="shared"; op="+="; cache_slices=y[:]; numel=256
where the parameters are as follows:
y
: the name of the variable to be cachedinout
: the way the parameter is read (in=read-only, out=write-only, inout=read+write)shared
: specifies that the variable is to be cached in shared memoryop
: indicates the accumulation operation (=
: write, +=
: accumulate by adding)cache_slices
: indicates the portion of the matrix that needs to be cachednumel
: determines an upper bound for the number of elements of the matrix to be cached (this needs to be fixed to enable shared memory calculations and increase efficiency of the transform).Using the histogram caching transform, our histogram method can be implemented using a parallel loop:
function y = hist_auto(im)
y : vec'unchecked = zeros(256)
!parallel for
for m=0..size(im,0)-1
for n=0..size(im,1)-1
for k=0..size(im,2)-1
!kernel_transform enable="sharedmemcaching"
!kernel_arg name=y; type="inout"; access="shared"; op="+="; cache_slices=y[:]; numel=256
!kernel_tiling dims=[128,256,1]; mode="global"; target="gpu"}
v = im[m,n,k]
y[v] += 1
endfor
endfor
endfor
endfunction
Additionally, the kernel transform parameters can be separated from the implementation, e.g., so that it can be reused by other kernels, or so that target-specific optimizations can easily be specified. To do so, the following macro can be defined:
macro [] = hist_profile()
if $target("gpu")
!kernel_arg name=y; type="inout"; access="shared"; op="+="; cache_slices=y[:]; numel=256
!kernel_transform enable="sharedmemcaching"
!kernel_tiling dims=[128,256,1]; mode="global"; target="gpu"}
endif
endmacro
Whenever the compiler encounters !hist_profile
in the code, the compiler will replace the code attribute by the macro expansion. The function $target("gpu")
selects the target at compile-time. Possible targets are cpu
, gpu
, nvidia_cuda
. This way, our implementation becomes as simple as the straightforward implementation:
function y = hist_auto_targetspecific(im)
y : vec'unchecked = zeros(256)
!parallel for
for m=0..size(im,0)-1
for n=0..size(im,1)-1
for k=0..size(im,2)-1
!hist_profile
v = im[m,n,k]
y[v] += 1
endfor
endfor
endfor
endfunction
Using a test program, we can compare the execution performance of the different methods.
function [] = main()
im = imread("image.tif")
num_runs = 1000
tic()
for run=0..num_runs-1
y = hist_naive(im)
endfor
toc("Naive algorithm")
print sum(y)/numel(im)
tic()
for run=0..num_runs-1
y = hist_auto(im)
endfor
toc("Auto algorithm")
print sum(y)/numel(im)
tic()
for run=0..num_runs-1
y = hist_shared(im)
endfor
toc("Hand-written algorithm")
endfunction
On an NVidia Geforce 980M GPU, the test results are as follows:
Naive algorithm: 565.4426 ms
Auto algorithm: 226.6738 ms
Hand-written algorithm: 326.774 ms
As we can see, the automatic algorithm performs the best. The reason is that it includes other optimizations than those that we have discussed here.