Title: Tutorial 6: Advanced GPU programming

Advanced GPU programming

In this tutorial, we show how the histogram of an image can be implemented efficiently using Quasar, using some more advanced GPU programming features. In particular, we consider three possible implementations of the histogram calculation:

  1. A naive straightforward implementation, using atomic operations
  2. A target-specific implementation using shared memory
  3. The use of the shared memory caching and kernel tiling code transforms, which allow the Quasar compiler to generate an efficient implementation automatically.

The straightforward implementation is as follows:

function y = hist_naive(im : cube)
    y = zeros(256)
    for m=0..size(im,0)-1
        for n=0..size(im,1)-1
            for k=0..size(im,2)-1
                v = im[m,n,k]
                y[v] += 1
            endfor
        endfor
    endfor
endfunction

Here, we calculate the number of occurences of a given pixel intensity between 0 and 255. Because the loops over the image (with iterators m, n, k) will be executed in parallel, an atomic operator is necessary to avoid data races. In Quasar, the inplace operators (+=, -=, …) are automatically atomic.

Although the above implementation is sufficient to implement a histogram in Quasar, because there are only 256 possible bins and because in most images, the intensities are not uniformly distributed, the atomic operation += is likely to encounter many concurrent actions, causing the addition to be executed in serial, eventually.

Kernel function implementation

A trick to speed up the histogram implementation on a GPU can be found in the CUDA programming manuals. The trick is to perform the histogram update in shared memory, and expand the 1D histogram to a matrix with multiple columns that allow concurrent updates. In Quasar, this can be achieved by manually writing the kernel functions. From within the kernel function, special arguments blkpos and blkdim (indicating respectively the position within a block and the block dimensions) become available.

function y = hist_shared(im : cube)
    function [] = __kernel__ kernel (im : cube, y : vec, pos : ivec3, blkpos : ivec3, blkdim : ivec3)
        nthreads = 4
        y_sh : mat = shared(256,4)
        % copy the value of y to the shared memory
        cnt = prod(blkdim)
        ind = pos2ind(blkdim, blkpos)
        for i = ind..cnt..numel(y_sh)-1
            y_sh[ind2pos(size(y_sh), i)] = 0.0
        endfor
        syncthreads

        v = im[pos]
        threadid = mod(ind,nthreads)
        y_sh[v, threadid] += 1

        syncthreads
        for i = ind..cnt..256-1
            y[i] += sum(y_sh[i,0..3])
        endfor
    endfunction        

    y = zeros(256)
    parallel_do(size(im,0..2),im,y,kernel)

endfunction

The kernel function works as follows: first, a local histogram y_sh is initialized with zeros. Because the size of the histogram is not necessarily equal to the number of the pixels in one block, we use the pos2ind and ind2pos functions to convert between position vectors and linear indices. With the loop,

for i = ind..cnt..numel(y_sh)-1
    y_sh[ind2pos(size(y_sh), i)] = 0.0
endfor
syncthreads

All threads within one block can collaboratively initialize the local histogram to zero. A thread synchronization step (syncthreads) is then required to ensure that all threads have access to the initialized local histogram.

In the next part, the local histogram is updated using atomic operations. To avoid the actions to be serialized, the size of the local histogram is expanded by a factor nthreads.

v = im[pos]
threadid = mod(ind,nthreads)
y_sh[v, threadid] += 1

Each column of y_sh then contains one single histogram that is accessed concurrently by only a part of the threads (the threads with index mod(ind,nthreads)). At the end of the kernel function, all contributions to the local histograms are being aggregated:

syncthreads
for i = ind..cnt..256-1
    y[i] += sum(y_sh[i,0..3])
endfor

Automatic histogram caching

The above boilerplate code can be avoided by the use of the shared memory caching transformation. To enable this transform, it is necessary to set some parameters:

!kernel_transform enable="sharedmemcaching"
!kernel_arg name=y; type="inout"; access="shared"; op="+="; cache_slices=y[:]; numel=256

where the parameters are as follows:

Using the histogram caching transform, our histogram method can be implemented using a parallel loop:

function y = hist_auto(im)
    y : vec'unchecked = zeros(256)
    !parallel for
    for m=0..size(im,0)-1
        for n=0..size(im,1)-1
            for k=0..size(im,2)-1
                !kernel_transform enable="sharedmemcaching"
                !kernel_arg name=y; type="inout"; access="shared"; op="+="; cache_slices=y[:]; numel=256
                !kernel_tiling dims=[128,256,1]; mode="global"; target="gpu"}               
                v = im[m,n,k]
                y[v] += 1
            endfor
        endfor
    endfor

endfunction

Additionally, the kernel transform parameters can be separated from the implementation, e.g., so that it can be reused by other kernels, or so that target-specific optimizations can easily be specified. To do so, the following macro can be defined:

macro [] = hist_profile()
   if $target("gpu")
       !kernel_arg name=y; type="inout"; access="shared"; op="+="; cache_slices=y[:]; numel=256
       !kernel_transform enable="sharedmemcaching"
       !kernel_tiling dims=[128,256,1]; mode="global"; target="gpu"}
   endif
endmacro

Whenever the compiler encounters !hist_profile in the code, the compiler will replace the code attribute by the macro expansion. The function $target("gpu") selects the target at compile-time. Possible targets are cpu, gpu, nvidia_cuda. This way, our implementation becomes as simple as the straightforward implementation:

function y = hist_auto_targetspecific(im)
    y : vec'unchecked = zeros(256)

    !parallel for
    for m=0..size(im,0)-1
        for n=0..size(im,1)-1
            for k=0..size(im,2)-1
                !hist_profile
                v = im[m,n,k]
                y[v] += 1
            endfor
        endfor
    endfor

endfunction    

Using a test program, we can compare the execution performance of the different methods.

function [] = main()

    im = imread("image.tif")
    num_runs = 1000

    tic()
    for run=0..num_runs-1
        y = hist_naive(im)
    endfor
    toc("Naive algorithm")
    print sum(y)/numel(im)

    tic()
    for run=0..num_runs-1
        y = hist_auto(im)
    endfor
    toc("Auto algorithm")
    print sum(y)/numel(im)

    tic()
    for run=0..num_runs-1
        y = hist_shared(im)
    endfor
    toc("Hand-written algorithm")


endfunction

On an NVidia Geforce 980M GPU, the test results are as follows:

Naive algorithm: 565.4426 ms
Auto algorithm: 226.6738 ms
Hand-written algorithm: 326.774 ms

As we can see, the automatic algorithm performs the best. The reason is that it includes other optimizations than those that we have discussed here.