CUDA - Exploiting Constant Memory

The fact that data is constant can be exploited to yield improved kernel function performance. The GPU hardware provides several caches or memory types that are designed for this purpose:

For practical purposes, the size of the constant memory is rather small, so it is mostly useful for storing filter coefficients and weights that do not change while the kernel is executed. On the other hand, the texture memory is quite large, has its own cache, and can be used for storing constant input signals/images.

The drawback is that the texture memory is separate from the global memory and can not be written. As such, global memory needs to be copied to texture memory before the kernel is started. This way, sometimes two separate copies of the data need to be stored in the GPU memory. Luckily, for Kepler GPUs, there is a solution (see further 'hwtex_const).

In Quasar, constant/texture memory can be utilized by adding modifiers to the kernel function parameter types. The following modifiers are available:

Note that global memory accesses (i.e., without 'hw* modifiers) are cached in L2. For Kepler GPU devices, using 'hwtex_const the texture cache is utilized directly, bypassing the L2 cache. The texture cache is a separate cache with a separate memory pipeline and relaxed memory coalescing rules, which may bring advantages to bandwidth-limited kernels. Source: Kepler tuning guide.

Starting with Maxwell GPU devices, the L1 cache and the texture caches are unified. The unified L1/texture cache coalesces the memory accesses, gathering up the data requested by the threads in a warp, before delivering the data to the warp. Source: Maxwell tuning guide.

Correspondingly, some optimization tips are:

In the future, the Quasar compiler may be able to add 'hwconst, 'hwtex_const automatically. For now, lets start testing and understanding when these modifiers are beneficial!

Example

As an example, consider the following convolution program:

Default version with no constant memory being used:

function [] = __kernel__ kernel(x : vec, y : vec, f : vec, pos : int)
    sum = 0.0
    for i=0..numel(f)-1
        sum += x[pos+i] * f[i]
    endfor
    y[pos] = sum
endfunction

Version with constant memory:

function [] = __kernel__ kernel_hwconst(x : vec, y : vec, f : vec'hwconst, pos : int)
    sum = 0.0
    for i=0..numel(f)-1
        sum += x[pos+i] * f[i]
    endfor
    y[pos] = sum
endfunction

Version with constant texture memory for f:

function [] = __kernel__ kernel_hwtex_const(x : vec, y : vec, f : vec'hwtex_const, pos : int)
    sum = 0.0
    for i=0..numel(f)-1
        sum += x[pos+i] * f[i]
    endfor
    y[pos] = sum
endfunction

Version with constant texture memory for x and f:

function [] = __kernel__ kernel_hwtex_const2(x : vec'hwtex_const, y : vec, f : vec'hwtex_const, pos : int)
    sum = 0.0
    for i=0..numel(f)-1
        sum += x[pos+i] * f[i]
    endfor
    y[pos] = sum
endfunction

Version with HW textures:

function [] = __kernel__ kernel_tex(x : vec, y : vec, f : vec'hwtex_nearest, pos : int)
    sum = 0.0
    for i=0..numel(f)-1
        sum += x[pos+i] * f[i]
    endfor
    y[pos] = sum
endfunction

Timing code:

x = zeros(2048^2)
y = zeros(size(x))
f = 0..31  % dummy filter coefficients

tic()
for k=0..99
    parallel_do(size(y),x,y,f,kernel)
endfor
toc("Default")

tic()
for k=0..99
    parallel_do(size(y),x,y,f,kernel_hwconst)
endfor
toc("'hwconst")

tic()
for k=0..99
    parallel_do(size(y),x,y,f,kernel_hwtex_const)
endfor
toc("'hwtex_const")

tic()
for k=0..99
    parallel_do(size(y),x,y,f,kernel_hwtex_const2)
endfor
toc("'hwtex_const2")

tic()
for k=0..99
    parallel_do(size(y),x,y,f,kernel_tex)
endfor
toc("'hwtex_nearest")

Results for the NVidia Geforce 980 (Maxwell architecture):

Default: 513.0294 ms
'hwconst: 132.0075 ms
'hwtex_const: 128.0074 ms
'hwtex_const: 95.005 ms
'hwtex_nearest: 169.0096 ms

It can be seen that using constant memory alone yields a speed-up of more than factor 5(!). The best performance is obtained with hwtex_const, which is close to the performance of 'hwconst.

Note, further improvement can be made using shared memory, for which we reach 85 ms in this case.