Chapter 11: Multi-GPU programming Up Main page Chapter 13: Best practices 

12 SIMD processing on CPU and GPU

In SIMD, multiple processing elements process the data simultaneously, driven by a single instruction stream. SIMD is supported by x86/x64 architectures, as well as ARM and refers to special ‘multimedia’ instructions that were originally added to allow real-time video encoding/decoding operations as well as 3D games to run the CPU.
CUDA supports SIMD processing via its warp-based execution model, also known as single instruction multiple threads (SIMT). However, CUDA supports a limited set of SIMD intrinsics on half precision floating point formats and 8-bit/16-bit integer types that can be used within a single thread, allowing essentially the vector length to be extended above the warp size (for example vector length 128 for 8-bit integers).
In essence, when performing calculations using the vectors of appropriate length (see below), the operations are automatically mapped onto a SIMD implementation whenever possible. For example
im[m,n,0..3]
might (depending on the settings and machine architectures) load into an SSE register (four 32-bit floating point numbers) in x86/x64 code, while
im[m,n,0..7]
might lead to an AVX load (eight 32-bit floating point numbers).
Some type aliases are defined in
inttypes.q
and
floattypes.q
to simplify the SIMD processing:
\raggedright\strut Type \strut
\raggedright\strut Defined in \strut
\raggedright\strut Meaning \strut
\raggedright\strut Alias for \strut
\raggedright\strut
i8vec4
\strut
\raggedright\strut
inttypes.q
\strut
\raggedright\strut 8-bit signed integer vector of length 4 \strut
\raggedright\strut
vec[int8](4)
\strut
\raggedright\strut
u8vec4
\strut
\raggedright\strut
inttypes.q
\strut
\raggedright\strut 8-bit unsigned integer vector of length 4 \strut
\raggedright\strut
vec[uint8](4)
\strut
\raggedright\strut
i16vec4
\strut
\raggedright\strut
inttypes.q
\strut
\raggedright\strut 16-bit signed integer vector of length 4 \strut
\raggedright\strut
vec[int16](4)
\strut
\raggedright\strut
u16vec4
\strut
\raggedright\strut
inttypes.q
\strut
\raggedright\strut 16-bit unsigned integer vector of length 4 \strut
\raggedright\strut
vec[uint16](4)
\strut
\raggedright\strut
hvec2
\strut
\raggedright\strut
floattypes.q
\strut
\raggedright\strut 16-bit (half precision) floating point vector of length 2 \strut
\raggedright\strut
vec[scalar’half](4)
\strut
By using vector types such as above, the back-end compiler (e.g., MSVC, GCC, \SpecialChar ldots) can choose the appropriate SIMD instructions. Because not all compilers have the best code generation in this respect (as a sidenote, the Intel C/C++ compiler currently offers the best results in our experiments), several low-level operations have been manually vectorized in the header file
quasar_simd.h
.
The Quasar compiler supports automatic vectorization of kernel functions. This relieves the programmer from unfolding loops and writing manually vectorized expressions. A kernel can automatically be enabled for SIMD processing by the following code attribute:
{!kernel_transform enable="simdprocessing"; target="cpu"}
where target specifies
cpu
(in case of x86/x64 SIMD) or
cuda
(in case of CUDA SIMD). When the target is ommitted, SIMD is applied to any target architecture (when possible). Using Quasar SIMD processing, the SIMD acceleration depends much less on the back-end compiler that is being used.
Because it is not guaranteed that SIMD processing improves the performance (especially when the used vector types or operations are natively supported), the SIMD processing is not enabled by default. Therefore, the following subsections list the operations and vector types that are accelerated.

12.1 Storage versus computation types

It is important to distinguish between types used for storage and types used for calculation. Storage types represent the data in a more compressed format allowing the bandwidth to the memory to be reduced whenever the precision requirements allow it. For example, an image may be stored in
uint8
format, whereas the processing is performed in
int32
(or even
scalar
) format. This combines the storage benefits with the calculation benefits associated with the computation units of the device.
Going one step further, the computations may also be performed in a lower precision, allowing, e.g., in CUDA 4
uint8
integers to be processed simultaneously. This is useful to reduce the pressure on the computation units of the GPU multi-processor. However, one should take into account that:
\tightlist
Therefore, by default, calculations are performed in a suited type for which all operations are fully supported by the device (typically, 32-bit integers or 32-bit floating point operations). Conversion operations between storage types and computation types (e.g., from
u8vec4
to
ivec4
and back) can be implemented using SIMD instructions depending on the device.
Storage modifiers:
Integer types have storage modifiers. These determine how integer overflows are handled when the results of a computation are written back to memory. When using SIMD vector types, the overflow detection can also be accelerated using SIMD instructions. The following table gives an overview of which integer overflow checks are currently accelerated.
\raggedright\strut Modifier \strut
\raggedright\strut Meaning \strut
\raggedright\strut x86/x64 SIMD \strut
\raggedright\strut CUDA SIMD \strut
\raggedright\strut
’unchecked
\strut
\raggedright\strut No overflow checking is performed \strut
\raggedright\strut Yes \strut
\raggedright\strut Yes \strut
\raggedright\strut
’checked
\strut
\raggedright\strut Overflow checking is performed, an overflow leads to a runtime error \strut
\raggedright\strut No \strut
\raggedright\strut No \strut
\raggedright\strut
’sat
\strut
\raggedright\strut Values are clipped (saturated) to the value range of the target type \strut
\raggedright\strut Yes \strut
\raggedright\strut No \strut

12.2 x86/x64 SIMD accelerated operations

Supported vector lengths: 4 and 8 (see below).
x86 and x64 processors support various vector processing extensions: SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2, AVX, AVX2. It is important that the selected CPU target architecture matches the machine, otherwise an exception will be generated. The target architecture can be configured in the Program Settings dialog box in Redshift (or alternatively in the
Quasar.config.xml
configuration file):
Target architecture specifies the architecture to tune for. Vector extensions sets the highest level of vector processing instructions that the processor supports.
The following operations are guaranteed to result in SIMD instructions, independent of the back-end compiler.
Floating point operations:
\raggedright\strut Operation \strut
\raggedright\strut Type \strut
\raggedright\strut Requirements (32-bit float) \strut
\raggedright\strut Requirements (64-bit float) \strut
\raggedright\strut Operators
+, -, .*, ./
\strut
\raggedright\strut
scalar4
\strut
\raggedright\strut SSE \strut
\raggedright\strut SSE2 / AVX \strut
\raggedright\strut Operators
+, -, .*, ./
\strut
\raggedright\strut
scalar8
\strut
\raggedright\strut AVX \strut
\raggedright\strut AVX \strut
\raggedright\strut Comparison operators
==, !=, >, <, >=, <=
\strut
\raggedright\strut
scalar4
\strut
\raggedright\strut SSE \strut
\raggedright\strut SSE2 / AVX \strut
\raggedright\strut Functions
min
,
max
,
floor
,
ceil
,
sqrt
,
rsqrt
,
round
,
frac
,
sign
\strut
\raggedright\strut
scalar4
\strut
\raggedright\strut SSE2 \strut
\raggedright\strut SSE2 / AVX \strut
\raggedright\strut Functions
sum
,
prod
,
dotprod
\strut
\raggedright\strut
scalar4
\strut
\raggedright\strut SSE2 \strut
\raggedright\strut SSE2 / AVX \strut
Integer operations:
Operation Type Requirements
Operators
+, -
int4
SSE2
Operators
.*
int4
SSE4.1
Comparison operators
==, !=, >, <, >=, <=
int4
SSE2
Functions
abs
,
min
,
max
,
sign
,
sum
,
prod
,
any
int4
SSE2
Functions
and
,
or
,
xor
,
int4
SSE2
Type conversions:
Operation Type Requirements (32-bit float) Requirements (64-bit float)
Conversion
vec4 -> ivec4
SSE2 SSE2 / AVX
Conversion
ivec4
->
vec4
SSE2 SSE2 / AVX
Conversion
ivec4
->
i8vec4
,
i16vec4
,
u8vec4
,
u16v
SSE2
Conversion
i8vec4
,
i16vec4
,
u8vec4
,
u16vec4
->
ivec4
SSE2

12.2.1 Example: AVX image filtering on CPU

The following code demonstrates how to use SIMD processing on the CPU.
{!parallel for}
for m=0..size(im,0)-1
for n=0..size(im,1)-1
{!kernel_transform enable="simdprocessing"; target="cpu"; numel=8}
r = 0.0
for x=0..7
r += im[m,n+x]
end
im_out[m,n] = r/(2*K+1)
endfor
endfor
The
simdprocessing
kernel transform causes the loop to be automatically parallelized and vectorized. The parameter
numel
specifies the desired vector length (8 for AVX, 4 for SSE). If
numel
is omitted, the vector length is determined depending on the available SIMD extensions of the CPU (e.g., AVX, SSE).
Whenever suited, the vectorizer will convert branches to branch-free expressions that can be vectorized. In addition, cooldown code is generated for cases that the image dimensions (in the above example size(im,1)) are not a multiple of the SIMD width.

12.3 CUDA SIMD accelerated operations

Supported vector lengths: 2 and 4
CUDA vector instructions always interact with 32-bit words. Depending on the precision (8-bit integer, 16-bit integer or 16-bit floating point), the vector length is either 2 or 4.
Floating point operations:
The half-precision floating point format (FP16) is useful to reduce the device memory bandwidth for algorithms which are not sensitive to the reduced precision. For
half
, only integers between -2048 and 2048 can exactly be represented. Integers larger than 65520 or smaller than -65520 are rounded toward respectively positive and negative infinity.
Next to the reduced bandwidth, starting with the Pascal architecture, the GPU also offers hardware support for computations using this type. To gain maximal performance benefits, it is best to use the 32-bit length 2 SIMD half type (
hvec2
). Use of
hvec2
typically results in two numbers being calculated in parallel, leading to a performance that is similar to one single precision floating point (FP32) operation. However, a Volta or Turing GPU is required to obtain performance benefits from using calculation in
half
precision format. For Kepler, Maxwell and Pascal GPUs, hardware support for operations in
half
precision is notably slow, therefore it is best to use the
half
type only for storage purposes (i.e., calculations are performed in the single-precision floating point type).
The following table shows the operations that have been accelerated using
half
types. For unsupported operations, the computation will be performed in FP32 format, leading to extra conversions between FP32 and FP16.
\raggedright\strut Operation \strut
\raggedright\strut Type \strut
\raggedright\strut Operators
+, -, .*
\strut
\raggedright\strut
hvec2
\strut
\raggedright\strut Comparison operators
==, !=, >, <, >=, <=
\strut
\raggedright\strut
hvec2
\strut
\raggedright\strut Functions
ceil
,
cos
,
exp
,
floor
,
log
,
log2
,
log10
,
sin
,
rsqrt
,
sqrt
,
round
\strut
\raggedright\strut
hvec2
\strut
Integer operations:
CUDA supports SIMD operations for 8-bit and 16-bit integer vectors that fit into a 32-bit words. Below is a table of the operations that are accelerated. For unsupported operations, the computation will be performed in 32-bit integer format, leading to extra conversions between 32-bit and 8/16-bit integer formats.
\raggedright\strut Operation \strut
\raggedright\strut Type \strut
\raggedright\strut Requirements \strut
\raggedright\strut Operators
+, -
\strut
\raggedright\strut
i8vec4
,
i16vec2
,
u8vec4
,
u16vec2
\strut
\raggedright\strut Any CUDA version \strut
\raggedright\strut Comparison operators
==, !=, >, <, >=, <=
\strut
\raggedright\strut
i8vec4
,
i16vec2
,
u8vec4
,
u16vec2
\strut
\raggedright\strut Any CUDA version \strut
\raggedright\strut
abs
\strut
\raggedright\strut
i8vec4
,
i16vec2
\strut
\raggedright\strut Any CUDA version \strut
\raggedright\strut
min
,
max
\strut
\raggedright\strut
i8vec4
,
i16vec2
,
u8vec4
,
u16vec2
\strut
\raggedright\strut Any CUDA version \strut
When reading lower precision values such as uint8(4), int16(2), ... from an array, the Quasar compiler will upcast the value to the best suited machine precision value (usually, vec[int]). This avoids unintended integer overflows in the computations, but may also prevent some operations from being SIMD accelerated.
As a workaround, Quasar provides the {!simd_noupcast} code attribute. The following box filtering example illustrates the use of this code attribute:
function [] = boxfilter8(im8 : mat[uint8], im_out : mat[uint8])
{!parallel for}
for m=0..size(im,0)-1
for n=0..16..size(im,1)-1
{!simd_noupcast enable=im8}
r = vec[uint8](16)
for x=0..K-1
r += im8[m,n+x+(0..15)]
endfor
im_out[m,n+(0..15)] = int(r/(2*K))
endfor
endfor
endfunction
This technique ensures that the accumulation in r is performed in uint8 precision. Be aware that there are no integer overflow checks in this code.

12.3.1 Example: 8-bit image filtering

The following example illustrates how to make use of CUDA SIMD integer operations. The calculations are performed in 32-bit integer format, while 8-bit unsigned integers are used for storage.
import "inttypes.q"
im8 = uint8(imread("image.jpg"))
im_out = mat[uint8’sat](size(im8))
{!parallel for}
for m=0..size(im,0)-1
for n=0..size(im,1)-1
{!kernel_transform enable="simdprocessing"; target="cuda"; numel=4}
r = 0
for x=0..K-1
r += im8[m,n+x]
endfor
im_out[m,n] = int(r/(2*K)) % integer division requires int() function
endfor
endfor
Compared to the previous example, the vectorization is performed automatically.

12.3.2 Example: 16-bit half float image filtering

import "floattypes.q"
im_half = scalar_to_half(imread("image.jpg"))
for m=0..size(im_half,0)-1
for n=0..size(im_half,1)-1
{!kernel_transform enable="simdprocessing"; target="cuda"; numel=2}
r = 0.0
for x=0..K-1
r += im_half[m,n+2*x]
endfor
im_out2[m,n] = r/(2*K+1)
endfor
endfor

12.4 ARM Neon accelerated operations

ARM Neon SIMD operations are currently not supported but may be added in the future. However, by using the vector-types, the back-end compiler may generate code relying on SIMD instructions.

12.5 Automatic alignment

To maximally benefit from SIMD operations, it is essential that the memory load operations are aligned (which means that the memory address modulo a given constant is zero, this constant is often equal to the cache line size, e.g., 16 bytes).
The Quasar runtime ensures that all vectors and matrices have aligned memory addresses. However, this is not sufficient to ensure that all memory loads are aligned, for example the memory load
A[n*4+(0..3)+1]
can never be aligned. In this case, an unaligned read will be generated which may reduce the performance.

12.6 Automatic SIMD code generation

There are already several situations in which the Quasar compiler automatically generates SIMD code:
  1. when fixed-length vectors are used, where the length is suitable chosen (see supported vector lengths for CPU and CUDA above).
  2. during expression optimization (see Section 8.1↑).
  3. in code using shared memory designators (see Section 9.2↑).
In the future, the compiler may rely more and more on SIMD code generation when it is appropriate. For now, SIMD processing can be activated by enabling the simdprocessing transform.
 Chapter 11: Multi-GPU programming Up Main page Chapter 13: Best practices