12 SIMD processing on CPU and GPU
In SIMD, multiple processing elements process the data simultaneously, driven by a single instruction stream. SIMD is supported by x86/x64 architectures, as well as ARM and refers to special ‘multimedia’ instructions that were originally added to allow real-time video encoding/decoding operations as well as 3D games to run the CPU.
CUDA supports SIMD processing via its warp-based execution model, also known as single instruction multiple threads (SIMT). However, CUDA supports a limited set of SIMD intrinsics on half precision floating point formats and 8-bit/16-bit integer types that can be used within a single thread, allowing essentially the vector length to be extended above the warp size (for example vector length 128 for 8-bit integers).
In essence, when performing calculations using the vectors of appropriate length (see below), the operations are automatically mapped onto a SIMD implementation whenever possible. For example
might (depending on the settings and machine architectures) load into an SSE register (four 32-bit floating point numbers) in x86/x64 code, while
might lead to an AVX load (eight 32-bit floating point numbers).
Some type aliases are defined in
and
to simplify the SIMD processing:
\raggedright\strut Type \strut
|
\raggedright\strut Defined in \strut
|
\raggedright\strut Meaning \strut
|
\raggedright\strut Alias for \strut
|
\raggedright\strut
\strut
|
\raggedright\strut
\strut
|
\raggedright\strut 8-bit signed integer vector of length 4 \strut
|
\raggedright\strut
\strut
|
\raggedright\strut
\strut
|
\raggedright\strut
\strut
|
\raggedright\strut 8-bit unsigned integer vector of length 4 \strut
|
\raggedright\strut
\strut
|
\raggedright\strut
\strut
|
\raggedright\strut
\strut
|
\raggedright\strut 16-bit signed integer vector of length 4 \strut
|
\raggedright\strut
\strut
|
\raggedright\strut
\strut
|
\raggedright\strut
\strut
|
\raggedright\strut 16-bit unsigned integer vector of length 4 \strut
|
\raggedright\strut
\strut
|
\raggedright\strut
\strut
|
\raggedright\strut
\strut
|
\raggedright\strut 16-bit (half precision) floating point vector of length 2 \strut
|
\raggedright\strut
\strut
|
By using vector types such as above, the back-end compiler (e.g., MSVC, GCC, \SpecialChar ldots) can choose the appropriate SIMD instructions. Because not all compilers have the best code generation in this respect (as a sidenote, the Intel C/C++ compiler currently offers the best results in our experiments), several low-level operations have been manually vectorized in the header file
.
The Quasar compiler supports automatic vectorization of kernel functions. This relieves the programmer from unfolding loops and writing manually vectorized expressions. A kernel can automatically be enabled for SIMD processing by the following code attribute:
{!kernel_transform enable="simdprocessing"; target="cpu"}
where target specifies
(in case of x86/x64 SIMD) or
(in case of CUDA SIMD). When the target is ommitted, SIMD is applied to any target architecture (when possible). Using Quasar SIMD processing, the SIMD acceleration depends much less on the back-end compiler that is being used.
Because it is not guaranteed that SIMD processing improves the performance (especially when the used vector types or operations are natively supported), the SIMD processing is not enabled by default. Therefore, the following subsections list the operations and vector types that are accelerated.
12.1 Storage versus computation types
It is important to distinguish between types used for
storage and types used for
calculation. Storage types represent the data in a more compressed format allowing the bandwidth to the memory to be reduced whenever the precision requirements allow it. For example, an image may be stored in
format, whereas the processing is performed in
(or even
) format. This combines the storage benefits with the calculation benefits associated with the computation units of the device.
Going one step further, the computations may also be performed in a lower precision, allowing, e.g., in CUDA 4
integers to be processed simultaneously. This is useful to reduce the pressure on the computation units of the GPU multi-processor. However, one should take into account that:
\tightlist
-
intermediate calculations may suffer from integer overflows or lack of precision
-
not all operations are supported by the hardware (for example, mathematical functions such as
,
on integer data), and using these operations will reduce the performance (because the compiler needs to convert to an integer or floating-point format, generate code to call the function and convert back to the initial integer format).
Therefore, by default, calculations are performed in a suited type for which all operations are fully supported by the device (typically, 32-bit integers or 32-bit floating point operations). Conversion operations between storage types and computation types (e.g., from
to
and back) can be implemented using SIMD instructions depending on the device.
Storage modifiers:
Integer types have storage modifiers. These determine how integer overflows are handled when the results of a computation are written back to memory. When using SIMD vector types, the overflow detection can also be accelerated using SIMD instructions. The following table gives an overview of which integer overflow checks are currently accelerated.
\raggedright\strut Modifier \strut
|
\raggedright\strut Meaning \strut
|
\raggedright\strut x86/x64 SIMD \strut
|
\raggedright\strut CUDA SIMD \strut
|
\raggedright\strut
\strut
|
\raggedright\strut No overflow checking is performed \strut
|
\raggedright\strut Yes \strut
|
\raggedright\strut Yes \strut
|
\raggedright\strut
\strut
|
\raggedright\strut Overflow checking is performed, an overflow leads to a runtime error \strut
|
\raggedright\strut No \strut
|
\raggedright\strut No \strut
|
\raggedright\strut
\strut
|
\raggedright\strut Values are clipped (saturated) to the value range of the target type \strut
|
\raggedright\strut Yes \strut
|
\raggedright\strut No \strut
|
12.2 x86/x64 SIMD accelerated operations
Supported vector lengths: 4 and 8 (see below).
x86 and x64 processors support various vector processing extensions: SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2, AVX, AVX2. It is important that the selected CPU target architecture matches the machine, otherwise an exception will be generated. The target architecture can be configured in the Program Settings dialog box in Redshift (or alternatively in the
configuration file):
Target architecture specifies the architecture to tune for. Vector extensions sets the highest level of vector processing instructions that the processor supports.
The following operations are guaranteed to result in SIMD instructions, independent of the back-end compiler.
Floating point operations:
\raggedright\strut Operation \strut
|
\raggedright\strut Type \strut
|
\raggedright\strut Requirements (32-bit float) \strut
|
\raggedright\strut Requirements (64-bit float) \strut
|
|
\raggedright\strut Operators
\strut
|
\raggedright\strut
\strut
|
\raggedright\strut SSE \strut
|
\raggedright\strut SSE2 / AVX \strut
|
|
\raggedright\strut Operators
\strut
|
\raggedright\strut
\strut
|
\raggedright\strut AVX \strut
|
\raggedright\strut AVX \strut
|
|
\raggedright\strut Comparison operators
\strut
|
\raggedright\strut
\strut
|
\raggedright\strut SSE \strut
|
\raggedright\strut SSE2 / AVX \strut
|
|
\raggedright\strut Functions
,
,
,
,
,
,
,
,
\strut
|
\raggedright\strut
\strut
|
\raggedright\strut SSE2 \strut
|
\raggedright\strut SSE2 / AVX \strut
|
|
\raggedright\strut Functions
,
,
\strut
|
\raggedright\strut
\strut
|
\raggedright\strut SSE2 \strut
|
\raggedright\strut SSE2 / AVX \strut
|
|
Integer operations:
Operation
|
Type
|
Requirements
|
|
|
|
SSE2
|
|
|
|
SSE4.1
|
|
|
|
SSE2
|
|
|
|
SSE2
|
|
|
|
SSE2
|
|
|
|
|
|
Type conversions:
Operation
|
Type
|
Requirements (32-bit float)
|
Requirements (64-bit float)
|
Conversion
|
|
SSE2
|
SSE2 / AVX
|
Conversion
|
|
SSE2
|
SSE2 / AVX
|
Conversion
|
|
SSE2
|
|
Conversion
|
|
SSE2
|
|
12.2.1 Example: AVX image filtering on CPU
The following code demonstrates how to use SIMD processing on the CPU.
{!parallel for}
for m=0..size(im,0)-1
for n=0..size(im,1)-1
{!kernel_transform enable="simdprocessing"; target="cpu"; numel=8}
r = 0.0
for x=0..7
r += im[m,n+x]
end
im_out[m,n] = r/(2*K+1)
endfor
endfor
The
kernel transform causes the loop to be automatically parallelized and vectorized. The parameter
specifies the desired vector length (8 for AVX, 4 for SSE). If
is omitted, the vector length is determined depending on the available SIMD extensions of the CPU (e.g., AVX, SSE).
Whenever suited, the vectorizer will convert branches to branch-free expressions that can be vectorized. In addition, cooldown code is generated for cases that the image dimensions (in the above example size(im,1)) are not a multiple of the SIMD width.
12.3 CUDA SIMD accelerated operations
The following SIMD vector types are supported:
Type name
|
Alias for
|
Meaning
|
u8vec4
|
vec[uint8](4)
|
Four unsigned integers (0-255)
|
i8vec4
|
vec[int8](4)
|
Four signed integers (-128..127)
|
u16vec2
|
vec[uint16](2)
|
Two unsigned integers (0-65536)
|
i16vec2
|
vec[int16](2)
|
Two signed integers (-32768-32767)
|
hvec2
|
vec[scalar’half](2)
|
Two half-precision floating point numbers
|
CUDA vector instructions always interact with 32-bit words. Depending on the precision (8-bit integer, 16-bit integer or 16-bit floating point), the vector length is either 2 or 4.
Floating point operations:
The half-precision floating point format (FP16) is useful to reduce the device memory bandwidth for algorithms which are not sensitive to the reduced precision. For
, only integers between -2048 and 2048 can exactly be represented. Integers larger than 65520 or smaller than -65520 are rounded toward respectively positive and negative infinity.
Next to the reduced bandwidth, starting with the Pascal architecture, the GPU also offers hardware support for computations using this type. To gain maximal performance benefits, it is best to use the 32-bit length 2 SIMD half type (
). Use of
typically results in two numbers being calculated in parallel, leading to a performance that is similar to one single precision floating point (FP32) operation. However, a Volta or Turing GPU is required to obtain performance benefits from using calculation in
precision format. For Kepler, Maxwell and Pascal GPUs, hardware support for operations in
precision is notably
slow, therefore it is best to use the
type only for storage purposes (i.e., calculations are performed in the single-precision floating point type).
The following table shows the operations that have been accelerated using
types. For unsupported operations, the computation will be performed in FP32 format, leading to extra conversions between FP32 and FP16.
\raggedright\strut Operation \strut
|
\raggedright\strut Type \strut
|
\raggedright\strut Operators
\strut
|
\raggedright\strut
\strut
|
\raggedright\strut Comparison operators
\strut
|
\raggedright\strut
\strut
|
\raggedright\strut Functions
,
,
,
,
,
,
,
,
,
,
\strut
|
\raggedright\strut
\strut
|
Integer operations:
CUDA supports SIMD operations for 8-bit and 16-bit integer vectors that fit into a 32-bit words. Below is a table of the operations that are accelerated. For unsupported operations, the computation will be performed in 32-bit integer format, leading to extra conversions between 32-bit and 8/16-bit integer formats.
\raggedright\strut Operation \strut
|
\raggedright\strut Type \strut
|
\raggedright\strut Requirements \strut
|
\raggedright\strut Operators
\strut
|
\raggedright\strut
,
,
,
\strut
|
\raggedright\strut Any CUDA version \strut
|
\raggedright\strut Comparison operators
\strut
|
\raggedright\strut
,
,
,
\strut
|
\raggedright\strut Any CUDA version \strut
|
\raggedright\strut
\strut
|
\raggedright\strut
,
\strut
|
\raggedright\strut Any CUDA version \strut
|
\raggedright\strut
,
\strut
|
\raggedright\strut
,
,
,
\strut
|
\raggedright\strut Any CUDA version \strut
|
When reading lower precision values such as
uint8(4),
int16(2), ... from an array, the Quasar compiler will upcast the value to the best suited machine precision value (usually,
vec[int]). This avoids unintended integer overflows in the computations, but may also prevent some operations from being SIMD accelerated.
As a workaround, Quasar provides the {!simd_noupcast} code attribute. The following box filtering example illustrates the use of this code attribute:
function [] = boxfilter8(im8 : mat[uint8], im_out : mat[uint8])
{!parallel for}
for m=0..size(im,0)-1
for n=0..16..size(im,1)-1
{!simd_noupcast enable=im8}
r = vec[uint8](16)
for x=0..K-1
r += im8[m,n+x+(0..15)]
endfor
im_out[m,n+(0..15)] = int(r/(2*K))
endfor
endfor
endfunction
This technique ensures that the accumulation in
r is performed in
uint8 precision. Be aware that there are no integer overflow checks in this code.
12.3.1 Example: 8-bit image filtering
The following example illustrates how to make use of CUDA SIMD integer operations. The calculations are performed in 32-bit integer format, while 8-bit unsigned integers are used for storage.
import "inttypes.q"
im8 = uint8(imread("image.jpg"))
im_out = mat[uint8’sat](size(im8))
{!parallel for}
for m=0..size(im,0)-1
for n=0..size(im,1)-1
{!kernel_transform enable="simdprocessing"; target="cuda"; numel=4}
r = 0
for x=0..K-1
r += im8[m,n+x]
endfor
im_out[m,n] = int(r/(2*K)) % integer division requires int() function
endfor
endfor
Compared to the previous example, the vectorization is performed automatically.
12.3.2 Example: 16-bit half float image filtering
import "floattypes.q"
im_half = scalar_to_half(imread("image.jpg"))
for m=0..size(im_half,0)-1
for n=0..size(im_half,1)-1
{!kernel_transform enable="simdprocessing"; target="cuda"; numel=2}
r = 0.0
for x=0..K-1
r += im_half[m,n+2*x]
endfor
im_out2[m,n] = r/(2*K+1)
endfor
endfor
12.4 ARM Neon accelerated operations
ARM Neon SIMD operations are currently not supported but may be added in the future. However, by using the vector-types, the back-end compiler may generate code relying on SIMD instructions.
12.5 Automatic alignment
To maximally benefit from SIMD operations, it is essential that the memory load operations are aligned (which means that the memory address modulo a given constant is zero, this constant is often equal to the cache line size, e.g., 16 bytes).
The Quasar runtime ensures that all vectors and matrices have aligned memory addresses. However, this is not sufficient to ensure that all memory loads are aligned, for example the memory load
can never be aligned. In this case, an unaligned read will be generated which may reduce the performance.
12.6 Automatic SIMD code generation
There are already several situations in which the Quasar compiler automatically generates SIMD code:
-
when fixed-length vectors are used, where the length is suitable chosen (see supported vector lengths for CPU and CUDA above).
-
during expression optimization (see Section 8.1↑).
-
in code using shared memory designators (see Section 9.2↑).
In the future, the compiler may rely more and more on SIMD code generation when it is appropriate. For now, SIMD processing can be activated by enabling the simdprocessing transform.