12 SIMD processing on CPU and GPU
In SIMD, multiple processing elements process the data simultaneously, driven by a single instruction stream. SIMD is supported by x86/x64 architectures, as well as ARM and refers to special ‘multimedia’ instructions that were originally added to allow realtime video encoding/decoding operations as well as 3D games to run the CPU.
CUDA supports SIMD processing via its warpbased execution model, also known as single instruction multiple threads (SIMT). However, CUDA supports a limited set of SIMD intrinsics on half precision floating point formats and 8bit/16bit integer types that can be used within a single thread, allowing essentially the vector length to be extended above the warp size (for example vector length 128 for 8bit integers).
In essence, when performing calculations using the vectors of appropriate length (see below), the operations are automatically mapped onto a SIMD implementation whenever possible. For example
might (depending on the settings and machine architectures) load into an SSE register (four 32bit floating point numbers) in x86/x64 code, while
might lead to an AVX load (eight 32bit floating point numbers).
Some type aliases are defined in
and
to simplify the SIMD processing:
\raggedright\strut Type \strut

\raggedright\strut Defined in \strut

\raggedright\strut Meaning \strut

\raggedright\strut Alias for \strut

\raggedright\strut
\strut

\raggedright\strut
\strut

\raggedright\strut 8bit signed integer vector of length 4 \strut

\raggedright\strut
\strut

\raggedright\strut
\strut

\raggedright\strut
\strut

\raggedright\strut 8bit unsigned integer vector of length 4 \strut

\raggedright\strut
\strut

\raggedright\strut
\strut

\raggedright\strut
\strut

\raggedright\strut 16bit signed integer vector of length 4 \strut

\raggedright\strut
\strut

\raggedright\strut
\strut

\raggedright\strut
\strut

\raggedright\strut 16bit unsigned integer vector of length 4 \strut

\raggedright\strut
\strut

\raggedright\strut
\strut

\raggedright\strut
\strut

\raggedright\strut 16bit (half precision) floating point vector of length 2 \strut

\raggedright\strut
\strut

By using vector types such as above, the backend compiler (e.g., MSVC, GCC, \SpecialChar ldots) can choose the appropriate SIMD instructions. Because not all compilers have the best code generation in this respect (as a sidenote, the Intel C/C++ compiler currently offers the best results in our experiments), several lowlevel operations have been manually vectorized in the header file
.
The Quasar compiler supports automatic vectorization of kernel functions. This relieves the programmer from unfolding loops and writing manually vectorized expressions. A kernel can automatically be enabled for SIMD processing by the following code attribute:
{!kernel_transform enable="simdprocessing"; target="cpu"}
where target specifies
(in case of x86/x64 SIMD) or
(in case of CUDA SIMD). When the target is ommitted, SIMD is applied to any target architecture (when possible). Using Quasar SIMD processing, the SIMD acceleration depends much less on the backend compiler that is being used.
Because it is not guaranteed that SIMD processing improves the performance (especially when the used vector types or operations are natively supported), the SIMD processing is not enabled by default. Therefore, the following subsections list the operations and vector types that are accelerated.
12.1 Storage versus computation types
It is important to distinguish between types used for
storage and types used for
calculation. Storage types represent the data in a more compressed format allowing the bandwidth to the memory to be reduced whenever the precision requirements allow it. For example, an image may be stored in
format, whereas the processing is performed in
(or even
) format. This combines the storage benefits with the calculation benefits associated with the computation units of the device.
Going one step further, the computations may also be performed in a lower precision, allowing, e.g., in CUDA 4
integers to be processed simultaneously. This is useful to reduce the pressure on the computation units of the GPU multiprocessor. However, one should take into account that:
\tightlist

intermediate calculations may suffer from integer overflows or lack of precision

not all operations are supported by the hardware (for example, mathematical functions such as
,
on integer data), and using these operations will reduce the performance (because the compiler needs to convert to an integer or floatingpoint format, generate code to call the function and convert back to the initial integer format).
Therefore, by default, calculations are performed in a suited type for which all operations are fully supported by the device (typically, 32bit integers or 32bit floating point operations). Conversion operations between storage types and computation types (e.g., from
to
and back) can be implemented using SIMD instructions depending on the device.
Storage modifiers:
Integer types have storage modifiers. These determine how integer overflows are handled when the results of a computation are written back to memory. When using SIMD vector types, the overflow detection can also be accelerated using SIMD instructions. The following table gives an overview of which integer overflow checks are currently accelerated.
\raggedright\strut Modifier \strut

\raggedright\strut Meaning \strut

\raggedright\strut x86/x64 SIMD \strut

\raggedright\strut CUDA SIMD \strut

\raggedright\strut
\strut

\raggedright\strut No overflow checking is performed \strut

\raggedright\strut Yes \strut

\raggedright\strut Yes \strut

\raggedright\strut
\strut

\raggedright\strut Overflow checking is performed, an overflow leads to a runtime error \strut

\raggedright\strut No \strut

\raggedright\strut No \strut

\raggedright\strut
\strut

\raggedright\strut Values are clipped (saturated) to the value range of the target type \strut

\raggedright\strut Yes \strut

\raggedright\strut No \strut

12.2 x86/x64 SIMD accelerated operations
Supported vector lengths: 4 and 8 (see below).
x86 and x64 processors support various vector processing extensions: SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2, AVX, AVX2. It is important that the selected CPU target architecture matches the machine, otherwise an exception will be generated. The target architecture can be configured in the Program Settings dialog box in Redshift (or alternatively in the
configuration file):
Target architecture specifies the architecture to tune for. Vector extensions sets the highest level of vector processing instructions that the processor supports.
The following operations are guaranteed to result in SIMD instructions, independent of the backend compiler.
Floating point operations:
\raggedright\strut Operation \strut

\raggedright\strut Type \strut

\raggedright\strut Requirements (32bit float) \strut

\raggedright\strut Requirements (64bit float) \strut


\raggedright\strut Operators
\strut

\raggedright\strut
\strut

\raggedright\strut SSE \strut

\raggedright\strut SSE2 / AVX \strut


\raggedright\strut Operators
\strut

\raggedright\strut
\strut

\raggedright\strut AVX \strut

\raggedright\strut AVX \strut


\raggedright\strut Comparison operators
\strut

\raggedright\strut
\strut

\raggedright\strut SSE \strut

\raggedright\strut SSE2 / AVX \strut


\raggedright\strut Functions
,
,
,
,
,
,
,
,
\strut

\raggedright\strut
\strut

\raggedright\strut SSE2 \strut

\raggedright\strut SSE2 / AVX \strut


\raggedright\strut Functions
,
,
\strut

\raggedright\strut
\strut

\raggedright\strut SSE2 \strut

\raggedright\strut SSE2 / AVX \strut


Integer operations:
Operation

Type

Requirements




SSE2




SSE4.1




SSE2




SSE2




SSE2






Type conversions:
Operation

Type

Requirements (32bit float)

Requirements (64bit float)

Conversion


SSE2

SSE2 / AVX

Conversion


SSE2

SSE2 / AVX

Conversion


SSE2


Conversion


SSE2


12.2.1 Example: AVX image filtering on CPU
The following code demonstrates how to use SIMD processing on the CPU.
{!parallel for}
for m=0..size(im,0)1
for n=0..size(im,1)1
{!kernel_transform enable="simdprocessing"; target="cpu"; numel=8}
r = 0.0
for x=0..7
r += im[m,n+x]
end
im_out[m,n] = r/(2*K+1)
endfor
endfor
The
kernel transform causes the loop to be automatically parallelized and vectorized. The parameter
specifies the desired vector length (8 for AVX, 4 for SSE). If
is omitted, the vector length is determined depending on the available SIMD extensions of the CPU (e.g., AVX, SSE).
Whenever suited, the vectorizer will convert branches to branchfree expressions that can be vectorized. In addition, cooldown code is generated for cases that the image dimensions (in the above example size(im,1)) are not a multiple of the SIMD width.
12.3 CUDA SIMD accelerated operations
Supported vector lengths: 2 and 4
CUDA vector instructions always interact with 32bit words. Depending on the precision (8bit integer, 16bit integer or 16bit floating point), the vector length is either 2 or 4.
Floating point operations:
The halfprecision floating point format (FP16) is useful to reduce the device memory bandwidth for algorithms which are not sensitive to the reduced precision. For
, only integers between 2048 and 2048 can exactly be represented. Integers larger than 65520 or smaller than 65520 are rounded toward respectively positive and negative infinity.
Next to the reduced bandwidth, starting with the Pascal architecture, the GPU also offers hardware support for computations using this type. To gain maximal performance benefits, it is best to use the 32bit length 2 SIMD half type (
). Use of
typically results in two numbers being calculated in parallel, leading to a performance that is similar to one single precision floating point (FP32) operation. However, a Volta or Turing GPU is required to obtain performance benefits from using calculation in
precision format. For Kepler, Maxwell and Pascal GPUs, hardware support for operations in
precision is notably
slow, therefore it is best to use the
type only for storage purposes (i.e., calculations are performed in the singleprecision floating point type).
The following table shows the operations that have been accelerated using
types. For unsupported operations, the computation will be performed in FP32 format, leading to extra conversions between FP32 and FP16.
\raggedright\strut Operation \strut

\raggedright\strut Type \strut

\raggedright\strut Operators
\strut

\raggedright\strut
\strut

\raggedright\strut Comparison operators
\strut

\raggedright\strut
\strut

\raggedright\strut Functions
,
,
,
,
,
,
,
,
,
,
\strut

\raggedright\strut
\strut

Integer operations:
CUDA supports SIMD operations for 8bit and 16bit integer vectors that fit into a 32bit words. Below is a table of the operations that are accelerated. For unsupported operations, the computation will be performed in 32bit integer format, leading to extra conversions between 32bit and 8/16bit integer formats.
\raggedright\strut Operation \strut

\raggedright\strut Type \strut

\raggedright\strut Requirements \strut

\raggedright\strut Operators
\strut

\raggedright\strut
,
,
,
\strut

\raggedright\strut Any CUDA version \strut

\raggedright\strut Comparison operators
\strut

\raggedright\strut
,
,
,
\strut

\raggedright\strut Any CUDA version \strut

\raggedright\strut
\strut

\raggedright\strut
,
\strut

\raggedright\strut Any CUDA version \strut

\raggedright\strut
,
\strut

\raggedright\strut
,
,
,
\strut

\raggedright\strut Any CUDA version \strut

When reading lower precision values such as
uint8(4),
int16(2), ... from an array, the Quasar compiler will upcast the value to the best suited machine precision value (usually,
vec[int]). This avoids unintended integer overflows in the computations, but may also prevent some operations from being SIMD accelerated.
As a workaround, Quasar provides the {!simd_noupcast} code attribute. The following box filtering example illustrates the use of this code attribute:
function [] = boxfilter8(im8 : mat[uint8], im_out : mat[uint8])
{!parallel for}
for m=0..size(im,0)1
for n=0..16..size(im,1)1
{!simd_noupcast enable=im8}
r = vec[uint8](16)
for x=0..K1
r += im8[m,n+x+(0..15)]
endfor
im_out[m,n+(0..15)] = int(r/(2*K))
endfor
endfor
endfunction
This technique ensures that the accumulation in
r is performed in
uint8 precision. Be aware that there are no integer overflow checks in this code.
12.3.1 Example: 8bit image filtering
The following example illustrates how to make use of CUDA SIMD integer operations. The calculations are performed in 32bit integer format, while 8bit unsigned integers are used for storage.
import "inttypes.q"
im8 = uint8(imread("image.jpg"))
im_out = mat[uint8’sat](size(im8))
{!parallel for}
for m=0..size(im,0)1
for n=0..size(im,1)1
{!kernel_transform enable="simdprocessing"; target="cuda"; numel=4}
r = 0
for x=0..K1
r += im8[m,n+x]
endfor
im_out[m,n] = int(r/(2*K)) % integer division requires int() function
endfor
endfor
Compared to the previous example, the vectorization is performed automatically.
12.3.2 Example: 16bit half float image filtering
import "floattypes.q"
im_half = scalar_to_half(imread("image.jpg"))
for m=0..size(im_half,0)1
for n=0..size(im_half,1)1
{!kernel_transform enable="simdprocessing"; target="cuda"; numel=2}
r = 0.0
for x=0..K1
r += im_half[m,n+2*x]
endfor
im_out2[m,n] = r/(2*K+1)
endfor
endfor
12.4 ARM Neon accelerated operations
ARM Neon SIMD operations are currently not supported but may be added in the future. However, by using the vectortypes, the backend compiler may generate code relying on SIMD instructions.
12.5 Automatic alignment
To maximally benefit from SIMD operations, it is essential that the memory load operations are aligned (which means that the memory address modulo a given constant is zero, this constant is often equal to the cache line size, e.g., 16 bytes).
The Quasar runtime ensures that all vectors and matrices have aligned memory addresses. However, this is not sufficient to ensure that all memory loads are aligned, for example the memory load
can never be aligned. In this case, an unaligned read will be generated which may reduce the performance.
12.6 Automatic SIMD code generation
There are already several situations in which the Quasar compiler automatically generates SIMD code:

when fixedlength vectors are used, where the length is suitable chosen (see supported vector lengths for CPU and CUDA above).

during expression optimization (see Section 8.1↑).

in code using shared memory designators (see Section 9.2↑).
In the future, the compiler may rely more and more on SIMD code generation when it is appropriate. For now, SIMD processing can be activated by enabling the simdprocessing transform.