Prev Chapter 11: Multi-GPU programming Up Main page Chapter 13: Best practices Next

12 SIMD processing on CPU and GPU

In SIMD, multiple processing elements process the data simultaneously, driven by a single instruction stream. SIMD is supported by x86/x64 architectures, as well as ARM and refers to special ‘multimedia’ instructions that were originally added to allow real-time video encoding/decoding operations as well as 3D games to run the CPU.

CUDA supports SIMD processing via its warp-based execution model, also known as single instruction multiple threads (SIMT). However, CUDA supports a limited set of SIMD intrinsics on half precision floating point formats and 8-bit/16-bit integer types that can be used within a single thread, allowing essentially the vector length to be extended above the warp size (for example vector length 128 for 8-bit integers).

In essence, when performing calculations using the vectors of appropriate length (see below), the operations are automatically mapped onto a SIMD implementation whenever possible. For example

im[m,n,0..3]

might (depending on the settings and machine architectures) load into an SSE register (four 32-bit floating point numbers) in x86/x64 code, while

im[m,n,0..7]

might lead to an AVX load (eight 32-bit floating point numbers).

Some type aliases are defined in

inttypes.q

and

floattypes.q

to simplify the SIMD processing:

\raggedright\strut Type \strut	\raggedright\strut Defined in \strut	\raggedright\strut Meaning \strut	\raggedright\strut Alias for \strut
\raggedright\strut i8vec4 \strut	\raggedright\strut inttypes.q \strut	\raggedright\strut 8-bit signed integer vector of length 4 \strut	\raggedright\strut vec[int8](4) \strut
\raggedright\strut u8vec4 \strut	\raggedright\strut inttypes.q \strut	\raggedright\strut 8-bit unsigned integer vector of length 4 \strut	\raggedright\strut vec[uint8](4) \strut
\raggedright\strut i16vec4 \strut	\raggedright\strut inttypes.q \strut	\raggedright\strut 16-bit signed integer vector of length 4 \strut	\raggedright\strut vec[int16](4) \strut
\raggedright\strut u16vec4 \strut	\raggedright\strut inttypes.q \strut	\raggedright\strut 16-bit unsigned integer vector of length 4 \strut	\raggedright\strut vec[uint16](4) \strut
\raggedright\strut hvec2 \strut	\raggedright\strut floattypes.q \strut	\raggedright\strut 16-bit (half precision) floating point vector of length 2 \strut	\raggedright\strut vec[scalar’half](4) \strut

By using vector types such as above, the back-end compiler (e.g., MSVC, GCC, \SpecialChar ldots) can choose the appropriate SIMD instructions. Because not all compilers have the best code generation in this respect (as a sidenote, the Intel C/C++ compiler currently offers the best results in our experiments), several low-level operations have been manually vectorized in the header file

quasar_simd.h

The Quasar compiler supports automatic vectorization of kernel functions. This relieves the programmer from unfolding loops and writing manually vectorized expressions. A kernel can automatically be enabled for SIMD processing by the following code attribute:

{!kernel_transform enable="simdprocessing"; target="cpu"}

where target specifies

cpu

(in case of x86/x64 SIMD) or

cuda

(in case of CUDA SIMD). When the target is ommitted, SIMD is applied to any target architecture (when possible). Using Quasar SIMD processing, the SIMD acceleration depends much less on the back-end compiler that is being used.

Because it is not guaranteed that SIMD processing improves the performance (especially when the used vector types or operations are natively supported), the SIMD processing is not enabled by default. Therefore, the following subsections list the operations and vector types that are accelerated.

12.1 Storage versus computation types

It is important to distinguish between types used for storage and types used for calculation. Storage types represent the data in a more compressed format allowing the bandwidth to the memory to be reduced whenever the precision requirements allow it. For example, an image may be stored in

uint8

format, whereas the processing is performed in

int32

(or even

scalar

) format. This combines the storage benefits with the calculation benefits associated with the computation units of the device.

Going one step further, the computations may also be performed in a lower precision, allowing, e.g., in CUDA 4

uint8

integers to be processed simultaneously. This is useful to reduce the pressure on the computation units of the GPU multi-processor. However, one should take into account that:

\tightlist

intermediate calculations may suffer from integer overflows or lack of precision
not all operations are supported by the hardware (for example, mathematical functions such as
```
sin
```
,
```
cos
```
on integer data), and using these operations will reduce the performance (because the compiler needs to convert to an integer or floating-point format, generate code to call the function and convert back to the initial integer format).

Therefore, by default, calculations are performed in a suited type for which all operations are fully supported by the device (typically, 32-bit integers or 32-bit floating point operations). Conversion operations between storage types and computation types (e.g., from

u8vec4

ivec4

and back) can be implemented using SIMD instructions depending on the device.

Storage modifiers:

Integer types have storage modifiers. These determine how integer overflows are handled when the results of a computation are written back to memory. When using SIMD vector types, the overflow detection can also be accelerated using SIMD instructions. The following table gives an overview of which integer overflow checks are currently accelerated.

\raggedright\strut Modifier \strut	\raggedright\strut Meaning \strut	\raggedright\strut x86/x64 SIMD \strut	\raggedright\strut CUDA SIMD \strut
\raggedright\strut ’unchecked \strut	\raggedright\strut No overflow checking is performed \strut	\raggedright\strut Yes \strut	\raggedright\strut Yes \strut
\raggedright\strut ’checked \strut	\raggedright\strut Overflow checking is performed, an overflow leads to a runtime error \strut	\raggedright\strut No \strut	\raggedright\strut No \strut
\raggedright\strut ’sat \strut	\raggedright\strut Values are clipped (saturated) to the value range of the target type \strut	\raggedright\strut Yes \strut	\raggedright\strut No \strut

12.2 x86/x64 SIMD accelerated operations

Supported vector lengths: 4 and 8 (see below).

x86 and x64 processors support various vector processing extensions: SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2, AVX, AVX2. It is important that the selected CPU target architecture matches the machine, otherwise an exception will be generated. The target architecture can be configured in the Program Settings dialog box in Redshift (or alternatively in the

Quasar.config.xml

configuration file):

Target architecture specifies the architecture to tune for. Vector extensions sets the highest level of vector processing instructions that the processor supports.

The following operations are guaranteed to result in SIMD instructions, independent of the back-end compiler.

Floating point operations:

\raggedright\strut Operation \strut	\raggedright\strut Type \strut	\raggedright\strut Requirements (32-bit float) \strut	\raggedright\strut Requirements (64-bit float) \strut
\raggedright\strut Operators +, -, .*, ./ \strut	\raggedright\strut scalar4 \strut	\raggedright\strut SSE \strut	\raggedright\strut SSE2 / AVX \strut
\raggedright\strut Operators +, -, .*, ./ \strut	\raggedright\strut scalar8 \strut	\raggedright\strut AVX \strut	\raggedright\strut AVX \strut
\raggedright\strut Comparison operators ==, !=, >, <, >=, <= \strut	\raggedright\strut scalar4 \strut	\raggedright\strut SSE \strut	\raggedright\strut SSE2 / AVX \strut
\raggedright\strut Functions min , max , floor , ceil , sqrt , rsqrt , round , frac , sign \strut	\raggedright\strut scalar4 \strut	\raggedright\strut SSE2 \strut	\raggedright\strut SSE2 / AVX \strut
\raggedright\strut Functions sum , prod , dotprod \strut	\raggedright\strut scalar4 \strut	\raggedright\strut SSE2 \strut	\raggedright\strut SSE2 / AVX \strut

Integer operations:

Operation	Type	Requirements
Operators +, -	int4	SSE2
Operators .*	int4	SSE4.1
Comparison operators ==, !=, >, <, >=, <=	int4	SSE2
Functions abs , min , max , sign , sum , prod , any	int4	SSE2
Functions and , or , xor ,	int4	SSE2

Type conversions:

Operation	Type	Requirements (32-bit float)	Requirements (64-bit float)
Conversion	vec4 -> ivec4	SSE2	SSE2 / AVX
Conversion	ivec4 -> vec4	SSE2	SSE2 / AVX
Conversion	ivec4 -> i8vec4 , i16vec4 , u8vec4 , u16v	SSE2
Conversion	i8vec4 , i16vec4 , u8vec4 , u16vec4 -> ivec4	SSE2

12.2.1 Example: AVX image filtering on CPU

The following code demonstrates how to use SIMD processing on the CPU.

{!parallel for}
for m=0..size(im,0)-1
for n=0..size(im,1)-1
{!kernel_transform enable="simdprocessing"; target="cpu"; numel=8}
r = 0.0
for x=0..7
r += im[m,n+x]
end
im_out[m,n] = r/(2*K+1)
endfor
endfor

The

simdprocessing

kernel transform causes the loop to be automatically parallelized and vectorized. The parameter

numel

specifies the desired vector length (8 for AVX, 4 for SSE). If

numel

is omitted, the vector length is determined depending on the available SIMD extensions of the CPU (e.g., AVX, SSE).

Whenever suited, the vectorizer will convert branches to branch-free expressions that can be vectorized. In addition, cooldown code is generated for cases that the image dimensions (in the above example size(im,1)) are not a multiple of the SIMD width.

12.3 CUDA SIMD accelerated operations

The following SIMD vector types are supported:

Type name	Alias for	Meaning
`u8vec4`	`vec[uint8](4)`	Four unsigned integers (0-255)
`i8vec4`	`vec[int8](4)`	Four signed integers (-128..127)
`u16vec2`	`vec[uint16](2)`	Two unsigned integers (0-65536)
`i16vec2`	`vec[int16](2)`	Two signed integers (-32768-32767)
`hvec2`	`vec[scalar’half](2)`	Two half-precision floating point numbers

CUDA vector instructions always interact with 32-bit words. Depending on the precision (8-bit integer, 16-bit integer or 16-bit floating point), the vector length is either 2 or 4.

Floating point operations:

The half-precision floating point format (FP16) is useful to reduce the device memory bandwidth for algorithms which are not sensitive to the reduced precision. For

half

, only integers between -2048 and 2048 can exactly be represented. Integers larger than 65520 or smaller than -65520 are rounded toward respectively positive and negative infinity.

Next to the reduced bandwidth, starting with the Pascal architecture, the GPU also offers hardware support for computations using this type. To gain maximal performance benefits, it is best to use the 32-bit length 2 SIMD half type (

hvec2

). Use of

hvec2

typically results in two numbers being calculated in parallel, leading to a performance that is similar to one single precision floating point (FP32) operation. However, a Volta or Turing GPU is required to obtain performance benefits from using calculation in

half

precision format. For Kepler, Maxwell and Pascal GPUs, hardware support for operations in

half

precision is notably slow, therefore it is best to use the

half

type only for storage purposes (i.e., calculations are performed in the single-precision floating point type).

The following table shows the operations that have been accelerated using

half

types. For unsupported operations, the computation will be performed in FP32 format, leading to extra conversions between FP32 and FP16.

\raggedright\strut Operation \strut	\raggedright\strut Type \strut
\raggedright\strut Operators +, -, .* \strut	\raggedright\strut hvec2 \strut
\raggedright\strut Comparison operators ==, !=, >, <, >=, <= \strut	\raggedright\strut hvec2 \strut
\raggedright\strut Functions ceil , cos , exp , floor , log , log2 , log10 , sin , rsqrt , sqrt , round \strut	\raggedright\strut hvec2 \strut

Integer operations:

CUDA supports SIMD operations for 8-bit and 16-bit integer vectors that fit into a 32-bit words. Below is a table of the operations that are accelerated. For unsupported operations, the computation will be performed in 32-bit integer format, leading to extra conversions between 32-bit and 8/16-bit integer formats.

\raggedright\strut Operation \strut	\raggedright\strut Type \strut	\raggedright\strut Requirements \strut
\raggedright\strut Operators +, - \strut	\raggedright\strut i8vec4 , i16vec2 , u8vec4 , u16vec2 \strut	\raggedright\strut Any CUDA version \strut
\raggedright\strut Comparison operators ==, !=, >, <, >=, <= \strut	\raggedright\strut i8vec4 , i16vec2 , u8vec4 , u16vec2 \strut	\raggedright\strut Any CUDA version \strut
\raggedright\strut abs \strut	\raggedright\strut i8vec4 , i16vec2 \strut	\raggedright\strut Any CUDA version \strut
\raggedright\strut min , max \strut	\raggedright\strut i8vec4 , i16vec2 , u8vec4 , u16vec2 \strut	\raggedright\strut Any CUDA version \strut

When reading lower precision values such as uint8(4), int16(2), ... from an array, the Quasar compiler will upcast the value to the best suited machine precision value (usually, vec[int]). This avoids unintended integer overflows in the computations, but may also prevent some operations from being SIMD accelerated.

As a workaround, Quasar provides the {!simd_noupcast} code attribute. The following box filtering example illustrates the use of this code attribute:

function [] = boxfilter8(im8 : mat[uint8], im_out : mat[uint8])
{!parallel for}
for m=0..size(im,0)-1
for n=0..16..size(im,1)-1
{!simd_noupcast enable=im8}
r = vec[uint8](16)
for x=0..K-1
r += im8[m,n+x+(0..15)]
endfor
im_out[m,n+(0..15)] = int(r/(2*K))
endfor
endfor
endfunction

This technique ensures that the accumulation in r is performed in uint8 precision. Be aware that there are no integer overflow checks in this code.

12.3.1 Example: 8-bit image filtering

The following example illustrates how to make use of CUDA SIMD integer operations. The calculations are performed in 32-bit integer format, while 8-bit unsigned integers are used for storage.

import "inttypes.q"
im8 = uint8(imread("image.jpg"))
im_out = mat[uint8’sat](size(im8))
{!parallel for}
for m=0..size(im,0)-1
for n=0..size(im,1)-1
{!kernel_transform enable="simdprocessing"; target="cuda"; numel=4}
r = 0
for x=0..K-1
r += im8[m,n+x]
endfor
im_out[m,n] = int(r/(2*K)) % integer division requires int() function
endfor
endfor

Compared to the previous example, the vectorization is performed automatically.

12.3.2 Example: 16-bit half float image filtering

import "floattypes.q"
im_half = scalar_to_half(imread("image.jpg"))
for m=0..size(im_half,0)-1
for n=0..size(im_half,1)-1
{!kernel_transform enable="simdprocessing"; target="cuda"; numel=2}
r = 0.0
for x=0..K-1
r += im_half[m,n+2*x]
endfor
im_out2[m,n] = r/(2*K+1)
endfor
endfor

12.4 ARM Neon accelerated operations

ARM Neon SIMD operations are currently not supported but may be added in the future. However, by using the vector-types, the back-end compiler may generate code relying on SIMD instructions.

12.5 Automatic alignment

To maximally benefit from SIMD operations, it is essential that the memory load operations are aligned (which means that the memory address modulo a given constant is zero, this constant is often equal to the cache line size, e.g., 16 bytes).

The Quasar runtime ensures that all vectors and matrices have aligned memory addresses. However, this is not sufficient to ensure that all memory loads are aligned, for example the memory load

A[n*4+(0..3)+1]

can never be aligned. In this case, an unaligned read will be generated which may reduce the performance.

12.6 Automatic SIMD code generation

There are already several situations in which the Quasar compiler automatically generates SIMD code:

when fixed-length vectors are used, where the length is suitable chosen (see supported vector lengths for CPU and CUDA above).
during expression optimization (see Section 8.1↑).
in code using shared memory designators (see Section 9.2↑).

In the future, the compiler may rely more and more on SIMD code generation when it is appropriate. For now, SIMD processing can be activated by enabling the simdprocessing transform.

Prev Chapter 11: Multi-GPU programming Up Main page Chapter 13: Best practices Next