tutorial3_Basic-Optimizations

Title: Tutorial 3: Basic optimizations

Basic optimizations

1. Reductions

In this tutorial, we will define a number of reductions to optimize the execution time of the following overly complex code.

The code to optimize is given below.

function [] = main()
    
    % Load an image
    im = imread("lena_big.tif")%[:,:,0]
    tic ()
    for cnt=0..99
        im=sqrt(ifft2(fft2(im.^2-10))+10)

        im=fft2(im)
        im[250..511,250..511,:]=0
        im=real(ifft2(im))
    end
    toc()/100

    imshow(im)
end

Note that many of these calculations are actually the inverse of other parts of the code. As a consequence, not executing these calculations would actually result in the same output, but with a significant computational reduction. Within Quasar we can define certain code patterns and replace them with other code. This approach is called code reduction and allows us to optimize our code at compile time, without actually having to optimize the code itself, we only have to define the reduction patterns:

reduction (x) -> real(ifft2(x)) = irealfft2(x)
reduction (x) -> sqrt(x.^2) = abs(x)
reduction (x,y) -> x-y+y = x
reduction (x) -> ifft2(fft2(x)) = x

These reductions are self-explanatory: whenever the compiler encounters the left handed side of the assignment, the expression is replaced by the right handed side. The parameter list (between the parentheses) contains some free parameters.

The first reduction uses a more efficient function irealfft2 for calculating the real part of the complex discrete Fourier coefficients. Quasar uses the FFTW-library for CPU-based FFT calculation and CUDA cuFFT for GPU-based FFT calculation. In both cases, this reduction accelerates the code approximately 2 times. This reduction is therefore included in the Quasar system library (system.q).
The other reductions specify some standard algebraic simplifications.

Applying the above reductions to the code result in a an execution time which is 2.6 times as fast (on an NVIDIA GeForce 1070) compared to the original code!

2. Type inference

Have a look at the following code:

function im3=switchrgb2(im)
    im2=rand(size(im))
    test=sqrt(im)+im2.^2
    im3=uninit(size(im))
    for x=2..size(im,0)-3                               
        for y=2..size(im,1)-3  
            im3[x,y,1]=im2[x,y,2]-test[x-1,y+1,2]
        end
    end
end

function [] = inference_test2()

    im=imread("lena_big.tif")
    tic()
    im_out=switchrgb2(im)
    toc("naive")
    hold("off"); imshow(im_out)
end

When compiling the above code, the following warning message is shown:

tutorial3.q - Line 6: Attempt to parallelize the 2-dim FOR loop.
Line 8: parallelization not possible due to the type of the variables/expressions:
* im2[x,y,2]:??
* test[(x-1),(y+1),2]:??

Need some help for determining the type of the variables marked with type '??'.

Tip: keep in mind the feedback you get from the compiler (displayed through tooltips in the code editor in Redshift).

Because the compiler does not know the type of the parameter im of the function switchrgb2 (allowing the function to be used with other datatypes), it can not automatically generate a kernel function for the for-loop. The compiler could use type interference to find out that it has the same type as the output of imread, i.e. cube. By defining the type, the compiler can once again generate a kernel for the for-loop:

function im3=switchrgb2(im:cube)
    im2=rand(size(im))
    test=sqrt(im)+im2.^2
    im3=uninit(size(im))
    for x=2..size(im,0)-3                               
        for y=2..size(im,1)-3  
            im3[x,y,1]=im2[x,y,2]-test[x-1,y+1,2]
        end
    end
end

In this case, the loop parallelizes as expected and a kernel function is generated:

tutorial3.q - Line 6: Automatic parallelization of the 2-dim FOR loop.

Another way of allowing the compiler to perform type inference for the loop, is by explicitly indicating that the for-loop needs to be parallelized:

function im3=switchrgb2(im)
    im2=rand(size(im))
    test=sqrt(im)+im2.^2
    im3=uninit(size(im))
    {!parallel for}   % <==========
    for x=2..size(im,0)-3                               
        for y=2..size(im,1)-3  
            im3[x,y,1]=im2[x,y,2]-test[x-1,y+1,2]
        end
    end
end

By specifying that the loop needs to be parallelized, the compiler will specialize the function switchrgb2 using type information from the calling side, in a way similar as in the above example, by adding the im:cube data type automatically. The function specialization is similar to template function specialization in, e.g., C++, although the specification of type parameters is optional. In Quasar, all functions for which the types are not known at-compile time are considered to be generic functions, suitable for specialization.