Chapter 1: Introduction Next

The Quasar Computation System: Quick Reference Manual

Bart Goossens

Table of Contents

Chapter 1: Introduction

Section 1.1: Computation Engines

Section 1.2: How to use?

Subsection: Architecture: 32-bit/64-bit CPU or GPU

Subsection: Supported libraries

Subsection: Distributing Quasar programs

Section 1.3: Quasar Programming Language

Section 1.4: Integration with foreign programming languages

Chapter 2: Getting started

Section 2.1: Quasar high-level programming concepts

Section 2.2: A brief introduction of the type system

Subsection 2.2.1: Floating point representation

Subsection 2.2.2: Mixed precision floating point computations

Subsection 2.2.3: Integer types

Subsubsection: Important note

Subsection 2.2.4: Fixed sized datatypes

Subsection 2.2.5: Higher dimensional matrices

Subsection 2.2.6: User-defined types, type definitions and pointers

Section 2.3: Automatic parallelization

Section 2.4: Writing parallel code using kernel functions

Subsection 2.4.1: Basic usage: kernel functions

Subsection 2.4.2: Device functions

Subsection 2.4.3: Memory usage inside kernel or device functions

Subsection 2.4.4: Advanced usage: shared memory and synchronization

Chapter 3: Type system

Section 3.1: Type definitions

Section 3.2: Variable construction

Section 3.3: Size constraints

Section 3.4: Dimension constraints

Section 3.5: Cell array types

Section 3.6: Type constructors and the typename function

Section 3.7: Type classes

Section 3.8: Class / user defined type (UDT) definitions

Section 3.9: Function types

Section 3.10: Enumerations

Section 3.11: Passed by reference / Passed by value

Section 3.12: Constants

Chapter 4: Programming concepts

Section 4.1: Polymorphic variables

Section 4.2: Closures

Section 4.3: Device functions, kernel functions, host functions

Section 4.4: Nested parallelism

Section 4.5: Function overloading

Subsection 4.5.1: Device function overloading

Subsection 4.5.2: Optional function parameters

Section 4.6: Functions versus lambda expressions

Subsection 4.6.1: Explicitly typed lambda expressions

Section 4.7: Kernel function output arguments

Section 4.8: Variadic functions

Subsection 4.8.1: Variadic device functions

Subsection 4.8.2: Variadic function types

Subsection 4.8.3: The spread operator

Subsection 4.8.4: Variadic output parameters

Section 4.9: Reductions

Subsection 4.9.1: Symbolic variables and reductions

Subsection 4.9.2: Reduction resolution

Subsection 4.9.3: Ensuring safe reductions

Subsection 4.9.4: Reduction where clauses

Subsection 4.9.5: Variadic reductions

Section 4.10: Partial evaluation

Section 4.11: Code attributes

Section 4.12: Macros

Section 4.13: Exception handling

Section 4.14: Documentation conventions

Chapter 5: The logic system

Section 5.1: Kernel function assertions

Section 5.2: Built-in compiler functions

Section 5.3: Assertion types recognized by the compiler

Subsection 5.3.1: Equalities

Subsection 5.3.2: Inequalities

Subsection 5.3.3: Type assertions

Section 5.4: User-defined properties

Section 5.5: Unassert

Section 5.6: The role of assertions

Chapter 6: Generic programming

Section 6.1: Parametrized functions

Section 6.2: Parametrized reductions

Section 6.3: Parametrized types

Section 6.4: Generic memory allocation functions and casting

Section 6.5: Explicit specialization through meta-functions

Section 6.6: Implicit specialization

Section 6.7: Generic size-parametrized arrays

Section 6.8: Generic dimension-parametrized arrays

Section 6.9: Example of generic programming: linear filtering

Chapter 7: Object-oriented programming

Section 7.1: Mutable/non-mutable classes

Section 7.2: Constructors

Section 7.3: Destructors

Subsection 7.3.1: Methods

Subsection 7.3.2: Properties

Subsection 7.3.3: Operators

Section 7.4: Dynamic classes

Section 7.5: Parametric types

Section 7.6: Inheritance

Section 7.7: Virtual functions, interfaces, abstract classes

Chapter 8: Special programming patterns

Section 8.1: Matrix/vector expressions

Section 8.2: Loop parallelization/serialization

Subsection 8.2.1: While-loop serialization

Subsection 8.2.2: Example: gamma correction

Section 8.3: Dynamic kernel memory

Subsection 8.3.1: Examples

Subsubsection: Kernel version

Subsubsection: Loop version

Subsection 8.3.2: Memory models

Subsection 8.3.3: Features

Subsection 8.3.4: Performance considerations

Section 8.4: Map and Reduce pattern

Section 8.5: Cumulative maps (prefix sum)

Section 8.6: Meta functions

Subsubsection: Example: copying the type and assumptions from one variable to another

Chapter 9: GPU hardware features

Section 9.1: Constant memory and texture memory

Section 9.2: Shared memory designators

Subsection 9.2.1: How to use

Subsection 9.2.2: Virtual blocks and overriding the dependency analysis

Subsection 9.2.3: Examples

Subsubsection 9.2.3.1: Histogram

Subsubsection 9.2.3.2: Separable linear filtering

Subsubsection 9.2.3.3: Parallel reduction (sum of NxN matrices)

Section 9.3: Speeding up spatial data access using Hardware Texturing Units

Section 9.4: 16-bit (half-precision) floating point textures

Section 9.5: Multi-component Hardware Textures

Section 9.6: Texture/surface writes

Section 9.7: Maximizing occupancy through shared memory assertions

Section 9.8: Cooperative groups and warp shuffling functions

Subsection 9.8.1: Fine synchronization granularity

Subsection 9.8.2: Optimizing block count for grid synchronization

Subsection 9.8.3: Memory fences

Section 9.9: Kernel launch bounds

Section 9.10: Memory management

Section 9.11: Querying GPU hardware features

Chapter 10: Parallel programming examples

Section 10.1: Gamma correction

Section 10.2: Fractals

Section 10.3: Image rotation, translation and scaling [basic]

Section 10.4: 2D Haar inplace wavelet transform using lifting

Section 10.5: Convolution

Section 10.6: Parallel reduction sum

Section 10.7: A more accurate parallel sum

Section 10.8: Parallel sort

Section 10.9: Matrix multiplication

Chapter 11: Multi-GPU programming

Section 11.1: A quick glance

Section 11.2: Setting up the device configuration

Section 11.3: Three levels of concurrency

Section 11.4: Manual vs. automatic multi-GPU scheduling

Section 11.5: Host Synchronization

Section 11.6: Key principles for efficient multi-GPU processing

Section 11.7: Supported Libraries

Section 11.8: Profiling techniques

Section 11.9: Automatic GPU scheduling

Section 11.10: Developing multi-GPU applications

Chapter 12: SIMD processing on CPU and GPU

Section 12.1: Storage versus computation types

Section 12.2: x86/x64 SIMD accelerated operations

Subsection 12.2.1: Example: AVX image filtering on CPU

Section 12.3: CUDA SIMD accelerated operations

Subsection 12.3.1: Example: 8-bit image filtering

Subsection 12.3.2: Example: 16-bit half float image filtering

Section 12.4: ARM Neon accelerated operations

Section 12.5: Automatic alignment

Section 12.6: Automatic SIMD code generation

Chapter 13: Best practices

Section 13.1: Use main functions

Section 13.2: Shared memory usage

Section 13.3: Loop parallelization

Section 13.4: Output arguments

Section 13.5: Writing numerically stable programs

Section 13.6: Writing deterministic kernels

Chapter 14: Built-in function quick reference

Chapter 15: Functional image processing in Quasar

Section 15.1: Example: tranlation and filtering

Chapter 16: The Quasar runtime system

Section 16.1: Program interpretation and execution

Section 16.2: Abstraction layer for computation devices

Section 16.3: Object management

Section 16.4: Memory management

Section 16.5: Load balancing and runtime scheduling

Section 16.6: Optimizing memory transfers with const and nocopy

Section 16.7: Controlling the runtime system programmatically

Chapter 17: The Quasar compiler/optimizer

Section 17.1: Function Transforms

Subsection 17.1.1: Automatic For-Loop Parallelizer (ALP)

Subsection 17.1.2: Automatic Kernel Generator

Subsection 17.1.3: Automatic Function Instantiation

Subsection 17.1.4: High Level Inference

Subsection 17.1.5: Function inlining

Subsection 17.1.6: Kernel fusion

Section 17.2: Kernel transforms

Subsection 17.2.1: Parallel Reduction Transform

Subsection 17.2.2: Local Windowing Transform

Subsection 17.2.3: Kernel Tiling Transform

Subsection 17.2.4: Kernel Boundary Checks

Subsection 17.2.5: Target-specific programming and manually invoking the runtime scheduler

Subsection 17.2.6: Compile-time specialization through the $target() meta function

Section 17.3: Common compilation settings

Section 17.4: CUDA target architecture

Chapter 18: Development tools

Section 18.1: Redshift - integrated development environment

Section 18.2: Spectroscope - command line debugger

Section 18.3: Redshift Profiler

Subsection 18.3.1: Security settings

Subsection 18.3.2: Peer to peer transfers

Subsection 18.3.3: GPU event view

Subsection 18.3.4: Timeline view

Subsection 18.3.5: Kernel line information

Subsection 18.3.6: Kernel metric reports

Chapter 1: Introduction Next