Chapter 17: The Quasar compiler/optimizer Up Main page

18 Development tools

Quasar comes with two main development tools: Redshift (the main IDE) and Spectroscope (a commandline debugger tool).

18.1 Redshift - integrated development environment

Redshift is the main IDE for Quasar. It is built on top of GTK and runs on Windows, Linux and MAC. Redshift has the following main features:
figure Figures/fig_Redshift_screenshot.png
Figure 18.1 Screenshot of the Redshift IDE for Quasar.
In the IDE, it is also possible to select the GPU devices on which the program needs to be executed in the main toolbar. Additionally, the precision mode (e.g., 32-bit floating point or 64-bit floating point) can be selected. The flame icon toggles concurrent kernel execution, a technique in which CUDA/OpenCL kernels are launched asynchronously; in many circumstances speeding up the execution.

18.2 Spectroscope - command line debugger

Spectroscope is a command line debugger for Quasar. Its functionality is actually integrated in Redshift (Section 18.1↑), although in some circumstances it is still useful to use the debugging tools from a terminal in circumstances where a graphical environment is not available (e.g., over SSH on a remote server). Spectroscope provides an interactive environment where Quasar statements can be entered on interpreted on the fly. In addition, different commands are available:
figure Figures/fig_Spectroscope_screenshot.png
Figure 18.2 Screenshot of Quasar Spectroscope.

18.3 Redshift Profiler

To analyze the performance of Quasar programs, a profiler has been integrated in Redshift. The profiler uses the NVIDIA CUDA Profiling tools interface (CUPTI) to obtain accurate time measurements. Using the Redshift Profiler, it is not necessary to use the tools
nvprof
, NVIDIA Visual Profiler and NVIDIA nSight separately. The Redshift Profiler offers the following features:
To take advantage of CUPTI, it is necessary to enable “Use NVIDIA CUDA profiling tools” in the program settings (done by default). Without this option set, the Quasar Profiler reverts to CUDA events (which is less accurate and degrades the performance during profiling).
In Windows,
cupti.dll
is bundled with the Quasar installation. In Linux, it may be necessary to adjust the
LD_LIBRARY_PATH
to include
libcupti.so
, depending on the installed version of CUDA. This can be done by modifying the
.bashrc
file (for example, for cuda 10.2):
export LD_LIBRARY_PATH=/usr/local/cuda-10.2/extras/CUPTI/lib64/:$LD_LIBRARY_PATH
The profiling menu in Redshift has been updated with the new features, as can be seen in the following screenshot:
figure Figures/RedshiftProfiler-ProfilingMenu.png
The following features are available:
As a result of CUPTI, the profiler may now list kernel functions that are not visible to Quasar but that are launched by library calls (e.g., CuDNN, CuBLAS, CuFFT, \SpecialChar ldots). An example is given below:
figure Figures/RedshiftProfiler-KernelSummary.png

18.3.1 Security settings

In CUDA 10.1 or newer, using cuPTI profiling requires an additional setting to be made. In Windows, open the NVIDIA control panel. Click on the desktop menu and enable the developer settings. Then, select “Manage GPU Performance Counters” and click on “Allow access to the GPU performance counters to all users” (see Figure 18.3↓). In Linux desktop systems, create a file /etc/modprobe.d with the following content:
options nvidia "NVreg_RestrictProfilingToAdminUsers=0"
where on some Ubuntu systems, nvidia may need to be replaced by nvidia-xxx where xxx is the version of the display driver (use lsmod | grep nvidia to find the number). In addition, it may be necessary to rebuild the ram FS:
update-initramfs -u
We recommend creating a backup of the initrd image first (see /boot directory).
For NVIDIA Jetson development boards, the modprobe approach is not available. The only way to get the cuPTI profiling to work is by executing Quasar or Redshift with admin rights, for example, using sudo.
For more information, see https://developer.nvidia.com/nvidia-development-tools-solutions-ERR_NVGPUCTRPERM-permission-issue-performance-counters#SolnAdminTag.
figure Figures/fig_ProfilingSecuritySettings.png
Figure 18.3 NVIDIA control panel: allow access to the GPU performance counters to all users

18.3.2 Peer to peer transfers

In multi-GPU configurations, the profiler now displays memory statistics for peer to peer memory copies (i.e., transfers between two GPUs). See the multi-GPU programming documentation for an explanation of the performance implications related to these peer to peer transfers.
figure Figures/RedshiftProfiler-DataTransferOverview.png
When the profiler indicates memory transfer performance bottlenecks, it is possible to investigate every bottleneck individually, via the memory transfer summary.
figure Figures/RedshiftProfiler-MemoryTransferOverview.png
Per line of code, the memory transfers are listed, including the following information:
In systems in which multiple GPUs are not connected to the same PCIe slot, the peer to peer copy between GPUs generally passes the CPU memory. Correspondingly, these peer-to-peer copies cause two transfers: 1) from the source GPU to the CPU host and 2) from the CPU host to the target GPU. In the memory transfer view, the peer to peer copies are listed as one operation, for clarity. In the timeline view however, such peer to peer copies are displayed as dual operations (in green in the screenshot below):
figure Figures/RedshiftProfiler-Peer2PeerCopyTimeline.png
In the screenshot, it can be seen that the peer to peer copy blocks all operations on both GPUs, which is degrades the runtime performance.

18.3.3 GPU event view

The GPU event view shows each operation performed on the GPU(s). This is useful to analyze whether e.g., memory copies and recomputation can be avoided.
figure Figures/RedshiftProfiler-GPUEventsView.png
The following types of operations are listed:
Notes:

18.3.4 Timeline view

The timeline view now accurately depicts the kernel execution times and duration. In the following screenshot, it can be seen that both GPUs are (almost) fully utilized:
figure Figures/RedshiftProfiler-MultiGPUTimeLine.png
The mouse tooltips now also show a table containing the memory access information of the kernel function.
This can be used to track down memory transfers, check the access mode (ReadOnly, WriteOnly) etc. By double-clicking on the object references, the operations to an individual object can be visualized in the GPU events view.
For
sync_event(global)
(which triggers a run of the global scheduling algorithm), the object reference that triggered the scheduling operation can be inspected:
figure Figures/RedshiftProfiler-GlobalSchedTooltip.png
Note that a global scheduling operation can occur:

18.3.5 Kernel line information

When profiling a kernel with “collecting line information” enabled, execution information is displayed in the code editor:
figure Figures/RedshiftProfiler-KernelLineInformation.png
Shown is the total execution time of running each line of the specified kernel on the GPU, as well as:
The kernel line information now allows to accurately identify bottlenecks within kernel functions, based on PTX to Quasar source code correlation. Note that due to compiler optimizations, the mapping from PTX to Quasar is not one-to-one. Therefore, the line information may not always correspond to the exact operation that was executed. To improve the correlation, the CUDA optimizations in the Program Settings dialog box can be disabled.

18.3.6 Kernel metric reports

Occupancy report
The occupancy report lists several parameters of the kernel function (block dimensions, data dimensions, amount of shared memory) and displays the calculated occupancy. Occupancy is a metric for the degree of “utilization” of the multiprocessors of the GPU.
Notes:
figure Figures/RedshiftProfiler-KernelReport1.png
The report displays the kernel execution time and subsequent analysis per launch configuration. A launch configuration is a set of parameter values, such as the grid dimensions, the block dimensions, the amount of shared memory being used). Because the performance of the kernel depends on the launch configuration, it is necessary to separate the measured profiling information according to the launch configuration.
In the bottom, the (theoretical) occupancy as function of respectively the number of threads/block, the number registers and the amount of shared memory is displayed. This indicates how occupancy can be improved:
However, the occupancy value is mostly indicates whether the launch configuration is selected to be “efficient” for the particular GPU. Unless the block dimensions are manually specified (e.g. via
parallel_do
,
{!cuda_launch_bounds max_threads_per_block=X}
, or
{!parallel_for blkdim=X}
), the runtime system uses in internal optimization algorithm to select the launch configuration that maximizes the occupancy.
In practice, an occupancy value of 50% (or even in many cases 25%) is sufficient to obtain maximal performance for the selected launch configuration. To gain more insights about the performance of a particular kernel, additional analysis is required.
Max. theoretical threads: calculated as the number of assigned warps  ×  warp size  ×  max active blocks, is the number of threads that is active on the GPU, after the warm-up phase of the kernel. If the occupancy value is smaller than 100%, check if the product of the data dimensions is smaller than max. theoretical threads. If this is the case, GPU multiprocessors are idle because the data dimensions of the kernel are too small.
Overview compute vs. memory and function unit utilization
The compute vs. memory metrics are obtained by reading the hardware counters of the GPU. The following metrics are given:
In addition, the utilization of the individual function units on a scale from 0 to 10 is displayed (the higher, the better).
figure Figures/RedshiftProfiler-KernelReport2.png
figure Figures/RedshiftProfiler-KernelReport2b.png
Instruction execution count
The instruction execution count report shows the total number of instructions per type that were executed. Due to the presence of different function and computation units, maximal utilization can be achieved by balancing the operations. For example, when the number of floating point operations is much higher than the number of integer operations, it is useful to investigate whether some parts of the calculations can be done in integer precision.
Note: miscellaneous instructions are warp voting and shuffling operations.
figure Figures/RedshiftProfiler-KernelReport3.png
Floating point operations
The floating point operations can further be categorized into:
figure Figures/RedshiftProfiler-KernelReport4.png
Stall reasons
Issue stall reasons indicate why an active warp is not eligable.
figure Figures/RedshiftProfiler-KernelReport5.png
The following possibilities occur:
Memory throughput analysis
The memory throughput analysis indicates the load/store and total throughputs (in bytes/second) achieved for the different memory units (shared memory, unified L1 cache, device memory and system memory), as well as the number of L2 cache operations and the hit rates of the L1 cache.
figure Figures/RedshiftProfiler-KernelReport7.png
Also shown are the average number of transactions per memory request. When the number of transactions per request is high, it may be beneficial to group the transactions (e.g., using vector data types such as
vec(4)
).
The best kernel performance is generally achieved by a good balance between operations using the different memory units. For many kernel functions, this practically means: