Abstract
In 2008 task based parallelism was added to OpenMP as the major update for version 3.0. Tasks provide an easy way to express dynamic parallelism in OpenMP applications. However, achieving a good performance with OpenMP task-parallel programs is a challenging task. OpenMP runtime systems are free to schedule, interrupt and resume tasks in many different ways, thereby complicating the prediction of the program behavior by the programmer. Hence, it is important for a programmer to get support from performance tools to understand the performance characteristics of his application.Different performance tools follow different approaches to collect this information and to present it to the programmer. Important differences are the amount of information which is gathered and stored and the amount of overhead that is introduced. We identify typical usage patterns of OpenMP tasks in application codes. Then we compare the usability of several performance tools for task-parallel applications. We concentrate our investigations on two topics, the amount and usefulness of the measured data and the overhead introduced by the performance tool.
Similar content being viewed by others
3.1 Introduction
In recent years task based parallel programming paradigms became an alternative to classical thread based techniques for shared memory parallel programming. Task based paradigms for example are Cilk, Intel Threading Building Blocks and OmpSS. Also in OpenMP, which is the de-facto standard for thread parallel applications in HPC, tasking has been added in version 3.0 of the OpenMP specification as an alternative way to express parallelism. A task is a code region which can be executed independently from other tasks. Data sharing attributes are used to specify which data is shared, private or firstprivate for the task. The OpenMP runtime is free to execute the task immediately or to defer the execution, and the task may be scheduled later on the current or a different thread. This level of freedom in task scheduling by the runtime complicates to understand the performance behavior of an application, in particular because different OpenMP runtimes deliver highly different performance for application codes (as was shown in [13]). To better understand the performance of a program a variety of performance analysis tools exist, all using different techniques to analyze and present the behavior of an application. Shortly after tasks were introduced in OpenMP in 2008 several ideas were presented to support OpenMP tasks in performance analysis tools. Fürlinger et al. showed support for task profiling in the OpenMP profiling tool OMPP [2] and Lin et al. introduced prototypical support for tasks in the Sun Studio Performance Analyzer [8]. However, not all of these ideas have been adopted in the released product versions of these tools. Still nowadays some tools do not support tasks at all and even the tools supporting tasks differ a lot in the data measured and how it is displayed. This work analyzes the ability of performance tools to handle task-parallel programs and to investigate the usefulness of the presented data for performance optimization. Therefore we investigate task-parallel programs to identify most commonly used patterns for task creation and performance problems coming along with these patterns. Afterwards we investigate representative example applications with performance analysis tools to find out if these tools observe the mentioned performance problems. The rest of this work is structured as follows: first we shortly introduce the investigated tools in Sect. 3.2. Then in Sect. 3.3 we inspect common patterns to use tasks in parallel programs to identify representative example codes which we further investigate with all selected performance tools. Here we highlight strengths and weaknesses of all tools for the different applications before we finally draw our conclusions in Sect. 3.4.
3.2 Investigated Performance Tools
In the following sections production versions of performance analysis tools installed on our cluster are investigated with respect to their applicability for task-parallel OpenMP programs. The tools are:
3.2.1 The Intel VTune Amplifier XE
The Intel VTune Amplifier XE [5] is a sampling based performance measurement tool. It supports C/C++, Fortran, Java, C# and assembly and a variety of parallel programming paradigms for single node performance, like OpenMP, Pthreads or Intel Threading Building Blocks. MPI or hybrid programs can be investigated, but no MPI specific data like messages sent is investigated. Instead, the tool delivers a single node profile for every MPI process. The Amplifier XE also supports measurements of hardware performance counters on Intel CPUs.
3.2.2 The Oracle Solaris Studio Performance Analyzer
The Oracle Solaris Studio Performance Analyzer [11] is also a sampling tool to investigate the performance of serial, OpenMP or PThread parallel or hybrid MPI/OpenMP applications. The tool can be used on Intel, AMD and Sparc CPUs and also supports hardware performance counters. The supported programming languages are C/C++, Fortran and Java.
3.2.3 The Score-P Measurement Infrastructure
In contrast to the previously mentioned analysis tools the Score-P measurement infrastructure [7] uses an event based technique instead of sampling to gather performance data of an application. At certain events, like the entry and exit of a function, data is measured and either directly stored in an event-trace or accumulated and finally stored in a profile, depending on the usage mode of Score-P. Score-P also allows online analysis of the data in combination with the Periscope tool [4]. Events can be instrumented in different ways, most compilers for example can automatically instrument function entry and exit events. For OpenMP pragmas the source-to-source instrumentation tool Opari [9] is used. If the data is stored in a profile it can be visualized with the Cube GUI and the trace data can be visualized with Vampir [10], as we have done it in the following sections. The gathered data can alternatively be analyzed with the TAU tool [12] or the Scalasca tool [3] can be used to automatically detect certain performance problems.
3.3 Investigating Task-Parallel Programs
To identify typical task creation patterns in applications the Barcelona OpenMP Benchmark Suite [1] (BOTS) and a set of task-parallel applications used at RWTH Aachen University are examined.
Table 3.1 shows if the codes use recursive functions to generate tasks or if the tasks are created iterative in a loop and if tasks are created nested inside of other tasks or not. Creating tasks in a recursive way is the most common task creation pattern used in 9 of 14 codes, followed by the iterative creation of tasks. Next we look at three example applications and analyze them with the performance tools. The applications are a Sudoku solver, a Conjugate Gradient Method implementation and the KegelSpan application code, an application developed at the Laboratory for Machine Tools and Production Engineering at RWTH Aachen University.Footnote 1
3.3.1 Sudoku
First a very simple application, namely a task-parallel Sudoku solver, is inspected. For a given Sudoku board the solver determines all possible solutions of the Sudoku puzzle in a brute force manner. Figure 3.1 shows the initial configuration of the board used in this experiment on the left-hand side and the algorithm in used pseudo-code at the right-hand side. For every empty field the solver tries to insert every possible number. Only if the number is not yet used in the same row, column or block it creates a task to insert the number and with that then check the rest of the board, otherwise no task is created. In both cases the algorithm continues with the next number for the current field or with the next possible field. Every task which founds a valid number and is on the last empty field, stores the current solution as a valid solution for the puzzle. After all tasks have finished, all valid solutions are found. Note, even if the algorithm is fairly simple, it is hard to fully understand the runtime behavior. For example determining the number of used tasks highly depends on the initial board and even if the board is known, like the one in Fig. 3.1, it is difficult to determine this number a-priori. Since tasks are very useful for such dynamic algorithms this is a representative problem for task-parallel programs.
Figure 3.2 shows the runtime (dark-blue bars) and speedup (red curve) of the Sudoku solver for the board shown in Fig. 3.1 on a two-socket server equipped with Intel Sandy Bridge processors clocked at 2 GHz. The application achieves a best speedup of less than 5 with 32 threads.
Obviously this is far from optimal. Next, we have a look at the performance information gathered by the investigated performance tools. With all tools we focus on execution time for our analysis. All tools allow to measure additional metrics with hardware performance counters but we are more interested how these metrics are assigned to OpenMP tasks and therefore the time metric is sufficient and by far the most important metric.
3.3.1.1 VTune
We compiled the executable with the Intel Compiler (version 13.1.1) and did several performance measurements with the Intel VTune Amplifier XE 2013 update 10. The most relevant information for the analysis of the tasking performance is shown in Fig. 3.3. First, the overview (Fig. 3.3a) presents the OpenMP tasks in line number 106 as the most time consuming region in the code, the so called hotspot. Second, the callstack view (Fig. 3.3b) gives further details on this hotspot and presents that the task regions are nested recursively one inside the other. It also presents time spend on different levels of the call stack and we can observe that the time shrinks noticeably for lower levels. This shows that computation is done inside the tasks and not only in the leaf tasks at the lowest level. Third, the finest granularity is displayed at the function or source code level (Fig. 3.3c, d) where the average runtime spent on every source line can be observed, as well as a metric called “Overhead and Spin Time” which indicates time spend in the OpenMP runtime waiting or scheduling threads or tasks. For the task region 1.2 s out of 9 s are overhead, which indicates a potential performance problem.
3.3.1.2 Oracle Solaris Studio Performance Analyzer
The measurements were done with the Oracle Solaris Studio Performance Analyzer version 7.9 for Linux. The analyzed executable was compiled with the Oracle Studio 12.3 compiler for a better interoperability between the OpenMP runtime and the performance tool. The tool delivers results similar to the Intel VTune tool. The solve_parallel function is identified as the hotspot (Fig. 3.4a) which is the recursive function spawning all the tasks. Also a callstack is given which indicates the recursive invocation of this function, but in contrast to VTune the Analyzer tool does not show tasks in the callstack (Fig. 3.4b), so the information that tasks are created on all these levels is not explicitly given. The time spend at each level is not presented here. Additionally to the call-stack the Analyzer presents an “OpenMP task” view (Fig. 3.4c) where the overhead, wait and work time for each task construct in the code is shown. Here we can see that the overhead for the task region overall is roughly 80 s, whereas the work time is only 12 s. This ratio is worse than the previous result when VTune was used, the reason is that a different OpenMP runtime (provided with the Oracle Compiler) was used which obviously incurs more overhead. Whereas the overhead of 15 % for the Intel runtime might be seen as acceptable, the overhead of 666 % here is clearly a performance problem. The tool also gives information at the sourceline level (Fig. 3.4d), but because of the very small function we could not get any additional value from this view.
3.3.1.3 Score-P/Profiling
The Score-P measurement infrastructure (version 1.2) was used with the executable compiled by the Intel Compiler. The profiling mode was used and only OpenMP constructs were instrumented, to avoid overhead by function instrumentation. Also this tool identifies the task region in line 106 as the hotspot of the application. Figure 3.5 shows information related to task-instances which were not presented in the other tools. The direct instrumentation allows for example to identify the number of visits for the task region, meaning the number of created tasks. This is about ten million tasks in this case. Given also the overall time spent in task execution of 46 s, we can calculate an average task-instance duration of 4.6 μs. This detailed information cannot be measured by the sampling based tools, since sampling might miss task instances depending on the sampling rate. However, Score-P is not able to show information on a finer granularity as an instrumented region, so we cannot see source line information. Since Score-P can also not get information from the runtime, we can only see how much time was spent in a barrier or taskwait construct, but we cannot determine which amount is overhead and which is waiting time.
3.3.1.4 Score-P/Tracing
In tracing mode Score-P measures and stores the most detailed information of all tools. Beside all information of the profiling mode, additional information on individual regions is stored. Figure 3.6 shows the Vampir timeline view, where we can see time spent in user code (red) and in taskwait regions (turquoise) on all threads. Looking more into details, the callstack view presents individual task-instances on different nesting levels. In Fig. 3.3b it can be observed, that a task in the second nested level has a duration of 0.16 s, whereas a deeper nested task has a duration of only 2.2 μs.
All tools show some overhead in the task, taskwait or barrier construct related to the task execution. The event based tools furthermore showed that the execution time of the tasks is very low and in the tracing tool we could also observe that tasks in the lower levels of the callstack are much smaller than higher ones. Therefore, we implemented a cut-off strategy for the Sudoku solver to stop creating tasks after the first two rows of the Sudoku board are processed. The resulting performance is also shown in Fig. 3.2. Obviously the performance problem is solved and the performance is much better in this optimized version, reaching a speedup of about 16 with 32 threads.
3.3.1.5 Overhead
The different measurement techniques (sampling or event based) and the different details of stored data (profile or trace) result in a different runtime overhead and amount of stored data. Table 3.2 shows the overhead and the amount of stored data for the different tools for a testrun with 16 threads. It can be seen that the overhead for the event based tools is really significant with about 150 %. The generated data should be interpreted with care, but as has been shown before the information provides useful hints on performance problems. Even if the average task duration would be 2 μs instead of the measured 4.6 μs, the problem would be more significant. The generated amount of data for the trace might also be a problem for larger programs, so our recommendation is to generate detailed traces for parts of the application which are critical. Other measurements then may help to identify the critical parts in advance.
3.3.2 Conjugate Gradient Method
Next we investigated a conjugate gradient solver (CG) where the sparse matrix vector multiplication was parallelized with OpenMP tasks. The tasks are spawned in a loop and one task processes cs rows of the matrix. The parameter cs can be adjusted to spawn more and smaller tasks, which is better for load balancing if the sparsity pattern is irregular, or to spawn only a few large tasks which introduces less overhead. We used all performance tools for different values of cs. Again all tools found the task region as a hotspot. The sampling tools also showed information on the source code level, but since the task region is only four lines long, this did not give any additional information. Again the event based tools delivered also the number of tasks and execution time for individual instances, but also this information was not very useful, since the number of matrix rows and the parameter cs is known, this can easily be computed by the programmer. For extremely large tasks, all tools identified load imbalance in the CG kernel. Overall, finding the best value for cs comes down to increase cs until the overhead spent in the OpenMP runtime is negligible. Since the sampling tools deliver this information directly due to the better cooperation with the vendor OpenMP runtimes, the sampling tools were preferable.
3.3.3 KegelSpan
The last investigated application is KegelSpan, a code simulating gearwheel cutting processes. KegelSpan is developed at the Laboratory for Machine Tools and Production Engineering at RWTH Aachen University. We investigated an experimental version of the code which has been parallelized with OpenMP tasks [6]. This experimental version differs from the production version. In the analyzed version a BSP tree is used to better handle the geometry which is adaptively refined at the cutting edge during the simulation. In particular the routine to setup the BSP tree recursively was investigated here. The problem of too small tasks has been observed for this code before and a cut-off mechanism was implemented which stops the creation of tasks at a certain configurable depth. We also investigated this application with all mentioned performance tools and found two performance weaknesses. First, with Score-P (tracing) we observed that even on the upper levels empty tasks are created. The reason therefore is that the BSP tree distributes the cell in two equal parts on every level. In some areas where only a few points reside, this can quickly lead to empty volumes, whereas at the refined area around the cutting edge a lot of points fall into an equal sized volume. We therefore changed the cut-off mechanism to stop creating tasks if the number of tasks in a cell is smaller than a certain threshold (version opt1). Second, the sampling based tools showed the hotspot on a sourcecode level at lines 4,717–4,744. In these lines points are resorted in an array. This process was never regarded as a hotspot and a simple sorting strategy was used. We optimized this sorting to further optimize the routine (version opt2). The event based tools highlighted the task region as a hotspot which is several hundred lines in length, so the more detailed sampling based analysis was useful to find the hotspot here. Figure 3.7 shows the runtime of the original code version for different cut-off depths and both optimized versions. The optimized versions could save about 5 % (opt1) and additional 10 % (opt2) of execution time. The overhead for all investigated tools for this application was below 4 %, so even for the event based tools the observed runtime was nearly undisturbed.
3.4 Conclusion
In this work we looked at several codes employing OpenMP tasks to express parallelism and tried out different performance analysis tools to investigate the code performance and search for possible performance problems. The codes use recursive or iterative creation of tasks. Both sampling based tools presented roughly the same kind of information and the same level of detail. The strength of these tools was to give detailed statistical information at fine granularity, even on the source-code level. This information was proven useful to identify a hotspot within a task for the KegelSpan code. The information on the amount of runtime overhead was shown in both tools which helped to identify inefficient task constructs. The downside was that the construct was investigated, not the task-instance. When tasks created within the same construct had different behavior, which particularly happens with recursive task creation, this phenomenon was not observed by the tools. The event based tools provided more detailed information on a task-instance level which helped to identify the average task duration for profiling or even the duration of task-instances on different call-stack levels for tracing. This information also provided valuable insights in the execution behavior of the KegelSpan application. The overhead for event based tools was sometimes significantly higher than for sampling based tools, but still the information was very useful in many cases.
References
Duran, A., Teruel, X., Ferrer, R., Martorell, X., Ayguade, E.: Barcelona OpenMP tasks suite: a set of benchmarks targeting the exploitation of task parallelism in OpenMP. In: Parallel Processing, 2009 (ICPP ’09), Vienna, pp. 124–131 (Sept 2009)
Fürlinger, K., Skinner, D.: Performance profiling for OpenMP tasks. In: Müller, M.S., Supinski, B.R., Chapman, B.M. (eds.) Evolving OpenMP in an Age of Extreme Parallelism. Lecture Notes in Computer Science, vol. 5568, pp. 132–139. Springer, Berlin/Heidelberg (2009).http://dx.doi.org/10.1007/978-3-642-02303-3_11
Geimer, M., Wolf, F., Wylie, B.J.N., Ábrahám, E., Becker, D., Mohr, B.: The SCALASCA performance toolset architecture. In: Proceedings of the International Workshop on Scalable Tools for High-End Computing (STHEC), Kos, pp. 51–65 (June 2008)
Gerndt, M., Ott, M.: Automatic performance analysis with periscope. Concurr. Comput.: Pract. Exp. 22(6), 736–748 (2010)
Intel: Intel VTune Amplifier XE (Sept 2013). http://software.intel.com/en-us/intel-vtune-amplifier-xe
Kapinos, P., an Mey, D.: Productivity and performance portability of the OpenMP 3.0 tasking concept when applied to an engineering code written in Fortran 95. Int. J. Parallel Program. 38(5–6), 379–395 (2010). http://dx.doi.org/10.1007/s10766-010-0138-1
Knüpfer, A., Rössel, C., an Mey, D., Biersdorff, S., Diethelm, K., Eschweiler, D., Geimer, M., Gerndt, M., Lorenz, D., Malony, A.D., Nagel, W.E., Oleynik, Y., Philippen, P., Saviankou, P., Schmidl, D., Shende, S.S., Tschüter, R., Wagner, M., Wesarg, B., Wolf, F.: Score-P – a joint performance measurement run-time infrastructure for Periscope, Scalasca, TAU, and Vampir. In: Proceedings of 5th Parallel Tools Workshop, Dresden, (Sept 2011)
Lin, Y., Mazurov, O.: Providing observability for OpenMP 3.0 applications. In: Müller, M.S., Supinski, B.R., Chapman, B.M. (eds.) Evolving OpenMP in an Age of Extreme Parallelism. Lecture Notes in Computer Science, vol. 5568, pp. 104–117. Springer, Berlin/Heidelberg (2009). http://dx.doi.org/10.1007/978-3-642-02303-3_9
Mohr, B., Malony, A.D., Shende, S., Wolf, F.: Design and prototype of a performance tool interface for OpenMP. J. Supercomput. 23(1), 105–128 (2002)
Nagel, W., Weber, M., Hoppe, H.C., Solchenbach, K.: VAMPIR: visualization and analysis of MPI resources. Supercomputer 12(1), 69–80 (1996)
Oracle: Oracle Solaris Studio 12.2: Performance Analyzer (Sept 2013). http://docs.oracle.com/cd/E18659_01/html/821-1379/
Shende, S., Malony, A.D.: The TAU parallel performance system, SAGE publications. Int. J. High Perform. Comput. Appl. 20(2), 287–331 (2006)
Terboven, C., Schmidl, D., Cramer, T., an Mey, D.: Assessing OpenMP tasking implementations on NUMA architectures. In: Chapman, B.M., Massaioli, F., Mller, M.S., Rorro, M. (eds.) OpenMP in a Heterogeneous World. Lecture Notes in Computer Science, vol. 7312, pp. 182–195. Springer, Berlin/Heidelberg (2012). http://dx.doi.org/10.1007/978-3-642-30961-8_14
Acknowledgements
Parts of this work were funded by the German Federal Ministry of Research and Education (BMBF) under Grant No. 01IH11006 (LMAC).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Schmidl, D., Terboven, C., an Mey, D., Müller, M.S. (2014). Suitability of Performance Tools for OpenMP Task-Parallel Programs. In: Knüpfer, A., Gracia, J., Nagel, W., Resch, M. (eds) Tools for High Performance Computing 2013. Springer, Cham. https://doi.org/10.1007/978-3-319-08144-1_3
Download citation
DOI: https://doi.org/10.1007/978-3-319-08144-1_3
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-08143-4
Online ISBN: 978-3-319-08144-1
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.