By Ryan Pepper and James C. Womack
This is a speed blog written up as part of the Bath Numerical Debugging Workshop activities.
During the Bath Numerical Debugging Workshop, we participated in a bug hunting session where people brought along real-world bugs which we attempted to tackle. James Womack brought along an issue he had identified while working on the ONETEP density functional theory software package. In this blog post we will discuss the bug, approaches taken to solving it, and the eventual resolution.
Title: “Unexpected changes to storage of local variables when compiling Fortran with OpenMP”
Language: Fortran 2003/2008
Code: ONETEP (http://www.onetep.org)
Submission: Attempted minimal working example (MWE) with example ONETEP output files
In Fortran, variables can be sustained for the lifetime of the program runtime with the ‘SAVE’ attribute set. This is functionally equivalent to ‘static’ in C family languages. According to the Fortran 2003 standard (J3/04-007, section 5.1, p. 74 of the working draft - the actual standard is behind a paywall!), this can be implicitly set by explicit initialisation of the variable. It was found that when using the Intel Fortran compiler, with OpenMP enabled, a particular boolean variable which should have been implicitly ‘SAVEd’ in the programme was no longer being SAVEd. This variable was set to true when a potential numerical issue was detected and a message warning about this had been output. Setting this SAVEd variable to true should have suppressed the warning for the remainder of programme execution. With Intel Fortran and OpenMP, this was not occurring and the warning message was being repeatedly output by the programme throughout execution. With GFortran, however, the programme behaved as expected.
James had already confirmed that this was not due to a race condition, because the problem remained even when the code was run with a single OpenMP thread. Additionally, when the variable was explicitly set to ‘SAVE’ the bug disappeared, suggesting a solution but not a satisfactory explanation.
James had previously tried to construct a minimal working example (MWE), but at the start of the session had not successfully reproduced the bug, suggesting that one of a number of things could have been the source of the error, and that the explanation was not simple, i.e., Intel Fortran did not always fail to set an implicit SAVE attribute when compiling with OpenMP.
Our goal for the bug hunt session was to attempt to reproduce the bug in a MWE and investigate the reasons for the unexpected behaviour observed with the Intel Fortran compiler.
Details from the bug hunt session
During the course of the session, we took a stripped-down reproduction of the original ONETEP routine in which the bug occurred and produced a MWE from this. This new MWE code reproduced the bug and attempts could therefore be made at debugging. This took up the bulk of the session because ‘fake’ data which would trigger the condition that caused the code path of interest to run needed to be constructed, and this was non-trivial when isolated from the larger simulation package. We tested this with the GNU Fortran compiler and PGI compiler and in neither case was the bug reproduced. Only with Intel Fortran (version 17.0.2) was the bug evident.
We set out to debug the code using a debugger (included with the Intel compiler is gdb-ia—an Intel version of the GNU debugger which resolves instructions specific to Intel hardware), but immediately found that compiling with the ‘-g’ flag to include debugging symbols removed the bug. This puzzled us at first, but we found that, by default, the inclusion of the debugging flag automatically sets the optimisation level to ‘-O0’ (in the absence of an explicit setting of the optimisation level). We initially played around with the optimisation levels and found that at -O0 the behaviour of the code was as expected (i.e., the variable was SAVEd), but with the -O1 level the bug reappeared. This made sense because when checking ONETEP’s compilation options, the optimisation level was set to ‘-O2’ by default.
We started looking into the Intel Fortran compiler documentation for a list of optimisations which were turned on at each level, with the intention of applying these individually to find which optimisation caused the issue, however we did not have time to fully explore this.
When trying to reproduce the bug on Ryan’s machine, which had a newer version of the Intel Compiler (18.0.0 20170811) than James was running (17.0.2 20170213), the issue magically disappeared. This suggests the issue is a subtle compiler bug, associated with the interaction of compiler optimizations and OpenMP. It goes to show, however, the importance of testing numerical codes against different compiler stacks. A takeaway from the discussion sessions later in the workshop on debugging numerical software was just how common compiler bugs can be and how they can particularly manifest themselves in numerical codes.
During the bug hunt session we
successfully created a MWE which reproduced the original bug in ONTEP;
confirmed that the issue was most likely a compiler bug in Intel Fortran 17.0.2; and
found that the bug was caused by an interaction of compiler optimisations with OpenMP.
Addendum: Variable allocation and OpenMP
During the investigation of the bug we also discovered a some useful general information about the behaviour of Fortran compilers when compiling with OpenMP. It appears that both the GFortran and Intel Fortran compilers silently apply flags which change how variables are allocated when compiling with OpenMP:
GFortran: -fopenmp implies -frecursive, which forces all local arrays to be allocated on the stack.
Intel Fortran: -qopenmp implies -automatic, which causes all local, non-SAVEd variables to be allocated on the run-time stack.
This information is potential useful when debugging issues associated with OpenMP. If the behaviour of your programme changes upon adding/removing OpenMP flags, you may have an issue related to a change in how local variables are stored. For example, your programme may be “accidentally” relying on the persistence of local variables between subroutine calls.