From: mccalpin@gmp246.austin.ibm.com (McCalpin) Newsgroups: comp.arch,comp.benchmarks,comp.sys.super Subject: Sun breaks another SPEC benchmark Date: Fri, 21 Mar 2003 22:32:02 +0000 (UTC) Message-ID: <b5g3t2$ima$1@ausnews.austin.ibm.com> In the ongoing conflict between static benchmarks and the need for marketing collateral, another SPEC benchmark seems to just lost a lot of its traditional usefulness.... SPECfp_rate2000 results for the Sun V1280 show the addition of a new flag (-Atile=skewp) boosts performance on the "171.swim" benchmark by almost a factor of five. Although not quite as dramatic as the now infamous "179.art" hack, this optimization will make it even more difficult to interpret SPECfp_rate2000 results. Obviously, I don't know exactly what Sun's compiler is doing with this benchmark, but I think that I probably understand the performance characteristics of this set of equations about as well as anyone in the world, and their range of options is quite limited. I know of two families of optimization techniques that can provide large boosts to performance of 171.swim. One of those families seems to be limited to small constant-factor speedup, and I can't see how to get much past 2x. The other class class of optimizations could completely eliminate the memory bottleneck and turn 171.swim into a completely cpu-bound program. Although the technique goes under many names, it is called "time-skewing" in a Rutgers University technical report by David Wonnacott and John McCalpin. A convenient citation is http://citeseer.nj.nec.com/mccalpin99time.html Having been thinking about this class of optimizations since the mid 1990's, I think that I am in a fair position to judge the general implementation of such techniques to be very difficult. Implementing the technique for the equations solved by 171.swim is hard -- implementing the technique for the specific source code representation of 171.swim is heroic (or masochistic?). Many special case analyses are needed to apply time-skewing to the particulars of 171.swim, and I am not sure if I should be in awe of the dedication of the Sun compiler writers or in awe of their desparation. Although the time-skewing optimizations are valid, potential customers should ask very hard questions about the robustness of such optimizations for different source code. If experience is a guide, these optimizations will be very brittle, and it will be virtually impossibly to tell whether a particular piece of source code will meet the criteria for the compiler to perform this complex set of optimizations. So now when reviewing Sun SPECfp_rate2000 results, you have to factor in the 179.art hack (which gave a 7.1x boost on the early 900 MHz Sun systems) and this optimization, which gives a 4.6x boost in the 171.swim rate on the 900 MHz V1280 system. The combined boost in the SPECfp_rate2000 score is the fourteenth root of the product of these two speedups, which is about 28% above the results before these two optimizations were added. This is no longer a little glitch that customers can afford to ignore. -- John D. McCalpin, Ph.D. mccalpin@austin.ibm.com Senior Technical Staff Member IBM POWER Microprocessor Development "I am willing to make mistakes as long as someone else is willing to learn from them."

From: mccalpin@gmp246.austin.ibm.com (McCalpin) Newsgroups: comp.arch,comp.benchmarks,comp.sys.super Subject: Re: Sun breaks another SPEC benchmark Date: Mon, 24 Mar 2003 00:45:32 +0000 (UTC) Message-ID: <b5lkfc$ef8$1@ausnews.austin.ibm.com> In article <qnPea.196812$qi4.89715@rwcrnsc54>, Glen Herrmannsfeldt <gah@ugcs.caltech.edu> wrote: > >"McCalpin" <mccalpin@gmp246.austin.ibm.com> wrote in message >news:b5g3t2$ima$1@ausnews.austin.ibm.com... >> In the ongoing conflict between static benchmarks and the need >> for marketing collateral, another SPEC benchmark seems to just >> lost a lot of its traditional usefulness.... >> >> SPECfp_rate2000 results for the Sun V1280 show the addition of a >> new flag (-Atile=skewp) boosts performance on the "171.swim" >> benchmark by almost a factor of five. >(snip) > >[...] For us who >don't know the 171.swim algorithm could you explain a little why >you think it is hard to apply time-skewing optimization? This would be quite difficult to explain concisely, but I will give it a try.... Consider a collection of two-dimensional arrays, and an iterative process that updates those arrays based on "spatial" averages and "spatial" differences. In each step, the "new" value at location "i,j" (that would be [j][i] for you C folks out there), is calculated based on the "old" values at points that might be offset as much as one index value in either direction for each of "i" and "j". Although the code does not use a separate index for the "time" index (I will call it "k"), you can write this as Q(i,j,k+1) = Function(Q(%,@,k),Q(%,@,k-1)) where "%" means i,i+1, or i-1, and @ means j,j+1, or j-1. In the shallow water equations, the "Function" above involves performing only a few arithmetic operations for each data element read from memory or written to memory. The exact number of arithmetic operations per point depends a bit on the common sub-expression elimination of the compiler, but in subroutine "CALC1" of swim, there are 24 floating-point operations associated with accesses to a total of 7 arrays -- so we are looking at something like 3 floating- point operations per load or store from main memory. The main memory accesses are all set up to be contiguous, but this is still about an order of magnitude more bandwidth per floating-point operation than most modern systems can provide, so the code is strongly memory-bandwidth dominated on most systems. (This class of codes is very happy on high-bandwidth vector systems.) "Time-skewing" is basically a transformation of variables that changes Q(i,j,k) to P(i,j+k,j-k). The loops can then be transformed to these new variables. The advantage of this technique is that it allows you to effectively calculate many time steps while reading the arrays from memory (and writing them back to memory) only once. The actual implementation is horrendously more complex than I have described because the update of the variables is split across three subroutines, with complex updates of the elements at the edges at the end of the first and second of those routines. There is also a global reduction (sums of absolute values of the elements of three of the arrays) that takes place between the second and third subroutines. The general time-skewing algorithm can operate in multiple spatial dimensions, but you probably only need to transform one spatial dimension to get the required speedup here. In multiple dimensions, time-skewing could be enough to allow you to run the shallow water equations from disk (for huge arrays) with neglible slowdown. -- John D. McCalpin, Ph.D. mccalpin@austin.ibm.com Senior Technical Staff Member IBM POWER Microprocessor Development "I am willing to make mistakes as long as someone else is willing to learn from them."

From: mccalpin@gmp246.austin.ibm.com (McCalpin) Newsgroups: comp.arch,comp.benchmarks,comp.sys.super Subject: Re: Sun breaks another SPEC benchmark Date: Mon, 24 Mar 2003 00:58:49 +0000 (UTC) Message-ID: <b5ll89$bvo$1@ausnews.austin.ibm.com> In article <3e7d0d11$1@news.meer.net>, Greg Lindahl <lindahl@pbm.com> wrote: > >You're making a big assumption here -- while the weather community >doesn't use shallow water anything these days, it was my impression >that people studying things like the Chesapeake Bay still use shallow >water models. Hopefully John will speak up if I'm wrong, but the >general point is that a code that's obsolete in one field might be >current in another. Some shallow water modelling is certainly still done for nearshore environments and lakes. Although the theoretical foundations of time-skewing are based on the hyperbolic nature of the equations (and are therefore applicable to any implementation), I think it quite unlikely that any automatic system could "do the right thing" for shallow water equations on irregular or unstructured meshes (which would be more representative of current applications). When I was working on my Ph.D. in the middle 1980's, my research group drew a fair amount of criticism for continuing to use shallow water models. So this level of sophistication was considered antique almost twenty years ago. It is still applicable to some specific theoretical problems (and I used a simple shallow water model in a paper in 1995 with no objections), but I would certainly not contend that these applications actually drive the purchase of very many computers. The 171.swim benchmark in SPEC is really a toy: in the complexity of the geometry (square), the complexity of the grid (square), and the complexity of the physics (none). Enhancements to the usefulness of the model in any of those directions would almost certainly prevent automatic application of the time-skewing transformations. -- John D. McCalpin, Ph.D. mccalpin@austin.ibm.com Senior Technical Staff Member IBM POWER Microprocessor Development "I am willing to make mistakes as long as someone else is willing to learn from them."

From: mccalpin@gmp246.austin.ibm.com (McCalpin) Newsgroups: comp.arch,comp.benchmarks,comp.sys.super Subject: Re: Sun breaks another SPEC benchmark Date: Mon, 24 Mar 2003 13:07:39 +0000 (UTC) Message-ID: <b5mvur$bb4$1@ausnews.austin.ibm.com> In article <Lvwfa.196539$L1.33670@sccrnsc02>, Glen Herrmannsfeldt <gah@ugcs.caltech.edu> wrote: > >It sounds sort of like a partial differential equation solver, though seven >arrays is more than I usually see. The shallow water equations are a set of hyperbolic equations. There are three equations and the resulting system is hyperbolic (which basically means that information propagates at finite speeds, or in this case, a finite number of grid points per time step). >I have worked on Poisson solvers before, and thought some how the >access patterns affect the timing. Poisson's equation is elliptic (which means that information travels infinite speed around the solution). Many classic iterative elliptic equation solvers for Poisson's equation look "hyperbolic" (in the sense that information propagates through the solution at a finite number of grid points per iteration), and the time-skewing algorithm is perfectly applicable to this class of algorithm. The TOMCATV benchmark from SPECfp95 was a elliptic equation solver for a technique called automatic grid generation. >I can almost imagine a compiler time skewing a two dimensional >Poisson solver, but seven arrays would make it significantly harder. There are nine primary arrays for the three solution variables at each of the three time levels required. Then there are four additional arrays used to carry temporary values between the first and second computational subroutine. A 14th array is used only by the initialization routine, though there is no reason why additional storage needed to be allocated for this purpose. All of the arrays sit in contiguous global storage (Fortran "blank common"). >If it had to do it across function calls, it would seem it would have to >inline the function first. Yes, but the really ugly stuff is figuring out how to account for the manipulation of the edge elements at the end of subroutines CALC1 and CALC2. The code runs around the edges and sets the first value in each row/column equal to the last value (or vice versa, depending on which array it is). It is also ugly figuring out how to handle the global reduction that occurs between CALC2 and CALC3. Finally, it is ugly to do the transformations to even find this stuff, since the whole iteration loop is based on GOTO (rather than an explicit loop), which the IF test for exit happening between CALC2 and CALC3 and with an alternate version of CALC3 called the first time through. I can easily *imagine* implementing all of this stuff, but I cannot imagine implementing coverage of all the possible combinations of the ways in which the shallow water equations could have been implemented. Sort of like a binary tree with several levels. A vendor may not be required to implement *all* paths through the tree in order to have a "general-purpose" optimization, but how many is enough to avoid the label "benchmark special"? -- John D. McCalpin, Ph.D. mccalpin@austin.ibm.com Senior Technical Staff Member IBM POWER Microprocessor Development "I am willing to make mistakes as long as someone else is willing to learn from them."

From: mccalpin@gmp246.austin.ibm.com (McCalpin) Newsgroups: comp.arch,comp.benchmarks,comp.sys.super Subject: Re: Sun breaks another SPEC benchmark Date: Tue, 25 Mar 2003 03:09:38 +0000 (UTC) Message-ID: <b5oh9i$jo2$1@ausnews.austin.ibm.com> In article <hxMfa.204992$L1.35404@sccrnsc02>, Glen Herrmannsfeldt <gah@ugcs.caltech.edu> wrote: > >So the arrays are one after another in storage, and the corresponding >element of each, and neighbors of that element, needs to be accessed in >evaluating each point. I don't know if it would help in this case, but >there was a discussion some time ago, possibly C related, about the ability >of a compiler to rearrange such storage so that related elements are stored >together. (In C, an array of structures instead of structure of arrays.) That would not help here, though it can be a useful optimization for some other coupled solvers. >Well, the boundary is O(N), and the interior O(N**2). If you can optimize >the interior, you win (for sufficiently large N). I believe one of the >other posts said it is a rectangular problem. That makes it much easier. The trouble is that at each time step, the stuff that was "boundary" at the previous time step now propagates its information further into the interior. In two dimensions, imagine blocking the 2D array into smaller chunks. Then at each time step, the sub-block that you work on is one element smaller in each dimension. Eventually you run out of elements to work on, and you have updated all the elements in a sort of "pyramid" in the three-dimensional space composed of the two spatial dimensions plus time. Then you have to figure out an ordering for completing the updates between the pyramids. Upside-down pyramids are easy to calculate, but right-side-up pyramids plus upside-down pyramids do not fully tile the space -- there are still bits and pieces in between that need to be updated. There are variants based on rectangular (rather than square) sub-domains. If these span the original grid, then the book-keeping is simpler, but the scheme may require a lot more cache memory. -- John D. McCalpin, Ph.D. mccalpin@austin.ibm.com Senior Technical Staff Member IBM POWER Microprocessor Development "I am willing to make mistakes as long as someone else is willing to learn from them."

From: mccalpin@gmp246.austin.ibm.com (McCalpin) Newsgroups: comp.arch,comp.benchmarks,comp.sys.super Subject: Re: Sun breaks another SPEC benchmark Date: Mon, 24 Mar 2003 19:49:53 +0000 (UTC) Message-ID: <b5nnh1$6vi$1@ausnews.austin.ibm.com> In article <b5nloq$fmp$2@vkhdsu24.hda.hydro.com>, Terje Mathisen <terje.mathisen@hda.hydro.com> wrote: >McCalpin wrote: >> Although the technique goes under many names, it is >> called "time-skewing" in a Rutgers University technical report >> by David Wonnacott and John McCalpin. A convenient citation is >> http://citeseer.nj.nec.com/mccalpin99time.html > >It took me a little while, but I believe I understand it: > >The optimization consists of blocking a set of array operations across >time, and not just array space, right? Sort of. There are a variety of ways that you can order the execution of the operations in the transformed space. In one case you are doing blocking in both space and time, but your domain gets one grid point smaller on each end at each step. Then when you are finished with this "pyramid", you still need to go and calculate all the pieces that you missed. An alternate ordering is to actually do the calculations along "diagonal" vectors in the (implied) space-time grid. The two approaches have different advantages and disadvantages depending on the dimensionality of the problem and whether or not you are attempting to solve the problem in parallel. >This is sort of what I call speed of light: Starting with some initial >situation, what is the minimum number of operations theoretically >possible to get to the desired destination? There is not a lot of opportunity to reduce the amount of arithmetic, but there is nothing in the data dependence that says that you have to stream all 13 arrays to and from main memory at each step. If you can figure out how to do a larger number of steps at a single go, you can read the arrays once at the beginning and write them once at the end of the "many-step" process. The code was widely distributed before SPEC began using it, and I probably have a copy from my professor days. I will see if there is still a copy on my Linux box at home that I can post.... -- John D. McCalpin, Ph.D. mccalpin@austin.ibm.com Senior Technical Staff Member IBM POWER Microprocessor Development "I am willing to make mistakes as long as someone else is willing to learn from them."

From: mccalpin@gmp246.austin.ibm.com (McCalpin) Newsgroups: comp.arch,comp.benchmarks,comp.sys.super Subject: Re: Sun breaks another SPEC benchmark Date: Tue, 25 Mar 2003 13:03:46 +0000 (UTC) Message-ID: <b5pk3i$9na$1@ausnews.austin.ibm.com> In article <b5p1e5$ai0$1@vkhdsu24.hda.hydro.com>, Terje Mathisen <terje.mathisen@hda.hydro.com> wrote: >McCalpin wrote: >> In one >> case you are doing blocking in both space and time, but your >> domain gets one grid point smaller on each end at each step. >> Then when you are finished with this "pyramid", you still >> need to go and calculate all the pieces that you missed. > >Anyway, for the algorithms where I've tried it, the pyramid surface >quickly becomes so large as to negate any speedups for the internal >calculations. > >This is even if I just try to disregard the edge conditions, i.e. doing >all internal nodes over a number of iterations. In the original performance experiments that Dave Wonnacott and I did, the extra bookkeeping did slow down the base code by a small constant factor. So the experiment that Dave did was to scale up the TOMCATV benchmark to overflow the available physical memory. With the last level of the memory hierarchy now being painfully slow, the bandwidth reduction from the time-skewing was able to give something like a 30x performance boost over the original code running to/from the disk. As I said before, there are a number of orderings that can be applied to the transformed set of equations. "Pyramids" are the best (if you can ignore edge effects), but other orderings can still give a substantial speedup. -- John D. McCalpin, Ph.D. mccalpin@austin.ibm.com Senior Technical Staff Member IBM POWER Microprocessor Development "I am willing to make mistakes as long as someone else is willing to learn from them."

Newsgroups: comp.arch,comp.benchmarks,comp.sys.super Subject: Re: Sun breaks another SPEC benchmark From: lindahl@pbm.com (Greg Lindahl) Message-ID: <3e8110e7$1@news.meer.net> Date: Wed, 26 Mar 2003 02:33:28 GMT In article <b5ll89$bvo$1@ausnews.austin.ibm.com>, McCalpin <mccalpin@gmp246.austin.ibm.com> wrote: >When I was working on my Ph.D. in the middle 1980's, my research >group drew a fair amount of criticism for continuing to use shallow >water models. OK, so I'll give an example from my area, numerical hydrodynamics, specifically compressible, super-sonic flows. At the risk of overgeneralization, time-explicit schemes represent the data as piecewise constant, piecewise linear, or piecewise parabolic quantities. As you get a more complicated representation, the computation time per cell rises, but the errors are smaller, so you can use fewer cells. So in 3 dimensions, piecewise parabolic is always the cheapest in the end. One such algorithm is PPM, which is one of the ASCI showcase codes. So for a huge community, piecewise linear is obsolete. But there's still a sub-community that uses it. One good reason is that sometimes you have additional physics that's less accurate than the hydro. In that case, it's not worth it to have much more accurate hydro than other stuff, so it's pointless to spend the extra computation needed for piecewise parabolic representations. This was the case a decade ago for MHD (magneto hydrodynamics). I don't know if it's still the case. And I'm pleased as punch that I managed to not mention "order" in this entire posting :-) greg

From: mccalpin@gmp246.austin.ibm.com (McCalpin) Newsgroups: comp.arch,comp.benchmarks,comp.sys.super Subject: Re: Sun breaks another SPEC benchmark Date: Wed, 26 Mar 2003 14:17:49 +0000 (UTC) Message-ID: <b5scqd$i2o$1@ausnews.austin.ibm.com> In article <rfaga.4998$Q57.312734840@newssvr21.news.prodigy.com>, Andy Glew <andy_glew_public@yahoo.com> wrote: >> Some shallow water modelling is certainly still done for nearshore >> environments and lakes. Although the theoretical foundations of >> time-skewing are based on the hyperbolic nature of the equations >> (and are therefore applicable to any implementation), I think it >> quite unlikely that any automatic system could "do the right thing" >> for shallow water equations on irregular or unstructured meshes >> (which would be more representative of current applications). > >Hmm... It seems to me that the time skewing optimization could be fairly >easily applied to "move forward" the computation for any subset of >the mesh. The border keeps growing. The trouble is that on unstructured meshes, the neighbors are only known at run-time. At compile time there is no way to know if elements with "near" index values are "near" in memory, and vice-versa. For irregular meshes (typical of all current production ocean models) the neighbors are known at compile time, but whether a cell is "wet" or "dry" is only known at runtime. If you are willing to do the calculations at all of the points and then throw away the "dry" cells after you have completed the update, then this would work. The bad news is that for typical global cases, fully 1/2 of the data points would be thrown away after each step. A more serious problem in ocean models is that there is typically separate time-stepping for the vertically averaged flow (either with short time steps or with long time steps using an implicit solver --- typically conjugate gradient) and the vertically varying flow (with longer time steps). Time-skewing could be applied to the vertically averaged flow if using short (explicit) time steps, or it could be applied to certain classes of iterative solvers (but definitely not conjugate gradient, or any other non-stationary Krylov scheme). Time-skewing could also be applied to the vertically varying flow. I suppose in theory I can see how to apply time skewing to the combined problem (as long as you avoid non-stationary solvers), but it makes my head hurt to think about the complexity of the resulting code. >Hmm... clustering of the nodes of an irregular mesh in all dimensions >might go a long way towards improving cache locality; the cluster >boundaries would be related to the subset that you want to time skew. That part is already part of standard software for sparse problems. The incentive was originally to reduce the width of the sparse matrix to reduce fill-in for direct sparse matrix solvers, but it clearly improves cache utilization as well. I can't see any way to do time-skewing in any useful sense for unstructured meshes, but maybe I have not thought about it enough. Shame on me.... -- John D. McCalpin, Ph.D. mccalpin@austin.ibm.com Senior Technical Staff Member IBM POWER Microprocessor Development "I am willing to make mistakes as long as someone else is willing to learn from them."

Index Home About Blog