adamsc@shark.WV.TEK.COM (Chuck Adams) (08/26/89)
About two weeks ago I tried to contact Claus Gittinger with the following concerns I have about the XSTONES calculations in xbench. Because I have concerns that someone may be using this program I will post this mail awaiting an answer from Claus. --- chuck adams adamsc@orca.wv.tek.com {decvax ucbvax hplabs}!tektronix!orca!adamsc Interactive Technologies Division/Visual Systems Group Tektronix, Inc. P.O. Box 1000, M/S 61-049 Wilsonville, OR 97070 (503) 685-2589 ------------------------------------------------------------------------------ To: sinix!claus@unido.UUCP Cc: adamsc Subject: Problems with XSTONES calculations in xbench I have been using the xbench program and have a fairly major problem with the algorithm used to compute xstones. I believe the file xstones.awk is incorrect in the following way. The current algorithm uses the following mathematics: ratio_n = measured_value / sun_value ratio_w = sum of (weight x sun_value / measured_value) k x sum of (weight x sun_value / measured_value) ratio_w = ------------------------------------------------ sum of (weight) k ratio_w = ------------------------------------------------------------------------ k x (k x sum of (weight x sun_value / measured_value)) / sum of (weight) k x sum of (weight) ratio_w = --------------------------------------------- sum of (weight x sun_value / measured_value)) The algorithm should be changed to use the following mathematics: ratio_n = measured_value / sun_value ratio_w = sum of (weight x measured_value / sun_value) k x sum of (weight x measured_value / sun_value) ratio_w = ------------------------------------------------ sum of (weight) In theory the more something is weighted the more it should affect the xstone calculated. I believe the following test cases indicate the nature of the problem: Test case a: measured_value[0] = 100 weight[0] = 300 sun_value[0] = 100 measured_value[1] = 10 weight[1] = 600 sun_value[1] = 10 xstone by old algorithm = 10000 xstone should be = 10000 Test case b: measured_value[0] = 50 weight[0] = 300 sun_value[0] = 100 measured_value[1] = 20 weight[1] = 600 sun_value[1] = 10 xstone by old algorithm = 10000 xstone should be = 15000 Test case c: measured_value[0] = 100 weight[0] = 300 sun_value[0] = 100 measured_value[1] = 20 weight[1] = 600 sun_value[1] = 10 xstone by old algorithm = 15000 xstone should be = 16666 A diagram of the correct results for these three test cases would be: case a case b case c xstone 10000 15000 16666 __ __ | | | | | | | | | | | | | | | | | | | | __ | | | | | | | | | | | | | | | | | | | | | | | | | | | | weight | | | | | | 600 | | | | | | ---- ---- ---- weight | | | | | | 300 | | ---- | | |__| |__| The context diffs at the end of this message should fix the problem. If you have any questions please contact me at your earliest convenience. Thanks for your help. ---- chuck adams *** xstones.awk Fri Aug 11 13:59:23 1989 --- xstones.awk.orig Fri Aug 11 12:38:41 1989 *************** *** 128,134 /rate =/ { if ( x != "dummy" ) { ratio = $3 / sunValue[x]; ! runtime["all"] = runtime["all"] + w*ratio; countedWeight["all"] = countedWeight["all"] + w; runtime[g] = runtime[g] + w*ratio; --- 128,134 ----- /rate =/ { if ( x != "dummy" ) { ratio = $3 / sunValue[x]; ! runtime["all"] = runtime["all"] + w/ratio; countedWeight["all"] = countedWeight["all"] + w; runtime[g] = runtime[g] + w/ratio; *************** *** 131,137 runtime["all"] = runtime["all"] + w*ratio; countedWeight["all"] = countedWeight["all"] + w; ! runtime[g] = runtime[g] + w*ratio; countedWeight[g] = countedWeight[g] + w; x = "dummy"; w = 1 } --- 131,137 ----- runtime["all"] = runtime["all"] + w/ratio; countedWeight["all"] = countedWeight["all"] + w; ! runtime[g] = runtime[g] + w/ratio; countedWeight[g] = countedWeight[g] + w; x = "dummy"; w = 1 } *************** *** 154,159 if (cw == 0) { print "TOTAL ? lineStones" } else { if (mw > 0) { text = "expected "; } else { --- 154,160 ----- if (cw == 0) { print "TOTAL ? lineStones" } else { + rt = (rt*allWeight)/cw; if (mw > 0) { text = "expected "; } else { *************** *** 160,166 text = ""; } ! ratio = rt / cw; stones = int(allWeight * ratio); t = sprintf("TOTAL %s %8.0f lineStones",text,stones); print t; --- 161,167 ----- text = ""; } ! ratio = allWeight/rt; stones = int(allWeight * ratio); t = sprintf("TOTAL %s %8.0f lineStones",text,stones); print t; *************** *** 173,178 if (cw == 0) { print "TOTAL ? fillStones" } else { if (mw > 0) { text = "expected "; } else { --- 174,180 ----- if (cw == 0) { print "TOTAL ? fillStones" } else { + rt = (rt*allWeight)/cw; if (mw > 0) { text = "expected "; } else { *************** *** 179,185 text = ""; } ! ratio = rt / cw; stones = int(allWeight * ratio); t = sprintf("TOTAL %s %8.0f fillStones",text,stones); print t; --- 181,187 ----- text = ""; } ! ratio = allWeight/rt; stones = int(allWeight * ratio); t = sprintf("TOTAL %s %8.0f fillStones",text,stones); print t; *************** *** 192,197 if (cw == 0) { print "TOTAL ? blitStones" } else { if (mw > 0) { text = "expected "; } else { --- 194,200 ----- if (cw == 0) { print "TOTAL ? blitStones" } else { + rt = (rt*allWeight)/cw; if (mw > 0) { text = "expected "; } else { *************** *** 198,204 text = ""; } ! ratio = rt / cw; stones = int(allWeight * ratio); t = sprintf("TOTAL %s %8.0f blitStones",text,stones); print t; --- 201,207 ----- text = ""; } ! ratio = allWeight/rt; stones = int(allWeight * ratio); t = sprintf("TOTAL %s %8.0f blitStones",text,stones); print t; *************** *** 211,216 if (cw == 0) { print "TOTAL ? arcStones" } else { if (mw > 0) { text = "expected "; } else { --- 214,220 ----- if (cw == 0) { print "TOTAL ? arcStones" } else { + rt = (rt*allWeight)/cw; if (mw > 0) { text = "expected "; } else { *************** *** 217,223 text = ""; } ! ratio = rt / cw; stones = int(allWeight * ratio); t = sprintf("TOTAL %s %8.0f arcStones",text,stones); print t; --- 221,227 ----- text = ""; } ! ratio = allWeight/rt; stones = int(allWeight * ratio); t = sprintf("TOTAL %s %8.0f arcStones",text,stones); print t; *************** *** 230,235 if (cw == 0) { print "TOTAL ? textStones" } else { if (mw > 0) { text = "expected "; } else { --- 234,240 ----- if (cw == 0) { print "TOTAL ? textStones" } else { + rt = (rt*allWeight)/cw; if (mw > 0) { text = "expected "; } else { *************** *** 236,242 text = ""; } ! ratio = rt / cw; stones = int(allWeight * ratio); t = sprintf("TOTAL %s %8.0f textStones",text,stones); print t; --- 241,247 ----- text = ""; } ! ratio = allWeight/rt; stones = int(allWeight * ratio); t = sprintf("TOTAL %s %8.0f textStones",text,stones); print t; *************** *** 249,254 if (cw == 0) { print "TOTAL ? complexStones" } else { if (mw > 0) { text = "expected "; } else { --- 254,260 ----- if (cw == 0) { print "TOTAL ? complexStones" } else { + rt = (rt*allWeight)/cw; if (mw > 0) { text = "expected "; } else { *************** *** 255,261 text = ""; } ! ratio = rt / cw; stones = int(allWeight * ratio); t = sprintf("TOTAL %s %8.0f complexStones",text,stones); print t; --- 261,267 ----- text = ""; } ! ratio = allWeight/rt; stones = int(allWeight * ratio); t = sprintf("TOTAL %s %8.0f complexStones",text,stones); print t; *************** *** 268,273 if (cw == 0) { print "TOTAL ? xStones" } else { if (mw > 0) { text = "expected "; } else { --- 274,280 ----- if (cw == 0) { print "TOTAL ? xStones" } else { + rt = (rt*allWeight)/cw; if (mw > 0) { text = "expected "; } else { *************** *** 274,280 text = ""; } ! ratio = rt / cw; stones = int(allWeight * ratio); t = sprintf("TOTAL %s %8.0f xStones",text,stones); print t; --- 281,287 ----- text = ""; } ! ratio = allWeight/rt; stones = int(allWeight * ratio); t = sprintf("TOTAL %s %8.0f xStones",text,stones); print t; -------
montnaro@sprite.crd.ge.com (Skip Montanaro) (08/26/89)
I also have some problems with the Xstones data. The author uses results from a Sun-3/50 as the base (10,000 *stones) for each category. I tried xbench on a Sun-3/50 with 68881 & 8 megs (diskless) and only got about a 7500 Xstone rating out of it. It's no big deal for me, since I use the relative magnitudes, not the absolute numbers. It wasn't paging at all, at least as far as our Excelan LANalyzer was concerned (essentially no network traffic at all from the machine during the test). -- Skip Montanaro (montanaro@sprite.crd.ge.com)
david@ms.uky.edu (David Herron -- One of the vertebrae) (08/26/89)
Funny, I was about to post a couplea questions on xbench & xstones .. We've got an evaluation copy of an NCD-16 and I'm evaluating it against a VaxStation 2000 that is the average-to-low end workstation here. I just happen to have one in our office, see. My perception -- and I spent two full days using both and switching back and forth -- is that they are equal in speed. And the NCD is faster in some things. But maybe I'm not measuring the same sort of things with my eyes the benchmarks are. I'm looking at things like iconizing/deiconizing windows, window refreshes, and so forth. Then I run xbench and get (old is original calculations, new is with patches) Xstones: old new Vs2000 (unix:0) 27907 131994 Vs2000 (ether) 17301 101474 NCD-16 6657 12390 Anybody else measured these terminals? Get similar numbers? Have comments on xbench itself? Have I possibly made any mistakes? (I did follow the directions in the README ...) BTW, the textstone numbers are very close (15863 for Vs2000 & ether, 14549 for NCD-16) and is probably the basis for my opinion that they're fairly equal. Also I realize that the on-board processors are very different, the NCD only has a 68000, so I'm not *completely* surprised at the differences. Don't get me wrong. Performance comparable to a diskless sun3/50 at much less network load (no swapping over ether!) for 1/2 (at least) the price is a good deal in my book. Is anybody collecting Xstones numbers? -- <- David Herron; an MMDF guy <david@ms.uky.edu> <- ska: David le casse\*' {rutgers,uunet}!ukma!david, david@UKMA.BITNET <- <- "So raise your right hand if you thought that was a Russian water tentacle."
rgs@jeff.megatek.uucp (Rusty Sanders) (08/31/89)
From article <4344@orca.WV.TEK.COM>, by adamsc@shark.WV.TEK.COM (Chuck Adams): > About two weeks ago I tried to contact Claus Gittinger with > the following concerns I have about the XSTONES calculations > in xbench. Because I have concerns that someone may be > using this program I will post this mail awaiting > an answer from Claus. [description of problem and patches to fix it deleted] I noticed when the benchmark first came in that the way it actually calculated the synthetic stones numbers and how it described how it calculated those numbers didn't quite agree. Without looking too deeply at your patch it does appear to fix this problem. However, I'm not sure that the actual problem is in the code. My impression is that the actual algorithm used is what it should be, and the text description should be changed to reflect the code, not the other way around. Remember that the Xstones number is synthetic, and doesn't need to actually represent any real comparisons. As long as different Xstones numbers can be compaired in some predictable fasion then all is well. With the current algorithm, servers are rewarded if they have consistent performance across all tested areas. Bad performance in any one area can seriously effect the Xstones number. With the modified code, servers are rewarded if any significant areas perform very well, even if others perform absolutly abysmally. For a general benchmark I suspect the first behaviour is better than the second. It doesn't help that the benchmark system (Sun 3/50, R3, no fpu) runs arcs very slowly. This allows any decent server to get arcStones in the hundreds of thousands, if not the millions. Even though arcStones makes up a small percentage of the final Xstone, a small percentage of a HUGE number is still a large number. This seriously skews the Xstones values for such machines. These problems could be mitigated by using a better benchmark base. But I really feel that the effect of the current algorithm gives a better comparison base then using the algorithm described in the text, and implemented in your patch. Of course, none of this is to imply the xbench is really a great benchmark. Its biggest asset is that it comes up with one final number, which can be used as a quick "general estimation" of a server's speed. A server could have a quite low Xstones rating, and still be the best price/performance solution to a particular application. Likewise, a server with a high Xstones rating could be a real dog for some applications. But, for a quick reference, I believe the Xstones number is usable. It at least sets a scale as a basis for further benchmarking efforts. ---- Rusty Sanders, Megatek Corp. --> rgs@megatek or... ...ucsd! ...hplabs!hp-sdd! ...ames!scubed! ...uunet!
rws@EXPO.LCS.MIT.EDU (Bob Scheifler) (08/31/89)
Its biggest asset is that it comes up with one final number, which can be used as a quick "general estimation" of a server's speed. Single-number benchmarks are absurd.
adamsc@shark.WV.TEK.COM (Chuck Adams) (09/01/89)
> However, I'm not sure that the actual problem is in the code. My impression is > that the actual algorithm used is what it should be, and the text description > should be changed to reflect the code, not the other way around. > > Remember that the Xstones number is synthetic, and doesn't need to actually > represent any real comparisons. As long as different Xstones numbers can be > compared in some predictable fashion then all is well. I realize xstones are synthetic but as such are supposed to convey a warm and fuzzy feeling. In my case they do not because the results are exactly the opposite of what the documentation states. Claus explicitly stated "the weights are based on our experience ... " Claus seemed to have spent a lot of effort on working out the proposed weights because a lot of the documentation goes over them. I thinks Claus intended the weights to reward performance in certain areas and overlook minor deficiencies in other areas. > With the current algorithm, servers are rewarded if they have consistent > performance across all tested areas. Bad performance in any one area can > seriously effect the Xstones number. This is not the case. The current algorithm rewards exceptional performance in areas that are weighted less than others. Take for instance, Test case d: measured_value[0] = 50 weight[0] = 300 sun_value[0] = 100 measured_value[1] = 5 weight[1] = 600 sun_value[1] = 10 xstone by old algorithm = 15000 xstone should be = 10000 __ | | | | | | __ | | | | weight | | | | 600 | | | | ---- ---- weight | | | | 300 | | | | |__| | | | | | | |__| > With the modified code, servers are rewarded if any significant areas perform > very well, even if others perform absolutely abysmally. For a general benchmark > I suspect the first behavior is better than the second. If you really want the later than you will have to use a third algorithm to compute xstones. > It doesn't help that the benchmark system (Sun 3/50, R3, no fpu) runs arcs very > slowly. This allows any decent server to get arcStones in the hundreds of > thousands, if not the millions. Even though arcStones makes up a small > percentage of the final Xstone, a small percentage of a HUGE number is still a > large number. This seriously skews the Xstones values for such machines. The point is that xstones are seriously skewed by either algorithm. The old algorithm rewards elements of the test that have low weights. The new algorithm rewards elements of the test that have high weights. But the later is what the documentation states is intended. > These problems could be mitigated by using a better benchmark base. But I > really feel that the effect of the current algorithm gives a better comparison > base then using the algorithm described in the text, and implemented in your > patch. Again the algorithm is biased to favor machines that perform better at lesser weighted elements of the test. This is contrary to what the documentation states. > Of course, none of this is to imply the xbench is really a great benchmark. Its > biggest asset is that it comes up with one final number, which can be used as > a quick "general estimation" of a server's speed. A server could have a quite > low Xstones rating, and still be the best price/performance solution to a > particular application. Likewise, a server with a high Xstones rating could be > a real dog for some applications. But, for a quick reference, I believe the > Xstones number is usable. It at least sets a scale as a basis for further > benchmarking efforts. For all you golfers out there. Xstones is about as useful as standing on the tee box and throwing grass up in the air to tell how to play the hole. I rather doubt that it sets any kind of scale for benchmarking X. ---- chuck adams adamsc@orca.wv.tek.com {decvax ucbvax hplabs}!tektronix!orca!adamsc Interactive Technologies Division/Visual Systems Group Tektronix, Inc. P.O. Box 1000, M/S 61-049 Wilsonville, OR 97070 (503) 685-2589
jim@EXPO.LCS.MIT.EDU (Jim Fulton) (09/01/89)
Claus explicitly stated "the weights are based on our experience ... " Claus seemed to have spent a lot of effort on working out the proposed weights because a lot of the documentation goes over them. I thinks Claus intended the weights to reward performance in certain areas and overlook minor deficiencies in other areas. I'll even go further and say that this is why any single number is useless without knowing the context in which it was generated. I find it easiest to think of the rating as the cross-product of the various request timings (including things like clipping, whether or not software cursors are used, number of subwindows, etc.) and the weighted profile of the application to be modeled (i.e. relative importance of each element in the set of server timings). By plugging in different application profiles, you'll get radically different ratings for a single server. In other words, a server that is acceptable for software development may be completely unable for CAD, imaging, wysiwyg, etc.
adamsc@shark.WV.TEK.COM (Chuck Adams) (09/01/89)
In article <8909011324.AA03731@expo.lcs.mit.edu>, jim@EXPO.LCS.MIT.EDU (Jim Fulton) writes: > > I'll even go further and say that this is why any single number is useless > without knowing the context in which it was generated. > Right on. I totally agree. > I find it easiest to think of the rating as the cross-product of the various > request timings (including things like clipping, whether or not software > cursors are used, number of subwindows, etc.) and the weighted profile of the > application to be modeled (i.e. relative importance of each element in the > set of server timings). By plugging in different application profiles, > you'll get radically different ratings for a single server. But as of this point we only have one profile. Even if it is poorly documented it is documented and it seems like someone out there is using it with many misconceptions. As I stated before it is not an unbiased weighting scheme and as such the bias should match the documentation. > > In other words, a server that is acceptable for software development may be > completely unable for CAD, imaging, wysiwyg, etc. Exactly, thanks for the clarity.