OxfordCroquet Logo

Dr Ian Plummer

Donation Button

http://oxfordcroquet.com/tech/nel-dg/index.asp
Technical
Introduction to Dynamic Grading

By the WCF Ranking Review Committee:

Ian Burridge, Chris Dent, Jonathan Kirby, David Maugham, Louis Nel and Chris Williams

Contents

Grading Systems
Simple Grading Systems
Measurement of Grade Difference Accuracy
Performance Deviation Trends (PDT)
Measurement of Wild Performance Handling
Wild Starting Performance
Measurement of Volatility
Class Factors
Variable Modulators
Dynamic Grading
Summary
Glossary of Abbreviations
Q&A

 

Recently the new option Dynamic Grading (DG) appeared among the systems on the Croquet Records website. This article introduces the DG system and explains various new insights obtained en route to its creation.

We gradually came to the realisation that the behaviour of every system under consideration can be explained in terms of simple systems. A simple system (for this purpose) is one in which all postgame grade adjustments are multiples of a fixed modulator M; for example, when M = 20 it is the familiar system I_20. Even the Bayesian system, despite its different origin, produces after each game approximately the effect of some I_M - an M which varies from game to game and from player to player, but in a controlled way. So the early part of this article is devoted to a study of how the behaviour of a simple system I_M changes when M changes. We focus mainly on grade difference accuracy and handling of outdated grades (like those of rapid improvers). Description of behaviour in plain English is possible but not nearly as effective as quantified description. So we introduce appropriate statistics.

In the development of Dynamic Grading we retained the idea (seen implicitly in Bayesian Grading) of a variable modulator i.e. a modulator M that varies from game to game with the individual player. However, the classification of players for this purpose in the Bayesian approach (based on periods of inactivity only) is inefficient. It becomes replaced in Dynamic Grading by a refined process which derives the size of M from the recent performance history of the player - more precisely, from the extent to which observed performance deviated from expected performance. The underlying idea of Dynamic Grading is that small performance deviation calls for small grade adjustments while large performance deviation calls for large adjustments - to bring outdated grades back in line more quickly.

Grading Systems

By grading system we mean a procedure that assigns and maintains for each player A a number G(A), the grade of A, in such a way that

(GS1) G(A) represents the most recently known performance level of player A,

(GS2) the grade difference G(A) - G(B) reflects, via the Classical Win Probability formula WP(A,B)= 1 / (1 + 10( G(B) - G(A)) / 500)), the probability that A will beat B if they were to play a game.

Any ranking system with property (GS1) allows players to be ranked in the order determined by their grades. However, not every ranking system is a grading system. For example, by putting G(A) = win percentage of A over the last 9 games, we get a system with property (GS1) which does not allow meaningful calculation of win probabilities as in (GS2).

No ranking system can state with certainty that A is better than B. A grading system can at least indicate a probability that A would beat B if they were to play a game, namely the probability reflected by the grade difference. So a grade difference of 30 compared to 90 will suggest a win probability of 0.53 compared to 0.60 (assuming the grade differences to be accurate). The following table provides examples of how grade differences (GDif) correspond to win probabilities (WP) for the higher-graded player.

GDif

0

30

60

90

120

150

180

210

240

270

300

WP

0.50

0.53

0.57

0.60

0.63

0.67

0.70

0.72

0.75

0.78

0.80

 
 
 
 
 
 
 
 
 
 
 
 

GDif

330

360

390

420

450

480

510

540

570

600

WP

0.82

0.84

0.86

0.87

0.89

0.90

0.91

0.92

0.93

0.94

This kind of comparison is the most that the public can expect from a system, but it should not expect anything less. It is worth noting that a difference in rank position conveys less information than a grade difference. On a recent ranking list there was a grade difference of 151 (win probability 0.67) between rank positions 13 and 43 but a grade difference of only 73 (win probability 0.58) between positions 51 and 81. So a difference of 30 rank positions near the top means something significantly different from 30 lower down. In view of all this we devote considerable attention below to development of statistical testing of grade difference accuracy.

In view of (GS1) a grading system implicitly predicts a winner for the next game of every player. The sooner that next game takes place the more accurate the prediction. While it cannot reliably predict performance several months into the future, a grading system can aid guesses about such performance.

Simple Grading Systems

It is tacitly assumed, for every grading system, that a starting grade is assigned to each player by the ranking officer. After that, grades become adjusted in accordance with game results. These adjustments differ from one grading system to the next. Some systems do them game by game, others event by event. In fact, a grading system is defined by how it does its grade adjustments. We will denote the winner and loser of a game respectively by W and L.

The system I_20, whose grades will be denoted IG (with 20 understood) is defined by declaring grade adjustments to be done game by game as follows. Recall that WP(L,W) denotes the loser’s win probability.

new IG(W) = old IG(W) + 20 * WP(L,W)

new IG (L) = old IG(L) - 20 * WP(L,W)

To illustrate, let us suppose the pregame data IG(W) = 2400 and IG(L) = 2200, thus IG(W) - IG(L) = 200. By applying CWP we get WP(L,W) = 1 / (1 + 10(200 / 500)) =1 / (1 + 2.511886) = 0.28. This gives the postgame data IG(W) = 2400 + 20 * 0.28 = 2405.6 and IG(L) = 2200 - 20 * 0.28 = 2194.4

The number 20 that appears as multiplier in postgame updates (in the term 20 * WP(L,W)) is the same for all players and all games in all events. We can put any positive number M in the role that 20 had in the system I_20 and obtain a grading system I_M that way. Its postgame adjustments will be like those of I_20 except that the term M * WP(L,W) will be substituted for 20 * WP(L,W). The number M, called the modulator of the system, is constant, i.e. has the same value for all players in all games of all events. We call such I_M a simple grading system.

In I_M the postgame update algorithm is quite clear as regards how the calculations are done. This clarity does not mean the system is easy to understand. Having introduced any grading system it is reasonable to ask: will it work? In particular, will I_20 work? Will I_50 work? These are not easy questions to answer in a definite way by mere inspection of the algorithm. It is not even clear what we would mean by “work”. It certainly cannot mean that all grades have to be accurate all the time. For suppose there is a magic moment when they actually are. Then after the very next game the winner’s grade has to go up by some points, say 11 points, and the loser’s grade then has to go down by 11 points. The two players are now suddenly 22 points further apart than before the game - when they were supposed to be in perfect relation to each other. And a winner who was overrated by 100 points before the game would be overrated by 111 points after the game. This latter situation shows that a player’s grade does not necessarily become more accurately estimated when more games are played. These considerations reveal that even if all players remain at a constant performance level it is hard to be confident about the success of a grading system on the basis of its update algorithm as such. In the real world players definitely do not remain at a constant performance level. That complicates the situation. In fact, the unpredictable fluctuation of player performance level is the greatest challenge that a grading system has to face.

Measurement of Grade Difference Accuracy

On a ranking list in October 2010 the system I_20 had Chris Clarke 78 grade points above Robert Fulford while I_50 had him 148 points above. Which system gives the more credible grade difference? The situation confronting us points to the need for an objective method to assess grade difference accuracy.

We will use as test games all the 160324 World Ranking games from January 2000 until mid-October 2010. For each test game g, the CGS database gives a winner and a loser; let us denote them Winner(g) and Loser(g). The grading system whose grade difference accuracy we are assessing provides the following additional information:

HGP(g) = Higher Graded Player in game g

HWP(g) = Win Probability of HGP(g) as determined by the grading system.

We are now going to describe the chi2(i.e. chi-squared) statistic - a standard statistical tool - that will enable us to assess the accuracy of these numbers HWP(g). The basic idea is to see how well the observed wins agree with the expected wins. The GDif vs WP correspondence (see above table) is in fact a one-to-one order-preserving correspondence. So an assessment of HWP(g) accuracy is at the same time an assessment of Grade Difference accuracy.

Let us divide the probability interval 0.5 to 1.0 into 10 subintervals I(1), I(2), …I(10) of equal length 0.5 / 10. So I(1) consists of values between 0.50 and 0.55, I(2) consists of values between 0.55 and 0.60, and so on. With the number of subintervals understood, each I(k) is identified by its lower boundary point. By using these subintervals I(k), we partition the set TG of test games into pairwise disjoint subsets B(k), called buckets, by defining B(k) to consist of all g in TG such that HWP(g) lies in the subinterval I(k). For example, the bucket B(1) consists of all games in which the win probability of the higher ranked player was between 0.5 and 0.55. We then assemble the following data for each B(k).

G(k) = total number of games in B(k)

OW(k) = total of Observed Wins by HGP(g) for g in B(k)

EW(k) = Sum_g HWP(g) (g in B(k))

V(k) = Sum_g HWP(g).(1 - HWP(g)) (g in B(k))

Z(k) = (OW(k) - EW(k)) / sqrt(V(k))

Here Sum_g HWP(g) = HWP(g_1) + HWP(g_2) + .. + HWP(g_G(k)), where g_1, g_2, .. g_G(k) is a listing of the G(k) games that comprise the bucket B(k). Other expressions in terms of the notation Sum_g should similarly be interpreted as a sum of the indicated terms. The tables to follow lists for each k the data for B(k), first for the system I_20, then for I_50.

Tabulation of Bucket Data

k

I

G

OW

EW

V

Z

I_20

 

 

 

 

 

 

1

0.50

26616

13955

13964.3

6632.4

-0.11

2

0.55

24331

14018

13980.4

5942.4

0.49

3

0.60

22087

13780

13788.8

5175.9

-0.12

4

0.65

19173

13009

12931.5

4205.6

1.20

5

0.70

16832

12301

12195.1

3356.0

1.83

6

0.75

14478

11164

11211.2

2526.7

-0.94

7

0.80

12579

10425

10371.9

1817.3

1.25

8

0.85

10711

9394

9365.6

1174.2

0.83

9

0.90

8679

8057

8018.6

608.4

1.56

10

0.95

4838

4721

4683.3

149.1

3.09

total

 

160324

110824

110511

 

 

             

I_50

 

 

 

 

 

 

1

0.50

21359

11079

11204.7

5322.5

-1.72

2

0.55

20952

11785

12045.5

5116.0

-3.64

3

0.60

19760

11950

12345.2

4628.3

-5.81

4

0.65

18264

11900

12324.6

4004.1

-6.71

5

0.70

17084

11960

12380.7

3404.9

-7.21

6

0.75

15519

11406

12021.4

2706.1

-11.83

7

0.80

14240

11278

11743.1

2056.1

-10.26

8

0.85

13074

11178

11438.0

1428.5

-6.88

9

0.90

11477

10322

10611.8

797.6

-10.26

10

0.95

8595

8310

8348.6

237.9

-2.51

total

 

160324

111168

114464

 

 

The difference OW(k) - EW(k) should be close to zero for a system with accurate HWP(g), or equivalently, with accurate grade differences. A direct comparison of the difference OW(k) - EW(k) produced by I_20 with the corresponding difference produced by I_50 is unsatisfactory because the number of games in the set B(k) of I_20 will be different from that of I_50. It is for the sake of more comparable numbers that the quantity Z(k) is introduced. The chart that follows, which plots these Z-statistics for the two systems, allows easy visual comparison.

Graph of Z-stat for I_20 vs I_50

It is clear that while the blue plot of I_20 remains close to zero, the red plot of I_50 wanders off well below it. The fact that all Z-values of I_50 are negative arises from the fact that OW - EW is always negative, i.e. the win probabilities of I_50 are always higher than they ought to be.

To express the closeness to zero in terms of a single number, the chi2 statistic is introduced. It is the sum of the squares of the Z-values:

chi2 = Z(1)2 + Z(2)2 + ... + Z(10)2.

It takes into account the entire set TG of test games and provides a general measure of win probability accuracy, thus of grade difference accuracy: the smaller the chi2 the better the accuracy. Computation of this statistic for the above two systems from the above data yields

chi2(I_20) = 20.13 and chi2(I_50) = 551.04

Thus, in comparison with I_50, the system I_20 gives more accurate win probabilities which means more accurate grade differences.

The number of buckets is chosen by the user. More buckets give more accurate comparisons provided there are enough test games to supply them all appropriately. In the present situation 100 buckets seems a near optimal choice. So that is what we are going to use for comparison of systems. The number of buckets, when applied to the mentioned two systems, compare as follows

100 buckets: chi2(I_20) = 118.09 and chi2(I_50) = 634.01

 10 buckets: chi2(I_20) = 20.13 and chi2(I_50) = 551.04

The chi2 obtained with 10 buckets could flatter the system because the sum OW - EW could produce a lot of cancellations in case of a large subinterval. This is particularly true in case of I_20 where the OW - EW values have different signs from time to time. In case of I_50, where win probabilities are systemically overrated, there is less cancellation.

The chi2 statistic can be transformed as follows. We define the Grade Deviation of a system to be the square root of the average of the squares of the Z-values. So

GDev = sqrt ( (1 / m) * chi2 ) and chi2 = m * GDev2,

where m is the number of buckets used. Clearly GDev is equivalent to chi2 in the sense that each can be derived from the other via a one-to-one order-preserving transformation. This is how GDev compares:

100 buckets: GDev(I_20) = 1.09 and GDev(I_50) = 2.52

 10 buckets: GDev(I_20) = 1.42 and GDev(I_50) = 7.42

It can be seen that GDev has magnitude similar to that of the Z-values which it is supposed to represent in a single number, while chi2 gives a larger value. The chi2 statistic is the natural choice for testing of an hypothesis (e.g. for a decision that something is good or bad), but for our purposes, where a more continuous scale of comparison is desirable, the GDev statistic is more suitable. So we systematically use GDev with 100 buckets from now on to measure grade difference accuracy. Applying it, we arrive at the following comparison of simple systems of interest to us.

system

GDev

I_16

1.21

I_20

1.09

I_24

0.99

I_35

1.53

I_50

2.52

I_20 has long appeared on the Croquet Records website, so it is familiar to readers. I_16, I_24 and I_35 are of interest for the role they are destined to play in Dynamic Grading, as will later be seen. I_50 is of interest because of its close connection to the current official system CGS.

If all players kept performing at a near constant level, the smaller modulator of I_16 ought to have given it an advantage over the others because once the grades had settled down its postgame grade adjustments would be less disruptive. If rapidly changing performance levels were the order of the day I_35 would have the advantage because the larger postgame adjustments would enable its grades to catch up more quickly, while not disrupting the grade as much as I_50 does. It can be seen how a grading system needs to find a compromise between conflicting demands. Also that practical testing is essential for comparison of systems. Let us now proceed towards quantification of performance fluctuation - the most difficult obstacle for any ranking system to overcome.

Performance Deviation Trends (PDT)

To detect a rapid improver is not simple. An upset win could be a sign of rapid improvement or it could just be a fluke. Even when A is clearly much better than B, with 90% win probability, then B is still expected to win 10% of games against A. Close to 30% of all croquet results are upsets, so it is not helpful to draw conclusions from just half a dozen or so games. On the other hand, to express rapid improvement in terms of the preceding 40 games would be more valuable than in terms of the preceding 80. Experiments have pointed to the preceding 30 games of a player as a useful starting point for studying performance fluctuation.

Given any grading system and any list of consecutive games, we can assemble the data displayed in the following table as an example. It starts with game 41 of the real player whose data we are using. This 41 is an arbitrary choice for the purpose of illustration. We list only the data relevant to our purpose. An entry 1 appears in column W if the player won the game, 0 appears otherwise. Column WP lists the player’s win probability in every game. The V column entries v(g) (for game g) are computed from the WP entries as follows: v(g)= WP(g) * (1 - WP(g)).

Game

 

W

WP

V

RPD

PDT

41

 

1

0.4167

0.2431

 

 

42

 

1

0.1952

0.1571

 

 

43

 

1

0.5536

0.2471

 

 

44

 

0

0.0492

0.0468

 

 

45

 

1

0.3075

0.2129

 

 

46

 

1

0.6244

0.2345

 

 

47

 

0

0.3457

0.2262

 

 

48

 

0

0.456

0.2481

 

 

49

 

0

0.4166

0.243

 

 

50

 

1

0.3509

0.2278

 

 

51

 

1

0.3938

0.2387

 

 

52

 

0

0.2498

0.1874

 

 

53

 

0

0.2374

0.181

 

 

54

 

1

0.0232

0.0227

 

 

55

 

1

0.1372

0.1184

 

 

56

 

1

0.0648

0.0606

 

 

57

 

0

0.1954

0.1572

 

 

58

 

0

0.1648

0.1377

 

 

59

 

0

0.0446

0.0426

 

 

60

 

1

0.3904

0.238

 

 

61

 

1

0.11

0.0979

 

 

62

 

0

0.1219

0.107

 

 

63

 

1

0.1137

0.1008

 

 

64

 

1

0.1329

0.1152

 

 

65

 

1

0.1565

0.132

 

 

66

 

0

0.2129

0.1675

 

 

67

 

1

0.2036

0.1621

 

 

68

 

0

0.2403

0.1825

 

 

69

 

0

0.3102

0.214

 

 

70

 

1

0.1193

0.1051

4.38

 

71

 

0

0.2091

0.1654

4.06

 

72

 

1

0.7173

0.2028

3.80

 

73

 

1

0.1104

0.0982

4.07

 

74

 

1

0.5563

0.2468

4.20

 

75

 

0

0.4663

0.2489

3.67

 

76

 

1

0.1537

0.1301

3.92

 

77

 

0

0.6735

0.2199

3.77

3.98

Consider the set of test games

Upto70 consisting of the 30 games 41, 42, .. 70.

There are further data of interest defined for the set Upto70, as follows:

OW70 = Sum_g W(g) (g in Upto70)

EW70 = Sum_g WP(g) (g in Upto70)

V70= Sum_g v(g), where v(g)= WP(g) * (1 - WP(g)) (g in Upto70)

RPD70 = (OW70 - EW70) / sqrt(V70)

Thus OW70, EW70, V70 are respectively the sums of the highlighted entries in columns W, WP, V. Since OW70 - EW70 expresses the difference between the totals of observed wins and expected wins, it can be seen that RPD70 (Recent Performance Deviation of the player at game 70) expresses the extent to which the observed performance of the player deviated from expected performance over the set of 30 games that ended at game 70. Note that RPD70 does not express inaccuracy of the grade at game 70. The effect produced by V70 is to give more weight to high disparity underdog wins than to lower disparity ones. The table on the left of the two below shows for various values of WP(g) the corresponding term v(g) and the table on the right indicates how 1 / sqrt(V) corresponds to V.

WP(g)

v(g)

 

V

1 / sqrt(V)

0.30

0.210

 

7.00

0.378

0.35

0.228

 

8.00

0.354

0.40

0.240

 

9.00

0.333

0.45

0.248

 

10.00

0.316

0.50

0.250

 

11.00

0.302

Generally, the smaller the WP(g) the smaller the v(g). So many underdog wins would make V70 smaller and thus 1 / sqrt(V70) larger, therefore also RPD larger.

What we did above at game 70 can be done at any game from game 30 onwards. In particular, we can form the game sets

Upto71 = {g_42, g_43, .. g_71}, with attendant OW71, EW71, V71 and RPD71,

Upto72 = {g_43, g_44, .. g_72}, with attendant OW72, EW72, V72 and RPD72,

etc.

Upto77 = {g_48, g_49, .. g_77}, with attendant OW77, EW77, V77 and RPD77.

The values RPD71, RPD72, .. RPD77 respectively, are shown as entries in the RPD column.

All this brings us to the Performance Deviation Trend at game 77, denoted PDT77. It is defined to be the average of the eight RPD-values RPD70, RPD71, .. RPD77. More generally, at every game g

PDTg = average of the preceding eight RPD values.

Thus PDT is a smoothed version of RPD. The chart to follow illustrates this. It plots the game by game RPD and PDT values of an unsteady player

Graph of RPD vs PDT

RPD does not perform as well as PDT in the role of the latter, even though RPD appears to be more up to date as assessment. We caution again that a PDT value does not quantify grade accuracy of the player at that time. It expresses the extent to which observed performance deviated from expected performance over the recent past. The chart to follow illustrates how the PDT unfolds for a steady player compared to an unsteady player. Both (real players) arrived at game 30 with a PDT close to zero - indicative of a pretty accurate starting grade.

Graph of PDT of steady vs unsteady player

While PDT quantifies how much a player has been playing above or below expectation, the quantification could be roughly translated in terms of grade differences. As a rule of thumb, pdt = 92 * PDT gives approximately the corresponding grade difference. So a player with PDT = 2.29 has performed (over the preceding 37 games) at a level approximately 92 * 2.29 = 210 points higher than his grades over that period would suggest. On ranking lists the pdt value will be listed rather than PDT (= pdt / 92) since readers are more familiar with the grade point scale.

The player statistic PDT will now be used to quantify how well a system handles grading of rapid improvers and other unsteady performers.

Measurement of Wild Performance Handling

By wild performance will be meant the performance of a player whose PDT satisfies abs(PDT) > 2.2; this effectively means that the observed performance over the preceding 37 games has been about 200 grade points above or below expectation. The value 2.2 is chosen for definiteness to serve as benchmark. Rapid improvers and rapid sliders are the most obvious sources of wild performance. Somebody who returns to competitive play after being inactive for a few years is another wild performance prospect. A grading system is not to blame for existence of wild performances, but it is responsible for handling them well.

The presence of wild performance among players who have played hundreds of games in the system shows that more games played does not imply a more accurate grade.

By considering a game by game comparison of PDT values under I_30 and I_20 of a rapid improver like Rutger Beijderwellen we get an illustration of how these systems cope. Under I_30 his PDT rose to the value 2.2 at game 37 and stayed above the benchmark of 2.2 until game 130. Under I_20 his performance became wild at game 38 and stayed that way until game 138. Thus I_20 allowed a wild streak that lasted 7 games longer than what I_30 allowed. At the end of this streak his rank position under I_30 was 80th compared to 111th under I_20. (A higher I_30 grade does not necessarily translate into a better I_30 rank position because the system may grade rival players higher too). When a player’s performance is wild, not only will that player be graded wrongly, but the postgame adjustments of all his/her opponents will be wrong, so their grades will become less accurate, and the effect propagates through the system. Since every player ought to be treated as fairly as possible, it is clearly desirable for a grading system to minimise the occurrence of wild performance. We need a separate statistic to gauge this ability of a system. To this end we introduce the Percentage of Wild Performance Games (PWPG) by putting

PWPG = 100 * (WPGtotal) / (GamesTotal),

where WPGtotal means the number of test games in which at least one of the two players had wild performance. The games considered for this purpose consists of all games in the set TG (games since 2000) in which at least one of the two players has at least 30 games in the system (so that PDT can be calculated). The smaller the PWPG the better the handling.

We can estimate wild performance prevalence by counting the number of players with wild performance on a typical ranking list. The following table gives this count for the January ranking list of I_24 in each of the years 2000 to 2010 (RLcnt = number on list, Wcnt = number of players with Wild Performance at the time of the listing, W% = 100 * WGcnt / RLcnt).

Year

RLcnt

Wcnt

W%

2000

645

30

4.70%

2001

686

27

3.90%

2002

743

23

3.10%

2003

741

20

2.70%

2004

855

38

4.40%

2005

833

28

3.40%

2006

785

33

4.20%

2007

775

34

4.40%

2008

778

32

4.10%

2009

826

30

3.60%

Overall

7667

295

3.80%

This survey suggests that, at any moment, close to 4% of active players will have wild performance. The table to follow compares GDev and PWPG for a selection of simple systems.

system

GDev

PWPG

I_16

1.21

10.25

I_20

1.09

8.35

I_24

0.99

6.91

I_35

1.53

4.35

I_50

2.52

2.60

This table sheds additional light on the effect produced by selection of a larger or smaller modulator M. Generally, the larger the M the better the PWPG, but for M > 24 the better PWPG comes at the cost of a worsening GDev. So with larger M the relatively small sector of the player population that have wild performance become better handled at the expense of worse handling of the larger population that don’t have them.

Wild Starting Performance

The ranking officer may have very little information about a new player that enters the database. So a player may start out with wild performance. The player statistic PDT at game 30 revealed the presence of nearly 150 players with wild starting performance, some going back for 20 or more years. After a review of the starting grades of these players and retroactive correction, a dramatic improvement resulted in the GDev of the system. Wild performances appear to have a ripple effect arising from less accurate postgame adjustments to the wrongly graded players as well as their opponents.

Wild starting performance will in future be detected when the 30th game is reached and the starting grades will then be retroactively corrected. The statistic PDT is already worth maintaining for this reason alone.

Measurement of Volatility

One measure of the volatility of a system is the rapidity with which rank positions change from one monthly ranking list to the next. A system with good grade difference accuracy and good wild performance management has some claim that its rank positions change only as much as is necessary to maintain efficient grading. So the statistic about to be introduced may be considered redundant. We introduce it nevertheless to facilitate comparison of systems with regard to their volatility.

Let Rk denote the rank position of a player on a ranking list and PrevRk the position of that player on the ranking list of the preceding month. Then abs(Rk - PrevRk) reflects the change in rank position for that player for that month. For example, if Rk = 7 and PrevRk = 12 then abs(Rk - PrevRk) = abs(7-12) = 5. A player who did not appear on both of these lists is ignored for that month. We define the Average Rank Variation (ARV) of a system to be the average value of all terms abs(Rk - PrevRk) that arise over the entire test period (January 2000 until October 2010.) The table to follow compares ARV with other system statistics in case of selected simple systems.

system

GDev

PWPG

ARV

I_16

1.21

10.25

7.79

I_20

1.09

8.35

8.51

I_24

0.99

6.91

9.20

I_35

1.53

4.35

10.92

I_50

2.52

2.60

13.03

It is clear that as the modulator M becomes larger the volatility of the system keeps increasing. To optimise GDev the modulator M needs to be large enough to reflect rapidly changing performance levels but not so large that it disrupts the grades of steady players too much.

Class Factors

Every system I_M can be modified into a new system Icf_M through introduction of Class Factors, as follows. For a Class 1 event (very prestigious) the modulator M becomes multiplied by the Class Factor 1.2, in a Class 2 event it remains unchanged (= multiplied by 1.0) and in a Class 3 (typically consolation events) it becomes multiplied by 0.8. (The ranking officer determines the class for each event). Class factors need consideration because the current official system CGS is based on Icf_50. So let us examine what introduction of Class Factors does to the statistics of simple systems. Since the game sample of over 160,000 games contains only about 14,000 of each of Class 1 and Class 3 games, one should not expect a great influence.

The system Icf_50 is of special interest because the current official system CGS is based on it in that the CGS grade, here denoted CG, is equal to CI (the Icf_50 grade) at the start and thereafter updated after every game as follows:

newCG = s * oldCG + (1 - s) * newCI

where s = 0.9 when oldCG < 2000 and s := 0.80 + (oldCG - 1000) / 10000 otherwise, subject to a maximum value of 0.97. Thus the CGS postgame grade adjustment algorithm is recursive. It looks elegant but hides the unpleasant fact that every new grade can be expressed explicitly as a weighted average of all preceding CI-values. This is the root cause of the notorious lag-effect of the CGS grade. The system statistics for the systems with class factors are as follows.

system

GDev

PWPG

ARV

Icf_16

1.19

10.34

7.75

Icf_20

1.07

8.39

8.47

Icf_24

0.96

6.95

9.15

Icf_35

1.40

4.45

10.86

Icf_50

2.60

2.68

12.95

CG

2.64

10.13

8.62

Variable Modulators

Measurement of grade difference accuracy marked a turning point in our analysis of systems. Previously we used only PCP (Percentage of Correct Predictions - a measurement of rank position accuracy). It did not use the available information to the full extent that GDev does, because it effectively differentiated between systems only on the basis of low disparity games - when rank positions are precarious. The PCP of Bayesian Grading (BG) made it look good, but it turned out to have poor grade difference accuracy. However, the Bayesian approach served as precursor of Dynamic Grading. It operates effectively like an I_M of which the modulator M changed from game to game and from player to player. Its Standard Deviation produces this effect. When the SD gets incremented at the start of each event, the size of the increment increases with the period of inactivity of the player. This is consistent with the idea that players are often rusty after an absence and this causes their grade to become outdated. The increase in this virtual modulator causes the outdated grade to catch up more rapidly.

These observations led to several experiments. Some were enhancements of BG, e.g. which incorporated increases to the SD for reasons other than inactivity. Others were to create simulations of BG which did not involve an SD but employed a variable modulator M instead, i.e. a factor in terms of which grade adjustments can be expressed in the form M * WP(L,W) similar to what is done in I_M. These experiments paved the way for the development of Dynamic Grading. The details are not worth reporting, being quite complicated and not leading to something we wish to pursue further. As regards BG itself, the experiments and their contemplation did cause us to realise that BG was hampered by the fact that it recognised only one source of unsteadiness, namely periods of inactivity. The desire to address this weakness was a major push towards development of Dynamic Grading.

There is one post-BG experiment whose details are quite simple and instructive. The system GG (Grade-driven Grading) is defined by the following postgame updates.

newGG(W) = oldGG(W) + M * WP(L,W)

newGG(L) = oldGG(L) - M * WP(L,W),

where the modulator M for each player separately is determined by oldGG as follows:

if (GG < 2000) then M = 30; if GG > 2500 the M = 15

if GG lies between 2000 and 2500 then M is between 30 and 15 via linear interpolation.

The underlying assumption is that players with grade above 2500 are generally steady and therefore should get small adjustments; players with grades below 2000 are not yet steady and should accordingly get larger adjustments. These ideas were suggested by a system long used in chess. However, the grade differences of GG were not nearly as good as that of I_24, presumably because the underlying assumptions were not sufficiently realistic.

Dynamic Grading

In the development of Dynamic Grading we retained the idea (seen implicitly in Bayesian grading) of a variable modulator. However, the classification of players for this purpose (based on periods of inactivity only in BG) becomes replaced in Dynamic Grading by a refined process which derives the size of M from the recent performance history of the player. It uses the extent to which observed performance deviated from expected performance, as reflected by the PDT. The underlying idea of Dynamic Grading is that small performance deviation calls for small grade adjustments while large performance deviation calls for large adjustments - to bring the outdated grades back in line more quickly.

The Dynamic Grading system is defined to have postgame updates that coincide with those of I_24 for the first 30 games of the player. From then on the updates are done much as for systems I_M, i.e.

newDG(W) = oldDG(W) + M_W * WP(L,W)

newDG(L) = oldDG(L) - M_L * WP(L,W),

where the modulators M_W and M_L are dynamically determined for each game by PDT_W and PDT_L (the PDT of winner and loser respectively) as follows:

M_W = f(PDT_W) and M_L = f(PDT_L), where f(x) = 16 + 19.2 * x2 / (1 + x2).

This means that the change of grade for the winner and loser will be different, and therefore (unlike a simple grading system) DG is not zero sum in grade adjustments.

The function f has a minimum value of 16 when x = 0. Since x2 = (-x)2, it is symmetric about the point x = 0 and increases smoothly as x2 increases. Since x2 / (1 + x2) < 1 for all x, it follows that every modulator M always satisfies

16 <= M < 16 + 19.2 = 35.2.

For a new player the first PDT value to be calculated is PDT30: the average of RPD23 through RPD30. Those early RPD are calculated in terms of the available game results. It is not ideal but better than nothing. From game 37 on the system is in full stride.

A player who plays steadily according to grade, with PDT = 0.2 (say), will have M = 16.74, so such a player will get relatively small postgame adjustments, approximately like those of I_16. On the other hand, a rapid improver will have PDT > 2.2 and thus a M > D(2.2) = 31.9. This player will get larger postgame adjustments. A rapid slider with PDT < -2.2 will also have M > 31.9. However, M never exceeds 35.2. It is tempting to let M be determined by the RPD-value instead of the PDT-value. That simplifies the algorithm, but causes prohibitive loss of grade difference accuracy. Every streak of 30 games in a row can be expected to include about 10 upset results. So the fluctuation in the RPD-values is not surprising. That fluctuation creates the need for a smoothed version of the RPD-values, namely the PDT. It is also tempting to define M recursively. That too simplifies the algorithm with prohibitive loss of accuracy.

Here, for purposes of illustration, follows a Top 25 Dynamic Grading ranking list at 1 October 2010. It shows the pdt, PDT and M at that time. Since these statistics change gradually rather than abruptly they give an idea of what to expect for the next game.

Rank

Player

DG

pdt

GIP

WIP

PDT

M

1

Chris Clarke

2717

40

25

24

0.43

18.98

2

Reg Bamford

2657

-19

40

30

-0.21

16.80

3

Robert Fulford

2632

40

88

63

0.43

19.04

4

Paddy Chapman

2616

62

142

115

0.67

21.89

5

Ed Duckworth

2535

89

35

24

0.97

25.29

6

Aaron Westerby

2505

26

100

68

0.28

17.40

7

Robert Fletcher

2502

-5

180

134

-0.05

16.05

8

David Maugham

2499

4

143

99

0.04

16.04

9

James Death

2499

1

74

50

0.01

16.00

10

Ian Lines

2497

-20

119

88

-0.22

16.86

11

Ben Rothman

2485

127

155

117

1.38

28.59

12

Stephen Mulliner

2467

-106

173

121

-1.15

26.94

13

Danny Huneycutt

2460

59

98

66

0.64

21.58

14

Jamie Burch

2444

60

59

43

0.65

21.65

15

Paul Skinley

2439

42

122

87

0.46

19.32

16

R Beijderwellen

2437

54

65

42

0.59

20.93

17

Samir Patel

2434

-32

84

53

-0.35

18.08

18

Greg Bryant

2414

78

59

39

0.85

24.06

19

Ian Dumergue

2402

-10

62

40

-0.11

16.23

20

Bruce Fleming

2394

24

54

38

0.26

17.20

21

Tony Le Moignan

2392

20

57

41

0.22

16.90

22

Mark Avery

2379

143

97

67

1.55

29.57

23

Jeff Dawson

2376

-35

87

58

-0.38

18.44

24

Stephen Forster

2375

-26

84

51

-0.28

17.42

25

Ian Burridge

2351

9

55

34

0.10

16.21

The system statistics of Dynamic Grading compares as follows with a selection of other systems, including a version DGcf of Dynamic Grading which employs class factors.

system

GDev

PWPG

ARV

DG

0.895

6.78

8.96

Icf_24

0.959

6.95

9.15

DGcf

0.964

6.85

8.91

I_24

0.986

6.91

9.20

CGS

2.642

10.13

8.62

Summary

The underlying idea in creating a good grading system is to keep the grades as accurate as possible all the time by ensuring that the grade adjustments of the players after each game are as appropriate as possible. The success of DG can be attributed in large measure to its direct testing of individual players to detect grades that are out of line and its effective remedial action in the light of this feedback. In this context, class factors often hamper the effectiveness of DG rather than helps it, so overall they make the system worse. The systems BG and GG make indirect assumptions about the classes requiring smaller or larger adjustments. They are conspicuously less successful than DG. While it may generally be true that higher graded players have lower than average PDT, giving them all a small modulator turned out detrimental to good grading, as the GG experiment showed.

Glossary of Abbreviations

ARV = Average Rank Variation (see under Measurement of Volatility)

BG = Bayesian Grade or Bayesian Grading system

CGS = current official grading system

CG = Grade of CGS

CI = Index of CGS

DG = Dynamic Grade or Dynamic Grading system

GG = Grade Driven Grading (see under Variable Modulators)

GDev = Grade Deviation (see under Measurement of Grade Difference Accuracy)

I_M = simple grading system with modulator M

IG = Grade of a simple grading system I_M

Icf_M = system I_M with class factors

PDT = Performance Deviation Trend

pdt = 92 * PDT

PWPG = Percentage of Wild Performance Games (see under Measurement of Wild Performance)

RPD = Recent Performance Deviation (see under Performance Deviation Trends)

WP(A,B) = win probability of A over B


Q&A

Bob Kroeger writes:

I'm mathematically challenged but am interested in trying to understand the new system. Could a lay description of what you’re discussing be posted from time to time? If not, no worries.

Louis Nel responds:

We are introducing a measure for Grade Deviation (the statistic GDev) which, roughly speaking, measures the efficiency of a system while ignoring the detailed description of how precisely this number GDev is arrived at. If such readers look at a column of numbers where the GDev of various systems are compared, they would be able to see that one system has a better GDev than the other. In this way I was hoping they would get enough of a gist of what is going on.

Jonathon Kirby responds:

I'll try to give a quick overview of what Dynamic Grading (DG) does.

If you look at the DG ranking list, you will see two important numbers for each player: the grade and the pdt (performance deviation trend).

The grade is simple - it is the system's best guess at how good the player is, in terms of their recent (and predicted future) form. When you win a game, your grade goes up and when you lose it goes down.

The rest of the system is all about how much it goes up or down. Just like the existing systems, if you beat someone with a high grade then you get more points than if you beat someone with a lower grade. So if you beat Robert Fulford, expect a bumper reward, but if you beat a novice who has just started playing ranking games and is at the bottom of the rankings, don't expect it to affect your grade much. On the other hand, if you lose to your local novice then expect to see a bigger effect.

So far the system is simpler than the existing one, where it is possible to win a game and for your grade to go down instead of up. So what is pdt? Roughly it measures how well you have played to your grade over the last 37 games (or how much you have deviated from your grade). So pdt = +100 means you have played about 100 points ahead of your grade, and pdt = -50 means 50 points under. Note, this is how well you played compared to your grade at the time of each game, not how well you played compared with your current grade. The current grade is still the system's best guess at how good you are now.The other thing to bear in mind is that pdt only very approximately measures how many points ahead or behind your grade you have played.
That is not exactly how it is calculated, but just a rule of thumb.

A player with a very high pdt looks like they may be a rapid improver, so the system adjusts by making their rewards (or losses) for winning or losing games bigger. That way, the system catches up with rapid improvers faster. Also, rapid sliders drop down the rankings faster. But players who are very steady at their grade (pdt close to 0) have smaller rewards and smaller losses for their games. The DG system works better than the existing ranking system for the various tests we have thrown at it.

George Cochran asks:

In the Dynamic Grade algorithm, the Recent Performance Deviation of a player is calculated over the previous 30 games, and thus the pdt uses results of the past 37 games. Thus the algorithm depends on the choice of the number 30; we could call the system "DG_30".

The excellent and well-written documentation doesn't provide any justification for the choice of 30 instead of, say, 15 or 20 or 40. Was there any evaluation of systems DG_k for values of k other than 30?

Also, I'm curious about the reason for the .2 added to the 19.2 in the formula for calculating the M of a player. It would be natural to make M range from 16 to 35, but why 35.2? Was this because of an observed improved performance of the system?

Which raises the question: If M ranges between j and k, instead of 16 to 35.2, does the overall performance go up or down as j deviates from 16 and k deviates from 35.2?

Excellent work, and a fascinating read.

Louis Nel responds:

George, all your questions have the same answer. Namely, the values of all these parameters were experimentally determined to optimise the value of GDev for the resulting system. In particular, we chose 30 (rather than 29) because it gave better performance. And we chose 16 (rather than 15.9) as lowest modulator value and 19.2 rather than 19 because these choices gave a better GDev.

Having said that, it cannot be guaranteed that cultural changes in future will not require changes to these parameters in years to come.

If j > 16 the overall performance goes down and likewise if k < 35.2 it also goes down. (This was meant to be implied by my previous response when I indicated that these bounds for M were chosen so as to optimise the performance in terms of the Grade Deviation (GDev)).

Bob Kroeger asks:

What would be very helpful is looking at players who have played the minimum required games to be included in the system and see how the new system might better (more accurately) reflect their ranking. Take me for example currently ranked 147. Let's say it's widely held that David Maloof (159) is clearly a better player than I am (which is true IMO). Would the new system correct this?

147)

Bob Kroeger

2108

14

8

2128

57.1

0

159)

David Maloof

2086

16

9

2143

56.2

4

Louis Nel responds:

Your question about your relative position to David Maloof is quite instructive. The CGS data compared to the DG data looks like this:

 

CGS

Dynamic

 

Grade

Idx

DG

pdt

Kroeger

2108

2128

2048

29

Maloof

2086

2143

2031

202

By the way, to get the Dynamic Grading data you just have to select that option on the Croquet Records site.

The significant difference is that the Dynamic data reveals that Maloof has been playing 202 points better than his grade over the preceding 37 games while you have been playing only 29 points better than your grade. So that is a formal indication that your surmise that he is the better player has some justification. He will likely catch up with you soon if he plays enough games, because his pdt is significantly larger than yours. As Jonathan already explained, this means his grade adjustments are going to be larger than yours.

In the Croquet Grading System (CGS) you would have received the same Index adjustments. Also, the statistic PWPG in the article shows that rapid improvers like Maloof are better handled by Dynamic Grading than by the CGS.

George Cochran comments:

The original goal of the grade system is that the difference in grades between two players was supposed to predict the probability that either player would win if a game were to actually be played between them, according to a specific formula.  Thus if the grade difference is 100, then the higher-graded player is supposed to have a 61.3% probability of winning a game against the lower-graded player.

The Ranking Review committee developed a direct measurement of how well the difference in grades in the system corresponded to how often the stronger-graded player won.   They took all the games actually played between Jan 2000 and Oct 2010 in the Croquet Grading System and divided them into bins according to the grade difference.  There were tens of thousands of actual games in each bin.  You can then count how many of those games were won by the stronger player, and see whether the actual win frequency is close to the theoretical win probability.

What they report is that the current system systematically over-estimates the win probability in every bin.  That is, the actual win frequency from played games was significantly less than the probability that the difference in grades is supposed to predict.  They also determined that this over-prediction of win probability was mostly caused by the math formula being used to calculate the amount by which the index goes up and down when you win or lose a game.  Roughly speaking, for most players the index changes too much, and a much better system is obtained by cutting the amount of change in roughly half.

However, just cutting the size of the change in index creates larger grade errors for players whose actual skills and performance are rapidly changing, such as rapid improvers.  Those errors then propagate through the system through their opponents grades, resulting in a degradation of the overall system.  By pegging the size of index change to how much a player has over-performed or under-performed the grade-difference over the past 37 games, the committee found an algorithm that performs much better than any system in which the index-change is calculated solely from the difference in grades.

By "performs better", I mean that in the proposed new system, the difference in grades more accurately predicts the probability of winning a game.

John Riches asks:

I am wondering whether the system is tailored to suit the top few players, who may play well over 100 games against other top players in a year, or also caters well for the "also rans". In particular, how well suited is it for use in a country like Australia where a top player may get to play no more than 20 singles games in a year at international level?

If the ranking is worked out over his last 30 games (or like the current system, games played 20 games ago have the greatest influence on the ranking), then it seems that the ranking could reflect the ability of  a player 18 months to 2 years ago, rather than his current ability.

For purposes such as selection and seeding of the national championships, do you think there would be an argument in favour of Australia using its own system, based on more recent form?

Jonathan Kirby responds:

There was discussion on the committee about whether to optimise the system for the top 100 or so players, or whether to optimise it for all the players in the system. It was decided to optimise it for all the players taken together, not just the top players. (The reasons put forward for caring more about the top players were that it is mainly the top players for whom the rankings are used, e.g. for selections and world championship seedings.)

Dynamic Grading is a big improvement over the current Croquet Grading System for players who play few games per year, because the grade is always right up to date, with the most recent games having the biggest influence. You correctly identify one of the main problems with the current CGS.

I believe Australia is the only country which uses the world rankings directly for selection purposes, rather than using a selection committee. However, it is understandable given the size of the country if suitable people for a selection committee cannot be found. Selections made using Dynamic Grading should be better (based on more recent form) than selections made using the CGS.

Author: WCF Ranking Review Committee
All rights reserved © 2011-2017


Updated 28.i.16
About, Feedback
oxfordcroquet.com/tech/nel-dg/index.asp
on www.oxfordcroquet.com
Hits: 9916