## Data Science and HR Annual Performance Reviews

- 01/08/2016
- 436
- 0 Like

**Published In**

- Big Data
- Analytics
- Artificial Intelligence

I was in a team performance review meeting the other day. The leadership group was trying to assign a rating to each member of our teams, based on their percentile rank amongst their colleagues, on two dimensions (roughly what they have delivered and how they have gone about it).

We had considerable difficulty in doing this.

It is difficult to “see” percentiles: Is X on the 60th percentile (ie better than 6 in 10 or her colleagues) or 50th percentile? It’s hard to say. It’s even quite difficult to see quartiles. This is because the comparable set (how all your colleagues rank) is the result of the same comparison exercise. So the reasoning is circular, and must be approximated by an iterative process. If the team is large it takes a long time. And few of the reviewers had worked with more than 20% of the team under review, so it is difficult for them to rank colleagues they know against those they don’t. Reviewers tended to over-rank people they knew, resulting in 75% of the cohort ranked better than the “median”.

In contrast, the group found it easy to do a pairwise comparison between members of the team: “X is better than Y at task A”; “I would rather have X in my team than Y for a technical role” etc. So in theory one could use pairwise comparison to build up a ranking over the whole team.

There are a number of constraints to this. Firstly, a full set of pairwise comparisons is a lot of pairs. For a team of 100, that’s 10,000 comparisons. Even if you did have a full set, there would surely be inconsistencies that would prevent a direct mapping between sets of pairs and a ranking. In particular there would be cycles (A is better than B who is better C who is better that A) which make ranking impossible (try it!). Cycles occur because it is hard to keep consistent over such a large exercise, and also because those voting on A vs B, on B vs C and on A vs C might be different, given that different reviewers might have worked with different subsets of the team.

We need a statistic that gets rid of the need to compare all combinations, and summarises the data even in the presence of cycles and variation.

As it turns out, precisely the same problem was encountered when ranking chess players. In a large population of chess players, there will be many who have never played each other, but they will probably have played the same opponents: A may not have played B before, but they may both have played C. Or perhaps they have played C and D respectively, and C and D have both played E. Moreover the outcomes of chess games are not deterministic: cycles are expected. The challenge was to develop a ranking system for the entire population based on this partial evidence.

A Hungarian-American mathematician, Arpad Elo came up with the famous Elo ranking system (https://en.wikipedia.org/wiki/Elo_rating_system ) which was adapted to include consistency of performance by Mark Glickman (https://en.wikipedia.org/wiki/Glicko_rating_system) and was further adapted to rate teams with changing members by Microsoft (https://en.wikipedia.org/wiki/TrueSkill ).

One of the nice features of these ranking systems is that they can produce a probabilistic result: The probability that A will beat B, or in this context, the probability that A will be better than B in a given task. At Betfair we used Glicko-2 to rate Tennis players (and to predict the outcome of games) and it worked well.

Here’s how it could work for ranking team members:

- Take the full list of members to assess (best to segment into job roles or hierarchical levels so you are making a fair comparison).
- Create a random list of pairs (A, B)
- Each reviewer gets a different list.
- For each pair, a result is recorded as A, B or Null (where the assessor does not know them both well enough to make a decision)
- The results are plugged into Elo (the equations are on the wiki). This will give an Elo score to each person. Where Elo Scores are close, the skills are similar.
- The team can then be ranked by Elo score and put into percentiles, quartiles or grades.

This would be easy to automate, pretty quick and a more accurate way to understand the relative skills of the team.

- 01/08/2016
- 436
- 0 Like

## Data Science and HR Annual Performance Reviews

- 01/08/2016
- 436
- 0 Like

#### Harry Powell

Barclays, London at Head of Advanced Data Analytics

Opinions expressed by Grroups members are their own.

#### Top Authors

I was in a team performance review meeting the other day. The leadership group was trying to assign a rating to each member of our teams, based on their percentile rank amongst their colleagues, on two dimensions (roughly what they have delivered and how they have gone about it).

We had considerable difficulty in doing this.

It is difficult to “see” percentiles: Is X on the 60th percentile (ie better than 6 in 10 or her colleagues) or 50th percentile? It’s hard to say. It’s even quite difficult to see quartiles. This is because the comparable set (how all your colleagues rank) is the result of the same comparison exercise. So the reasoning is circular, and must be approximated by an iterative process. If the team is large it takes a long time. And few of the reviewers had worked with more than 20% of the team under review, so it is difficult for them to rank colleagues they know against those they don’t. Reviewers tended to over-rank people they knew, resulting in 75% of the cohort ranked better than the “median”.

In contrast, the group found it easy to do a pairwise comparison between members of the team: “X is better than Y at task A”; “I would rather have X in my team than Y for a technical role” etc. So in theory one could use pairwise comparison to build up a ranking over the whole team.

There are a number of constraints to this. Firstly, a full set of pairwise comparisons is a lot of pairs. For a team of 100, that’s 10,000 comparisons. Even if you did have a full set, there would surely be inconsistencies that would prevent a direct mapping between sets of pairs and a ranking. In particular there would be cycles (A is better than B who is better C who is better that A) which make ranking impossible (try it!). Cycles occur because it is hard to keep consistent over such a large exercise, and also because those voting on A vs B, on B vs C and on A vs C might be different, given that different reviewers might have worked with different subsets of the team.

We need a statistic that gets rid of the need to compare all combinations, and summarises the data even in the presence of cycles and variation.

As it turns out, precisely the same problem was encountered when ranking chess players. In a large population of chess players, there will be many who have never played each other, but they will probably have played the same opponents: A may not have played B before, but they may both have played C. Or perhaps they have played C and D respectively, and C and D have both played E. Moreover the outcomes of chess games are not deterministic: cycles are expected. The challenge was to develop a ranking system for the entire population based on this partial evidence.

A Hungarian-American mathematician, Arpad Elo came up with the famous Elo ranking system (https://en.wikipedia.org/wiki/Elo_rating_system ) which was adapted to include consistency of performance by Mark Glickman (https://en.wikipedia.org/wiki/Glicko_rating_system) and was further adapted to rate teams with changing members by Microsoft (https://en.wikipedia.org/wiki/TrueSkill ).

One of the nice features of these ranking systems is that they can produce a probabilistic result: The probability that A will beat B, or in this context, the probability that A will be better than B in a given task. At Betfair we used Glicko-2 to rate Tennis players (and to predict the outcome of games) and it worked well.

Here’s how it could work for ranking team members:

- Take the full list of members to assess (best to segment into job roles or hierarchical levels so you are making a fair comparison).
- Create a random list of pairs (A, B)
- Each reviewer gets a different list.
- For each pair, a result is recorded as A, B or Null (where the assessor does not know them both well enough to make a decision)
- The results are plugged into Elo (the equations are on the wiki). This will give an Elo score to each person. Where Elo Scores are close, the skills are similar.
- The team can then be ranked by Elo score and put into percentiles, quartiles or grades.

This would be easy to automate, pretty quick and a more accurate way to understand the relative skills of the team.

- 01/08/2016
- 436
- 0 Like

## Harry Powell

Barclays, London at Head of Advanced Data Analytics

Opinions expressed by Grroups members are their own.