Collective Choice: Competitive Ranking Systems

by Christopher Allen & Shannon Appelcline

[This is the third in a series of articles on collective choice, co-written by my collegue Shannon Appelcline. It will be jointly posted in Shannon's Trials, Triumphs & Trivialities online games column at Skotos.]

In our first article on collective choice we outlined a number of different types of choice systems, among them voting, polling, rating, and ranking. Since then we've been spending some time expanding upon the systems, with the goal being to create both a lexicon of and a dialogue about systems for collective choice.

This time we're going to dig more into comparison ranking systems, by focusing on competitive rankings and looking more in depth at ELO Chess Ranking System and the other systems that we briefly mentioned previously. Our goal is to explicate these systems, to better address their flaws, to begin detailing the purposes of ranking systems, and to show how those purposes are critical in the design of ranking systems.

Subjective vs. Objective Rankings

In our original article we discussed rating systems as being largely subjective and ranking systems as being objective, but the situation isn't nearly as simple as that. In truth, there's a clear spectrum of ratings and rankings with varying amounts of subjectivity and objectivity in each collective choice system.

Bcs_1 Golfrankings_1 The Bowl Championship Series (BCS) for college football is a good example of a ranking system that explicitly allows a subjective component. It involves a complex mathematical formula that includes things like win/loss ratios, but also sportswriters' and coaches' ratings.

However, public opinion continues to show that people don't necessarily like seeing true ranking systems having subjective components, because they expect them to be "fair". The BCS formula has come under attack several times in the last few years precisely due to its subjective basis. Cal Berkeley was one of several teams denied a bowl position in 2004 when many felt that they were worthy.

The APL tennis rankings and the official world golf rankings also have a subjective component, but it is much more subtle. Each tournament is worth a certain number of points, and the allocation of those points is relatively arbitrary, based upon the "prestige" of each tournament and the quality of players who have traditionally played in it. The subjectivism isn't quite as near to the surface as that of the college bowls, but it's still something that can have a notable, and perhaps unwarranted, effect upon the final results.

Algorithmic Rankings

Wcsrating_2 This brings us back to the ELO system, a ranking system originally designed for chess which is fairly well-known and well-understood. As we said in our overview article, "[ELO] builds a simple distribution of player ratings around a norm (typically 1500 points), then awards or deducts points based upon wins and losses, with the total sum of all points in the system staying constant. Players are then ranked according to their comparative scores."

The big difference between this and the previously discussed systems is that it's almost entirely objective; in fact it uses a statistical basis to create an underlying mathematical model for rankings, rather than allowing human subjectivity to get in the way.

The simplest formulation for an ELO rating looks like this:

R' = R + K * (S - E)

R' is the new rating
R is the old rating
K is a maximum value for increase or decrease of rating (16 or 32 for ELO)
S is the score for a game
E is the expected score for a game

Much of the trick is in figuring out what the (E)xpected score of a game is. ELO uses the following formulas for players A and B:

E(A) = 1 / [ 1 + 10 ^ ( [R(B) - R(A)] / 400 ) ]
E(B) = 1 / [ 1 + 10 ^ ( [R(A) - R(B)] / 400 ) ]

It's a good model because, using the two formulas, it means that a great player gains little from beating an average player, but an average player gains a lot from beating a great player. Take the following example:

R(A) = 1900
R(B) = 1500
E(A) = 1 / [ 1 + 10 ^ ( [1500 - 1900] / 400 ) ]
     = 1 / [ 1 + 10 ^ ( -400 / 400) ]
     = 1 / [ 1 + 10 ^ -4 / 4 ]
     = 1 / [ 1 + 10 ^ -1 ]
     = 1 / 1 + .1
     = .91
     = 91%

E(B) = 1 / [ 1 + 10 ^ ( [1900 - 1500] / 400) ]
     = 1 / [ 1 + 10 ^ ( 400 / 400 ) ]
     = 1 / [ 1 + 10 ^ 1 ]
     = 1 / 11
     = .09
     = 9%

Player A is expected to score .91 in an average game, which is to say he should win 91% of the time, and will be punished accordingly if he loses to player B:

R' = 1900 + 32 * (0 - .91)
R' = 1900 - 29.12
R' = 1871

Conversely a win nets him very little:

R' = 1900 + 32 * (1 - .91)
R' = 1900 + 32 * .09
R' = 1900 + 2.88
R' = 1903

ELO is almost entirely mathematical. Players can gain or lose different amounts of points based upon playing different players, but this is all part of the formula. The only slightly subjective element is the definition of K -- how much a player can win or lose from a particular game. The most widely used ELO systems for Chess break K down into two values: 16 for masters and 32 for everyone else. So there is a subjective decision that masters should vary their score less frequently than other players.

That's a very minor element in an otherwise objective system, but as we'll see, more recent systems by Days of Wonder and Microsoft first reduce, then eliminate even this subjectivity.

Variations of a Theme: Days of Wonder

Dowlogo_1 ELO is probably the most used ranking system in the world. You can find it in use for Go, Tantrix, and many other games. Days of Wonder, producers of Gang of Four, Ticket to Ride, and many other games use a variant of the system which they describe on their website.

They identify three core problems with ELO:

  1. New players can take a long time to ascend or descend to their correct levels.
  2. Highly ranked players can be hesitant to play with provisional players whose ranking might be much more uncertain.
  3. There are no allowances for games with more than two players.

Days of Wonder resolved the first problem by creating a new formula for provisional players, allowing them to rise and fall in the rankings much more quickly.

Conversely when playing against provisional players, regular players can only lose a maximum of K*n/20 points, where n is the number of games that the provisional player has played--rather than the normal maximum loss of K. For example, playing someone who has just played one game, can only result in a loss of 1/20th of the regular K value, and so it really doesn't matter if the provisional player's ranking is wildly out of whack.

Both of these new formulas are set up to converge toward a normal ELO formula as a provisional player's number of games approaches 20 (making them a normal player at Days of Wonder).

(It should be pointed out that using the number "20" to define a provisional player, and making a player less provisional in clean 5% steps, inevitably offers yet another small, subjective element into this mathematical formula; as we'll see momentarily Microsoft has more recently incorporated the idea of provisional uncertainty into their core mathematical model, much as the whole ELO system originally turned subjective win and loss statistics into tighter mathematics.)

Ttrskotosrankings Finally, to resolve the situation of multiple players, Days of Wonder considers each game to be a set of duels, as described here:

There are 4 players in a Gang of Four game. Let's name A the winning player, B the second one, C the third one and D the last one. We consider that there were 6 duels: A won against B, C and D. B won against C and D. C won against D. We compute independently the new scores for each duel, and then we average the values for each player.

It's a fairly elegant answer that not only rewards or penalizes all players separately, but also encourages playing for second place, or even third, if first isn't possible.

There have been continued discussions of the Days of Wonder ELO variant in their forums, and the questions raised there are common to many different ranking systems. Some players wanted unranked games, while others thought that having unranked games would discourage people from playing good competitors except in unranked games.  There has also been a lot of discussion regarding Ticket to Ride, a strategy game that supports 2-5 people, and whether the ELO variant system discourages multiperson play.

The various lessons learned at Days of Wonder underline two basic ideas about rankings. First, even with a well-studied system like ELO, there's still a lot to understand, and, second, any ranking system needs to reflect the specifics of what it's ranking -- and what its purpose is.

Variations on a Theme: XBox 360 Live

Trueskillxbox360 An even more recent large-scale ranking system is the TrueSkill system developed by Microsoft for use with the XBox 360. It appears to be an expanded variant of the glicko ranking system used by the free internet chess server.

Many of the problems identified by Microsoft were the same as those already noted by Days of Wonder and others, including: the uncertainty of provisional ratings and the need to rank players in multiplayer games. However, the TrueSkill system notably expands both issues. Ranking uncertainty is now defined as a mathematical concept and the rankings now support not just multiple players, but also multiple teams.

TrueskillTrueSkill explicitly includes two values in any ranking: a skill level and an uncertainty level. The first, like the more common ELO ranking, tells how good a player is. The second states how sure that ranking is. The uncertainty rating is effectively a margin of error, similar to those we saw in polling systems. If a first-time player has a skill rating of 25 with an uncertainty rating of 8.3 that means that his skill is probably somewhere in the range of 16.7 to 33.3, a pretty wide range, but then this is a totally untested player. According to benchmarks that Microsoft produced, 99.99% of actual skill levels were within 3x of the uncertainty rating, and 100% were within 4x.

The rest of TrueSkill's innovations are built around this model of uncertainty. All players win or lose skill points, based upon how many players they beat or lose to, and they also decrease their uncertainty rating as they play more games. However, uncertainty is decreased more for players toward the middle of a pack within a game than those around the edges (because on the edges the players could actually be much better or much worse than it is possible to see from a specific game). In addition, TrueSkill is only a zero-sum ranking system for players at the exact same level of uncertainty. The more uncertainty that an opponent possesses, the smaller the weighting of any gain or loss (much like the simpler system that Days of Wonder uses, which bases weightings of games against provisional players as n/20).

Overall TrueSkill is a somewhat complex system that is described more fully at Microsoft's web site. Some of their expansions had already been considered by others, but still their system is notably innovative in two ways:

  • Expanding a competitive ranking system to include concepts of teams.

  • Incorporating the uncertainty of ratings further into the core mathematical model, rather than using a somewhat more subjective model such as that described by Days of Wonder for provisional players.

Trueskillcalculator_1 The TrueSkill calculations are a bit complex. In general, that's not a problem for a computer-based ranking model because you can have a computer doing all the computations, and players only need to understand the results. However the two-part ranking system used by TrueSkill, which notes both skill level and uncertainty, does offer a potential problem on this latter point. Can players understand it? In general, the concept of uncertainty will not be understood by people other that statisticians, thus raising a real user-interface question with the TrueSkill system -- and the exact sort of thing that designers of new ranking systems will need to consider.

Variations on a Theme: A Tale in the Desert

A_tale_in_the_desert_logo_1 The online game, A Tale in the Desert, identified a different problem with the ELO system: cheating. This is a uniquely Internet-based problem, because there users can create fake accounts, then defeat those accounts to win points. This can also be done more subtly, by having multiple additional accounts build up the rating of that fake account before the fake account is defeated. So a totally new ranking system, called the eGenesis Ranking System, was created.

Each player is ranked through a 256-bit vector, half of which is initially set to 0 and half of which is set to 1 (therefore creating an average ranking of 128). Whenever a match occurs between players a hash function based on the players' names mathematically selects 32 of those bits, 8 of which are then randomly selected. Among those bits, any 1s in the loser's vector which correspond to 0s in the winner's vector are "transferred".

This simple design corresponds in some ways to ELO's more complex formula. A good player will have more 1s and thus more to lose, and he will lose correspondingly more to a poor player who has more 0s in his vector.

However, the system also prevents the collusion earlier noted. Statistically, a single player will only ever gain 8 ranking points from another new player, since out of the 32 bit hash only eight of those will, on average, be in the correct 0-1 configuration. Expanding a group of players expands the number of points that can potentially be gained, but within real limits.

Wowsocialmap_1 In fact, the eGenesis system prevents cheating by measuring the size of social networks, then limiting the number of ranking points that can be earned within a social network. It's not necessarily the only way to measure social network size, but its methodology points toward social software as an interesting area for additional study of ranking systems.

As with XBox's TrueSkill, the eGenesis algorithms are overall fairly sophisticated and confusing, perhaps more so than TrueSkill itself. However, unlike TrueSkill the output is very simple: a skill number between 0 and 255. The intricacies are hidden by the system.

Competitive Ranking Goals

Ultimately, as we mentioned when discussing Days of Wonder, any ranking system has to be measured by what it's trying to do and how well it does that. ELO and similar numerical, long-term ranking systems, are most likely trying to achieve one of three goals:

Hierarchy: Players are divided into hierarchies of success, giving players goals to constantly strive for and ways to measure their success (or failure).

Matching: Players can play with other players at their same skill level, rather than having to play beginners or experts who are much better than they are. This generally increases everyone's enjoyment. For computer games, the complexity of a matching system can be largely moderated by the computer, thus ensuring better competition.

Handicapping: If players do play against others of different skill levels, the better players can be handicapped in automatic, appropriate ways for the game in question, again increasing the fairness of games and everyone's enjoyment. For instance, someone ranked 3-kyu in Go playing a less experienced 7-kyu player would give him a starting 4 stone advantage to make for better competition.

The ELO system may be a good matching system, which allows players to easily find other players of their same skill level and play against them. However it doesn't provide any way to handicap players, nor would the ELO method necessarily be a good one to analyze handicaps (and conversely a golf handicap might not do a good job of finding like players nor measuring players' ability in a hierarchy).

More recently the XBox system has stated that it's explicitly for matchmaking, with the goal being to always try and match up players at nearly the same skill level. It's also used for hierarchy (or "leaderboards" as it's described in the TrueSkill docs), but that's clearly a subsidiary purpose.

All of these systems would be ineffective for measuring a winner in a live event, which is a very different goal:

Tourney: A single player is listed as an absolute winner, the "King of the Hill". Often, second, third, and fourth place winners are measured too.

And, the systems we've discussed thus far may not be useful for measuring privileges, yet another goal:

Threshold: The best ranks of players can be given special privileges, including the ability to create games and form tournaments. Alternatively, they can be given privileges totally outside the game, again giving them something extra to strive for.

For each of these additional goals we may need to consider very different ranking systems, not just variations of ELO.

Different Themes: Tourneys

Tournament_1 There are a number of well-known tournament types which can be used to create a "King of the Hill" ranking.

The simplest is the single-elimination tournament, where the winner of each competition moves on to compete with other winners, until there is only one. However, this style of tournament is quite cut-throat and is not suited very well to events where the competition may result in a draw, or where chance is a notable factor in the competition. It also has a very subjective factor in the initial seeding of the rounds. The single-elimination tournament also does not rank the losers. However, by having the losers compete with each other in a Swiss-style tournament, the relative strengths of the players can be ranked.

Pseudodoubleelimination_1 An improvement is the double-elimination tournament which is now one of the best known tournament systems in sports. Players compete in series of two-player matches, and a player has to lose twice before he's eliminated. This is done through a system of winner and loser brackets, wherein people drop from the winners' brackets to the losers' brackets when they lose once, and drop out altogether when they lose twice.

One problem with standard double-elimination is that there are unusual situations where a significantly inferior player can still make it to the final round, or the last player to remain undefeated can lose only once and still be eliminated. These can be addressed through variants such as face-off (requiring the last two remaining competitors to compete again if the undefeated team is defeated for the first time in the finals) or by reconfiguring the loser's brackets.

Wsc_1 Round-robin tournaments, such as official Scrabble Tournaments involve every player playing a set number of games (24 in the 2005 World Scrabble Championship), facing opponents with similar win-lose records. They then ultimately rank players by their win-lose ratios.

The advantage of these sorts of tournament over an ELO-style ranking is that they're easily understandable and seem fair. In addition, they measure ranking in a much more topical manner: how well someone is playing during a singular instant, rather than over a longer career. As a result they work much better for a live tournament.

Different Themes: Thresholds

As we discussed in our original article on Collective Choice, thresholds are ranking barriers above which members get a special ability--or alternatively levels below which members lose a special ability. They can also act as another goal for a ranking system.

Gosmall In the game of Go there are both amateur and professional players. Although they aren't technically in the same hierarchy of rankings, the highest Go amateur ranking  (7 dan) is approximately equal to the lowest Go professional ranking (1 dan), forming a de facto threshold.

Uscf_1 Likewise the United States Chess Federation uses their ELO rankings to denote Chess Masters. Anyone who achieves 2200 UCSF is given a National Master threshold ranking and anyone who maintains it for 300 games is given a Life Master threshold ranking.

Acblopt2_1 The American Contract Bridge Association uses a threshold system where you have to win a certain number of tournaments and thus earn masterpoints in order achieve official rankings such as "section master". Furthermore, players may earn different "colors" of masterpoints depending the difficulty of the tournament, and some ranks require that you earn at least some specific colored masterpoints in order to meet the requirements for the next threshold.

These thresholds are fairly explicitly based on other hierarchical ranking systems, but this doesn't need to be the case. Since determining the purpose of a ranking system is often the first step in designing it, as we delve further into the area of thresholds we may well find that systems specifically dedicated toward measuring thresholds are more likely to do so well.

In our next article we'll consider among other things the Avogadro reputation system, which manages thresholds in such a way as to prevent cheating.

Conclusion

There's actually a lot of variety in ranking systems, and even though we'd like them to be totally objective, various subjective elements often creep into these systems. In addition, there's a lot of variety in what ranking systems can do. For competitive systems, hierarchy, privilege, matching, and handicapping are some of the top purposes of ranking. Determining what a ranking system is going to do is a necessary first step in designing the system, as different systems will accomplish various goals to a better or worse degree.

ELO, in several variants, is the best studied and most used competitive ranking system. It works particularly well as a matching system. However, even ELO has flaws in it, among them: issues with new player rankings; its core two-player basis; its lack of provisions for teams; a few minor subjective elements; and problems with cheaters. New systems continue to be rolled out on the Internet to resolve these issues, and overall, it's an area of interesting new study.

Tournament systems and threshold systems offer a few good examples of competitive ranking systems with very different purposes, underlying the need to understand what you're doing before you do it.

Ranking systems also lay very near yet another type of Collective Choice: reputation systems. We briefly addressed reputation systems when talking about threshold systems and will return to this in our next article.


Related articles from this blog:

  • 2005-12: Systems for Collective Choice
  • 2005-12: Collective Choice: Rating Systems
  • 2006-08: Using 5-Star Rating Systems
  • 2007-01: Experimenting with Ratings
  • Related articles from Shannon Appelcline's Trials, Triumphs & Trivialities:

  • #192: Managing User Creativity, Part One
  • #193: Managing User Creativity, Part Two
  • #196: Collective Choice: Ratings, Who Do You Trust?
  • #198: Collective Choice: More Thoughts About Ratings
  • Posted on January 3, 2006 at 11:37 PM in Politics, Social Software, User Interface, Web/Tech | Permalink | Comments (5) | TrackBack

    Collective Choice: Rating Systems

    by Christopher Allen & Shannon Appelcline

    [This is the second of a series of articles on collective choice, co-written by my collegue Shannon Appelcline. It will be jointly posted in Shannon's Trials, Triumphs & Trivialities online games column at Skotos.]

    In our previous article we talked about the many systems available for collective choice. There are selection systems, which are primarily centered on voting and deliberation, opinion systems, which represent how voting could occur, and finally comparison systems, which rank or rate different people or things in a simple, comparative manner.

    Stars_1One purpose of our previous article was to create a dictionary of terms for talking about these related, but clearly different, systems. Another was to start offering analyses of these systems, many of which had not been well studied before their introduction onto the Internet.

    However at best our previous article provided an overview of what should be further investigated in each system. This article provides more in-depth coverage of one of the systems we previously outlined: rating systems.

    As we wrote in our previous article, in comparison rating systems "the value of individual items (most frequently goods) rise or fall based upon the largely subjective judgment of individual users." Ratings systems should be clearly differentiated from the closely related ranking systems. Ratings systems have a more subjective component, while ranking systems are largely objective. Amazon, Netflix, BoardGameGeek, and even the Stock Market were offered up as examples of ratings systems. Another example of a comparison rating system, and one of the earliest that appeared on the modern Internet, is eBay. The techniques they use are now beginning to show their age.

     

    eBay: A Failed Rating Experiment

    EbaysalesMost rating systems center around rating content, often user-contributed content, and they frequently help apply community values and acclaim to that content. However, the idea of ratings can go far beyond that narrow niche (though that will doubtless be its greatest use as the Internet continues to expand). Early Internet site, eBay, was one of the first to widely use user-submitted ratings, and it used them for a different manner: to determine the good traders on their auction site.

    Unfortunately, as one of the first in this field, eBay made many mistakes which now leave their ratings system only slightly helpful. However, its failures can also provide us with insights in creating new rating systems on the Internet.

    eBay allows you to leave positive, negative, or (more recently) neutral feedback for each transaction you conduct in their society. These are aggregated into two numbers. "Feedback Score" is calculated as unique positive feedback received minus unique negative feedback received, and results in a whole number like "32" or "10,302". "Positive Feedback" is calculated as positive feedback received divided by all feedback received, and results in a percentage like "100%" or "99.8%".

    Unfortunately, for reasons discussed below, almost all feedback is positive, and thus the Feedback Score acts almost entirely as a track record of how many trades someone has made. The Feedback Score could be largely replaced by that single number. You can look at a score of "27", and say, "That's an amateur trader, or someone just getting started", at a score of "3", and say, "That person may or may not know what they're doing", at a score of "10,302", and say, "That person has done a lot of trades." But you still don't know how good the trader is.

    EbayprofileTheoretically, the Positive Feedback percentage should give a more meaningful number, but people so infrequently give bad ratings that, even when they do appear, they look like noise. Does a percentage of "99.8%" on a user with a score of "1,762" mean that the seller has a genuine problem or not? Do those 3 unhappy customers really represent another 30 who were unwilling to actually click the negative feedback? And, did those people have slightly bad experience or really bad experiences? It's pretty hard to say.

    Overall, eBay has a few major problems with their rating system:

    • It's non-granular, with only two options (positive/negative), or more recently three (positive/negative/neutral).

    • It's non-distinct, with no useful guidelines on what behaviors should result in each rating.

    • It's non-statistical, and thus ends up showing only a gross number of sales, not a real subjective measure.

    • It's bilateral, with buyers and sellers rating each other simultaneously, and thus people are afraid to give bad ratings lest they get them in return.

    • It's meaningless, because there are no good tools to control who bids on an auction based on Feedback numbers. (Technically it may be legitimate to ban low feedback bidders from an auction, then cancel their bids if they enter the auction, but this is neither obvious, automatic, nor simple.)

    We're going to address each of these issues in turn, to offer insight into creating new comparison rating systems. The first three topics--granularity, distinction, and a statistical basis--are the most important elements of a good comparison rating system. Bilateral & meaningfulness issues will only be relevant on certain sites.

    (As a final caveat: in some ways eBay falls closer in ultimate result to a reputation system, a topic which we'll be covering more in a few articles down the road, but its lessons learned are still entirely accurate for rating systems of all sorts.)

     

    Granular Ratings

    Smiley In general, people want to be nice. There are exceptions to that rule, perhaps even great numbers of them, but the average, well-adjusted person would prefer to make other people happy, not sad.

    This has a notable effect on any comparison rating system, because it means that people are less likely to use the bottom half of any rating scale. If you did a statistical run on eBay, you'd certainly find that more than 99 out of every 100 ratings are positive. This is largely influenced by concerns of bilateral revenge, as discussed below, and the fact that eBay suggests other means of dispute resolution when you try and leave negative feedback. However, RPGnet, a roleplaying site which reviews games, comics, books, movies, and more shows a similar trend despite the lack of bilaterality.

    RPGnet uses two 5-points scales for reviews, resulting in a total rating of 2-10. Of all the ratings at RPGnet, 6,983 reviews have a total that's above average, a total rating of 6 or more, and 795 have a total that's below average, a total rating of 5 or less. Perhaps there are more people who sit down to write a review because they really like a game than those who do so because they really hated it, but the result of ~90% of reviews being above average is still stunning.

    The following table shows all the ratings for each of the two categories that RPGnet uses, "Style" and "Substance":

    Rpgnetsettlersreview

    Rating Style Substance %
    1 73 210 1.8%
    2 687 590 8.2%
    3 2127 1583 23.8%
    4 3337 3242 42.2%
    5 1554 2153 23.8%

    This evidence confirms what we'd already suspected. Only 10% of raters use the bottom two ratings in a 5-point scale, and only 2% use the bottom rating. The median of the 5-point scale is actually the fourth point, with a neat bell curve arranged around it.

    Because users are innately unwilling to give bad ratings, as evidenced here, useful comparison ratings truly come about only through fractional differences between good ratings. In this case, the difference between "3", "4", and "5" is meaningful, and becomes more meaningful as more ratings are entered. Eventually you can look at a ranked list of ratings and see that "4.2" is a good rating while "3.5" is not.

    In order to do this, however, you need enough levels of good ratings to be able to distinguish between them. eBay, only offering one positive rating, does not provide enough differentiation. RPGnet, with its three positive ratings, might. However, sites that offer a 10-point scale are the ones that really seem to be able to produce meaningful statistics. On those sites we can expect that 90% of users will choose between six different numbers, from "5" to "10", and as the number of ratings builds up, this will produce enough differentiation to be meaningful. If you have already adopted a 5-point scale, consider allowing users to select the half-points, giving users a greater ability to differentiate their ratings.

     

    Distinct Ratings

    No two users are ever going to rate the same; different rating numbers will mean different things to each person. This can introduce minor discrepencies into ratings, if a single individual rates particularly low or high. However, because most ratings are eventually used for comparisons, if that low- or high-rater rates many different things, the ratings equalize. "Item A" is rated low by this person, but so is "Item B", and so they end up in the correct positions in relation to each other.

    A bigger problem occurs when an individual is inconsistent in his ratings over time. If an individual rates everything low for a while, then rates everything high, then he has a greater chance of biasing the overall rating pool. Worse, his individual ratings aren't meaningful, because you can't look at two items, see that one is a "6" and another is an "8", and truly believe that he likes the "8" a fair amount more than the "6". This reduces the usability of an individual recommendation system or a friends system where one user might look at what other users thought about products, because their unaggregated numbers are not accurate.

    You thus want to help individuals to stay consistent, and the best way to do that is to make the criteria for your ratings distinct. BoardGameGeek, a board game web site that supports a 10-point rating system for games, does a good job of offering distinction in its ratings.

    Settlers_rating_1

    • 10 - Outstanding. Always want to play and expect this will never change.
    • 9 - Excellent game. Always want to play it.
    • 8 - Very good game. I like to play. Probably I'll suggest it and will never turn down a game.
    • 7 - Good game, usually willing to play.
    • 6 - Ok game, some fun or challenge at least, will play sporadically if in the right mood.
    • 5 - Average game, slightly boring, take it or leave it.
    • 4 - Not so good, it doesn't get me but could be talked into it on occasion.
    • 3 - Likely won't play this again although could be convinced. Bad.
    • 2 - Extremely annoying game, won't play this ever again.
    • 1 - Defies description of a game. You won't catch me dead playing this. Clearly broken.

    If you offer a distinct rating listing like this, some users will still come up with their own rating ideas, but if they do, they're more likely to remember what each number means to them. For everyone else, a very clear, s rating system is the most likely to produce meaningful and consistent results. As long as users aren't puzzled by the distinction, they'll be consistent in picking the same numbers for the same rating every time.

     

    Statistical Ratings

    The last big topic that you have to think about in creating most comparison rating systems is whether they're statistically sound.

    The best way to make your ratings statistically sound is with volume. If you can manage thousands or tens of thousands of ratings for each item, any anomolies are going to become noise. However, the fewer ratings you have, the more likely it is that your ratings are inaccurate in relationship to your database of ratings as a whole. (And thus one of the failures for eBay is that it tries to claim meaningfulness for users with very few ratings, where there's clearly no statistical basis.)

    Bayes Ideally what you want to do is give items with fewer ratings among your collection less weight, and those with more ratings higher weight. One simple way to do this is to apply a bayesian average. Variants of this are used by the aforementioned BoardGameGeek and by IMDB. RPGnet is using it for some unreleased software as well.

    The idea behind a bayesian average is that you normalize ratings by pushing them toward the average rating for your site, and you do that more for items with fewer ratings than those with more ratings. The basic formula looks like this:

    b(r) = [ W(a) * a + W(r) * r ] / (W(a) + W(r)]

    r = average rating for an item
    W(r) = weight of that rating, which is the number of ratings
    a = average rating for your collection
    W(a) = weight of that average, which is an arbitrary number, but should be higher if you generally expect to have more ratings for your items; 100 is used here, for a database which expects many ratings per item
    b(r) = new bayesian rating

    Say three "shill" users had come onto your site and rated a brand new indie film a "10" because the producer asked them to. However, you use a bayesian average with a weight of 100, and thus 3 ratings won't move the movie very far from the average site rating of 6.50:

    b(r) = [100 * 6.50 + 3 * 10] / (100 + 3)
    b(r) = 680 / 103
    b(r) = 6.60

    WowWebDesigns uses a similar model and even offers a good explanation of their methods on their web site.Wowwebdesignsrating

    With everything that's been described thus far, including granularity and distinction, a bayesian average (or some other similiar method) will probably be enough to give your ratings a good, sound statistical basis. However, sites with low volume of ratings may still be concerned with "shills" or "crappers" who come in to your site just to put "10"s on their favorite items on "1"s on their least favorite. RPGnet's reviews are an example of a site that could experience this issue, because only a few people are going to ever write reviews for an individual item, and this small number of reviews could compromise the nature of any comparisons generated by the ratings sytems.

    In short summary the following additional methods may help with this issue:

    • Rate the Raters: Reviews are low volume, but presumably readers of those reviews are high volume, and you can take advantage of that to then have your readers rate the reviews. Amazon and Netflix are two examples of sites which use this method by asking "how many readers found this helpful".

    • Altruistic Punishment: An alternative method for rating raters is to use altruistic punishment. Herein users can punish someone who does contribute to the community, but at a cost to themselves. So, a reader could flag a poor rating or a poor review at some minor cost to their own rating. Though this method may seems somewhat paradoxical, game theory  suggests that it is a generally effective technique for improving the commons.

    • Adjust Ratings Based on Ratings: Ratings can be self-adjusted based upon the rater's own behavior. The simplest method here is to map a rater's average rating to the average rating for a site. For example, if the average rating of a site is 6.50 and a shill's average rating is 10.0, then those 10s should be treated as 6.50s. This has the possibility for some intensive calculations, however, and may lead to additional bias in your rating pool if shills figure out the methods you use to adjust ratings.

    • Allow Editorial Fiat: Another method is to allow editorial fiat, where editors are expected to come in and remove bad ratings (or proactively not release them). This clearly results in time issues, but they may not be major since only sites with small numbers of ratings/item will have to do this type of adjusting. Further, automated systems could flag "suspicious" rating patterns which are outside the norms for average, speed of rating, etc. (RPGnet supports editorial fiat by requiring editorial release of all reviews.)

    The idea of adjusting ratings based on ratings bears a bit of additional discussion because it's somewhat similar to another well-knowing rating system: slashdot. Herein you have both ratings and meta-ratings. People can rate threads and articles, then other people can agree or disagree with those ratings, which in turn makes it more or less likely that the original rater will be allowed to rate in the future (depending on if people agree or disagree with his ratings). Under a more general classification, this is probably a meta-rating system based on a reputation system, so it's something we'll look at further a couple of articles down the road.

     

    Other Issues: Bilateralism & Usefulness

    90% of the rating issues that sites will face are covered by the above. However eBay in particular raised two other issues -- bilateralism and usefulness -- that aren't as generally relevant but do deserve some consideration.

    Ebayfeedback Bilateralism: One of the reasons that eBay's ratings fall apart is that they're bilateral. Buyers and sellers rate each other simultaneously and thus there's the fear of revenge if you rate someone badly. It's a sufficient issue that eBay has a FAQ on the topic, though they don't offer any good answers.

    The following solution would address some issues of bilateral revenge:

    • Put a time limit on bilateral ratings

    • Release bilateral ratings simultaneously at the end of the time limit

    • Don't allow additional ratings after the time period

    This would work well on an eBay, where you're unlikely to conclude an additional deal with someone you rated badly, and thus there's no possibility ever for rating revenge. On a game site, however, where people are arbitrarily put into games with each other, and thus you could end up in a game with someone you rated poorly, there might be room for later revenge, down the road. This would have to be addressed to truly feel comfortable with bilateral ratings.

    Additional investigation might reveal more variations of this method, or offer good answers for alternatives, like anonymous ratings.

    In addition, good privacy restrictions are really needed to make bilateral ratings work, as well as Terms of Service that protect users from lawsuits for ratings. There have already been cases of physical threats based upon eBay ratings. eBay has also produced cases where people threatened slander or libel lawsuits for bad ratings, and this even further chills the possibility of true ratings appearing on the eBay server.

    Usefulness: Finally, you want to make sure your ratings are useful at your sites. Rankings are a good way to achieve this. You can see the "best games ever" ranked, or you can see the most interesting user content rise to the top of a long listing, and the least interesting sink to the bottom.

    eBay offers a counter example of frustration with the usefulness of ratings. As already mentioned, you can theoretically ban "bad" users from bidding on your items, and then cancel bids from these users if they appear. However, there are multiple issues with this approach. First, how do you define "bad" users on eBay? Insufficient feedback? Too much negative feedback? Too high a percentage of negative feedback? Second, there is no automated method for doing this, so you must remain ever vigilant on your auctions to make sure that "bad" users aren't involved. Third, there's no way to keep a bad bidder from returning after you've cancelled his bid. Fourth, these bad bids and cancellations have the possibility of corrupting your auction, as you could lose other bidders who came in, saw the higher bid when the bad bidder was involved, then left before the bid was reduced by his removal. Finally, greed is a powerful motivator on eBay, which might lead to the retention of bad users.

    You also need to be careful with your user interface for ratings. Here is an example of a poor UI:

    Uselessratingui

     

    Conclusion

    Comparison ratings are going to be an increasingly important force as the Internet continues to mature. To produce meaningful comparison ratings for your site, you need to concentrate on four important factors: granularity, specifity, sound statistics, and usefulness. And, if you offer bilateral ratings, make sure you understand the subtleties of that as well.


    Related articles from this blog:

  • 2005-12: Systems for Collective Choice
  • 2006-01: Collective Choice: Competitive Ranking Systems
  • 2006-08: Using 5-Star Rating Systems
  • 2007-01: Experimenting with Ratings
  • Related articles from Shannon Appelcline's Trials, Triumphs & Trivialities:

  • #192: Managing User Creativity, Part One
  • #193: Managing User Creativity, Part Two
  • #196: Collective Choice: Ratings, Who Do You Trust?
  • #198: Collective Choice: More Thoughts About Ratings
  • Posted on December 12, 2005 at 05:58 PM in Politics, Social Software, User Interface, Web/Tech | Permalink | Comments (7) | TrackBack

    Systems for Collective Choice

    by Christopher Allen & Shannon Appelcline

    [Shannon Appelcline is a friend and colleague of mine at Skotos, an online game company. Over the last few years we've had many discussions about how decisions are made, and how our society collectively makes choices. The origin of these discussions have varied from "what makes this board game work?", to "how can we give our players more control of our online games?", to "how do we make decisions in our company?", and of course "how did we collectively make such a mess of decision making in America?". This article, and some followup articles, summarize our thoughts on these topics, and will be jointly posted in Shannon's Trials, Triumphs & Trivialities online games column at Skotos.]

    Peoplearoundthesun Collective choice systems have been around for a long time. Since at least the birth of democracy in ancient Greece people have made joint decisions about important issues, and since at least the knightly tournaments of the late Middle Age people have competed to be ranked against their peers. Today Western culture especially values diversity of input when implementing any type of choice, believing that wide input from a variety of people provides the fairest result.

    The Internet expands this long history of collective choice. However, as we bring collective choice systems onto the Internet, quantifying and programming them, we discover the need to be more analytical and more methodical in the techniques used. Thus we're beginning to learn that we don't know nearly as much about these collective choice systems as we should. There is a need to analyze and study them further, to understand their strengths and weaknesses, and to evaluate their social impact. Fortunately, the social software and online games on the Internet provides the perfect petri dish for doing so.

    Before any analysis can occur, however, there is a need for a categorization of systems and a definition of terms. That is the purpose of this article: to lay out at least some of the ways in which collective choices can be made, to organize them, to define them, and to briefly consider them.

    Broadly, there seem to be three methods of collective choice, divided by the intended result: selection, opinion, or comparison.

     

    Selection Systems

    Selection systems allow for the purposeful choice between multiple items. There are many types of selection systems, but two in particular, representative systems, deliberative systems, and consensus systems are worth noting.

    19thamend Representative Systems: In a representative system, individuals cast a ballot for someone who will represent their interests. They're by definition voting systems and the heart of any Republican system of government. When you're voting for a president, prime minister, senator, congressman, director, or board member, that's representative voting

    In most representative voting a winner is selected by plurality, meaning the winner had more votes than any other candidate. This works well in a simple two-member election, but begins to fall apart if there are multiple candidates, because similar candidates can steal votes from each other, and thus allow a candidate with less popular ideas to be elected.

    The simplest solution is to require a majority victory, meaning that one winner must have at least 50% of the votes. Some places in the United States use this system for their representative elections, holding a first election, eliminating all but the two biggest vote-getters, then holding a new election between these two.

    Another solution is a primary-based election system, wherein all like-minded candidates compete against each other before participating in the real election. This requires buy-in from all like-minded candidates, however, and recent U.S. elections with third-party candidates like Ralph Nader and Ross Perot show the flaws in a voluntary primary system.

    Many other types of voting systems are possible, most of which allow voters to select multiple candidates at the same time. These systems then eliminate the lowest ranked candidates and give their votes to others based upon those voter selections.

    Irv Instant Runoff Voting, or IRV, is a fairly commonly used multiple candidate system (though not necessarily the best one). It's technically a single transferable vote preferential voting system. Wikipedia describes the process like this:

    Each voter ranks at least one candidate in order of preference.
    ...
    First choices are tallied. If no candidate has the support of a majority of voters, the candidate with the least support is eliminated. A second round of counting takes place, with the votes of supporters of the eliminated candidate now counting for their second choice candidate. After a candidate is eliminated, he or she may not receive any more votes.

    This process of counting and eliminating is repeated until one candidate has over half the votes.

    It's simple to understand, but also flawed. That's because every voting system ultimately has some flaw in it, as is evidenced by the fact that given the same conditions the different systems will often declare different winners. Different systems may also allow voters to "game" the system in different ways. This is often called tactical voting or strategic voting. Similar to analyzing various "attacks" when studying a cryptosystem, looking at which tactical voting approaches each voting system is vulnerable to helps you evaluate the voting systems.

    For instance, one type of tactical voting that can be used against an Instant Runoff Vote is the push-over strategy, which Wikipedia describes as this:

    Push-over is a type of strategic voting in which a voter ranks a perceived weak alternative higher, but not in the hopes of getting it elected. This primarily occurs in runoff voting when a voter already believes that his favored candidate will make it to the next round - the voter then ranks an unpreferred, but easily beatable, candidate higher so that his preferred candidate can win later. A United States analogy would be voters of one party crossing over to vote in the other party's primary to nominate a candidate who will be easy for their favorite to beat.

    For example, in an IRV election between Bush, Gore, and Nader, Democrats might rank Nader over Bush. The hope would be that this would give Nader enough votes to keep him from being eliminated, thus knocking Bush out instead in the first round. Afterward the Republican Bush votes transfer to the less progressive Gore, rather than the fringe Nader, allowing Gore to beat the "pushover" Nader when they couldn't have faced Bush in a straight-up fight.

    There several different tactical voting "attacks" against various representative voting systems. One of the technically better multiple-candidate voting methods is the condorcet method for voting. It's immune to most tactical voting strategies and more people would consider its result "correct", but unfortunately it is much harder on the voters, who have to rank every single candidate. Maybe if we can create a better Internet user-interface to condorcet voting we can make this more sophisticated representative voting system more broadly available.

    Deliberative Systems: In a deliberative system, individuals directly make a decision, rather than selecting a representative to do so. Deliberative systems do not have to include voting, and the subcategory of consensus systems described below technically don't, however most modern deliberative sytems do. A deliberative system is the heart of true democracy. Traditionally it's been relatively unfeasible because voters were not expected to be educated enough to make governmental decisions and because they didn't have the time or capability to regularly decide on issues. The spread of the Internet alleviates at least the latter problem, since millions of people can now simultaneously decide on any issue if they so desire.

    In the United States the best known deliberative system is the initiative system found in some states, including California. It allows for issues to be put directly before the voters through the submission of sufficient signatures, and then allows the voters to pass or fail those issues, based on either plurality (most votes), majority (at least 50% of votes), or else super majority (some percentage of votes in excess of 51%). In California, for example, 66% approval is required for new tax initiatives.

    Constitution_signingThe United States constitution defines a large and very complex deliberative system. It creates three bodies of government to support deliberation and voting, and uses a checks and balances systems in order to allow different branches to have different effects upon a vote. The main voting is done by the legislature, which requires two pluralities from two different groups of people to pass a vote. Then the executive branch has a singular opportunity to veto legislature, which then requires a super majority (here, 66%) to override that veto. Once a law is established, the judicial department may by plurality vote to declare that legislation unconstitutional, but that may be overcome by an even greater super majority (typically, 66% of each legislature + 75% of the state governments) who want to amend the constitution

    The constitution also shows how deliberation can span beyond simple voting because of the fact that it includes specific rules for how to debate, when debate can be closed, etc. In today's very fractured congress, however, it's unclear if individuals ever are actually swayed by deliberations in the floor of the legislature, or if they've already decided to follow their party lines or their specific interests, long before they entered the Capitol buildings.

    A smaller example of a true deliberative system, based on guiding discussions as much as holding votes, is found in Robert's Rules of Order, a guide for conducting meetings. These rules detail explicit methods not just for voting, but also for the deliberation and discussion surrounding the voting. Various majority and minority votes can be taken to allow for certain actions.

    Because deciding directly upon ideas rather than just voting for representatives can have a greater effect upon a community, the deliberative systems may need to be more complex to avoid abuse, as evidenced by the complexities of the U.S. Government and Robert's Rule of Order. However, these very complexities can make these systems more prone to purposeful gaming. The benefits and deficits of more complex deliberative systems have not yet been fully studied, nor have there been as much analysis of "attacks" against them.

    Consensus_1Consensus Systems: In consensus systems people jointly come to a consensus as a group through group interactions. This sort of decision making theoretically avoids the "tyranny of the majority" and likewise can produce more informed decision making. It's a variant of the broader deliberative systems, but one with more group and less individual power.

    One example of consensual selection is cabinet government as laid out under the Westminster System. Wikipedia describes it as follows:

    Members of the Cabinet are collectively seen as responsible for government policy. All Cabinet decisions are made by consensus, a vote is never taken in a Cabinet meeting. All ministers, whether senior and in the Cabinet, or junior ministers, must support the policy of the government publicly regardless of any private reservations. If a minister does not agree with a decision he, or she, can resign from the government; as did several British ministers over the 2003 Invasion of Iraq. This means that in the Westminster system of government the cabinet always collectively decides all decisions and all ministers are responsible for arguing in favour of any decision made by the cabinet.

    Quaker-based consensus offers a similar example. Herein a facilitator helps to identify disagreements and agreements to move a discussion forward until an end result is embraced by all individuals.

    As a final note, it's important to differentiate consensus from coercion. The end result of unanimity isn't the sole definition of a consensus system, nor is it entirely required. What is required is a more open and thoughtful selection process.

     

    Opinion Systems

    PollOpinion systems are a clear subsidiary category to selection systems. An opinion system's main use is as a decision indicator, to show how people will decide or did decide in a representative system, a deliberative system, or both. Current opinion systems tend to be oriented toward actual votes, as opposed to more freeform selection systems (though the delphic polling system shows a more freeform version of the category itself). Opinion systems tend to be push-based (meaning people are asked for their opinions rather than actively offering them), but this isn't required.

    All opinion systems tend to have the same general problem, which is figuring out how to use scientific means to determine the actual results of a decision. This means massaging respondent numbers to offset categories of people more or less likely to vote to try and generate the actual results. For example, one 1998 poll showed that 62% of Republicans were absolutely certain they were going to vote, while only 51% of Independents could say the same. This means that every Republican voter a poll contacted in that year might have been weighted about 1.2x over every Independent contacted. Of course the actual calculations are much more complex than that, since they tend to depend upon traditional voter turnout and lots of analysis, but the core idea is sound, which is that every polled individual should not be considered equal.

    All opinion system results tend to be rated with margins of error. The margin of error is a percent spread which the poll is expected to be within, 90-99% of the time (depending on how conservative of a confidence rating is given). If a poll shows that a politician is expected to take 48% of the vote, for example, and the margin of error is 4%, that means he is expected to take 44-52% of the vote with 90-99% surety. Margins of error are typically given much greater importance in the modern media than they should, as they're calculated solely based upon the total number of respondents to a poll.

    There are two general categories of opinion systems: pre-voting (subjective) polling systems and post-voting (objective) polling systems. A different type of opinion system, delphic polling, which could apply to either pre- or post-voting systems is also covered. Polling systems not directly related to selection systems are covered later, as subjective rating systems, since they tend to have issues very different from other polling systems, as their goal isn't to try and match the "true" number of an actual vote.

    Pre-voting Polling Systems: These are polls made before a vote is cast. They're often called "opinion polls" and tend to be conducted via phone. They try and isolate "likely voters" and determine how they will vote. This question of voter likelihood is one of the first issues with a pre-voting system, because there's no guarantee that the polled people will actually later vote. Likewise, pre-voting systems have to accommodate "undecided voters" and the fact that no voter has ever truly made up their mind until they cast their final ballot. Unlike post-voting polling systems, pre-voting systems also have considerable more possibility for bias (which is not accounted for by margins of error), based upon how questions are asked, in what order, and with what additional text.

    Exitpoll Post-voting Polling Systems: These are polls taken after a vote is cast. They're typically called "exit polls", as most are conducted as people are leaving a "polling" station (where they cast a vote). One would expect these to be much more reliable than pre-voting polls, but as the 2004 U.S. Presidential Election showed, exit polls can be wildly inaccurate.

    One of the problems with post-voting polling systems, shared with pre-voting systems, is that the results must be manipulated to make sure that respondents to the poll match the percentages of those constituencies in the overall populations. For example, in the 2004 exit polls it appears that women were initially overrepresented in exit polls, and because of increased black turnout it appears that blacks were underrepresented in the exit polls. It can easily be seen how either of these misrepresentations could cause notable changes in an exit poll result.

    When conducted & matched correctly, exit polls are supposed to be quite reliable.

    Delphiamphitheatre Delphic Polling Systems: An interesting polling method applicable to all sorts of opinion systems is the "delphi poll". This is a specific method of polling which is iterative and anonymous and which supports confidence ratings and feedback. The general idea is that people are polled on a question using not just binary responses, but a full confidence rating (e.g., you would state that you are 60% sure that Bush would be elected, rather than stating that you think Bush would be elected). After polls are collected, the anonymous results--or at least a summary of those results--are shared with the participants, who then poll again. This iterative process continues until a consistent answer is settled upon. By incorporating feedback into the polling process there's the possibility for greatly increased reliability.

    In some ways delphic polling systems can be seen as an analogy to consensus systems, since both involve more iterative processes that eventually result in a more commonly-held decision.

     

    Comparison Systems

    Comparison systems allow individual items to be measured up against each other. There are three general categories: comparison ranking systems, which are largely objective and which typically rank people; and comparison rating systems, which more often mix subjective and objective opinions, and which more frequently rate things; and reputation rating systems, which again tend to rank people, but also have a subject and objective mix.

    Comparison Ranking Systems: In a ranking system, items in a hierarchy (most frequently people) rise or fall based upon specific, objective, and well-known rules. This is the heart of most multiplayer competitive systems.

    Chess The ELO System is an example of a ranking system used for two-player games, and is used by the U.S. Chess Federation. Days of Wonder uses a multiplayer variant of the ELO system for their online games. Each system builds a simple distribution of player ratings around a norm (typically 1500 points), then awards or deducts points based upon wins and losses, with the total sum of all points in the system staying constant. Players are then ranked according to their comparative scores.

    There are flaws in ranking systems like ELO. For example, two players could collude, with one purposefully throwing games so that his opponent could increase his ranking. Alternatively if a player gets a few lucky victories against good opponents, his rating might temporarily skyrocket above its normative value. However, these tend to be well-known and well-researched problems.

    These are numerous other ranking systems which are used for competitions, from double-elimination seeded tournaments (e.g., a tennis tournament) to ranked comparisons based upon win-loss ratios (e.g., baseball standings). Objective rankings are also (less commonly) used to rank items, such as a ranking of cars based upon safety ratings.

    Usmc_enlisted_rank_structure

    Most ranking systems create a hierarchy of positive rankings (e.g., "best chess players ever"). However, a hierarchy of negative rankings may also most be used, most commonly based on a negative criteria (e.g., "biggest Player Killers (PKers)"). In addition, either direction of ranking can use threshold systems to mark positive or negative rankings that meet a certain criteria. A positive threshold might be a "Grand Master" ranking threshold for anyone with a Chess rating of 2700, while a negative threshold might be a "Player Killer" ranking threshold, for with sufficient "accidental" PKs.

    Ranking systems are somewhat removed from the other collective choice systems listed here, since there's isn't a collaborative decision, only a collective result. However their problems & results remain closely related to the more collective rating and reputation systems, hence their inclusion.

    Stars_1 Rating Systems: In a rating system, the value of individual items (most frequently goods) rise or fall based upon the largely subjective judgment of individual users.

    Amazon and Netflix are two examples of stores which provide subjective rating systems. Individual users rate items from 1 to 5 stars, then an average user rating is calculated. BoardGameGeek offers a slightly different example because it not only lets users rate individual items, but also ranks items against each other based upon those ratings.

    Flaws in these systems are similar to those in ranking systems: low numbers of ratings producing bad rankings, and individual users purposefully biasing ratings. Some mathematical methods may be used to smooth out these issues, among them bayesian averages, which give ratings weight based upon total number of ratings for an item.

    The Stock Market offers an example of a different sort of rating system, because there's theoretically some objective basis to it. In a perfect Stock Market system, stock prices are based upon a solid cost analysis, such as a multiplier on yearly revenues or profits. However, as the Internet bubble of the late 1990s conclusively showed, there's also a high irrational component to stock purchases: thus subjective and objective views are combined in the rating (cost) of a stock.

    Reputation Systems: Finally, reputation systems are very similar to ranking systems: items in a hierarchy (most frequently people) rise or fall based upon specific and well-known rules. However, unlike true ranking systems, reputation systems instead base their rules for rise and fall upon other user feedback.

    The goal of a reputation system is ultimately to create a trust metric that often allows different users access to different powers. We'll be covering reputation systems a bit more thoroughly in a couple of weeks.

     

    Conclusion

    There are a variety of ways to measure the collective choices of a large group of people. We've outline nine here: representative, deliberative, and consensus selection systems; ranking, rating, and reputation comparison systems; and three varieties of opinion systems. When developing social software it is important to understand the difference between these broad categories of systems and to use lessons already learned from the appropriate category in your own social software designs.


    Related articles from this blog:

  • 2005-12: Collective Choice: Rating Systems
  • 2006-01: Collective Choice: Competitive Ranking Systems
  • 2006-08: Using 5-Star Rating Systems
  • 2007-01: Experimenting with Ratings
  • Related articles from Shannon Appelcline's Trials, Triumphs & Trivialities:

  • #192: Managing User Creativity, Part One
  • #193: Managing User Creativity, Part Two
  • #196: Collective Choice: Ratings, Who Do You Trust?
  • #198: Collective Choice: More Thoughts About Ratings
  • Posted on December 1, 2005 at 04:03 PM in Politics, Social Software, Web/Tech | Permalink | Comments (9) | TrackBack

    Extrapolative Hostility in the Online Medium

    Extrapolate
    To infer an unknown from something that is known; conjecture.
    -- The Random House College Dictionary

    Mick LaSalle, an acerbic movie reviewer for the San Francisco Chronicle, writes a regular column "Ask Mick LaSalle" in the Sunday paper, where he sometimes allows others to vent their displeasure at his movie reviews. In this week's column he says something that I find very accurate to my experience with the online medium:

    As for why people get hostile when they hear a differing opinion, I go back to Spinoza's definition of love and hatred. He says that people love that which they think reinforces their survival and hate that which they think threatens their survival. I believe -- this is just my humble theory, now -- that when people hear an opinion that counters theirs, their minds extrapolate from that one opinion to imagine a whole philosophical system. And then they imagine how they would fare in a world run according to that imagined system. So they go from disagreeing to feeling threatened in a matter of seconds, and they lash out. Often they write letters that begin, "You are obviously," and that's where they identify, not you, but the phantom they feel threatened by.

    Over the years, I've been "obviously" liberal, conservative, gay, straight, humorless, frivolous, angry and deeply jealous of Tom Hanks. When I was 30, I remember getting accused of being a 45-year-old former hippie who drove a BMW, wore a Rolex and had done acid in the '60s. I'm not sure if I wrote back, but if I did, I would have said, "Wrong. Wrong. Wrong. Wrong. Wrong." But, of course, that kind of letter is your key to acquiring distance. It lets you know that the person's real quarrel is with some middle-aged former hippie -- probably known as Dad -- and that you're just the vehicle for that day's projection.

    I think that Mick LaSalle is exactly right -- I've seen this type of hostility based on extrapolation regularly in online mediums: in emails, newsgroups, wikis, blogs, etc. I've been guilty of it a few times myself, though usually for me the result is that I don't respond at all -- "Oh, he is just a flaming liberal", "She's an arch-conservative" or "He is a just a technophobe." I can then feel comfortable in ignoring the rest of his or her point of view rather then trying to understand it.

    I doubt if explaining this theory to someone who writes a hostile message is useful -- they will take it as yet another attack, which will likely contribute to another cycle of flamage. But I do find Mick's theory useful as another way to read and understand hostile messages, and respond more appropriately.

    Understanding this lets me add another widget to my social software toolbox: when a group process results in a hostile message, try to determine if the author is actually reacting to what you said or if their hostility is based on extrapolating to "obvious" generalities. This may not allow you to directly address the hostility, but it may help you better understand it and thus not contribute to the cycle of flames.

    Posted on July 18, 2005 at 02:13 PM in Film, Politics, Social Software, Web/Tech, Weblogs, Wiki | Permalink | Comments (6) | TrackBack

    Politics with Respect

    This is not a political blog, but like most bloggers I've been indundated with various discussions about the Bush Administration, Michael Moore, Red vs. Blue, etc. So much so that on some professional email lists we've had to ask the participants to use the text "[POLITICS]" in their subject lines so that we can filter out the overt political content.

    One of the reasons why I tend not to participate in the political discussions is because most often they are polarizing. They tend to preach to the already converted, to harden the line, and often are too polemic. This was my problem with Farenheight 411 -- I thought there was some good stuff in the movie, but it was in a format that I didn't think would persuade the uncommitted. The first time the director used innuendo to oversell his points, and when he showed a lack of any respect for other points of view I think he lost the uncommitted audience.

    Thus in general I've been reluctantly participating in the current political dialogue. I have contributed to various causes and campaigns, and have spoken to my friends about my thoughts on the subject, but I've not been bringing that activity online.

    BushArtSmHowever, I really do admire this statement on a t-shirt by Jerry Michalski "Apparently half of you believe George W. Bush is an adequate president. I don't agree, and would love to talk about it respectfully". I like it because because it clearly states the respect that is needed regarding the issues that the other half believe in. You may not agree with those issues, but you will be respectful to the person that believes them. Jerry has made this t-shirt available at Cafe Press.jont

    Not quite as respectful, but done with conviction is a song by UK artist Jont that I like called World Gone Blind (MP3 format), and is freely distributable under a Creative Commons' Attribution-NonCommercial-NoDerivitives License. I first heard him play at one of Jerry's Retreats. The song is very hip and elegant.

    Posted on August 16, 2004 at 01:59 PM in Politics | Permalink | Comments (0) | TrackBack

    Four Kinds of Privacy

    I've been thinking about the nature of privacy a lot lately.

    I've long been associated with issues of preserving privacy. I helped with anti-Clipper Chip activism in the early 90s and supported various efforts to free cryptography such as PGP and other tools built with RSAREF from export control. However, my efforts in these areas wasn't really focused on privacy -- instead my focus was on issues of trust.

    I've always tried to be precise here. For instance, one of the uses of the SSL encryption software that I designed and sold at Consensus Development was to preserve privacy; however, I never sold it with privacy as a feature. Instead I clearly stated that SSL offered "message integrity", "confidentiality" and "authentication". Part of the reason that I never used the word privacy with SSL was that I felt that the concept of privacy was too overloaded -- or possibly orthogonal to issues of cryptographic security. Promising privacy was promising too much.

    More recently issues of privacy have been coming up in my study of social software. It got started with my post about privacy issues in Orkut and a general insecurity and discomfort with information made available in various social networking services. Later I wrote about handcrafting my FOAF which required me to re-think how and what information I wished to reveal about myself. More recently I was stung by Zero Degrees which still disturbs me greatly.

    All of this has stewed in my head until I arrived at the Computers, Freedom and Privacy Conference this week here in Berkeley, where I met many of my friends and colleagues in the cryptographic security business, as well as advocates on issues of privacy in organizations such as EFF and EPIC. My thoughts have now gelled sufficiently to make some observations about privacy.

    When people speak about privacy, they may actually be talking about very different forms of privacy: defensive privacy, human-rights privacy, personal privacy, and contextual privacy.

    Defensive privacy is the first form: it's about protecting information about myself that makes me vulnerable or makes me feel at risk. This type of information can include things like my social security number, my credit report, or non-financial things such as my medical records or my home address. For some of my female friends this includes things like their photographs and email addresses. All of this information can be misused by other individuals or organizations in one way or another to mess up my life -- and in fact defensive privacy is usually centered around protecting this critical information from those singular individuals or organizations, be they con men, stalkers, or the Mafia. Most of the current privacy issues on the Internet seem to fall into this category. This form of privacy has also not fared well in the US courts -- for instance, in 1974 the Supreme Court decided that your bank records belong to the bank, to do with as they see fit.

    Closely intersecting defensive privacy is the category of human-rights privacy. When you are speaking with a European about privacy, this often is the type of privacy they are speaking of. This comes from their history: the Netherlands in the 1930s had a very comprehensive administrative census and registration of their own population, and this information was captured by the Nazis within the first three days of occupation. Thus Dutch Jews had the highest death rate (73 percent) of Jews residing in any occupied western European country -- far higher than the death rate among the Jewish population of Belgium (40 percent) and France (25 percent). Even the death rate in Germany was less then the Netherlands because the Jews there had avoided registration. (source: The Dark Side of Numbers). Human-rights privacy differs from defensive privacy in that it is about how governments can abuse information, rather then individuals abusing information. I used to feel safe about human-rights privacy in the US, that there was no way that what happened in Europe could happen here, but now I have lost such confidence because of Bush and Ashcroft.

    The third kind of privacy, personal privacy, is more unique to the United States. It is what Supreme Court Justice Brandeis in 1890 called "the right to be left alone". This form of privacy is often what the more Libertarian-oriented founders of the Internet mean when they talk about privacy. Personal privacy covers things like the "do not call registry", the various rights to do as we please in our own houses -- such as view pornography or play S&M games with our partners -- and the general right to not be interrupted or interfered with unnecessarily at home. This form of privacy has more basis in US law; the concept is based on an interpretation of the First, Fourth, and Fifth amendments of the US Constitution, but is not explicitly defined there. However, this form of privacy is guaranteed by the State of California Constitution which assures residents that they may pursue and obtain safety, happiness, and privacy.

    Finally, contextual privacy is what Danah Boyd calls the ickiness factor in her blog, and also in her post at Many-To-Many:

    Ickiness is the guttural reaction that makes you cringe, scrunch your nose or gasp "ick" simply because there’s something slightly off, something disconcerting, something not socially right about an interaction.

    This category is very difficult to define, and is easily confused with other forms of privacy, but I believe it has more to do with an inappropriate level of intimacy. An example of this is when I discovered that my professional colleagues on Orkut could see that I was in a committed relationship, and in turn I could see that some of them were in open marriages. I don't think there is very much harm that can come from this information being revealed, however, it was "icky" because it was an inappropriate level of intimacy for a professional context.

    All four of these forms of privacy can intersect -- for instance, Orkut allows you to reveal your sexual orientation, which could be used secretly by an employer to discriminate against you (defensive privacy), or by a future Ashcroftian government to violate your civil rights (human-rights privacy), might lead you to being bothered at home because of people who either agree with or disagree with your orientation (personal privacy), and often is inappropriate for casual professional acquaintances to be told about (contextual privacy).

    I don't think any of this answers the question of how to solve problems of privacy, but I do believe that it can help when you are discussing privacy to be sure that you try to convey and understand each others' ideas of what privacy means.

    Posted on April 22, 2004 at 02:00 AM in Politics, Social Software, Web/Tech | Permalink | Comments (2) | TrackBack

    Political Action and Visual Media

    I'm fascinated by new this trend of various political organizations to create Visual Media, whether documentaries, flash animations, or complex web pages to educate the public.

    Some exemplars:

    Ben Cohen's of Ben & Jerry's, explains the Federal Budget, using oreo cookies, and sponsored by True Majority.

    MoveOn.Org's documentary Truth Uncovered.

    Posted on December 10, 2003 at 11:42 AM in Politics, Web/Tech | Permalink | Comments (0) | TrackBack