iPhoneDevCamp and Hack-a-Thon

Everyone I feel privileged and honored to have been part of the iPhoneDevCamp this last weekend. Over 380 iPhone developers came out to the Adobe Campus in San Francisco to help each other make the best possible web pages and webapps for the iPhone.

I was the keynote speaker on Saturday and Master of Ceremonies for the MacHack-style Hack-a-Thon Demo on Sunday.

At the Hack-a-Thon almost 50 iPhone web applications were demonstrated to an enthusiastic audience. Take a look at Tilt, a game that takes advantage of the iPhone's motion sensor, PickleView, which is a same-time live baseball game enhancer, and The Pool, an attractive social game of water droplets hitting a pool. What is remarkable about these applications is not just the quality, but that each of them was written over just the weekend by a small team of 3-4 people who hadn't met each other before Friday!

Hewitt Prizes were awarded after the Hack-a-Thon based on the spirit of openness, contribution, sharing, and participation. Prizes included 3 iPhones and some very expensive Adobe software. In particular Joe Hewitt, of Firebug fame, was honored for his positive contributions, generous spirit, and wonderful iPhone UI example code. During the demonstrations, more than one person praised Joe, saying that his assistance, his code, or his debugger made their apps possible. Personally, I think about one-third of the web apps presented used some of his code.

Building on my experience with the same-time collaboration tool SynchroEdit, and the Skotos web-based games, I worked remotely with Kalle from Sweden and Erwin from Kansas to present an AJAX chat application called iLace. I am particularly proud of how well this little web application performs and how well it works using the iPhone UI. In particular, I think its melding of text entry and chat message receipt and its response to changes between portrait and landscape modes are very good examples of what can be done for chat on the iPhone. Source code is available!

Keynote My keynote presentation slides are now available in .pdf and .mov. I'm told a live recording of the session and an .mp3 will be available soon.

Over the last few weeks an online developer community that I started at WWDC called iPhoneWebDev has grown to over 650 members. It's now the best place to get online support for building iPhone web pages and webapps. I'd like to keep the momentum from the iPhoneDevCamp going forward on this list, so if you are interested in developing for the iPhone, check out the example code and join the discussion today!

Posted on July 8, 2007 at 11:19 PM in Games, iPhone, User Interface, Web/Tech | Permalink | Comments (0) | TrackBack

Getting Ready for the iPhone

Watereddown I've been excited about the web capabilities of the upcoming iPhone for some time. As a reluctant laptop user ("oh, my aching shoulders"), there is real appeal to me in a better portable web browser. I have tried most of the PDA and cellphone browsers to date, and none offer more then a poor cousin to the web that we experience on the desktop.

Instead, the iPhone offers a desktop-class browser. There is no transcoding, nor any subset of HTML such as WML. Full web pages are rendered in the small display, and when you "double-tap" with your finger the section you touch is expanded to a more readable size. The video available at the Apple website shows this capability in use.

Because of the iPhone's upcoming July 29th release, I decided to participate in this week's Apple WWDC conference for Macintosh developers. There a number of announcements about the iPhone were released, and a number of technical sessions on the iPhone and iPhone-related technologies were held. Together the iPhone demonstrations at the public keynote and other demonstrations throughout the WWDC offered some real promise for when the phone is released on June 29th.

Iphonesteve The biggest announcement at the public keynote was that there will not be an SDK for building native iPhone apps; instead, the only way for third parties to get involved is to create web applications optimized for the iPhone. This came as a big disappointment to the majority of developers participating at WWDC. However, as someone who has been involved lately in creating AJAX/Web 2.0 apps, I was less unhappy.

The other significant announcement at the keynote was that a Safari 3.0 beta for both Mac and Windows was being released and that a third Safari platform would be released on July 29th—inside the iPhone. This means that web 2.0 applications created to work with Safari on the Mac will likely also work on the iPhone.

SynchroEditSince SynchroEdit, an open-source simultaneous web editor (in the style of SubEthaEdit) for Firefox that I produced last year, is one of the most sophisticated AJAX/Web 2.0 applications, I dug deeper at various WWDC sessions to see if it might be possible to make SynchroEdit work on the iPhone.

One of the biggest things that SynchroEdit needs in order to function is DOM Mutation Events. At a party for WebKit (the open source code underpinnings of Safari's web renderer) and in questions after a session at WWDC it was confirmed that these are available to Safari 3.0 and presumably the iPhone.

The other key ability that SynchroEdit requires is WYSIWYG editing. This was terribly broken in Safari 2.0, but I saw many demonstrations of it working in Safari 3.0, so I don't anticipate any problems with this.

SynchroEdit also requires AJAX and in particular the XMLHttpRequest function, and the keynote clearly said that this was available.

The final thing that SynchroEdit needs is the ability to keep the browser at readystate==3, i.e. not "finish" sending the page, so that we can continue to interactively pass updates to users as they arrive, without creating a new connection for every message. It is not clear if this will be supported on the iPhone, but there are ways to work around it.

So, in principle, it appears that we should be able to make SynchoEdit work on the iPhone. I am not sure that many iPhone users need SynchroEdit, but as an example of a very sophisticated web technology that should work on that platform, it shows the potential for what might be possible.

Because of this technological capability I've decided to begin investigating what type of social software apps could be highly useful on the iPhone and that aren't being served by the existing web 2.0 community. I am also going to continue investigating the technical issues of developing web apps for the iPhone

If you are interested as well, I invite you to participate in the new iPhoneWebDev community. It should be a great resource for everyone interested in getting in on the ground floor with this new web technology. I have also begun tagging relevant web pages in del.icio.us with the tag iphonewebdev—I hope that others will begin to use this tag as well.

I have quite a bit more I'd like to write about specific iPhone technology, but unfortunately I have to wait until the WWDC confidentiality expires on June 29th with the release of the iPhone, so keep an eye out here for more details.

Posted on June 15, 2007 at 08:06 PM in iPhone, Social Software, User Interface, Web/Tech | Permalink | Comments (0) | TrackBack

Collective Choice: Experimenting with Ratings

by Christopher Allen & Shannon Appelcline

[This is the fourth in a series of articles on collective choice, co-written by my collegue Shannon Appelcline. It will be jointly posted in Shannon's Trials, Triumphs & Trivialities online games column at Skotos.]

Last year in Collective Choice: Rating Systems we took a careful look at eBay and other websites that collect ratings, and used those systems as examples to highlight a number of theories about how to make rating systems more useful.

We suggested three main methods for improving rating systems:

Granular Ratings: Based on the clumping of ratings to high values, we believed that ratings could be made more useful by increasing the size of a rating scale. Most rating scales are 5-point ranges, so we suggested a 10-point range instead.

Distinct Ratings: Raters can be somewhat arbitrary in how they rate items, varying both from each other and even from themselves (usually over multiple sessions). Thus we believed that providing explicit statements of what each number meant could improve ratings.

Statistical Ratings: Finally we stated that in low volumes ratings could be biased by various quirks of data entry, either malevolent or not, and that ratings could be improved with strong statistical methods being used to polish up data and automatically keep "bad" data in line with "good".

In the year since we wrote that article we've decided to practice what we preach and have rolled out an entirely new rating system called The RPGnet Gaming Index. We've applied all of the above theories and thus far it looks like they're not only working, but that they're actually providing better rating systems than previous ones we've used at the RPGnet site.

In this article we're going to step through the data we've collected from this experience and see how it applies to our theory: first by looking at our previous RPGnet rating system, then by looking at the new system, and finally by by examining the data from these two systems and comparing their results. We've also run into some unexpected troubles along the way, and we'll talk about that too.

The RPGnet Reviews System

Rpgnetlogo_1 RPGnet is our gaming site for tabletop roleplaying—games like Dungeons & Dragons and Vampire: The Masquerade. We purchased it in 2001 from the original owners. One of the benefits of RPGnet was that it had a very large community. As of today it sports one of the top-100 forums on the Internet, with over 1000 simultaneous users regularly logging in. However, because of its maturity, we also inherited many existing systems.

Rpgnet_review_summary_1 One of these was the RPGnet Reviews System which gave individual users the ability to review gaming products—mostly role-playing games, but also board games, books, DVDs, and a smattering of related products.

Most of these reviews are submitted by average readers who just want to talk about a product that they like (or don't), though a fair percentage are instead submitted by staff reviewers. (Overall at least 26% of our reviews are based on publisher "comp" copies, and thus may be considered largely professional, while the other 74% may or may not be.) The large community size of RPGnet applies to the Reviews System as well: currently it features 8,505 published reviews.

Looking at the RPGnet Reviews through our three filters we find the following:

Granularity. The ratings from our existing reviews aren't as granular as we'd like. We have a theoretical scale of 2-10, but that's based upon a Style rating of 1-5 and a Substance rating of 1-5.

Rating Style Substance %
1 81 225 1.8%
2 732 651 8.1%
3 2364 1777 24.3%
4 3618 3525 42.0%
5 1709 2326 23.7%

Approximately 90% of raters rate only with values of 3-5, and thus our scale is more limited than the 2-10 range would indicate. 42.9% of reviews further rate Style and Substance exactly the same, suggesting that not everyone sees a difference between these two elements. On the whole this scale isn't as a bad as a singular 5-point scale, but it also isn't a real 10-point scale, and the two orthogonal types of comparison don't necessarily provide a coherent description of a product.

Distinctiveness. Conversely, the review ratings are fairly distinct because the Review System provides an explanation of what each rating number means. For example the five Substance ratings are: I Wasted My Money (1); Sparse (2); Average (3); Meaty (4); Excellent(5). The descriptions could be better, but hopefully they connect to some users in meaningful ways, and help them to rate consistently.

Statistics. Our review ratings have no statistical basis. These values are used entirely unfiltered.

On the whole, the existing RPGnet Reviews embodied slightly less than half of what we wanted to see in a rating systems: some improvement over a simple 5-point scale; some effort put into making individual ratings distinct; and nothing statistical.

There is room for improvement, however, as we'll see when we analyze this system more fully.

The RPGnet Gaming Index

Our newer system is the RPGnet Gaming Index. It doesn't supersede our Reviews, but instead offers a complementary look at the roleplaying field. The Index is essentially an RPG industry database. It contains individual entries for many different gamebooks—currently 5248—and allows registered users to rate each of them. Those ratings are then turned into averages by various mathematical formulas on a nightly basis and the roleplaying games in our index are then ranked.

The large size of RPGnet has allowed us to very quickly turn our ideas of a Gaming Index into reality. Just six months after release we have:

  • 5248 well-written Index entries
  • 5908 different editions
  • 4240 authors
  • 4478 covers
  • 360 different game systems
  • 345 series
  • 10142 individual ratings

Most of the ratings are clumped around the best and worst games, with many less popular games unrated as of yet. Four different items have at least 80 ratings each (Call of Cthulhu, Exalted, Nobilis, and Unknown Armies). Our average rating is 6.79. Ratings above 7.82 are in the 99th percentile, ratings above 7.21 are in the 90th percentile, and ratings below 6.53 are beneath the 10th percentile.

(For more info on the creation of the RPG Index, and how to encourage user generated content, see Shannon's articles, "Managing User Creativity", Part One and Part Two.)

The RPGnet Index also handles some unusual situations, such as when a game book contains other game books as part of an anthology or compilation. For instance, the 8-book compilation In Search of Adventure has a composite rating of 6.57 which is partially based upon the individual adventures that make it up.

Granularity: The first thing we did was provide a 10-point scale for this new system.

Distinctiveness: We also made sure each point of the scale was clearly defined. Currently the points of our scale are: Worthless (1), Poor (2), Some Flaws (3), Almost Average (4), Average (5), Above Average (6), Good (7), Very Good (8), Outstanding (9), and One of the Best Ever (10).

We made some mistakes in our original release of our "distinctive" titles, and we discovered this had real effects on the user input, telling us that these title labels are meaningful to users.

First, we initially labeled 6 as "average", to mirror the rating system for our existing Reviews, rather than setting 5 to be average. But as we noted in our first article, people like to be nice, and thus they tend to rate on the good side of a scale. Changing the label for our definition of average from 6 to 5 has slowly started dropping the average of all ratings down as a result (providing more breadth, a topic we'll talk about more shortly).

Second, two of our original distinctive titles were at odds with the others. Our original "2" value said that the game had "a few useful elements" and our original "9" value said that it was the "best of the year". The 2 was much more specific than any of our other terms and the 9 created a comparative query that was very different from anything else. Overall our ratings conformed to a bell curve centered between 6 and 7, but we saw very clear dropouts in our curve at 2 and 9, telling us that we'd made mistakes in those terms, and that people were less willing to use them as a result. Since we've made the change to our current set of titles those two discontinuities have disappeared.

Statistics. Finally, we fully integrated statistics into our new Index by using two main methods: bayesian weights and trust.

We explained bayesian weights pretty fully in our previous article. Here's what we said then:

The idea behind a bayesian average is that you normalize ratings by pushing them toward the average rating for your site, and you do that more for items with fewer ratings than those with more ratings. The basic formula looks like this:

b(r) = [ W(a) * a + W(r) * r ] / (W(a) + W(r)]

r = average rating for an item

W(r) = weight of that rating, which is the number of ratings

a = average rating for your collection

W(a) = weight of that average, which is an arbitrary number, but should be higher if you generally expect to have more ratings for your items; 100 is used here, for a database which expects many ratings per item

b(r) = new bayesian rating

Say three "shill" users had come onto your site and rated a brand new indie film a "10" because the producer asked them to. However, you use a bayesian average with a weight of 100, and thus 3 ratings won't move the movie very far from the average site rating of 6.50:

b(r) = [100 * 6.50 + 3 * 10] / (100 + 3)
b(r) = 680 / 103

b(r) = 6.60

We implemented bayesian weights exactly as we'd detailed, but with a lower weight of 25. Since then we've accrued over 10,000 ratings in the database, and we can probably start thinking about cranking that weight up, another topic we'll return to.

Our trust-based algorithms suggest that some ratings are better than others, and should thus be more trusted (and thus more weighted when we calculate the average rating of an item). Though bayesian weights have been used before, we're not aware of other systems that weight ratings based on trust.

The calculation of trust is very simple:

Weight = 0 if #ratings(user) <= 2
Otherwise Weight = #ratings(user) / 50 to a maximum of 2

Weight *= 2, to a maximum of 4, if the user included a comment

This was based on the idea that the average good rater would rate 25 different items and the average great rater would rate at least 50. Additionally, we believed that ratings with comments were more likely to be thoughtful than those without.

That, overall, is a quick picture of what we've done with the RPGnet Gaming Index. Some of these ideas were laid out from the start, and others have been tuned as we progressed.

So how did we do, particularly in comparison to our existing RPGnet Reviews System?

The Comparison

One of our goals in improving rating systems has been to widen the range of possible input. As we noted earlier we discovered that 90% of our RPGnet Reviews Ratings were in the 3-5 range, and only 10% in the 1-2 range.

Generally, we can measure the success of widening a range by seeing whether the average rating of a database moves toward the true average. For the purposes of a 10-point scale from 1-10, that's a desired value of 5.5. That generally means we're looking for our average rating to decrease because people tend to rate high.

The following table compares the average results of Reviews ratings and Index ratings.

Database Average
Converted Reviews 7.25
Massaged Reviews 7.29
Unweighted Index 7.10
Weighted Index 6.78

Here's what the categories in the above chart represent:

Converted Reviews: The Style + Substance of the Reviews, converted from its 2-10 scale to a 1-10 scale:

$rating = avg($style) + avg($substance);
$rating = ($rating * 1.125) - 1.25;

Massaged Reviews: The Style + Substance of the Reviews, with Substance given double weight over Style because we think that more closely reflects the intentions of the reviewer, converted from its 2-10 scale to a 1-10 scale:

$rating = (average($style) + 2*average($substance))/1.5;
$rating = ($rating * 1.125) - 1.25;

Unweighted Index: Index ratings exactly as users have entered into our Gaming Index:

$rating = average($index-rating);

Weighted Index: Index ratings adjusted by the weight of each individual rating, which is based on user trust and inclusion of comments:

$rating = average($index-rating*$index-weight)/average($index-weight);

Our average rating—which is our criteria for success—decreased somewhat from the Reviews System to the Gaming Index and it decreased much more dramatically when we introduced our trust systems.

The following chart shows the a typical example of how review and index ratings differ, using the venerable Dungeons & Dragons Player's Handbook as an example:

Dd_players_handbook_rpgnet_reviews_onlyDd_players_handbook_index_ratings_only

For this book the median ratings from reviews-only is 8, and the median from index-only is 7. A one-to-two point drop in median rating from reviews to index was consistent in all of our most-rated games other than those which were a rated a "10" in both places.

We believe that this initial success of our unweighted Gaming Index can be attributed to the slightly better granularity—a 10-point scale versus two 5-point scales—and our improved distinctiviness—based on better naming of the rating levels. The veracity of this will ultimately be played out as the Index grows.

However we have no doubt that our statistical approach to the index data, when we moved from our unweighted Index to our weighted Index, is providing even better results. We had theorized that users who input more and who include comments would provide "better" data, and by our criteria of the average of the ratings moving toward 5.5 that seems to be borne out. The following table looks at the information a bit more precisely, by comparing average ratings as total number of ratings increases over several ranges:

# of Ratings Average w/Comment Average w/o Comment
1-2 8.55 8.88
3-24 8.08 8.16
25-49 7.32 7.11
50-99 7.14 7.03
100+ 6.17 6.99

This table fairly definitively shows that base maxim: that the breadth of the ratings, and thus their quality, increases the more ratings a user makes. The improved quality of ratings with comments is less definitive. Among the vast mass of users the two values are pretty close, and sometimes the reverse of what we expect, but for the best and the worst users, ratings with comments seem to be better than those without. This latter point is another one that we'll have to continue to monitor as the Index grows beyond its current total of 10,000 ratings.

The other major element of our statistical approach to the Index is our bayesian weight. The following chart shows a top-ten chart for roleplaying games calculated via four different methodologies: our Reviews; our Index with no weighting; our Index with a 25 bayesian weighting (as it currently stands); and our Index with a 50 bayesian weighting:

# Reviews-Only 0-weight Index 25-weight Index 50-weight Index
1 Delta Green: Countdown The Chronicles of Talislanta Delta Green: Countdown Delta Green
2 Nobilis Wildside Spirit of the Century Delta Green: Countdown
3 Castle Falkenstein Devil's Due Delta Green Unknown Armies
4 Vimary Sourcebook Lodges: The Faithful Unknown Armies Call of Cthulhu
5 Liber Servitorum Apocalypse Call of Cthulhu Nobilis
6 Ork! Earthdawn Gamemaster's Compendium Nobilis Spirit of the Century
7 GURPS Russia Into the Badlands Pendragon Over the Edge
8 GURPS Reign of Steel Earthdawn Player's Compendium Over the Edge Pendragon
9 Cudgel's Compendium Chronicle of the Black Labyrinth Mutants & Masterminds Mutants & Masterminds
10 Corum The Spell Book Pulp Hero Vimary Sourcebook

We actually did do a little bit of statistical analysis on the Reviews because on our first try to produce this chart we got a random clump of reviews that were 5/5 from a much larger pool, so we further ordered them by descending total count of reviews, and as a result you're seeing a better selection of ranked reviews than a truly unstatistical sampling would allow. We did the same for the unweighted Index (which clumped a number of results at "10"), except we further ordered items at the same weight by decreasing number of views (another statistical decision).

Clearly, deciding which of these lists is "right" is a much more subjective measure than the mathematical analysis we were able to apply to earlier problems. However, most roleplayers would tell you that the unweighted Reviews and Index lists are terrible. The top 5 items in the Reviews list actually aren't bad for a starting list of good games—but only because we did the aforementioned statistical ordering. Before that we just had a random listing of gaming items. Even with our attempts at quickie statistical analysis the unweighted Index is still quite bad, with only Talislanta regularly showing up on other "best" lists.

The problem is the ability of one person to come in and rate an item a "10" (or a "5"/"5"), thereby making that item more highly rated than any item which has an actual consensus of ratings. Of our unweighted top Reviews only the top three had more than 2 reviews and the rest had 2. Not surprisingly those top three were the best fits to a typical top-ten list. Of the unweighted Index only the top three had more than 1 rating, and the rest had 1. Our single good pick was in those top three.

Our 25-weight Index, which is what we currently use, has been generally accepted by the RPGnet community as a good marker of what's good and what's not. However there have been two items on it which some percentage of people disagree with: Spirit of the Century and Pulp Hero. It's instructive to see that when we increase to a 50-weight Index Spirit of the Century drops (even more notably than depicted here, because its actual rating changes from .01 from first place to .16 from first place) and Pulp Hero disappears entirely.

The questions of what to set your bayesian weight to, when to increase it, and what maximum value to set it to are all relatively unstudied and thus we don't have good answers to them. As we pass 10,000 ratings we're considering upping the bayesian value to 50. We expect that 100 will be our ultimate value when the Index is fully mature, however if we increase the weight too far an older, less rated game will never be able to get enough weight to get out of the doldrums.

Conclusion

We're by no means done with this ratings experiment. Though we've pleased and impressed with the growth of the RPGnet Index thus far, by next year we hope that the Index will include the vast majority of all games in print (as opposed to somewhat less than half now) and that our 10,000 ratings will grow to 50,000 or more. This will allow us to offer even more definitive answers to our questions.

In the meantime we're still mucking with our statistics and facing new problems.  Some of the newest:

  • What to do about drive-by ratings: Our trust algorithm does a good job of making drive-by ratings, where a publisher points his audience to an item in our site, mostly irrelevant, but there's some concern that they could have more effect in the long run.
  • How to incorporate our review ratings in our index ratings: It seems a shame to waste the thousands of reviews that have been written—and indeed currently they're calculated into a composite rating we use in the Index—but we're realizing that people have very different purposes for writing reviews and inputing ratings, which may result in some of the upward skew we see on the review side of things. Ultimately we need to decide whether they're just too different or whether our statistical massaging is enough to incorporate those reviews into a composite Index rating.
  • How to pick some of our numbers: As we already noted we don't have good formulas for when to choose which bayesian weights. Likewise we've been guessing at which values to use for the trust-based weighting of our raters. Originally we set our desired rating count to 100 for good rater and 200 for great raters, but we've since dropped those to 50 for good and 100 for great based upon the real numbers of ratings that users were making. Again, we'd prefer to derive an actual formula for this type of calculation

Shannon has discussed some of these issues more in his recent article More Thoughts Abour Ratings.

Despite unanswered questions, we still feel good about the basic ideas we laid out in our article last year. We have no doubt that giving our ratings a statistical basis has dramatically improved them and evidence thus far suggests that both granularity and distinctiveness have been helpful as well.


Related articles from this blog:

  • 2005-12: Systems for Collective Choice
  • 2005-12: Collective Choice: Rating Systems
  • 2006-01: Collective Choice: Competitive Ranking Systems
  • 2006-08: Using 5-Star Rating Systems
  • Related articles from Shannon Appelcline's Trials, Triumphs & Trivialities:

  • #192: Managing User Creativity, Part One
  • #193: Managing User Creativity, Part Two
  • #196: Collective Choice: Ratings, Who Do You Trust?
  • #198: Collective Choice: More Thoughts About Ratings
  • Posted on January 1, 2007 at 10:38 PM in Social Software, User Interface, Web/Tech | Permalink | Comments (1) | TrackBack

    Speaking about SynchroEdit at WikiWednesday

    I will be speaking tonight at WikiWednesday on the topic of Same Time, Different Place Editing, and will be demonstrating SynchroEdit integration with MediaWiki and EditThisPagePHP.

    If you are interested, see you tonight (Wednesday) at 6-8pm, at Socialtext.

    Posted on December 6, 2006 at 02:34 PM in User Interface, Web/Tech | Permalink | Comments (0) | TrackBack

    Ratings: Who Do You Trust?

    My colleague, Shannon Appelcline, has been working on a game rating system for RPGnet. This has resulted in real-world application of the principles for designing rating systems which we've previously discussed in our Collective Choice articles. Shannon's newest article, Ratings, Who Do You Trust? offers a look at weighting ratings based on reliability.

    Shannon_appelcline

    On the RPGnet Gaming Index we've put this all together to form a tree of weighted ratings that answer the question, who do you trust?

    Here's how we measured each type of trust, and what we did about it:

    • Volume of Ratings for an Item. Introduce a bayesian weight to offset the variability of items with low-volume ratings.

    • Volume of Ratings by a User. Give each user a weight based on his volume of contribution which is applied to his ratings.

    • Depth of Content by a User. Give each rating a weight based on the depth of thought implicit in the rating which is applied to that rating.

    These all get put together to create our final ratings for the Gaming Index, with each user's individual rating for an item getting multiplied by its user weight and its content weight, and then all of that averaged with the other user ratings and the bayesian weight too. The result is in no way intuitive, but users don't really need to understand the back end of a rating system. Conversely we hope it's accurate, or at least more accurate than would otherwise be true given the relatively low volume of ratings we've collected thus far.

    Here are some of Shannon's earlier discussions about the design behind the new "user content" based RPGnet Gaming Index:


    Related articles from this blog:

  • 2005-12: Systems for Collective Choice
  • 2005-12: Collective Choice: Rating Systems
  • 2006-01: Collective Choice: Competitive Ranking Systems
  • 2006-08: Using 5-Star Rating Systems
  • 2007-01: Experimenting with Ratings
  • Related articles from Shannon Appelcline's Trials, Triumphs & Trivialities:

  • #196: Collective Choice: Ratings, Who Do You Trust?
  • #198: Collective Choice: More Thoughts About Ratings
  • Posted on September 14, 2006 at 04:28 PM in Games, Social Software, User Interface, Web/Tech, Weblogs | Permalink | Comments (2) | TrackBack

    Using 5-Star Rating Systems

    In Collective Choice: Rating Systems I discuss ratings scales of various sorts, from eBay's 3-point scale to RPGnet's double 5-point scale, and BoardGame Geek's 10-point scale.

    Stars_1

    Of the various ratings scales, 5-point scales are probably the most common on the Internet. You can find them not just in my own RPGnet, but also on Amazon, Netflix, and iTunes, as well as many other sites and services. Unfortunately 5-point rating scales also face many challenges in their use, and different studies suggest different flaws with this particular methodology.

    First, one study using Amazon data has shown that many undetailed ratings (where the rater isn't required to add any additional information other than the rating they select) show a bimodal distribution.  In other words the distribution of ratings tends to cluster around two different numbers (e.g., 1 and 5) rather than offering a normal distribution where the ratings cluster around a single height (e.g., 3). Thus the median of these ratings is not an accurate reflection of product quality, but instead is a statement of conflicting opinions.

    Second, our own study using RPGnet data has shown that many detailed ratings (where the rater does add additional information, in this case a full review) offer normal distributions, however it is biased toward the high end of the scale. On RPGnet, for example, we discovered that 90% of this 5-point rating system was 3 or higher with an average around 4.

    Randy Farmer of Yahoo suggests that this scale limitation is particularly troublesome for fan-based ratings, such as those found on episodic TV sites:

    Only the fans of a show evaluate the episodes, and being fans, will never rate an episode one or two stars, ever. I've seen this attempted over and over on the net with the same results every time: Each episode of a show is 4-stars +/- .5 stars. This goes all the way back to the Babylon-5 website, probably the first source for this kind of data.

    (And indeed, the TV episode TKO, from Babylon 5's first season, is considered an entirely atrocious episode by even the fans. Yet it has a 6.1 of 10 "Fair" rating on tv.com.)

    Thus even when a bimodal distribution is not a problem, on a 5-point scale the upward bias often results in only 2 or 3 meaningful data points. This is problematic because it minimizes differentiation. In many cases, a 5-star rating system where most of the ratings are either 3 or 4 is actually no better then just a thumbs-up/thumbs-down rating system.

    However, given that 5-point scales are probably here to stay, we are forced to make the best use of them we can.

    First, we need to provide raters with incentives, so that they provide meaningful ratings. We've already seen that this can be done by requesting detailed ratings: when a person takes the time to write text, and knows that his name will be attached to it, he generally does a better job in his rating. There are other possible incentives techniques as well, such as RPGnet's new XP System.

    Second, we need to provide means for a 5-point scale to become more meaningful by encouraging raters to use not just the top half of the scale, but the bottom half as well. One method to accomplish this is to make ratings distinct -- as I briefly mentioned in my previous article on this topic -- and encourage standards so that an "average" rating is 2 or 3, not 4.

    As an example of how to accomplish both of these goals with already existing 5-point rating scales, I've detailed my own experiences with using ratings on two popular services -- iTunes and Amazon. By providing myself with incentives and making my use of ratings very distinctive, I have created more meaningful and useful output for myself.

    Music Ratings - iTunes

    Apple's iTunes software offers you the ability to rate individual songs with a 0-5 Star rating. If you use iTunes with an iPod, you can change the rating of a song on your iPod and the change will be reflected in your iTunes database the next time you sync your iPod. The "Shuffle Songs" feature available on more modern iPods has an option to have songs with higher ratings be played more often. A very powerful feature, Smart Playlists, can dynamically create sophisticated playlists based on ratings. All of this makes rating music on iTunes very useful.

    After Shannon and I wrote our Rating Systems article, I examined the ratings in my iTunes catalog. Using the Alastair's fabulous XLST iTunes rating statistics tool, I discovered that the ratings I created in iTunes clearly were biased overly high, matching the pattern we'd described. I had far too many songs rated with 4 Stars, and almost nothing rated 1 or 2. This made my ratings less useful.


    Here are some statistics from your iTunes Library: 4172 tracks, 412 (10%) rated
      Cumulative % of Rated
    Number % of rated Actual Target Shortfall
    Tracks rated 5 stars: 112 27 27 5 -22
    Tracks rated 4 stars: 183 44 72 15 -57
    Tracks rated 3 stars: 92 22 94 50 -44
    Tracks rated 2 stars: 22 5 99 90 -9
    Tracks rated 1 stars: 3 1 100

    So over the last few months I've completely revamped my iTunes ratings. Since I can't change the user interface, I've changed my behavior. I'm also taking advantage of two other fields: "checked" which I use to give more distinctiveness to my ratings, and "play count" which shows whether or not I've listened to something through to the end.

    Here are the criteria I used:

    Rated 5 - Exemplars Myrating_5_checked: Only my most favorite songs are rated 5. They have to meet the following criteria: they make me feel good or excite me no matter how often I listen to them, I can typically listen to them often without getting tired of them, and they are the best of their particular genre.

    Rated 4 - Great Myrating_4_checked: There is only a small difference between a song that is rated 4 and 5 in my ratings -- typically it doesn't excite me or make me smile quite as much, or it isn't necessarily an exemplar of its genre. However, I still can typically listen to them often without getting tired of them. Items that are rated 4 and 5 are ones that I carry on my iPod Shuffle.

    Rated 4 - Great (Unchecked) Myrating_4_unchecked: There are a few songs that I do consider to be great, but that I only want to play when I'm in the mood for them, or I want to only play in a specific order, or they "don't play well" with other music. For instance I love the song "The Highwayman" by Loreena McKennitt, however, it is over 10 minutes long and I just don't want to hear that type of song unless I'm in the mood for it. Other examples are the 12 songs that make up Mussorgsky's "Pictures at an Exhibition"  -- I want them played in order when I do play them, and I really don't want them played in the middle of my other songs. Unfortunately, iTunes does not let you select only unchecked items, so I don't have a Smart Playlist for these; instead I keep them in a regular playlist.

    Rated 3 - Good Myrating_3_checked: These are songs I like. Typically I can play them regularly but not too often. Songs rated 3-5 go on my iPod Nano.

    Rated 3 - Good (Unchecked) Myrating_3_unchecked: There is a lot of music that I think is Good, but I don't want to play all the time. I have a large catalog of sound tracks from movies. All but a few of those tracks are in this category. Again, iTunes does not let you select only unchecked items in a Smart Playlist, so I have several regular playlists for these items.

    Rated 2 - Ok Myrating_2_checked: I have very diverse musical tastes, starting with jazz, various ethnic and world music, and also including quite a bit of pop, rap, R&B, punk, and metal that I enjoy. I don't enjoy them all the time -- but I do like them to pop up every once in a while for variety. So I rate these 2 and leave them checked. I have an old 40GB iPod that I take on long trips, and it stores everything I have that is checked and rated 2-5.

    Rated 2 - Ok (Unchecked) Myrating_2_unchecked: Some songs are OK, but I really have to be in the mood specifically for that song. Listening to Jimmy Buffet's "Margaritaville" can be a guilty pleasure on a lazy summer day at the beach, but it isn't something I want to regularly listen to. I have a number of special playlists for songs rated like this.

    Rated 1 - Don't Like Myrating_1_checked: These are the songs that I don't like. They're just not my style. Many are still quality music, they just doesn't work for me. I do keep most of these for completeness -- it might just be one or two songs on the album, and I want to keep the album complete. Or I keep it in case my tastes change. But in general, once something is rate 1 Star, I'll probably never listen to it again.

    Rated 1 - Trash (Unchecked) Myrating_1_unchecked: These are songs that not only do I not like, they just are not good music. I don't like most rap music, but I can tell that most are still quality. Some are junk -- these I rate 1 and uncheck, and are candidates for deletion the next time I purge my collection.

    Unrated & Listened Myrating_0_checked, playcount > 0: If I've listened to something through to the end, but haven't rated it yet, it shows up in this Smart Playlist. Periodically I check this Smart Playlist, sort by playcount, and try to rate everything that I've listened to more then once.

    Unrated & Unlistened Myrating_0_checked, play count=0: This is the default when a new song is added to my library. So any song that is unrated, checked, and has a play count of 0 shows up in my "Unrated & Unlistened" Smart Playlist. When I'm in the mood for variety, I go through this playlist and rate songs.

    Modifying my rating system in this way has caused my average rating for music to change from around 4 to somewhere between 2 and 3. It will probably, over time, become closer to 2 as I rate more of my collection. This gives me a lot of distinctiveness so that I can create Smart Playlists that work well for me.


    Here are some statistics from your iTunes Library: 6519 tracks, 726 (11%) rated
      Cumulative % of Rated
    Number % of rated Actual Target Shortfall
    Tracks rated 5 stars: 74 10 10 5 -5
    Tracks rated 4 stars: 144 20 30 15 -15
    Tracks rated 3 stars: 211 29 59 50 -9
    Tracks rated 2 stars: 270 37 96 90 -6
    Tracks rated 1 stars: 27 4 100

    Obviously rating a large music collection can become a chore -- you don't want to spend your limited music listening time always fine tuning your ratings. So I have some approaches that make it easier for me to rate my music with less effort:

    • First, I sorted my catalog by my old ratings, and modified everything down by 1, Starting with everything rated 2 becoming 1, 3 becoming 2, etc. This gave me a good base to start with

    • Exemplar_smart_playlist Next I created Smart Playlists for each rating, i.e. "Rating 5 - Exemplar" with "Match only checked songs" and "Live updating" checked. I then added "Play Count" as a column to my view, and sorted by it. This gave me the songs that I played the most and least, and I adjusted some songs up and down accordingly.

    • Plays_well_with_others_smart_playlist Then I created a new Smart Playlist that simply plays songs rated 3 to 5, limiting the list to the first 100 GB selected by random (i.e. everything random), and saved this Smart Playlist as "Plays Well With Others". I play this on occasion in the background, and when I hear something that jars me I know something isn't rated right. Thus without a lot of effort I can change ratings for songs that no longer fit their rating, or uncheck items where the rating was appropriate but it "didn't play well with others".

    • I try to be aware when I'm using my iPod of what a songs rating is, and change it if it seems wrong. The next time I sync the iPod my ratings will be adjusted in my iTunes catalog.

    • Play_countI also try to be aware of Play Count -- this number only goes up if you play a song to the end. So even if I'm not able to take a look at the rating (for instance when I'm in a car), I can at least forward to the next song. Periodically I review the play counts for songs that I've rated and consider moving them up and down accordingly. Of course, this means that I have to be careful and not let the iPod keep running when I'm not listening.

    A tip for those of you that do put a lot of effort into your iTunes ratings: I've learned the hard way that unlike most song information, the rating is NOT stored in the song itself, so if your iTunes database gets corrupted, or you move your music to another server, you'll lose all your ratings. One way to avoid this is to periodically backup your ratings into a field that is stored in the song itself. I personally use the "Grouping" field as it is rarely used, select all songs with the same rating and click on "Get Info", and change the Grouping field to "My Rating: 5 Stars".

    I only have 11% of my collection rated so far, but using this system I'm finding it a lot easier to manage my ratings. I'm already getting many benefits from it -- I'm playing my music more often, my iPods typically have the music I want on them, and various music discovery services can use my ratings to help me identify new music I might enjoy. This provides the incentive to keep me entering meaningful ratings.

    Book Ratings - Amazon

    Amazon also uses a 5-Star rating system, and your ratings can be used by Amazon to help you find books that you might like. Though I like to support my local bookstores, it is this feature that brings me back to Amazon time and again. Whenever I browse through Amazon and see a book I've already read I try to take the time to update my rating.

    Amazon has a number of different tools to assist you in your ratings. If you are an Amazon customer, you can go to Improve Your Recommendations: Edit Items You Own and see all the books that you've purchased and quickly rate them with a nice AJAX interface. You can also review items that you've already rated, whether or not you own them, at Improve Your Recommendations: Edit Items You've Rated.

    Amazon_your_media_library Amazon has also recently added a very nice web service called Your Media Library that can be used to help manage your media library of books, music, and dvds. I personally only have used it to manage my books and dvds, as I find rating albums useless -- it is songs that I prefer to rate.

    After browsing through my ratings to date, I discovered the same flaws I found iTunes -- my ratings typically were too high; most were a 4. This is particularly encouraged by the popup when your cursor is over the Stars "1 - I hate it, 2 - I don't like it, 3 - It's Ok, 4 - I like it, and 5 - I love it". I suspect if I use the same trick that I use for iTunes of making a rating of 2 Stars mean "Ok" I could potentially cause the recommendation engine to be less effective (though it could possibly make it better, I don't know). So I am being much more brutal with my ratings and pushing many more down to 3, so that my ratings of 4 and 5 have more meaning.

    5 Stars Amazon_5_star: These have to be the exemplars -- the best books I've ever read, would be glad to read again, would be proud to show off on my best bookshelf, and will buy extra copies to give to friends.

    4 Stars Amazon_4_star: These have to be really good books -- most of them I'm willing to read again and I promote them by offering to loan them to my more discriminating friends. Although I may keep them on my bookshelf I'd rather give them to a friend then sell them at a used book store.

    3 Stars Amazon_3_star: These are books are decent books, and I do share them with my voracious reader friends. But I don't push them and I'm much more likely to sell them at a used bookstore then keep them on my shelf. This is the rating that I significantly underused previously, and I'm finding that the key discriminator for me so far is how much I feel like recommending this to friends who are more discriminating readers.

    2 Stars Amazon_2_star: This rating is where the Amazon rating system fails the most -- these are suppost to be books that "I don't like", however, most of the time I don't buy books that I probably wouldn't like, much less read them, so I have very few in this category. However, I've decided this category is for books that are just not quite good enough, or are slightly disappointing. Not bad, or disliked, but just somewhat disappointing.

    1 Stars Amazon_1_star: This is where I put the books that I don't like, or worse, I hate. Not many here, but I'm willing to risk more then many people are so I have some. Also books go here that just don't fit my interest, like romance novels that get recommended to me because I like some crossover fantasy-romance authors.

    Since I started more accurately rating my books at Amazon, I've found that their suggestions for other books to read to be more accurate. Thus I am getting value from rating these books, and I have incentive to continue to make the effort.

    Conclusion

    Offering an incentive for people to rate is important for ratings of all sorts, with both individual gain and status recognition being powerful motivators.

    However the easiest technique for making a 5-point rating scale more useful is to make it "distinct". If a user has a more specific meaning for each rating, ratings will slowly settle toward a truer average, and thus more of each rating scale will be used. We've also tried this technique recently on RPGnet, with our new Gaming Index; and thus far our new 10-point scale -- which has distinct meanings for each number -- is averaging 7.27. That's still a fair amount above the real average of 5.5, but at least it's below the 8+ rating that our old double 5-point scale resulted in.

    Often you, as a consumer of rating systems, will be making use of rating scales designed by others, rather than those you're designing yourself. For those cases it often makes sense to design your own rules for what each number means, and to do so in such a way that your median is the average of the scale, rather than toward one of the extremes. When you do, even if you're using a tight 5-point scale you'll end up with enough differentiation for it to actually be more meaningful than a thumbs up or a thumbs down.


    Related articles from this blog:

  • 2005-12: Systems for Collective Choice
  • 2005-12: Collective Choice: Rating Systems
  • 2006-01: Collective Choice: Competitive Ranking Systems
  • 2007-01: Experimenting with Ratings
  • Related articles from Shannon Appelcline's Trials, Triumphs & Trivialities:

  • #192: Managing User Creativity, Part One
  • #193: Managing User Creativity, Part Two
  • #196: Collective Choice: Ratings, Who Do You Trust?
  • #198: Collective Choice: More Thoughts About Ratings
  • Posted on August 11, 2006 at 08:49 AM in Books, Music, Social Software, User Interface, Web/Tech | Permalink | Comments (6) | TrackBack

    Collective Choice: Competitive Ranking Systems

    by Christopher Allen & Shannon Appelcline

    [This is the third in a series of articles on collective choice, co-written by my collegue Shannon Appelcline. It will be jointly posted in Shannon's Trials, Triumphs & Trivialities online games column at Skotos.]

    In our first article on collective choice we outlined a number of different types of choice systems, among them voting, polling, rating, and ranking. Since then we've been spending some time expanding upon the systems, with the goal being to create both a lexicon of and a dialogue about systems for collective choice.

    This time we're going to dig more into comparison ranking systems, by focusing on competitive rankings and looking more in depth at ELO Chess Ranking System and the other systems that we briefly mentioned previously. Our goal is to explicate these systems, to better address their flaws, to begin detailing the purposes of ranking systems, and to show how those purposes are critical in the design of ranking systems.

    Subjective vs. Objective Rankings

    In our original article we discussed rating systems as being largely subjective and ranking systems as being objective, but the situation isn't nearly as simple as that. In truth, there's a clear spectrum of ratings and rankings with varying amounts of subjectivity and objectivity in each collective choice system.

    Bcs_1 Golfrankings_1 The Bowl Championship Series (BCS) for college football is a good example of a ranking system that explicitly allows a subjective component. It involves a complex mathematical formula that includes things like win/loss ratios, but also sportswriters' and coaches' ratings.

    However, public opinion continues to show that people don't necessarily like seeing true ranking systems having subjective components, because they expect them to be "fair". The BCS formula has come under attack several times in the last few years precisely due to its subjective basis. Cal Berkeley was one of several teams denied a bowl position in 2004 when many felt that they were worthy.

    The APL tennis rankings and the official world golf rankings also have a subjective component, but it is much more subtle. Each tournament is worth a certain number of points, and the allocation of those points is relatively arbitrary, based upon the "prestige" of each tournament and the quality of players who have traditionally played in it. The subjectivism isn't quite as near to the surface as that of the college bowls, but it's still something that can have a notable, and perhaps unwarranted, effect upon the final results.

    Algorithmic Rankings

    Wcsrating_2 This brings us back to the ELO system, a ranking system originally designed for chess which is fairly well-known and well-understood. As we said in our overview article, "[ELO] builds a simple distribution of player ratings around a norm (typically 1500 points), then awards or deducts points based upon wins and losses, with the total sum of all points in the system staying constant. Players are then ranked according to their comparative scores."

    The big difference between this and the previously discussed systems is that it's almost entirely objective; in fact it uses a statistical basis to create an underlying mathematical model for rankings, rather than allowing human subjectivity to get in the way.

    The simplest formulation for an ELO rating looks like this:

    R' = R + K * (S - E)

    R' is the new rating
    R is the old rating
    K is a maximum value for increase or decrease of rating (16 or 32 for ELO)
    S is the score for a game
    E is the expected score for a game

    Much of the trick is in figuring out what the (E)xpected score of a game is. ELO uses the following formulas for players A and B:

    E(A) = 1 / [ 1 + 10 ^ ( [R(B) - R(A)] / 400 ) ]
    E(B) = 1 / [ 1 + 10 ^ ( [R(A) - R(B)] / 400 ) ]

    It's a good model because, using the two formulas, it means that a great player gains little from beating an average player, but an average player gains a lot from beating a great player. Take the following example:

    R(A) = 1900
    R(B) = 1500
    E(A) = 1 / [ 1 + 10 ^ ( [1500 - 1900] / 400 ) ]
         = 1 / [ 1 + 10 ^ ( -400 / 400) ]
         = 1 / [ 1 + 10 ^ -4 / 4 ]
         = 1 / [ 1 + 10 ^ -1 ]
         = 1 / 1 + .1
         = .91
         = 91%

    E(B) = 1 / [ 1 + 10 ^ ( [1900 - 1500] / 400) ]
         = 1 / [ 1 + 10 ^ ( 400 / 400 ) ]
         = 1 / [ 1 + 10 ^ 1 ]
         = 1 / 11
         = .09
         = 9%

    Player A is expected to score .91 in an average game, which is to say he should win 91% of the time, and will be punished accordingly if he loses to player B:

    R' = 1900 + 32 * (0 - .91)
    R' = 1900 - 29.12
    R' = 1871

    Conversely a win nets him very little:

    R' = 1900 + 32 * (1 - .91)
    R' = 1900 + 32 * .09
    R' = 1900 + 2.88
    R' = 1903

    ELO is almost entirely mathematical. Players can gain or lose different amounts of points based upon playing different players, but this is all part of the formula. The only slightly subjective element is the definition of K -- how much a player can win or lose from a particular game. The most widely used ELO systems for Chess break K down into two values: 16 for masters and 32 for everyone else. So there is a subjective decision that masters should vary their score less frequently than other players.

    That's a very minor element in an otherwise objective system, but as we'll see, more recent systems by Days of Wonder and Microsoft first reduce, then eliminate even this subjectivity.

    Variations of a Theme: Days of Wonder

    Dowlogo_1 ELO is probably the most used ranking system in the world. You can find it in use for Go, Tantrix, and many other games. Days of Wonder, producers of Gang of Four, Ticket to Ride, and many other games use a variant of the system which they describe on their website.

    They identify three core problems with ELO:

    1. New players can take a long time to ascend or descend to their correct levels.
    2. Highly ranked players can be hesitant to play with provisional players whose ranking might be much more uncertain.
    3. There are no allowances for games with more than two players.

    Days of Wonder resolved the first problem by creating a new formula for provisional players, allowing them to rise and fall in the rankings much more quickly.

    Conversely when playing against provisional players, regular players can only lose a maximum of K*n/20 points, where n is the number of games that the provisional player has played--rather than the normal maximum loss of K. For example, playing someone who has just played one game, can only result in a loss of 1/20th of the regular K value, and so it really doesn't matter if the provisional player's ranking is wildly out of whack.

    Both of these new formulas are set up to converge toward a normal ELO formula as a provisional player's number of games approaches 20 (making them a normal player at Days of Wonder).

    (It should be pointed out that using the number "20" to define a provisional player, and making a player less provisional in clean 5% steps, inevitably offers yet another small, subjective element into this mathematical formula; as we'll see momentarily Microsoft has more recently incorporated the idea of provisional uncertainty into their core mathematical model, much as the whole ELO system originally turned subjective win and loss statistics into tighter mathematics.)

    Ttrskotosrankings Finally, to resolve the situation of multiple players, Days of Wonder considers each game to be a set of duels, as described here:

    There are 4 players in a Gang of Four game. Let's name A the winning player, B the second one, C the third one and D the last one. We consider that there were 6 duels: A won against B, C and D. B won against C and D. C won against D. We compute independently the new scores for each duel, and then we average the values for each player.

    It's a fairly elegant answer that not only rewards or penalizes all players separately, but also encourages playing for second place, or even third, if first isn't possible.

    There have been continued discussions of the Days of Wonder ELO variant in their forums, and the questions raised there are common to many different ranking systems. Some players wanted unranked games, while others thought that having unranked games would discourage people from playing good competitors except in unranked games.  There has also been a lot of discussion regarding Ticket to Ride, a strategy game that supports 2-5 people, and whether the ELO variant system discourages multiperson play.

    The various lessons learned at Days of Wonder underline two basic ideas about rankings. First, even with a well-studied system like ELO, there's still a lot to understand, and, second, any ranking system needs to reflect the specifics of what it's ranking -- and what its purpose is.

    Variations on a Theme: XBox 360 Live

    Trueskillxbox360 An even more recent large-scale ranking system is the TrueSkill system developed by Microsoft for use with the XBox 360. It appears to be an expanded variant of the glicko ranking system used by the free internet chess server.

    Many of the problems identified by Microsoft were the same as those already noted by Days of Wonder and others, including: the uncertainty of provisional ratings and the need to rank players in multiplayer games. However, the TrueSkill system notably expands both issues. Ranking uncertainty is now defined as a mathematical concept and the rankings now support not just multiple players, but also multiple teams.

    TrueskillTrueSkill explicitly includes two values in any ranking: a skill level and an uncertainty level. The first, like the more common ELO ranking, tells how good a player is. The second states how sure that ranking is. The uncertainty rating is effectively a margin of error, similar to those we saw in polling systems. If a first-time player has a skill rating of 25 with an uncertainty rating of 8.3 that means that his skill is probably somewhere in the range of 16.7 to 33.3, a pretty wide range, but then this is a totally untested player. According to benchmarks that Microsoft produced, 99.99% of actual skill levels were within 3x of the uncertainty rating, and 100% were within 4x.

    The rest of TrueSkill's innovations are built around this model of uncertainty. All players win or lose skill points, based upon how many players they beat or lose to, and they also decrease their uncertainty rating as they play more games. However, uncertainty is decreased more for players toward the middle of a pack within a game than those around the edges (because on the edges the players could actually be much better or much worse than it is possible to see from a specific game). In addition, TrueSkill is only a zero-sum ranking system for players at the exact same level of uncertainty. The more uncertainty that an opponent possesses, the smaller the weighting of any gain or loss (much like the simpler system that Days of Wonder uses, which bases weightings of games against provisional players as n/20).

    Overall TrueSkill is a somewhat complex system that is described more fully at Microsoft's web site. Some of their expansions had already been considered by others, but still their system is notably innovative in two ways:

    • Expanding a competitive ranking system to include concepts of teams.

    • Incorporating the uncertainty of ratings further into the core mathematical model, rather than using a somewhat more subjective model such as that described by Days of Wonder for provisional players.

    Trueskillcalculator_1 The TrueSkill calculations are a bit complex. In general, that's not a problem for a computer-based ranking model because you can have a computer doing all the computations, and players only need to understand the results. However the two-part ranking system used by TrueSkill, which notes both skill level and uncertainty, does offer a potential problem on this latter point. Can players understand it? In general, the concept of uncertainty will not be understood by people other that statisticians, thus raising a real user-interface question with the TrueSkill system -- and the exact sort of thing that designers of new ranking systems will need to consider.

    Variations on a Theme: A Tale in the Desert

    A_tale_in_the_desert_logo_1 The online game, A Tale in the Desert, identified a different problem with the ELO system: cheating. This is a uniquely Internet-based problem, because there users can create fake accounts, then defeat those accounts to win points. This can also be done more subtly, by having multiple additional accounts build up the rating of that fake account before the fake account is defeated. So a totally new ranking system, called the eGenesis Ranking System, was created.

    Each player is ranked through a 256-bit vector, half of which is initially set to 0 and half of which is set to 1 (therefore creating an average ranking of 128). Whenever a match occurs between players a hash function based on the players' names mathematically selects 32 of those bits, 8 of which are then randomly selected. Among those bits, any 1s in the loser's vector which correspond to 0s in the winner's vector are "transferred".

    This simple design corresponds in some ways to ELO's more complex formula. A good player will have more 1s and thus more to lose, and he will lose correspondingly more to a poor player who has more 0s in his vector.

    However, the system also prevents the collusion earlier noted. Statistically, a single player will only ever gain 8 ranking points from another new player, since out of the 32 bit hash only eight of those will, on average, be in the correct 0-1 configuration. Expanding a group of players expands the number of points that can potentially be gained, but within real limits.

    Wowsocialmap_1 In fact, the eGenesis system prevents cheating by measuring the size of social networks, then limiting the number of ranking points that can be earned within a social network. It's not necessarily the only way to measure social network size, but its methodology points toward social software as an interesting area for additional study of ranking systems.

    As with XBox's TrueSkill, the eGenesis algorithms are overall fairly sophisticated and confusing, perhaps more so than TrueSkill itself. However, unlike TrueSkill the output is very simple: a skill number between 0 and 255. The intricacies are hidden by the system.

    Competitive Ranking Goals

    Ultimately, as we mentioned when discussing Days of Wonder, any ranking system has to be measured by what it's trying to do and how well it does that. ELO and similar numerical, long-term ranking systems, are most likely trying to achieve one of three goals:

    Hierarchy: Players are divided into hierarchies of success, giving players goals to constantly strive for and ways to measure their success (or failure).

    Matching: Players can play with other players at their same skill level, rather than having to play beginners or experts who are much better than they are. This generally increases everyone's enjoyment. For computer games, the complexity of a matching system can be largely moderated by the computer, thus ensuring better competition.

    Handicapping: If players do play against others of different skill levels, the better players can be handicapped in automatic, appropriate ways for the game in question, again increasing the fairness of games and everyone's enjoyment. For instance, someone ranked 3-kyu in Go playing a less experienced 7-kyu player would give him a starting 4 stone advantage to make for better competition.

    The ELO system may be a good matching system, which allows players to easily find other players of their same skill level and play against them. However it doesn't provide any way to handicap players, nor would the ELO method necessarily be a good one to analyze handicaps (and conversely a golf handicap might not do a good job of finding like players nor measuring players' ability in a hierarchy).

    More recently the XBox system has stated that it's explicitly for matchmaking, with the goal being to always try and match up players at nearly the same skill level. It's also used for hierarchy (or "leaderboards" as it's described in the TrueSkill docs), but that's clearly a subsidiary purpose.

    All of these systems would be ineffective for measuring a winner in a live event, which is a very different goal:

    Tourney: A single player is listed as an absolute winner, the "King of the Hill". Often, second, third, and fourth place winners are measured too.

    And, the systems we've discussed thus far may not be useful for measuring privileges, yet another goal:

    Threshold: The best ranks of players can be given special privileges, including the ability to create games and form tournaments. Alternatively, they can be given privileges totally outside the game, again giving them something extra to strive for.

    For each of these additional goals we may need to consider very different ranking systems, not just variations of ELO.

    Different Themes: Tourneys

    Tournament_1 There are a number of well-known tournament types which can be used to create a "King of the Hill" ranking.

    The simplest is the single-elimination tournament, where the winner of each competition moves on to compete with other winners, until there is only one. However, this style of tournament is quite cut-throat and is not suited very well to events where the competition may result in a draw, or where chance is a notable factor in the competition. It also has a very subjective factor in the initial seeding of the rounds. The single-elimination tournament also does not rank the losers. However, by having the losers compete with each other in a Swiss-style tournament, the relative strengths of the players can be ranked.

    Pseudodoubleelimination_1 An improvement is the double-elimination tournament which is now one of the best known tournament systems in sports. Players compete in series of two-player matches, and a player has to lose twice before he's eliminated. This is done through a system of winner and loser brackets, wherein people drop from the winners' brackets to the losers' brackets when they lose once, and drop out altogether when they lose twice.

    One problem with standard double-elimination is that there are unusual situations where a significantly inferior player can still make it to the final round, or the last player to remain undefeated can lose only once and still be eliminated. These can be addressed through variants such as face-off (requiring the last two remaining competitors to compete again if the undefeated team is defeated for the first time in the finals) or by reconfiguring the loser's brackets.

    Wsc_1 Round-robin tournaments, such as official Scrabble Tournaments involve every player playing a set number of games (24 in the 2005 World Scrabble Championship), facing opponents with similar win-lose records. They then ultimately rank players by their win-lose ratios.

    The advantage of these sorts of tournament over an ELO-style ranking is that they're easily understandable and seem fair. In addition, they measure ranking in a much more topical manner: how well someone is playing during a singular instant, rather than over a longer career. As a result they work much better for a live tournament.

    Different Themes: Thresholds

    As we discussed in our original article on Collective Choice, thresholds are ranking barriers above which members get a special ability--or alternatively levels below which members lose a special ability. They can also act as another goal for a ranking system.

    Gosmall In the game of Go there are both amateur and professional players. Although they aren't technically in the same hierarchy of rankings, the highest Go amateur ranking  (7 dan) is approximately equal to the lowest Go professional ranking (1 dan), forming a de facto threshold.

    Uscf_1 Likewise the United States Chess Federation uses their ELO rankings to denote Chess Masters. Anyone who achieves 2200 UCSF is given a National Master threshold ranking and anyone who maintains it for 300 games is given a Life Master threshold ranking.

    Acblopt2_1 The American Contract Bridge Association uses a threshold system where you have to win a certain number of tournaments and thus earn masterpoints in order achieve official rankings such as "section master". Furthermore, players may earn different "colors" of masterpoints depending the difficulty of the tournament, and some ranks require that you earn at least some specific colored masterpoints in order to meet the requirements for the next threshold.

    These thresholds are fairly explicitly based on other hierarchical ranking systems, but this doesn't need to be the case. Since determining the purpose of a ranking system is often the first step in designing it, as we delve further into the area of thresholds we may well find that systems specifically dedicated toward measuring thresholds are more likely to do so well.

    In our next article we'll consider among other things the Avogadro reputation system, which manages thresholds in such a way as to prevent cheating.

    Conclusion

    There's actually a lot of variety in ranking systems, and even though we'd like them to be totally objective, various subjective elements often creep into these systems. In addition, there's a lot of variety in what ranking systems can do. For competitive systems, hierarchy, privilege, matching, and handicapping are some of the top purposes of ranking. Determining what a ranking system is going to do is a necessary first step in designing the system, as different systems will accomplish various goals to a better or worse degree.

    ELO, in several variants, is the best studied and most used competitive ranking system. It works particularly well as a matching system. However, even ELO has flaws in it, among them: issues with new player rankings; its core two-player basis; its lack of provisions for teams; a few minor subjective elements; and problems with cheaters. New systems continue to be rolled out on the Internet to resolve these issues, and overall, it's an area of interesting new study.

    Tournament systems and threshold systems offer a few good examples of competitive ranking systems with very different purposes, underlying the need to understand what you're doing before you do it.

    Ranking systems also lay very near yet another type of Collective Choice: reputation systems. We briefly addressed reputation systems when talking about threshold systems and will return to this in our next article.


    Related articles from this blog:

  • 2005-12: Systems for Collective Choice
  • 2005-12: Collective Choice: Rating Systems
  • 2006-08: Using 5-Star Rating Systems
  • 2007-01: Experimenting with Ratings
  • Related articles from Shannon Appelcline's Trials, Triumphs & Trivialities:

  • #192: Managing User Creativity, Part One
  • #193: Managing User Creativity, Part Two
  • #196: Collective Choice: Ratings, Who Do You Trust?
  • #198: Collective Choice: More Thoughts About Ratings
  • Posted on January 3, 2006 at 11:37 PM in Politics, Social Software, User Interface, Web/Tech | Permalink | Comments (5) | TrackBack

    Collective Choice: Rating Systems

    by Christopher Allen & Shannon Appelcline

    [This is the second of a series of articles on collective choice, co-written by my collegue Shannon Appelcline. It will be jointly posted in Shannon's Trials, Triumphs & Trivialities online games column at Skotos.]

    In our previous article we talked about the many systems available for collective choice. There are selection systems, which are primarily centered on voting and deliberation, opinion systems, which represent how voting could occur, and finally comparison systems, which rank or rate different people or things in a simple, comparative manner.

    Stars_1One purpose of our previous article was to create a dictionary of terms for talking about these related, but clearly different, systems. Another was to start offering analyses of these systems, many of which had not been well studied before their introduction onto the Internet.

    However at best our previous article provided an overview of what should be further investigated in each system. This article provides more in-depth coverage of one of the systems we previously outlined: rating systems.

    As we wrote in our previous article, in comparison rating systems "the value of individual items (most frequently goods) rise or fall based upon the largely subjective judgment of individual users." Ratings systems should be clearly differentiated from the closely related ranking systems. Ratings systems have a more subjective component, while ranking systems are largely objective. Amazon, Netflix, BoardGameGeek, and even the Stock Market were offered up as examples of ratings systems. Another example of a comparison rating system, and one of the earliest that appeared on the modern Internet, is eBay. The techniques they use are now beginning to show their age.

     

    eBay: A Failed Rating Experiment

    EbaysalesMost rating systems center around rating content, often user-contributed content, and they frequently help apply community values and acclaim to that content. However, the idea of ratings can go far beyond that narrow niche (though that will doubtless be its greatest use as the Internet continues to expand). Early Internet site, eBay, was one of the first to widely use user-submitted ratings, and it used them for a different manner: to determine the good traders on their auction site.

    Unfortunately, as one of the first in this field, eBay made many mistakes which now leave their ratings system only slightly helpful. However, its failures can also provide us with insights in creating new rating systems on the Internet.

    eBay allows you to leave positive, negative, or (more recently) neutral feedback for each transaction you conduct in their society. These are aggregated into two numbers. "Feedback Score" is calculated as unique positive feedback received minus unique negative feedback received, and results in a whole number like "32" or "10,302". "Positive Feedback" is calculated as positive feedback received divided by all feedback received, and results in a percentage like "100%" or "99.8%".

    Unfortunately, for reasons discussed below, almost all feedback is positive, and thus the Feedback Score acts almost entirely as a track record of how many trades someone has made. The Feedback Score could be largely replaced by that single number. You can look at a score of "27", and say, "That's an amateur trader, or someone just getting started", at a score of "3", and say, "That person may or may not know what they're doing", at a score of "10,302", and say, "That person has done a lot of trades." But you still don't know how good the trader is.

    EbayprofileTheoretically, the Positive Feedback percentage should give a more meaningful number, but people so infrequently give bad ratings that, even when they do appear, they look like noise. Does a percentage of "99.8%" on a user with a score of "1,762" mean that the seller has a genuine problem or not? Do those 3 unhappy customers really represent another 30 who were unwilling to actually click the negative feedback? And, did those people have slightly bad experience or really bad experiences? It's pretty hard to say.

    Overall, eBay has a few major problems with their rating system:

    • It's non-granular, with only two options (positive/negative), or more recently three (positive/negative/neutral).

    • It's non-distinct, with no useful guidelines on what behaviors should result in each rating.

    • It's non-statistical, and thus ends up showing only a gross number of sales, not a real subjective measure.

    • It's bilateral, with buyers and sellers rating each other simultaneously, and thus people are afraid to give bad ratings lest they get them in return.

    • It's meaningless, because there are no good tools to control who bids on an auction based on Feedback numbers. (Technically it may be legitimate to ban low feedback bidders from an auction, then cancel their bids if they enter the auction, but this is neither obvious, automatic, nor simple.)

    We're going to address each of these issues in turn, to offer insight into creating new comparison rating systems. The first three topics--granularity, distinction, and a statistical basis--are the most important elements of a good comparison rating system. Bilateral & meaningfulness issues will only be relevant on certain sites.

    (As a final caveat: in some ways eBay falls closer in ultimate result to a reputation system, a topic which we'll be covering more in a few articles down the road, but its lessons learned are still entirely accurate for rating systems of all sorts.)

     

    Granular Ratings

    Smiley In general, people want to be nice. There are exceptions to that rule, perhaps even great numbers of them, but the average, well-adjusted person would prefer to make other people happy, not sad.

    This has a notable effect on any comparison rating system, because it means that people are less likely to use the bottom half of any rating scale. If you did a statistical run on eBay, you'd certainly find that more than 99 out of every 100 ratings are positive. This is largely influenced by concerns of bilateral revenge, as discussed below, and the fact that eBay suggests other means of dispute resolution when you try and leave negative feedback. However, RPGnet, a roleplaying site which reviews games, comics, books, movies, and more shows a similar trend despite the lack of bilaterality.

    RPGnet uses two 5-points scales for reviews, resulting in a total rating of 2-10. Of all the ratings at RPGnet, 6,983 reviews have a total that's above average, a total rating of 6 or more, and 795 have a total that's below average, a total rating of 5 or less. Perhaps there are more people who sit down to write a review because they really like a game than those who do so because they really hated it, but the result of ~90% of reviews being above average is still stunning.

    The following table shows all the ratings for each of the two categories that RPGnet uses, "Style" and "Substance":

    Rpgnetsettlersreview

    Rating Style Substance %
    1 73 210 1.8%
    2 687 590 8.2%
    3 2127 1583 23.8%
    4 3337 3242 42.2%
    5 1554 2153 23.8%

    This evidence confirms what we'd already suspected. Only 10% of raters use the bottom two ratings in a 5-point scale, and only 2% use the bottom rating. The median of the 5-point scale is actually the fourth point, with a neat bell curve arranged around it.

    Because users are innately unwilling to give bad ratings, as evidenced here, useful comparison ratings truly come about only through fractional differences between good ratings. In this case, the difference between "3", "4", and "5" is meaningful, and becomes more meaningful as more ratings are entered. Eventually you can look at a ranked list of ratings and see that "4.2" is a good rating while "3.5" is not.

    In order to do this, however, you need enough levels of good ratings to be able to distinguish between them. eBay, only offering one positive rating, does not provide enough differentiation. RPGnet, with its three positive ratings, might. However, sites that offer a 10-point scale are the ones that really seem to be able to produce meaningful statistics. On those sites we can expect that 90% of users will choose between six different numbers, from "5" to "10", and as the number of ratings builds up, this will produce enough differentiation to be meaningful. If you have already adopted a 5-point scale, consider allowing users to select the half-points, giving users a greater ability to differentiate their ratings.

     

    Distinct Ratings

    No two users are ever going to rate the same; different rating numbers will mean different things to each person. This can introduce minor discrepencies into ratings, if a single individual rates particularly low or high. However, because most ratings are eventually used for comparisons, if that lo