My colleague, Shannon Appelcline, has been working on a game rating system for RPGnet. This has resulted in real-world application of the principles for designing rating systems which we've previously discussed in our Collective Choice articles. Shannon's newest article, Ratings, Who Do You Trust? offers a look at weighting ratings based on reliability.
On the RPGnet Gaming Index we've put this all together to form a tree of weighted ratings that answer the question, who do you trust?
Here's how we measured each type of trust, and what we did about it:
Volume of Ratings for an Item. Introduce a bayesian weight to offset the variability of items with low-volume ratings.
Volume of Ratings by a User. Give each user a weight based on his volume of contribution which is applied to his ratings.
Depth of Content by a User. Give each rating a weight based on the depth of thought implicit in the rating which is applied to that rating.
These all get put together to create our final ratings for the Gaming Index, with each user's individual rating for an item getting multiplied by its user weight and its content weight, and then all of that averaged with the other user ratings and the bayesian weight too. The result is in no way intuitive, but users don't really need to understand the back end of a rating system. Conversely we hope it's accurate, or at least more accurate than would otherwise be true given the relatively low volume of ratings we've collected thus far.
Here are some of Shannon's earlier discussions about the design behind the new "user content" based RPGnet Gaming Index:
Related articles from this blog:
2005-12: Systems for Collective Choice 2005-12: Collective Choice: Rating Systems 2006-01: Collective Choice: Competitive Ranking Systems 2006-08: Using 5-Star Rating Systems 2007-01: Experimenting with Ratings
Related articles from Shannon Appelcline's Trials, Triumphs & Trivialities:
If you'd like to follow along, here is a pdf copy of my presentation sides (10mb).
Biggest addition to what I've written about before is some discussion about different kinds of social software and what what size groups they seem to be appropriate for.
Some other posts about the Dunbar Number and group size issues:
In Collective Choice: Rating Systems I discuss ratings scales of various sorts, from eBay's 3-point scale to RPGnet's double 5-point scale, and BoardGame Geek's 10-point scale.
Of the various ratings scales, 5-point scales are probably the most common on the Internet. You can find them not just in my own RPGnet, but also on Amazon, Netflix, and iTunes, as well as many other sites and services. Unfortunately 5-point rating scales also face many challenges in their use, and different studies suggest different flaws with this particular methodology.
First, one study using Amazon data has shown that many undetailed ratings (where the rater isn't required to add any additional information other than the rating they select) show a bimodal distribution. In other words the distribution of ratings tends to cluster around two different numbers (e.g., 1 and 5) rather than offering a normal distribution where the ratings cluster around a single height (e.g., 3). Thus the median of these ratings is not an accurate reflection of product quality, but instead is a statement of conflicting opinions.
Second, our own study using RPGnet data has shown that many detailed ratings (where the rater does add additional information, in this case a full review) offer normal distributions, however it is biased toward the high end of the scale. On RPGnet, for example, we discovered that 90% of this 5-point rating system was 3 or higher with an average around 4.
Randy Farmer of Yahoo suggests that this scale limitation is particularly troublesome for fan-based ratings, such as those found on episodic TV sites:
Only the fans of a show evaluate the episodes, and being fans, will never rate an episode one or two stars, ever. I've seen this attempted over and over on the net with the same results every time: Each episode of a show is 4-stars +/- .5 stars. This goes all the way back to the Babylon-5 website, probably the first source for this kind of data.
Thus even when a bimodal distribution is not a problem, on a 5-point scale the upward bias often results in only 2 or 3 meaningful data points. This is problematic because it minimizes differentiation. In many cases, a 5-star rating system where most of the ratings are either 3 or 4 is actually no better then just a thumbs-up/thumbs-down rating system.
However, given that 5-point scales are probably here to stay, we are forced to make the best use of them we can.
First, we need to provide raters with incentives, so that they provide meaningful ratings. We've already seen that this can be done by requesting detailed ratings: when a person takes the time to write text, and knows that his name will be attached to it, he generally does a better job in his rating. There are other possible incentives techniques as well, such as RPGnet's new XP System.
Second, we need to provide means for a 5-point scale to become more meaningful by encouraging raters to use not just the top half of the scale, but the bottom half as well. One method to accomplish this is to make ratings distinct -- as I briefly mentioned in my previous article on this topic -- and encourage standards so that an "average" rating is 2 or 3, not 4.
As an example of how to accomplish both of these goals with already existing 5-point rating scales, I've detailed my own experiences with using ratings on two popular services -- iTunes and Amazon. By providing myself with incentives and making my use of ratings very distinctive, I have created more meaningful and useful output for myself.
Apple's iTunes software offers you the ability to rate individual songs with a 0-5 Star rating. If you use iTunes with an iPod, you can change the rating of a song on your iPod and the change will be reflected in your iTunes database the next time you sync your iPod. The "Shuffle Songs" feature available on more modern iPods has an option to have songs with higher ratings be played more often. A very powerful feature, Smart Playlists, can dynamically create sophisticated playlists based on ratings. All of this makes rating music on iTunes very useful.
After Shannon and I wrote our Rating Systems article, I examined the ratings in my iTunes catalog. Using the Alastair's fabulous XLST iTunes rating statistics tool, I discovered that the ratings I created in iTunes clearly were biased overly high, matching the pattern we'd described. I had far too many songs rated with 4 Stars, and almost nothing rated 1 or 2. This made my ratings less useful.
|Here are some statistics from your iTunes Library: 4172 tracks, 412 (10%) rated|
|Cumulative % of Rated|
|Number||% of rated||Actual||Target||Shortfall|
|Tracks rated 5 stars:||112||27||27||5||-22|
|Tracks rated 4 stars:||183||44||72||15||-57|
|Tracks rated 3 stars:||92||22||94||50||-44|
|Tracks rated 2 stars:||22||5||99||90||-9|
|Tracks rated 1 stars:||3||1||100|
So over the last few months I've completely revamped my iTunes ratings. Since I can't change the user interface, I've changed my behavior. I'm also taking advantage of two other fields: "checked" which I use to give more distinctiveness to my ratings, and "play count" which shows whether or not I've listened to something through to the end.
Here are the criteria I used:
Rated 5 - Exemplars : Only my most favorite songs are rated 5. They have to meet the following criteria: they make me feel good or excite me no matter how often I listen to them, I can typically listen to them often without getting tired of them, and they are the best of their particular genre.
Rated 4 - Great : There is only a small difference between a song that is rated 4 and 5 in my ratings -- typically it doesn't excite me or make me smile quite as much, or it isn't necessarily an exemplar of its genre. However, I still can typically listen to them often without getting tired of them. Items that are rated 4 and 5 are ones that I carry on my iPod Shuffle.
Rated 4 - Great (Unchecked) : There are a few songs that I do consider to be great, but that I only want to play when I'm in the mood for them, or I want to only play in a specific order, or they "don't play well" with other music. For instance I love the song "The Highwayman" by Loreena McKennitt, however, it is over 10 minutes long and I just don't want to hear that type of song unless I'm in the mood for it. Other examples are the 12 songs that make up Mussorgsky's "Pictures at an Exhibition" -- I want them played in order when I do play them, and I really don't want them played in the middle of my other songs. Unfortunately, iTunes does not let you select only unchecked items, so I don't have a Smart Playlist for these; instead I keep them in a regular playlist.
Rated 3 - Good : These are songs I like. Typically I can play them regularly but not too often. Songs rated 3-5 go on my iPod Nano.
Rated 3 - Good (Unchecked) : There is a lot of music that I think is Good, but I don't want to play all the time. I have a large catalog of sound tracks from movies. All but a few of those tracks are in this category. Again, iTunes does not let you select only unchecked items in a Smart Playlist, so I have several regular playlists for these items.
Rated 2 - Ok : I have very diverse musical tastes, starting with jazz, various ethnic and world music, and also including quite a bit of pop, rap, R&B, punk, and metal that I enjoy. I don't enjoy them all the time -- but I do like them to pop up every once in a while for variety. So I rate these 2 and leave them checked. I have an old 40GB iPod that I take on long trips, and it stores everything I have that is checked and rated 2-5.
Rated 2 - Ok (Unchecked) : Some songs are OK, but I really have to be in the mood specifically for that song. Listening to Jimmy Buffet's "Margaritaville" can be a guilty pleasure on a lazy summer day at the beach, but it isn't something I want to regularly listen to. I have a number of special playlists for songs rated like this.
Rated 1 - Don't Like : These are the songs that I don't like. They're just not my style. Many are still quality music, they just doesn't work for me. I do keep most of these for completeness -- it might just be one or two songs on the album, and I want to keep the album complete. Or I keep it in case my tastes change. But in general, once something is rate 1 Star, I'll probably never listen to it again.
Rated 1 - Trash (Unchecked) : These are songs that not only do I not like, they just are not good music. I don't like most rap music, but I can tell that most are still quality. Some are junk -- these I rate 1 and uncheck, and are candidates for deletion the next time I purge my collection.
Unrated & Listened , playcount > 0: If I've listened to something through to the end, but haven't rated it yet, it shows up in this Smart Playlist. Periodically I check this Smart Playlist, sort by playcount, and try to rate everything that I've listened to more then once.
Unrated & Unlistened , play count=0: This is the default when a new song is added to my library. So any song that is unrated, checked, and has a play count of 0 shows up in my "Unrated & Unlistened" Smart Playlist. When I'm in the mood for variety, I go through this playlist and rate songs.
Modifying my rating system in this way has caused my average rating for music to change from around 4 to somewhere between 2 and 3. It will probably, over time, become closer to 2 as I rate more of my collection. This gives me a lot of distinctiveness so that I can create Smart Playlists that work well for me.
|Here are some statistics from your iTunes Library: 6519 tracks, 726 (11%) rated|
|Cumulative % of Rated|
|Number||% of rated||Actual||Target||Shortfall|
|Tracks rated 5 stars:||74||10||10||5||-5|
|Tracks rated 4 stars:||144||20||30||15||-15|
|Tracks rated 3 stars:||211||29||59||50||-9|
|Tracks rated 2 stars:||270||37||96||90||-6|
|Tracks rated 1 stars:||27||4||100|
Obviously rating a large music collection can become a chore -- you don't want to spend your limited music listening time always fine tuning your ratings. So I have some approaches that make it easier for me to rate my music with less effort:
First, I sorted my catalog by my old ratings, and modified everything down by 1, Starting with everything rated 2 becoming 1, 3 becoming 2, etc. This gave me a good base to start with
Next I created Smart Playlists for each rating, i.e. "Rating 5 - Exemplar" with "Match only checked songs" and "Live updating" checked. I then added "Play Count" as a column to my view, and sorted by it. This gave me the songs that I played the most and least, and I adjusted some songs up and down accordingly.
Then I created a new Smart Playlist that simply plays songs rated 3 to 5, limiting the list to the first 100 GB selected by random (i.e. everything random), and saved this Smart Playlist as "Plays Well With Others". I play this on occasion in the background, and when I hear something that jars me I know something isn't rated right. Thus without a lot of effort I can change ratings for songs that no longer fit their rating, or uncheck items where the rating was appropriate but it "didn't play well with others".
I try to be aware when I'm using my iPod of what a songs rating is, and change it if it seems wrong. The next time I sync the iPod my ratings will be adjusted in my iTunes catalog.
I also try to be aware of Play Count -- this number only goes up if you play a song to the end. So even if I'm not able to take a look at the rating (for instance when I'm in a car), I can at least forward to the next song. Periodically I review the play counts for songs that I've rated and consider moving them up and down accordingly. Of course, this means that I have to be careful and not let the iPod keep running when I'm not listening.
A tip for those of you that do put a lot of effort into your iTunes ratings: I've learned the hard way that unlike most song information, the rating is NOT stored in the song itself, so if your iTunes database gets corrupted, or you move your music to another server, you'll lose all your ratings. One way to avoid this is to periodically backup your ratings into a field that is stored in the song itself. I personally use the "Grouping" field as it is rarely used, select all songs with the same rating and click on "Get Info", and change the Grouping field to "My Rating: 5 Stars".
I only have 11% of my collection rated so far, but using this system I'm finding it a lot easier to manage my ratings. I'm already getting many benefits from it -- I'm playing my music more often, my iPods typically have the music I want on them, and various music discovery services can use my ratings to help me identify new music I might enjoy. This provides the incentive to keep me entering meaningful ratings.
Amazon also uses a 5-Star rating system, and your ratings can be used by Amazon to help you find books that you might like. Though I like to support my local bookstores, it is this feature that brings me back to Amazon time and again. Whenever I browse through Amazon and see a book I've already read I try to take the time to update my rating.
Amazon has a number of different tools to assist you in your ratings. If you are an Amazon customer, you can go to Improve Your Recommendations: Edit Items You Own and see all the books that you've purchased and quickly rate them with a nice AJAX interface. You can also review items that you've already rated, whether or not you own them, at Improve Your Recommendations: Edit Items You've Rated.
Amazon has also recently added a very nice web service called Your Media Library that can be used to help manage your media library of books, music, and dvds. I personally only have used it to manage my books and dvds, as I find rating albums useless -- it is songs that I prefer to rate.
After browsing through my ratings to date, I discovered the same flaws I found iTunes -- my ratings typically were too high; most were a 4. This is particularly encouraged by the popup when your cursor is over the Stars "1 - I hate it, 2 - I don't like it, 3 - It's Ok, 4 - I like it, and 5 - I love it". I suspect if I use the same trick that I use for iTunes of making a rating of 2 Stars mean "Ok" I could potentially cause the recommendation engine to be less effective (though it could possibly make it better, I don't know). So I am being much more brutal with my ratings and pushing many more down to 3, so that my ratings of 4 and 5 have more meaning.
5 Stars : These have to be the exemplars -- the best books I've ever read, would be glad to read again, would be proud to show off on my best bookshelf, and will buy extra copies to give to friends.
4 Stars : These have to be really good books -- most of them I'm willing to read again and I promote them by offering to loan them to my more discriminating friends. Although I may keep them on my bookshelf I'd rather give them to a friend then sell them at a used book store.
3 Stars : These are books are decent books, and I do share them with my voracious reader friends. But I don't push them and I'm much more likely to sell them at a used bookstore then keep them on my shelf. This is the rating that I significantly underused previously, and I'm finding that the key discriminator for me so far is how much I feel like recommending this to friends who are more discriminating readers.
2 Stars : This rating is where the Amazon rating system fails the most -- these are suppost to be books that "I don't like", however, most of the time I don't buy books that I probably wouldn't like, much less read them, so I have very few in this category. However, I've decided this category is for books that are just not quite good enough, or are slightly disappointing. Not bad, or disliked, but just somewhat disappointing.
1 Stars : This is where I put the books that I don't like, or worse, I hate. Not many here, but I'm willing to risk more then many people are so I have some. Also books go here that just don't fit my interest, like romance novels that get recommended to me because I like some crossover fantasy-romance authors.
Since I started more accurately rating my books at Amazon, I've found that their suggestions for other books to read to be more accurate. Thus I am getting value from rating these books, and I have incentive to continue to make the effort.
Offering an incentive for people to rate is important for ratings of all sorts, with both individual gain and status recognition being powerful motivators.
However the easiest technique for making a 5-point rating scale more useful is to make it "distinct". If a user has a more specific meaning for each rating, ratings will slowly settle toward a truer average, and thus more of each rating scale will be used. We've also tried this technique recently on RPGnet, with our new Gaming Index; and thus far our new 10-point scale -- which has distinct meanings for each number -- is averaging 7.27. That's still a fair amount above the real average of 5.5, but at least it's below the 8+ rating that our old double 5-point scale resulted in.
Often you, as a consumer of rating systems, will be making use of rating scales designed by others, rather than those you're designing yourself. For those cases it often makes sense to design your own rules for what each number means, and to do so in such a way that your median is the average of the scale, rather than toward one of the extremes. When you do, even if you're using a tight 5-point scale you'll end up with enough differentiation for it to actually be more meaningful than a thumbs up or a thumbs down.
Related articles from this blog:
2005-12: Systems for Collective Choice 2005-12: Collective Choice: Rating Systems 2006-01: Collective Choice: Competitive Ranking Systems 2007-01: Experimenting with Ratings
Related articles from Shannon Appelcline's Trials, Triumphs & Trivialities:
I've been a moderator/host/forum leader for various bulletin boards and other online communities since the early 1980s; first on CompuServe, later on GEnie and AOL, and then professionally in the early days of Consensus Development. One of the behaviors that happens in online communities and that I rarely see elsewhere is flaming -- where one member writes an extremely inappropriate, typically passionately worded attack on another. Flaming behavior can hurt an online community.
It is commonly thought that flames occur because "there is very little proper policing done on the Internet" but I believe this to be false. Instead, I believe that it is the consequence of the medium primarily existing as text.
In fact, what you'll observe if you study individual flames is that they typically start as an escalation of emotion, only spiraling later into passionate and personal arguments. The only way to stop flames from destroying a community is to break this cycle.
So I've taught all my staffers over the years my ideas on what causes the cycle of flames, and how to avoid them. One particular piece of advice that I give is in regards to how emotions are amplified in the online text medium.
This happens for several different reasons:
Since text is lacking tonal and visual context, we have a tendency to over-interpret any emotional content that does exist (link to paper). In fact, we may have no better than a random chance of correctly interpreting the emotional tone of ironic vs sincere text in a message (link to Epley/Kruger paper).
In addition, we tend to respond to someone's emotional state by expressions of similar intensity (this phenomenon is known as Emotional Contagion). And the higher the level of intensity of our emotions, the less our ability to be empathetic (link to paper).
These tendencies lead into a vicious feedback cycle.
Thus I now find that now there are certain words and phrases that I avoid using when responding to people online. I have to be very careful with irony and sarcasm, and when I use them I include symbols such as smilies to such give the emotional context that is missing from the text. I find that even the slightest hint of blame will be over-interpreted. I avoid the words "should" and "didn't", never tell someone that they forgot something, etc.
My online community staffers have found understanding this cycle an important tool in moderating the communities they lead.
(This is a update/rewrite of what I originally wrote in Dave Winer's UserLand discussion group back in September of 2000).
However, I don't consider myself a venture capitalist. Instead, I am what is known as an "angel investor".
This week has also seen a new topic enter the blog zeitgeist: the topic of reforming or reinventing venture capital. This topic was initially raised by Dave Winer, followed by Robert Scoble, Doc Searls, Jeff Nolan, Michael Arrington, Thatedeguy and many more.
All types of venture investment -- seed, angel, venture, and institutional alike -- carry with it great risks and great rewards. But before we can reinvent venture capital and related venture funding methods like angel capital, we need to understand how it works.
A venture capitalist is a partner or associate in a venture capital management firm, which manages money on behalf of large institutional investors.
Basically, a large institutional investor (such as a pension fund or an insurance company) can statistically afford to invest a small part of their portfolio -- perhaps from 1% to as much as 5% -- in high-risk, long-term investments. If they lose the money outright, their other more stable investments have a good chance of making up the loss. But if the high risk investment does well, they can substantially improve their IRR (internal rate of return). To a certain extent they can't lose if they are careful. So these institutional investors invest in a number of types of high-risk funds, including such investments as venture capital funds.
A venture capital management company will manage one or more of these funds, investing in private companies. These VC management firms operate off of a management fee, from 2% to 3% of the capital invested to date. Thus all of the salaries for the staff of a VC management firm are paid, even if the investments are a failure. In addition, if any of the investments are successful, the VC management company earns 20% off of the top of the gain (called a "carry"), which is distributed to all of the full partners in the VC management firm, and sometimes a little of it to the associates.
It is the VC associates that do the brunt of the work for a VC management firm. They make a good salary, but the real return is if they are able to do well in identifying, managing, and selling new startups; then they are invited to become a partner the next time the VC management firm raises a fund. Then if the fund that they are a partner in does well, they can make a true fortune, or even start their own VC management firm.
However, the odds are against the VC associate. It's common wisdom that an associate can't easily manage more the 7 firms at a time. Other common wisdom says that 1 in 5 investments will survive to break even and that 1 in 20 will "make the fund", i.e. pay for all the losses in the other 19 investments. Some newer firms say 1 in 10, but I'll go with the older more conservative numbers. Thus associates are incentivized to try to manage more then 7 investments and to be smarter than their peers in the firm, so that at least one of their investments will be the 1 in 20 that makes the fund. This makes it easier for the associate to become a partner in the future, as at best 1 in 3 or 1 in 5 associates becomes a partner. Cutthroat competition between associates exists in some firms. This pressure often adds the perception that associates don't give enough attention to companies in their portfolio; they want their startups to do well, but the odds are it is another company that will make it, or a startup managed by peer associates. So they divide their attention. This is not unrelated to the Dunbar Triage problem.
Another problem that VC management firms face is the number of investments they are able to effectively handle. If there are 5-6 associates and 2-4 partners, there is probably a max of 50 investments that they have time to manage. If they are managing a $500 million dollar fund, that means that they have to invest at least $10M in a company, but in fact that is more likely to be $25M over time. If 1 in 20 makes the fund, that $25.0M has to give a return of $250M. Thus when entrepreneurs complain that VCs will not invest in their company, it is often because the VCs can't figure out how to invest a minimum of $25M and turn out at the end with $250M. A related problem is that a startup that might have a successful business model that could grow into a profitable $50M annual revenues will be encouraged to take a more risky route so that they can go public, which requires a minimum of $100-200M annual revenues.
There is a lot of variety in VC management firms; some VCs have smaller funds under management, others give their associates more of a share, others have different management fees or carry percentages, and most specialize in some way: either vertically in a particular field, or horizontally in a particular stage of investment. For instance, there are some VC firms known as mezzanine firms that only invest in your company right before they think it can go public.
This is the way most VC management firms work. Periodically a new VC management firms will explore and push the limits of the above boundary conditions, but the more edges they attempt, the more likely they will fail.
So what is an angel investor? I learned a lot of what I know from the 3 angel investors that invested in my software startup, Consensus Development.
Gifford Pinchot, with his wife Libba, was my first angel investor in Consensus Development. We met at a Maxis meeting where Gifford had been asked to facilitate the formation of a new startup to create simulation software. At the end of the meeting we left frustrated with the results of the meeting, but Gifford liked what he heard about my broader vision. Gifford flew me to San Diego, where we walked the beach and discussed my vision for collaborative software. He liked what he heard, and later in the month flew me to his home in Connecticut, where I stayed for a month in a barn guest house near his home while we worked on our first business plan.
Gifford only invested a low 5 figures, which got me started. However, it wasn't his money that was his most valuable contribution -- it was his time. Over the years he probably put 5-10% of his into time as Chairman of Consensus Development working with me, talking to me, advising me, and coached me. When our first software effort, InfoLog (a folksonomy tagging program like del.icio.us that was a decade too early) failed, he didn't walk away and instead encouraged me to continue. I dug deeper into the problem, discovered that trust and security were a key obstacle, and created a profitable consulting business. But Gifford encouraged me when I said we were going to take the risk of dropping all of our profitable consulting and focusing on a product, SSL Plus. Later, when this company was being shopped around to various buyers, Gifford spent lots of time doing due diligence, and ultimately came on half-time as CEO so that I could concentrate on selling the business.
In the end, Gifford earned probably 7 figures on his initial 5 figure investment, close to a hundred-fold return on the dollars he invested. However, his real investment was the time he spent with me -- almost 10 years of never giving up.
I met Scott Loftesness when he was the executive vice president at Visa International. We learned of each other through CompuServe, where we both were sysops in the 80s. I did some consulting for him at Visa in the groupware area over a couple of years and we grew to respect and trust each other. I came to him when I branched out from groupware consulting and began to include consulting on cryptographic security. I'd seen an opportunity--I had a potential contract from RSA Data Security to be a distributor of RSAREF--but in order to take advantage of this opportunity I needed some seed capital.
Scott invested over twice what Gifford invested, but still 5 figures. However, like Gifford, what I gained from my association with Scott was a lot more then the seed capital. He had a respected name in the industry -- a friend at Visa USA told me "Scott is where all innovation at Visa flows from." He joined my board of directors, supported our risky choice to drop all groupware and cryptographic consulting to focus on our SSL project, helped tremendously in doing due diligence on potential buyers, and was pivotal to the negotiations to close our final sale of Consensus Development.
In the end, Scott Loftesness also did quite well in his investment in Consensus Development. His involvement on the day-to-day operation of Consensus Development was significantly less, but he was always around to support and advise us when we needed him.
Jim Bidzos was the CEO of RSA Data Security, whose firm had a critical patent on almost all meaningful cryptographic security. Over the years I did a lot of consulting for him to support various projects like RSAREF in standards, to create client tools for their Certificate Services Division, and to help with the founding of Verisign.
One day I told Jim that RSAREF would never be successful in his goal of promoting the RSA algorithm in security standards as long as it could only be sold through RSA salespeople. They preferred to sell RSA's premiere toolkit, BSAFE. I somewhat jokingly proposed that maybe Consensus Development should sell it instead. To my surprise, he agreed.
A couple of years later I leveraged the fact that Consensus Development had the only RSA toolkit available other then RSA's own to get the contract to develop the reference implementation of SSL 3.0 for Netscape. I took this Netscape contract back to Jim and said that I needed some investment to make this successfully. He invested a middle six figures in Consensus Development in return for a percentage that was roughly equivalent to that of Gifford and Scott, but because of his involvement as CEO of RSA Data Security he could not be on our board of directors.
After this investment, Jim had very little to do with Consensus Development. In fact, he had spread his angel money so widely in the cryptographic security industry that he was also invested in a couple of our competitors. In the end his investment was worth roughly 10 times what he invested, but the cachet of being able to tell others that Jim Bidzos was an investor made Consensus Development much more "legitimate", which also added significant value to us.
After I left Certicom, the company that had purchased my firm, Consensus Development (see Bad Business of Fear for more info), I wondered what I should do next. I could theoretically retire if I abandoned the Bay Area, but I was not ready for that and I thought I had maybe enough capital to start one more business of my own instead. Under a non-compete from Certicom, I was not sure what type of non-cryptographic business I wanted to start. So I decided that one thing I could do was some angel investing. In part this was to make money, but a larger part of it was that I enjoyed working with entrepreneurs. I wanted to do for others what Gifford Pinchot had done for me.
I did some study about how venture economics works, how angels and venture capital firms invest, and became concerned. I saw that being an angel investor in many ways is much harder then being a venture capitalist.
One of the biggest challenges is that angels share all the problems of the institutional investor, of the VC management firm, and of the VC associate.
The first challenge is deciding how much to invest. The institutional investors only risk 1%-5% of their capital. If I limited myself to that amount I could maybe invest in a couple of companies. I decided I was still young and could risk investing more.
The second problem was no management fee -- unlike a VC firm, angels don't get a management fee to cover salaries, legal fees, other expenses.
The third problem was my time. Most angels still work for a living -- being an angel investor is part-time, a venture capitalist typically works full-time. If only 1 in 20 investments "make the fund", but I could at most manage 7 investments, that meant that I had a 2/3rd's chance of losing my entire investment. I might be able to argue that for some kinds of businesses I might more informed than the average VC, and thus might be able to make better choices, but not that much better.
The key, I decided, was to work with at least 2 other angel investors. That would theoretically allow us to invest in 21 companies, diversify our portfolios, and split the work. I approached my first angel investor, Gifford Pinchot, and he agreed to be one of the partners. The second was Harold Shattuck, who had done some due diligence and operations consulting for Consensus Development, and had been VC once before, but enjoyed being closer to the actual building of a new company with some operating interaction. I was the managing partner for files and accounting, but we all brought to the table our "deal flow", performed due diligence together, and worked closely with each other.
Alacrity Ventures is over 6 years old, and I have learned many lessons from it.
First, I feel that we did a good job selecting our investments, during a time in which being an angel investor was very difficult. I discovered that Gifford, Harold and I were really good at due diligence; our differing skills, Gifford's in coaching and evaluating the management team, Harold's in operations and business models, and mine in technology truly complemented each other.
For a long time I could say that the good news was that that out of 13 investments, all but 1 were still in business. However, we were never able to invest in the 21 investments that we planned because we discovered a significant problem in angel investing: the VC.
The angel investor can only really afford to invest early on, as a seed investor, or in an early investment round such as series A. However, the firms we invested in needed more money along the way; in fact, almost all firms need money at more then one point. The venture climate at the time was such that the VCs required in their term sheets that previous investment rounds lose their liquidation preferences, and ultimately their investment.
Let me give a specific example -- we invested in a first round of an enterprise software company in 2000 that is still around today. In 2002 they needed more money, and because of the difficulty in getting VC investment, the lead VC insisted that the preferences from the previous rounds be removed, effectively making us common stock, unless we participated in this subsequent round. We reluctantly did invest some more, but because we don't have the funds that a VC has, we were only able to protect some of our preferred stock. A year and half later, the software company needed more money, and the VC did it again. This time, all our stock was converted to common. Now it is 2006, and the company might be acquired this year; however the VCs, because of their liquidation preferences, will get the first $65 million (or more). As I doubt the firm is worth more then $50M, we will not get anything, nor will any of the other founders that are no longer involved with the firm.
This has repeated itself over and over again. We made a decent choice and did our due diligence well, but subsequent VC investors have pushed us out. A few of our ventures have failed outright. That is understandable given our original 1 in 20 expectations. But what we didn't expect was how difficult it was going to be to participate in the upside. Yes, we had preferences in our early rounds that should have protected us, but they didn't.
So of our 13 investments, only 2 remain that may "make the fund": a very innovative high-tech titanium powder manufacturer ITT, and a high-tech manufacturer of ceramic devices Vapore. But even as these two investments survive, they are still vulnerable to requiring additional investment and possibly forcing us out.
Of the rest: one of our early investments sold to VeriSign at a 50% premium, our investment in Salon.com will give us a small return, MG Taylor paid off its loan, and Skotos may someday pay back its original investment. The other 8 are being written off as a loss.
So in spite of the odds, you still want to become an angel investor? Here is some advice...
Collaborate with other angels: Going it alone is dangerous -- there are a number of angel investor networks, such as Gathering of Angels, Band of Angels and others in listed the Directory of Angel-Investor Networks. Be careful, though, the enthusiasm of others can be contagious -- don't always go with the herd.
Do your own due diligence: I can't emphasize this enough. Talk to the entrepreneurs and meet their staff. Read their business plan and tear it apart. Find the hidden assumptions. Understand their business model. It needs to feel realistic. Try to get more eyes on the job: different people see different things. Don't follow others; they may have different investment criteria then your own.
Be an advisor first: Be an advisor first -- if the entrepreneurs don't listen to your advice, don't invest. If you have to invest to become an advisor, invest only a small amount, or have part of the money be contingent on a meaningful goal.
Guard your upside: When negotiating terms, don't worry about the downside. It is the VCs that need items on the term sheet for when things go wrong -- what you need to guard is for when things go right. Watch for changes in the executive staff -- they may be incentivized differently than you are.
Consider a secured loan: Somewhat contrary to the "guard your upside" advice, rather then investing only in stock, consider investing via a secured loan as well. The security can not only be on hard company assets, but intellectual property such as copyrights, trademarks or patents. Your return will be lower on the loan, but if you can get all of your investment back early and get a small percentage of the company, it can be a good way to balance risk. Just remember to file the property documents to make sure that the assets are properly secured, and be prepared that someday you may own that asset.
Save $2 for every $1: Almost every company you invest in, even if successful, will need additional funding. Make sure that you keep on hand $2 for every $1 initially invested. This will also help you from being squeezed out by later VC investors.
Invest in acquisition targets: Let the VCs take companies public -- the companies that you should be interested are the companies that will eventually be acquired. Creating an acquisition target requires the management to think differently -- coach them to do so.
Understand the founders dilemma: There are many founders dilemmas, however, one is particularly important to the angel investor. A founder may be incentivized to sell sooner then his early investors. Remember that most often, the only significant asset that a founder has is his company. If the founder has an opportunity to sell early and buy a house, he might, even if it may not be enough return on investment for the risk that the angel took. Find ways to keep your interest aligned with that of the founders, which may include even buying some stock directly from the founder.
Consider alternative exits: There are lots of boutique opportunities that are too small for VCs. I know of a local Berkeley software company that was number one in their market, but too small to go public. They had $20M in annual revenues, and profits of almost $10M, but little opportunity for growth -- early investors could have gotten their money back in dividends rather than sale of the company.
Time the cycle: We didn't invest at the ideal time for the angel investor. We picked well considering the times, but had we waited for a few years it would have been easier. Not to say that timing is everything; we'd have lost our titanium powder opportunity if we'd waited for better market timing.
Respect people: Treat the people you invest in like a paying client. Respect their time and concerns.
Be prepared that the plan will change: I've never been involved with a business where the business plan doesn't significantly change. As an angel investor you need to help your businesses to plan for those changes.
So you want investment from an angel investor? Some advice...
Recognize the odds: The angel investor is taking a substantial risk investing in your company -- you need to be able show a scenario where the investor might be able to make 10x or 20x their investment. So if you are looking for $100K, you need to show how the angel can ultimately have stock worth $1M to $2M.
Consider their advice: Angel investors may not always be right, but show them that you are listening. If you use angels for more then just a source of money, you'll get a lot more value from them.
Draft your business plan: An angel investor does not need as complete a business plan as a VC does, but they need to see how you think. You should clearly identify what the product or service is, who is going to buy it, what is the marketplace that those buyers may find it in, what differentiates your product or service and why your team is good enough to deliver. Angel investors know that your plan will change, probably drastically, but if they understand your thinking process they can be more confident that your company will survive change.
It takes time: Don't count on the money from an angel investor (or any investor) until you get the check. Investors are always selecting from a number of choices, often very competitive choices. No matter how optimistic you are, it is likely it will take 6 months or likely more to raise angel money.
Team with Many Hats: Angel investors don't recruit new team members for you. You don't necessarily have to have your whole team in place, but there at least needs to be someone who has experience managing, someone with development experience, someone with marketing experience, and someone with sales experience. Whatever team is there, they need to be able to juggle all of those hats. Financial, HR, and administrative positions can all be part-time or farmed out.
Value the angel investor: The angel investor serves a point in the marketplace that you are not able to serve. Rather then driving them out, find some way for them to continue to participate so that they can find other ventures for you.
Angels are not VCs: The angel investor can't afford to invest in later rounds -- their model is different than yours. It may make sense to force participation in subsequent rounds by other VCs, but carve out some room for angels.
Though I've enjoyed some aspects of being an angel investor, I enjoy working with creative people to innovate new products more. I expect to spend most of my time in the next few years continuing to explore social software and collaboration tools, and the new product opportunities that may evolve from them.
Thus I expect that any future angel investments I make will be more along the lines of Gifford's style of investment in Consensus Development: a small investment of money and a large investment of time. Harold and Gifford both feel the same way. Currently we plan to continue monitoring our existing investments, but don't plan any new investments unless we can take a more active role in the firm -- for instance Harold is a board member in Vapore.
Gifford is now dedicating his life to building a better world by transforming business education. He is a co-founder and President of the Bainbridge Graduate Institute, which provides an MBA program integrating sustainability, green economics, the internet, and open source within a traditional MBA program. As an open source school, he helps other schools to use BGI’s curriculum. Check out his blog entry on Angel Philanthropy.
If there's one thing we've learned from six years of angel investing, one thing that may be more valuable than all the nuts and bolts I describe here, it's that Gifford Pinchot's partner-style of angel investment is what suits our investing style, not Jim Bidzos' style of hands-off angel investing, and that's a lesson that we're going to carry forward with Alacrity Ventures.
Technorati Tags: advice, alacrity ventures, angel, bainbridge graduate institute, carry, entrepreneur, founders dilemmas, fund, gifford pinchot, harold shattuck, investment, IRR, jim bidzos, lessons, liquidation preferences, reinvention, return, scott loftesness, VC, venture capital
by Christopher Allen & Shannon Appelcline
[This is the third in a series of articles on collective choice, co-written by my collegue Shannon Appelcline. It will be jointly posted in Shannon's Trials, Triumphs & Trivialities online games column at Skotos.]
In our first article on collective choice we outlined a number of different types of choice systems, among them voting, polling, rating, and ranking. Since then we've been spending some time expanding upon the systems, with the goal being to create both a lexicon of and a dialogue about systems for collective choice.
This time we're going to dig more into comparison ranking systems, by focusing on competitive rankings and looking more in depth at ELO Chess Ranking System and the other systems that we briefly mentioned previously. Our goal is to explicate these systems, to better address their flaws, to begin detailing the purposes of ranking systems, and to show how those purposes are critical in the design of ranking systems.
In our original article we discussed rating systems as being largely subjective and ranking systems as being objective, but the situation isn't nearly as simple as that. In truth, there's a clear spectrum of ratings and rankings with varying amounts of subjectivity and objectivity in each collective choice system.
The Bowl Championship Series (BCS) for college football is a good example of a ranking system that explicitly allows a subjective component. It involves a complex mathematical formula that includes things like win/loss ratios, but also sportswriters' and coaches' ratings.
However, public opinion continues to show that people don't necessarily like seeing true ranking systems having subjective components, because they expect them to be "fair". The BCS formula has come under attack several times in the last few years precisely due to its subjective basis. Cal Berkeley was one of several teams denied a bowl position in 2004 when many felt that they were worthy.
The APL tennis rankings and the official world golf rankings also have a subjective component, but it is much more subtle. Each tournament is worth a certain number of points, and the allocation of those points is relatively arbitrary, based upon the "prestige" of each tournament and the quality of players who have traditionally played in it. The subjectivism isn't quite as near to the surface as that of the college bowls, but it's still something that can have a notable, and perhaps unwarranted, effect upon the final results.
This brings us back to the ELO system, a ranking system originally designed for chess which is fairly well-known and well-understood. As we said in our overview article, "[ELO] builds a simple distribution of player ratings around a norm (typically 1500 points), then awards or deducts points based upon wins and losses, with the total sum of all points in the system staying constant. Players are then ranked according to their comparative scores."
The big difference between this and the previously discussed systems is that it's almost entirely objective; in fact it uses a statistical basis to create an underlying mathematical model for rankings, rather than allowing human subjectivity to get in the way.
The simplest formulation for an ELO rating looks like this:
R' = R + K * (S - E)
R'is the new rating
Ris the old rating
Kis a maximum value for increase or decrease of rating (16 or 32 for ELO)
Sis the score for a game
Eis the expected score for a game
Much of the trick is in figuring out what the (
E)xpected score of a game is. ELO uses the following formulas for players A and B:
E(A) = 1 / [ 1 + 10 ^ ( [R(B) - R(A)] / 400 ) ]
E(B) = 1 / [ 1 + 10 ^ ( [R(A) - R(B)] / 400 ) ]
It's a good model because, using the two formulas, it means that a great player gains little from beating an average player, but an average player gains a lot from beating a great player. Take the following example:
R(A) = 1900
R(B) = 1500
E(A) = 1 / [ 1 + 10 ^ ( [1500 - 1900] / 400 ) ]
= 1 / [ 1 + 10 ^ ( -400 / 400) ]
= 1 / [ 1 + 10 ^ -4 / 4 ]
= 1 / [ 1 + 10 ^ -1 ]
= 1 / 1 + .1
E(B) = 1 / [ 1 + 10 ^ ( [1900 - 1500] / 400) ]
= 1 / [ 1 + 10 ^ ( 400 / 400 ) ]
= 1 / [ 1 + 10 ^ 1 ]
= 1 / 11
Player A is expected to score .91 in an average game, which is to say he should win 91% of the time, and will be punished accordingly if he loses to player B:
R' = 1900 + 32 * (0 - .91)
R' = 1900 - 29.12
R' = 1871
Conversely a win nets him very little:
R' = 1900 + 32 * (1 - .91)
R' = 1900 + 32 * .09
R' = 1900 + 2.88
R' = 1903
ELO is almost entirely mathematical. Players can gain or lose different amounts of points based upon playing different players, but this is all part of the formula. The only slightly subjective element is the definition of
K -- how much a player can win or lose from a particular game. The most widely used ELO systems for Chess break
K down into two values: 16 for masters and 32 for everyone else. So there is a subjective decision that masters should vary their score less frequently than other players.
That's a very minor element in an otherwise objective system, but as we'll see, more recent systems by Days of Wonder and Microsoft first reduce, then eliminate even this subjectivity.
ELO is probably the most used ranking system in the world. You can find it in use for Go, Tantrix, and many other games. Days of Wonder, producers of Gang of Four, Ticket to Ride, and many other games use a variant of the system which they describe on their website.
They identify three core problems with ELO:
Days of Wonder resolved the first problem by creating a new formula for provisional players, allowing them to rise and fall in the rankings much more quickly.
Conversely when playing against provisional players, regular players can only lose a maximum of
K*n/20 points, where n is the number of games that the provisional player has played--rather than the normal maximum loss of
K. For example, playing someone who has just played one game, can only result in a loss of 1/20th of the regular
K value, and so it really doesn't matter if the provisional player's ranking is wildly out of whack.
Both of these new formulas are set up to converge toward a normal ELO formula as a provisional player's number of games approaches 20 (making them a normal player at Days of Wonder).
(It should be pointed out that using the number "20" to define a provisional player, and making a player less provisional in clean 5% steps, inevitably offers yet another small, subjective element into this mathematical formula; as we'll see momentarily Microsoft has more recently incorporated the idea of provisional uncertainty into their core mathematical model, much as the whole ELO system originally turned subjective win and loss statistics into tighter mathematics.)
There are 4 players in a Gang of Four game. Let's name A the winning player, B the second one, C the third one and D the last one. We consider that there were 6 duels: A won against B, C and D. B won against C and D. C won against D. We compute independently the new scores for each duel, and then we average the values for each player.
It's a fairly elegant answer that not only rewards or penalizes all players separately, but also encourages playing for second place, or even third, if first isn't possible.
There have been continued discussions of the Days of Wonder ELO variant in their forums, and the questions raised there are common to many different ranking systems. Some players wanted unranked games, while others thought that having unranked games would discourage people from playing good competitors except in unranked games. There has also been a lot of discussion regarding Ticket to Ride, a strategy game that supports 2-5 people, and whether the ELO variant system discourages multiperson play.
The various lessons learned at Days of Wonder underline two basic ideas about rankings. First, even with a well-studied system like ELO, there's still a lot to understand, and, second, any ranking system needs to reflect the specifics of what it's ranking -- and what its purpose is.
An even more recent large-scale ranking system is the TrueSkill system developed by Microsoft for use with the XBox 360. It appears to be an expanded variant of the glicko ranking system used by the free internet chess server.
Many of the problems identified by Microsoft were the same as those already noted by Days of Wonder and others, including: the uncertainty of provisional ratings and the need to rank players in multiplayer games. However, the TrueSkill system notably expands both issues. Ranking uncertainty is now defined as a mathematical concept and the rankings now support not just multiple players, but also multiple teams.
TrueSkill explicitly includes two values in any ranking: a skill level and an uncertainty level. The first, like the more common ELO ranking, tells how good a player is. The second states how sure that ranking is. The uncertainty rating is effectively a margin of error, similar to those we saw in polling systems. If a first-time player has a skill rating of 25 with an uncertainty rating of 8.3 that means that his skill is probably somewhere in the range of 16.7 to 33.3, a pretty wide range, but then this is a totally untested player. According to benchmarks that Microsoft produced, 99.99% of actual skill levels were within 3x of the uncertainty rating, and 100% were within 4x.
The rest of TrueSkill's innovations are built around this model of uncertainty. All players win or lose skill points, based upon how many players they beat or lose to, and they also decrease their uncertainty rating as they play more games. However, uncertainty is decreased more for players toward the middle of a pack within a game than those around the edges (because on the edges the players could actually be much better or much worse than it is possible to see from a specific game). In addition, TrueSkill is only a zero-sum ranking system for players at the exact same level of uncertainty. The more uncertainty that an opponent possesses, the smaller the weighting of any gain or loss (much like the simpler system that Days of Wonder uses, which bases weightings of games against provisional players as
Overall TrueSkill is a somewhat complex system that is described more fully at Microsoft's web site. Some of their expansions had already been considered by others, but still their system is notably innovative in two ways:
Expanding a competitive ranking system to include concepts of teams.
Incorporating the uncertainty of ratings further into the core mathematical model, rather than using a somewhat more subjective model such as that described by Days of Wonder for provisional players.
The TrueSkill calculations are a bit complex. In general, that's not a problem for a computer-based ranking model because you can have a computer doing all the computations, and players only need to understand the results. However the two-part ranking system used by TrueSkill, which notes both skill level and uncertainty, does offer a potential problem on this latter point. Can players understand it? In general, the concept of uncertainty will not be understood by people other that statisticians, thus raising a real user-interface question with the TrueSkill system -- and the exact sort of thing that designers of new ranking systems will need to consider.
The online game, A Tale in the Desert, identified a different problem with the ELO system: cheating. This is a uniquely Internet-based problem, because there users can create fake accounts, then defeat those accounts to win points. This can also be done more subtly, by having multiple additional accounts build up the rating of that fake account before the fake account is defeated. So a totally new ranking system, called the eGenesis Ranking System, was created.
Each player is ranked through a 256-bit vector, half of which is initially set to 0 and half of which is set to 1 (therefore creating an average ranking of 128). Whenever a match occurs between players a hash function based on the players' names mathematically selects 32 of those bits, 8 of which are then randomly selected. Among those bits, any 1s in the loser's vector which correspond to 0s in the winner's vector are "transferred".
This simple design corresponds in some ways to ELO's more complex formula. A good player will have more 1s and thus more to lose, and he will lose correspondingly more to a poor player who has more 0s in his vector.
However, the system also prevents the collusion earlier noted. Statistically, a single player will only ever gain 8 ranking points from another new player, since out of the 32 bit hash only eight of those will, on average, be in the correct 0-1 configuration. Expanding a group of players expands the number of points that can potentially be gained, but within real limits.
In fact, the eGenesis system prevents cheating by measuring the size of social networks, then limiting the number of ranking points that can be earned within a social network. It's not necessarily the only way to measure social network size, but its methodology points toward social software as an interesting area for additional study of ranking systems.
As with XBox's TrueSkill, the eGenesis algorithms are overall fairly sophisticated and confusing, perhaps more so than TrueSkill itself. However, unlike TrueSkill the output is very simple: a skill number between 0 and 255. The intricacies are hidden by the system.
Ultimately, as we mentioned when discussing Days of Wonder, any ranking system has to be measured by what it's trying to do and how well it does that. ELO and similar numerical, long-term ranking systems, are most likely trying to achieve one of three goals:
Matching: Players can play with other players at their same skill level, rather than having to play beginners or experts who are much better than they are. This generally increases everyone's enjoyment. For computer games, the complexity of a matching system can be largely moderated by the computer, thus ensuring better competition.
Handicapping: If players do play against others of different skill levels, the better players can be handicapped in automatic, appropriate ways for the game in question, again increasing the fairness of games and everyone's enjoyment. For instance, someone ranked 3-kyu in Go playing a less experienced 7-kyu player would give him a starting 4 stone advantage to make for better competition.
The ELO system may be a good matching system, which allows players to easily find other players of their same skill level and play against them. However it doesn't provide any way to handicap players, nor would the ELO method necessarily be a good one to analyze handicaps (and conversely a golf handicap might not do a good job of finding like players nor measuring players' ability in a hierarchy).
More recently the XBox system has stated that it's explicitly for matchmaking, with the goal being to always try and match up players at nearly the same skill level. It's also used for hierarchy (or "leaderboards" as it's described in the TrueSkill docs), but that's clearly a subsidiary purpose.
All of these systems would be ineffective for measuring a winner in a live event, which is a very different goal:
Tourney: A single player is listed as an absolute winner, the "King of the Hill". Often, second, third, and fourth place winners are measured too.
And, the systems we've discussed thus far may not be useful for measuring privileges, yet another goal:
Threshold: The best ranks of players can be given special privileges, including the ability to create games and form tournaments. Alternatively, they can be given privileges totally outside the game, again giving them something extra to strive for.
For each of these additional goals we may need to consider very different ranking systems, not just variations of ELO.
The simplest is the single-elimination tournament, where the winner of each competition moves on to compete with other winners, until there is only one. However, this style of tournament is quite cut-throat and is not suited very well to events where the competition may result in a draw, or where chance is a notable factor in the competition. It also has a very subjective factor in the initial seeding of the rounds. The single-elimination tournament also does not rank the losers. However, by having the losers compete with each other in a Swiss-style tournament, the relative strengths of the players can be ranked.
An improvement is the double-elimination tournament which is now one of the best known tournament systems in sports. Players compete in series of two-player matches, and a player has to lose twice before he's eliminated. This is done through a system of winner and loser brackets, wherein people drop from the winners' brackets to the losers' brackets when they lose once, and drop out altogether when they lose twice.
One problem with standard double-elimination is that there are unusual situations where a significantly inferior player can still make it to the final round, or the last player to remain undefeated can lose only once and still be eliminated. These can be addressed through variants such as face-off (requiring the last two remaining competitors to compete again if the undefeated team is defeated for the first time in the finals) or by reconfiguring the loser's brackets.
Round-robin tournaments, such as official Scrabble Tournaments involve every player playing a set number of games (24 in the 2005 World Scrabble Championship), facing opponents with similar win-lose records. They then ultimately rank players by their win-lose ratios.
The advantage of these sorts of tournament over an ELO-style ranking is that they're easily understandable and seem fair. In addition, they measure ranking in a much more topical manner: how well someone is playing during a singular instant, rather than over a longer career. As a result they work much better for a live tournament.
As we discussed in our original article on Collective Choice, thresholds are ranking barriers above which members get a special ability--or alternatively levels below which members lose a special ability. They can also act as another goal for a ranking system.
In the game of Go there are both amateur and professional players. Although they aren't technically in the same hierarchy of rankings, the highest Go amateur ranking (7 dan) is approximately equal to the lowest Go professional ranking (1 dan), forming a de facto threshold.
Likewise the United States Chess Federation uses their ELO rankings to denote Chess Masters. Anyone who achieves 2200 UCSF is given a National Master threshold ranking and anyone who maintains it for 300 games is given a Life Master threshold ranking.
The American Contract Bridge Association uses a threshold system where you have to win a certain number of tournaments and thus earn masterpoints in order achieve official rankings such as "section master". Furthermore, players may earn different "colors" of masterpoints depending the difficulty of the tournament, and some ranks require that you earn at least some specific colored masterpoints in order to meet the requirements for the next threshold.
These thresholds are fairly explicitly based on other hierarchical ranking systems, but this doesn't need to be the case. Since determining the purpose of a ranking system is often the first step in designing it, as we delve further into the area of thresholds we may well find that systems specifically dedicated toward measuring thresholds are more likely to do so well.
In our next article we'll consider among other things the Avogadro reputation system, which manages thresholds in such a way as to prevent cheating.
There's actually a lot of variety in ranking systems, and even though we'd like them to be totally objective, various subjective elements often creep into these systems. In addition, there's a lot of variety in what ranking systems can do. For competitive systems, hierarchy, privilege, matching, and handicapping are some of the top purposes of ranking. Determining what a ranking system is going to do is a necessary first step in designing the system, as different systems will accomplish various goals to a better or worse degree.
ELO, in several variants, is the best studied and most used competitive ranking system. It works particularly well as a matching system. However, even ELO has flaws in it, among them: issues with new player rankings; its core two-player basis; its lack of provisions for teams; a few minor subjective elements; and problems with cheaters. New systems continue to be rolled out on the Internet to resolve these issues, and overall, it's an area of interesting new study.
Tournament systems and threshold systems offer a few good examples of competitive ranking systems with very different purposes, underlying the need to understand what you're doing before you do it.
Ranking systems also lay very near yet another type of Collective Choice: reputation systems. We briefly addressed reputation systems when talking about threshold systems and will return to this in our next article.
Related articles from this blog:
2005-12: Systems for Collective Choice 2005-12: Collective Choice: Rating Systems 2006-08: Using 5-Star Rating Systems 2007-01: Experimenting with Ratings
Related articles from Shannon Appelcline's Trials, Triumphs & Trivialities:
Technorati Tags: a tale in the desert, algorithmic ranking, bridge, chess, collective choice, competition, days of wonder, double-elimination, egenesis, egenesis ranking system, elo, face-off, football, go, golf, grand master, grandmaster, handicap, handicapping, heirarchy, life master, matching, psuedo-double-elimination, ranking, ranking goals, round-robin, single-elimination, social network, statistics, threshold, ticket to ride, tournament, tourney, trueskill, uncertainty, xbox 360, xbox live
by Christopher Allen & Shannon Appelcline
[This is the second of a series of articles on collective choice, co-written by my collegue Shannon Appelcline. It will be jointly posted in Shannon's Trials, Triumphs & Trivialities online games column at Skotos.]
In our previous article we talked about the many systems available for collective choice. There are selection systems, which are primarily centered on voting and deliberation, opinion systems, which represent how voting could occur, and finally comparison systems, which rank or rate different people or things in a simple, comparative manner.
One purpose of our previous article was to create a dictionary of terms for talking about these related, but clearly different, systems. Another was to start offering analyses of these systems, many of which had not been well studied before their introduction onto the Internet.
However at best our previous article provided an overview of what should be further investigated in each system. This article provides more in-depth coverage of one of the systems we previously outlined: rating systems.
As we wrote in our previous article, in comparison rating systems "the value of individual items (most frequently goods) rise or fall based upon the largely subjective judgment of individual users." Ratings systems should be clearly differentiated from the closely related ranking systems. Ratings systems have a more subjective component, while ranking systems are largely objective. Amazon, Netflix, BoardGameGeek, and even the Stock Market were offered up as examples of ratings systems. Another example of a comparison rating system, and one of the earliest that appeared on the modern Internet, is eBay. The techniques they use are now beginning to show their age.
Most rating systems center around rating content, often user-contributed content, and they frequently help apply community values and acclaim to that content. However, the idea of ratings can go far beyond that narrow niche (though that will doubtless be its greatest use as the Internet continues to expand). Early Internet site, eBay, was one of the first to widely use user-submitted ratings, and it used them for a different manner: to determine the good traders on their auction site.
Unfortunately, as one of the first in this field, eBay made many mistakes which now leave their ratings system only slightly helpful. However, its failures can also provide us with insights in creating new rating systems on the Internet.
eBay allows you to leave positive, negative, or (more recently) neutral feedback for each transaction you conduct in their society. These are aggregated into two numbers. "Feedback Score" is calculated as unique positive feedback received minus unique negative feedback received, and results in a whole number like "32" or "10,302". "Positive Feedback" is calculated as positive feedback received divided by all feedback received, and results in a percentage like "100%" or "99.8%".
Unfortunately, for reasons discussed below, almost all feedback is positive, and thus the Feedback Score acts almost entirely as a track record of how many trades someone has made. The Feedback Score could be largely replaced by that single number. You can look at a score of "27", and say, "That's an amateur trader, or someone just getting started", at a score of "3", and say, "That person may or may not know what they're doing", at a score of "10,302", and say, "That person has done a lot of trades." But you still don't know how good the trader is.
Theoretically, the Positive Feedback percentage should give a more meaningful number, but people so infrequently give bad ratings that, even when they do appear, they look like noise. Does a percentage of "99.8%" on a user with a score of "1,762" mean that the seller has a genuine problem or not? Do those 3 unhappy customers really represent another 30 who were unwilling to actually click the negative feedback? And, did those people have slightly bad experience or really bad experiences? It's pretty hard to say.
Overall, eBay has a few major problems with their rating system:
It's non-granular, with only two options (positive/negative), or more recently three (positive/negative/neutral).
It's non-distinct, with no useful guidelines on what behaviors should result in each rating.
It's non-statistical, and thus ends up showing only a gross number of sales, not a real subjective measure.
It's bilateral, with buyers and sellers rating each other simultaneously, and thus people are afraid to give bad ratings lest they get them in return.
It's meaningless, because there are no good tools to control who bids on an auction based on Feedback numbers. (Technically it may be legitimate to ban low feedback bidders from an auction, then cancel their bids if they enter the auction, but this is neither obvious, automatic, nor simple.)
We're going to address each of these issues in turn, to offer insight into creating new comparison rating systems. The first three topics--granularity, distinction, and a statistical basis--are the most important elements of a good comparison rating system. Bilateral & meaningfulness issues will only be relevant on certain sites.
(As a final caveat: in some ways eBay falls closer in ultimate result to a reputation system, a topic which we'll be covering more in a few articles down the road, but its lessons learned are still entirely accurate for rating systems of all sorts.)
In general, people want to be nice. There are exceptions to that rule, perhaps even great numbers of them, but the average, well-adjusted person would prefer to make other people happy, not sad.
This has a notable effect on any comparison rating system, because it means that people are less likely to use the bottom half of any rating scale. If you did a statistical run on eBay, you'd certainly find that more than 99 out of every 100 ratings are positive. This is largely influenced by concerns of bilateral revenge, as discussed below, and the fact that eBay suggests other means of dispute resolution when you try and leave negative feedback. However, RPGnet, a roleplaying site which reviews games, comics, books, movies, and more shows a similar trend despite the lack of bilaterality.
RPGnet uses two 5-points scales for reviews, resulting in a total rating of 2-10. Of all the ratings at RPGnet, 6,983 reviews have a total that's above average, a total rating of 6 or more, and 795 have a total that's below average, a total rating of 5 or less. Perhaps there are more people who sit down to write a review because they really like a game than those who do so because they really hated it, but the result of ~90% of reviews being above average is still stunning.
The following table shows all the ratings for each of the two categories that RPGnet uses, "Style" and "Substance":
This evidence confirms what we'd already suspected. Only 10% of raters use the bottom two ratings in a 5-point scale, and only 2% use the bottom rating. The median of the 5-point scale is actually the fourth point, with a neat bell curve arranged around it.
Because users are innately unwilling to give bad ratings, as evidenced here, useful comparison ratings truly come about only through fractional differences between good ratings. In this case, the difference between "3", "4", and "5" is meaningful, and becomes more meaningful as more ratings are entered. Eventually you can look at a ranked list of ratings and see that "4.2" is a good rating while "3.5" is not.
In order to do this, however, you need enough levels of good ratings to be able to distinguish between them. eBay, only offering one positive rating, does not provide enough differentiation. RPGnet, with its three positive ratings, might. However, sites that offer a 10-point scale are the ones that really seem to be able to produce meaningful statistics. On those sites we can expect that 90% of users will choose between six different numbers, from "5" to "10", and as the number of ratings builds up, this will produce enough differentiation to be meaningful. If you have already adopted a 5-point scale, consider allowing users to select the half-points, giving users a greater ability to differentiate their ratings.
No two users are ever going to rate the same; different rating numbers will mean different things to each person. This can introduce minor discrepencies into ratings, if a single individual rates particularly low or high. However, because most ratings are eventually used for comparisons, if that low- or high-rater rates many different things, the ratings equalize. "Item A" is rated low by this person, but so is "Item B", and so they end up in the correct positions in relation to each other.
A bigger problem occurs when an individual is inconsistent in his ratings over time. If an individual rates everything low for a while, then rates everything high, then he has a greater chance of biasing the overall rating pool. Worse, his individual ratings aren't meaningful, because you can't look at two items, see that one is a "6" and another is an "8", and truly believe that he likes the "8" a fair amount more than the "6". This reduces the usability of an individual recommendation system or a friends system where one user might look at what other users thought about products, because their unaggregated numbers are not accurate.
You thus want to help individuals to stay consistent, and the best way to do that is to make the criteria for your ratings distinct. BoardGameGeek, a board game web site that supports a 10-point rating system for games, does a good job of offering distinction in its ratings.
- 10 - Outstanding. Always want to play and expect this will never change.
- 9 - Excellent game. Always want to play it.
- 8 - Very good game. I like to play. Probably I'll suggest it and will never turn down a game.
- 7 - Good game, usually willing to play.
- 6 - Ok game, some fun or challenge at least, will play sporadically if in the right mood.
- 5 - Average game, slightly boring, take it or leave it.
- 4 - Not so good, it doesn't get me but could be talked into it on occasion.
- 3 - Likely won't play this again although could be convinced. Bad.
- 2 - Extremely annoying game, won't play this ever again.
- 1 - Defies description of a game. You won't catch me dead playing this. Clearly broken.
If you offer a distinct rating listing like this, some users will still come up with their own rating ideas, but if they do, they're more likely to remember what each number means to them. For everyone else, a very clear, s rating system is the most likely to produce meaningful and consistent results. As long as users aren't puzzled by the distinction, they'll be consistent in picking the same numbers for the same rating every time.
The last big topic that you have to think about in creating most comparison rating systems is whether they're statistically sound.
The best way to make your ratings statistically sound is with volume. If you can manage thousands or tens of thousands of ratings for each item, any anomolies are going to become noise. However, the fewer ratings you have, the more likely it is that your ratings are inaccurate in relationship to your database of ratings as a whole. (And thus one of the failures for eBay is that it tries to claim meaningfulness for users with very few ratings, where there's clearly no statistical basis.)
Ideally what you want to do is give items with fewer ratings among your collection less weight, and those with more ratings higher weight. One simple way to do this is to apply a bayesian average. Variants of this are used by the aforementioned BoardGameGeek and by IMDB. RPGnet is using it for some unreleased software as well.
The idea behind a bayesian average is that you normalize ratings by pushing them toward the average rating for your site, and you do that more for items with fewer ratings than those with more ratings. The basic formula looks like this:
b(r) = [ W(a) * a + W(r) * r ] / (W(a) + W(r)]
r= average rating for an item
W(r)= weight of that rating, which is the number of ratings
a= average rating for your collection
W(a)= weight of that average, which is an arbitrary number, but should be higher if you generally expect to have more ratings for your items; 100 is used here, for a database which expects many ratings per item
b(r)= new bayesian rating
Say three "shill" users had come onto your site and rated a brand new indie film a "10" because the producer asked them to. However, you use a bayesian average with a weight of 100, and thus 3 ratings won't move the movie very far from the average site rating of 6.50:
b(r) = [100 * 6.50 + 3 * 10] / (100 + 3)
b(r) = 680 / 103
b(r) = 6.60
WowWebDesigns uses a similar model and even offers a good explanation of their methods on their web site.
With everything that's been described thus far, including granularity and distinction, a bayesian average (or some other similiar method) will probably be enough to give your ratings a good, sound statistical basis. However, sites with low volume of ratings may still be concerned with "shills" or "crappers" who come in to your site just to put "10"s on their favorite items on "1"s on their least favorite. RPGnet's reviews are an example of a site that could experience this issue, because only a few people are going to ever write reviews for an individual item, and this small number of reviews could compromise the nature of any comparisons generated by the ratings sytems.
In short summary the following additional methods may help with this issue:
Rate the Raters: Reviews are low volume, but presumably readers of those reviews are high volume, and you can take advantage of that to then have your readers rate the reviews. Amazon and Netflix are two examples of sites which use this method by asking "how many readers found this helpful".
Altruistic Punishment: An alternative method for rating raters is to use altruistic punishment. Herein users can punish someone who does contribute to the community, but at a cost to themselves. So, a reader could flag a poor rating or a poor review at some minor cost to their own rating. Though this method may seems somewhat paradoxical, game theory suggests that it is a generally effective technique for improving the commons.
Adjust Ratings Based on Ratings: Ratings can be self-adjusted based upon the rater's own behavior. The simplest method here is to map a rater's average rating to the average rating for a site. For example, if the average rating of a site is 6.50 and a shill's average rating is 10.0, then those 10s should be treated as 6.50s. This has the possibility for some intensive calculations, however, and may lead to additional bias in your rating pool if shills figure out the methods you use to adjust ratings.
Allow Editorial Fiat: Another method is to allow editorial fiat, where editors are expected to come in and remove bad ratings (or proactively not release them). This clearly results in time issues, but they may not be major since only sites with small numbers of ratings/item will have to do this type of adjusting. Further, automated systems could flag "suspicious" rating patterns which are outside the norms for average, speed of rating, etc. (RPGnet supports editorial fiat by requiring editorial release of all reviews.)
The idea of adjusting ratings based on ratings bears a bit of additional discussion because it's somewhat similar to another well-knowing rating system: slashdot. Herein you have both ratings and meta-ratings. People can rate threads and articles, then other people can agree or disagree with those ratings, which in turn makes it more or less likely that the original rater will be allowed to rate in the future (depending on if people agree or disagree with his ratings). Under a more general classification, this is probably a meta-rating system based on a reputation system, so it's something we'll look at further a couple of articles down the road.
90% of the rating issues that sites will face are covered by the above. However eBay in particular raised two other issues -- bilateralism and usefulness -- that aren't as generally relevant but do deserve some consideration.
Bilateralism: One of the reasons that eBay's ratings fall apart is that they're bilateral. Buyers and sellers rate each other simultaneously and thus there's the fear of revenge if you rate someone badly. It's a sufficient issue that eBay has a FAQ on the topic, though they don't offer any good answers.
The following solution would address some issues of bilateral revenge:
Put a time limit on bilateral ratings
Release bilateral ratings simultaneously at the end of the time limit
Don't allow additional ratings after the time period
This would work well on an eBay, where you're unlikely to conclude an additional deal with someone you rated badly, and thus there's no possibility ever for rating revenge. On a game site, however, where people are arbitrarily put into games with each other, and thus you could end up in a game with someone you rated poorly, there might be room for later revenge, down the road. This would have to be addressed to truly feel comfortable with bilateral ratings.
Additional investigation might reveal more variations of this method, or offer good answers for alternatives, like anonymous ratings.
In addition, good privacy restrictions are really needed to make bilateral ratings work, as well as Terms of Service that protect users from lawsuits for ratings. There have already been cases of physical threats based upon eBay ratings. eBay has also produced cases where people threatened slander or libel lawsuits for bad ratings, and this even further chills the possibility of true ratings appearing on the eBay server.
Usefulness: Finally, you want to make sure your ratings are useful at your sites. Rankings are a good way to achieve this. You can see the "best games ever" ranked, or you can see the most interesting user content rise to the top of a long listing, and the least interesting sink to the bottom.
eBay offers a counter example of frustration with the usefulness of ratings. As already mentioned, you can theoretically ban "bad" users from bidding on your items, and then cancel bids from these users if they appear. However, there are multiple issues with this approach. First, how do you define "bad" users on eBay? Insufficient feedback? Too much negative feedback? Too high a percentage of negative feedback? Second, there is no automated method for doing this, so you must remain ever vigilant on your auctions to make sure that "bad" users aren't involved. Third, there's no way to keep a bad bidder from returning after you've cancelled his bid. Fourth, these bad bids and cancellations have the possibility of corrupting your auction, as you could lose other bidders who came in, saw the higher bid when the bad bidder was involved, then left before the bid was reduced by his removal. Finally, greed is a powerful motivator on eBay, which might lead to the retention of bad users.
You also need to be careful with your user interface for ratings. Here is an example of a poor UI:
Comparison ratings are going to be an increasingly important force as the Internet continues to mature. To produce meaningful comparison ratings for your site, you need to concentrate on four important factors: granularity, specifity, sound statistics, and usefulness. And, if you offer bilateral ratings, make sure you understand the subtleties of that as well.
Related articles from this blog:
2005-12: Systems for Collective Choice 2006-01: Collective Choice: Competitive Ranking Systems 2006-08: Using 5-Star Rating Systems 2007-01: Experimenting with Ratings
Related articles from Shannon Appelcline's Trials, Triumphs & Trivialities:
Technorati Tags: altruistic punishment, bayesian, bilateralism, board game geek, collective choice, comparison systems, ebay, editorial fiat, feedback, granular ratings, meta-ratings, rating scale, rating systems, rating weight, retaliation, rpgnet, specific ratings, statistical ratings, usefulness, user-contributed content, wow web designs
by Christopher Allen & Shannon Appelcline
[Shannon Appelcline is a friend and colleague of mine at Skotos, an online game company. Over the last few years we've had many discussions about how decisions are made, and how our society collectively makes choices. The origin of these discussions have varied from "what makes this board game work?", to "how can we give our players more control of our online games?", to "how do we make decisions in our company?", and of course "how did we collectively make such a mess of decision making in America?". This article, and some followup articles, summarize our thoughts on these topics, and will be jointly posted in Shannon's Trials, Triumphs & Trivialities online games column at Skotos.]
Collective choice systems have been around for a long time. Since at least the birth of democracy in ancient Greece people have made joint decisions about important issues, and since at least the knightly tournaments of the late Middle Age people have competed to be ranked against their peers. Today Western culture especially values diversity of input when implementing any type of choice, believing that wide input from a variety of people provides the fairest result.
The Internet expands this long history of collective choice. However, as we bring collective choice systems onto the Internet, quantifying and programming them, we discover the need to be more analytical and more methodical in the techniques used. Thus we're beginning to learn that we don't know nearly as much about these collective choice systems as we should. There is a need to analyze and study them further, to understand their strengths and weaknesses, and to evaluate their social impact. Fortunately, the social software and online games on the Internet provides the perfect petri dish for doing so.
Before any analysis can occur, however, there is a need for a categorization of systems and a definition of terms. That is the purpose of this article: to lay out at least some of the ways in which collective choices can be made, to organize them, to define them, and to briefly consider them.
Broadly, there seem to be three methods of collective choice, divided by the intended result: selection, opinion, or comparison.
Selection systems allow for the purposeful choice between multiple items. There are many types of selection systems, but two in particular, representative systems, deliberative systems, and consensus systems are worth noting.
Representative Systems: In a representative system, individuals cast a ballot for someone who will represent their interests. They're by definition voting systems and the heart of any Republican system of government. When you're voting for a president, prime minister, senator, congressman, director, or board member, that's representative voting
In most representative voting a winner is selected by plurality, meaning the winner had more votes than any other candidate. This works well in a simple two-member election, but begins to fall apart if there are multiple candidates, because similar candidates can steal votes from each other, and thus allow a candidate with less popular ideas to be elected.
The simplest solution is to require a majority victory, meaning that one winner must have at least 50% of the votes. Some places in the United States use this system for their representative elections, holding a first election, eliminating all but the two biggest vote-getters, then holding a new election between these two.
Another solution is a primary-based election system, wherein all like-minded candidates compete against each other before participating in the real election. This requires buy-in from all like-minded candidates, however, and recent U.S. elections with third-party candidates like Ralph Nader and Ross Perot show the flaws in a voluntary primary system.
Many other types of voting systems are possible, most of which allow voters to select multiple candidates at the same time. These systems then eliminate the lowest ranked candidates and give their votes to others based upon those voter selections.
Instant Runoff Voting, or IRV, is a fairly commonly used multiple candidate system (though not necessarily the best one). It's technically a single transferable vote preferential voting system. Wikipedia describes the process like this:
Each voter ranks at least one candidate in order of preference.
First choices are tallied. If no candidate has the support of a majority of voters, the candidate with the least support is eliminated. A second round of counting takes place, with the votes of supporters of the eliminated candidate now counting for their second choice candidate. After a candidate is eliminated, he or she may not receive any more votes.
This process of counting and eliminating is repeated until one candidate has over half the votes.
It's simple to understand, but also flawed. That's because every voting system ultimately has some flaw in it, as is evidenced by the fact that given the same conditions the different systems will often declare different winners. Different systems may also allow voters to "game" the system in different ways. This is often called tactical voting or strategic voting. Similar to analyzing various "attacks" when studying a cryptosystem, looking at which tactical voting approaches each voting system is vulnerable to helps you evaluate the voting systems.
For instance, one type of tactical voting that can be used against an Instant Runoff Vote is the push-over strategy, which Wikipedia describes as this:
Push-over is a type of strategic voting in which a voter ranks a perceived weak alternative higher, but not in the hopes of getting it elected. This primarily occurs in runoff voting when a voter already believes that his favored candidate will make it to the next round - the voter then ranks an unpreferred, but easily beatable, candidate higher so that his preferred candidate can win later. A United States analogy would be voters of one party crossing over to vote in the other party's primary to nominate a candidate who will be easy for their favorite to beat.
For example, in an IRV election between Bush, Gore, and Nader, Democrats might rank Nader over Bush. The hope would be that this would give Nader enough votes to keep him from being eliminated, thus knocking Bush out instead in the first round. Afterward the Republican Bush votes transfer to the less progressive Gore, rather than the fringe Nader, allowing Gore to beat the "pushover" Nader when they couldn't have faced Bush in a straight-up fight.
There several different tactical voting "attacks" against various representative voting systems. One of the technically better multiple-candidate voting methods is the condorcet method for voting. It's immune to most tactical voting strategies and more people would consider its result "correct", but unfortunately it is much harder on the voters, who have to rank every single candidate. Maybe if we can create a better Internet user-interface to condorcet voting we can make this more sophisticated representative voting system more broadly available.
Deliberative Systems: In a deliberative system, individuals directly make a decision, rather than selecting a representative to do so. Deliberative systems do not have to include voting, and the subcategory of consensus systems described below technically don't, however most modern deliberative sytems do. A deliberative system is the heart of true democracy. Traditionally it's been relatively unfeasible because voters were not expected to be educated enough to make governmental decisions and because they didn't have the time or capability to regularly decide on issues. The spread of the Internet alleviates at least the latter problem, since millions of people can now simultaneously decide on any issue if they so desire.
In the United States the best known deliberative system is the initiative system found in some states, including California. It allows for issues to be put directly before the voters through the submission of sufficient signatures, and then allows the voters to pass or fail those issues, based on either plurality (most votes), majority (at least 50% of votes), or else super majority (some percentage of votes in excess of 51%). In California, for example, 66% approval is required for new tax initiatives.
The United States constitution defines a large and very complex deliberative system. It creates three bodies of government to support deliberation and voting, and uses a checks and balances systems in order to allow different branches to have different effects upon a vote. The main voting is done by the legislature, which requires two pluralities from two different groups of people to pass a vote. Then the executive branch has a singular opportunity to veto legislature, which then requires a super majority (here, 66%) to override that veto. Once a law is established, the judicial department may by plurality vote to declare that legislation unconstitutional, but that may be overcome by an even greater super majority (typically, 66% of each legislature + 75% of the state governments) who want to amend the constitution.
The constitution also shows how deliberation can span beyond simple voting because of the fact that it includes specific rules for how to debate, when debate can be closed, etc. In today's very fractured congress, however, it's unclear if individuals ever are actually swayed by deliberations in the floor of the legislature, or if they've already decided to follow their party lines or their specific interests, long before they entered the Capitol buildings.
A smaller example of a true deliberative system, based on guiding discussions as much as holding votes, is found in Robert's Rules of Order, a guide for conducting meetings. These rules detail explicit methods not just for voting, but also for the deliberation and discussion surrounding the voting. Various majority and minority votes can be taken to allow for certain actions.
Because deciding directly upon ideas rather than just voting for representatives can have a greater effect upon a community, the deliberative systems may need to be more complex to avoid abuse, as evidenced by the complexities of the U.S. Government and Robert's Rule of Order. However, these very complexities can make these systems more prone to purposeful gaming. The benefits and deficits of more complex deliberative systems have not yet been fully studied, nor have there been as much analysis of "attacks" against them.
Consensus Systems: In consensus systems people jointly come to a consensus as a group through group interactions. This sort of decision making theoretically avoids the "tyranny of the majority" and likewise can produce more informed decision making. It's a variant of the broader deliberative systems, but one with more group and less individual power.
One example of consensual selection is cabinet government as laid out under the Westminster System. Wikipedia describes it as follows:
Members of the Cabinet are collectively seen as responsible for government policy. All Cabinet decisions are made by consensus, a vote is never taken in a Cabinet meeting. All ministers, whether senior and in the Cabinet, or junior ministers, must support the policy of the government publicly regardless of any private reservations. If a minister does not agree with a decision he, or she, can resign from the government; as did several British ministers over the 2003 Invasion of Iraq. This means that in the Westminster system of government the cabinet always collectively decides all decisions and all ministers are responsible for arguing in favour of any decision made by the cabinet.
Quaker-based consensus offers a similar example. Herein a facilitator helps to identify disagreements and agreements to move a discussion forward until an end result is embraced by all individuals.
As a final note, it's important to differentiate consensus from coercion. The end result of unanimity isn't the sole definition of a consensus system, nor is it entirely required. What is required is a more open and thoughtful selection process.
Opinion systems are a clear subsidiary category to selection systems. An opinion system's main use is as a decision indicator, to show how people will decide or did decide in a representative system, a deliberative system, or both. Current opinion systems tend to be oriented toward actual votes, as opposed to more freeform selection systems (though the delphic polling system shows a more freeform version of the category itself). Opinion systems tend to be push-based (meaning people are asked for their opinions rather than actively offering them), but this isn't required.
All opinion systems tend to have the same general problem, which is figuring out how to use scientific means to determine the actual results of a decision. This means massaging respondent numbers to offset categories of people more or less likely to vote to try and generate the actual results. For example, one 1998 poll showed that 62% of Republicans were absolutely certain they were going to vote, while only 51% of Independents could say the same. This means that every Republican voter a poll contacted in that year might have been weighted about 1.2x over every Independent contacted. Of course the actual calculations are much more complex than that, since they tend to depend upon traditional voter turnout and lots of analysis, but the core idea is sound, which is that every polled individual should not be considered equal.
All opinion system results tend to be rated with margins of error. The margin of error is a percent spread which the poll is expected to be within, 90-99% of the time (depending on how conservative of a confidence rating is given). If a poll shows that a politician is expected to take 48% of the vote, for example, and the margin of error is 4%, that means he is expected to take 44-52% of the vote with 90-99% surety. Margins of error are typically given much greater importance in the modern media than they should, as they're calculated solely based upon the total number of respondents to a poll.
There are two general categories of opinion systems: pre-voting (subjective) polling systems and post-voting (objective) polling systems. A different type of opinion system, delphic polling, which could apply to either pre- or post-voting systems is also covered. Polling systems not directly related to selection systems are covered later, as subjective rating systems, since they tend to have issues very different from other polling systems, as their goal isn't to try and match the "true" number of an actual vote.
Pre-voting Polling Systems: These are polls made before a vote is cast. They're often called "opinion polls" and tend to be conducted via phone. They try and isolate "likely voters" and determine how they will vote. This question of voter likelihood is one of the first issues with a pre-voting system, because there's no guarantee that the polled people will actually later vote. Likewise, pre-voting systems have to accommodate "undecided voters" and the fact that no voter has ever truly made up their mind until they cast their final ballot. Unlike post-voting polling systems, pre-voting systems also have considerable more possibility for bias (which is not accounted for by margins of error), based upon how questions are asked, in what order, and with what additional text.
Post-voting Polling Systems: These are polls taken after a vote is cast. They're typically called "exit polls", as most are conducted as people are leaving a "polling" station (where they cast a vote). One would expect these to be much more reliable than pre-voting polls, but as the 2004 U.S. Presidential Election showed, exit polls can be wildly inaccurate.
One of the problems with post-voting polling systems, shared with pre-voting systems, is that the results must be manipulated to make sure that respondents to the poll match the percentages of those constituencies in the overall populations. For example, in the 2004 exit polls it appears that women were initially overrepresented in exit polls, and because of increased black turnout it appears that blacks were underrepresented in the exit polls. It can easily be seen how either of these misrepresentations could cause notable changes in an exit poll result.
When conducted & matched correctly, exit polls are supposed to be quite reliable.
Delphic Polling Systems: An interesting polling method applicable to all sorts of opinion systems is the "delphi poll". This is a specific method of polling which is iterative and anonymous and which supports confidence ratings and feedback. The general idea is that people are polled on a question using not just binary responses, but a full confidence rating (e.g., you would state that you are 60% sure that Bush would be elected, rather than stating that you think Bush would be elected). After polls are collected, the anonymous results--or at least a summary of those results--are shared with the participants, who then poll again. This iterative process continues until a consistent answer is settled upon. By incorporating feedback into the polling process there's the possibility for greatly increased reliability.
In some ways delphic polling systems can be seen as an analogy to consensus systems, since both involve more iterative processes that eventually result in a more commonly-held decision.
Comparison systems allow individual items to be measured up against each other. There are three general categories: comparison ranking systems, which are largely objective and which typically rank people; and comparison rating systems, which more often mix subjective and objective opinions, and which more frequently rate things; and reputation rating systems, which again tend to rank people, but also have a subject and objective mix.
Comparison Ranking Systems: In a ranking system, items in a hierarchy (most frequently people) rise or fall based upon specific, objective, and well-known rules. This is the heart of most multiplayer competitive systems.
The ELO System is an example of a ranking system used for two-player games, and is used by the U.S. Chess Federation. Days of Wonder uses a multiplayer variant of the ELO system for their online games. Each system builds a simple distribution of player ratings around a norm (typically 1500 points), then awards or deducts points based upon wins and losses, with the total sum of all points in the system staying constant. Players are then ranked according to their comparative scores.
There are flaws in ranking systems like ELO. For example, two players could collude, with one purposefully throwing games so that his opponent could increase his ranking. Alternatively if a player gets a few lucky victories against good opponents, his rating might temporarily skyrocket above its normative value. However, these tend to be well-known and well-researched problems.
These are numerous other ranking systems which are used for competitions, from double-elimination seeded tournaments (e.g., a tennis tournament) to ranked comparisons based upon win-loss ratios (e.g., baseball standings). Objective rankings are also (less commonly) used to rank items, such as a ranking of cars based upon safety ratings.
Most ranking systems create a hierarchy of positive rankings (e.g., "best chess players ever"). However, a hierarchy of negative rankings may also most be used, most commonly based on a negative criteria (e.g., "biggest Player Killers (PKers)"). In addition, either direction of ranking can use threshold systems to mark positive or negative rankings that meet a certain criteria. A positive threshold might be a "Grand Master" ranking threshold for anyone with a Chess rating of 2700, while a negative threshold might be a "Player Killer" ranking threshold, for with sufficient "accidental" PKs.
Ranking systems are somewhat removed from the other collective choice systems listed here, since there's isn't a collaborative decision, only a collective result. However their problems & results remain closely related to the more collective rating and reputation systems, hence their inclusion.
Rating Systems: In a rating system, the value of individual items (most frequently goods) rise or fall based upon the largely subjective judgment of individual users.
Amazon and Netflix are two examples of stores which provide subjective rating systems. Individual users rate items from 1 to 5 stars, then an average user rating is calculated. BoardGameGeek offers a slightly different example because it not only lets users rate individual items, but also ranks items against each other based upon those ratings.
Flaws in these systems are similar to those in ranking systems: low numbers of ratings producing bad rankings, and individual users purposefully biasing ratings. Some mathematical methods may be used to smooth out these issues, among them bayesian averages, which give ratings weight based upon total number of ratings for an item.
The Stock Market offers an example of a different sort of rating system, because there's theoretically some objective basis to it. In a perfect Stock Market system, stock prices are based upon a solid cost analysis, such as a multiplier on yearly revenues or profits. However, as the Internet bubble of the late 1990s conclusively showed, there's also a high irrational component to stock purchases: thus subjective and objective views are combined in the rating (cost) of a stock.
Reputation Systems: Finally, reputation systems are very similar to ranking systems: items in a hierarchy (most frequently people) rise or fall based upon specific and well-known rules. However, unlike true ranking systems, reputation systems instead base their rules for rise and fall upon other user feedback.
The goal of a reputation system is ultimately to create a trust metric that often allows different users access to different powers. We'll be covering reputation systems a bit more thoroughly in a couple of weeks.
There are a variety of ways to measure the collective choices of a large group of people. We've outline nine here: representative, deliberative, and consensus selection systems; ranking, rating, and reputation comparison systems; and three varieties of opinion systems. When developing social software it is important to understand the difference between these broad categories of systems and to use lessons already learned from the appropriate category in your own social software designs.
Related articles from this blog:
2005-12: Collective Choice: Rating Systems 2006-01: Collective Choice: Competitive Ranking Systems 2006-08: Using 5-Star Rating Systems 2007-01: Experimenting with Ratings
Related articles from Shannon Appelcline's Trials, Triumphs & Trivialities:
Technorati Tags: collective choice, comparison, deliberation, delphic poll, democracy, exit poll, opinion, polling, ranking, rating, representation, robert's rules of order, rules, selection, social choice, tactical voting, voting, voting system attacks, voting system criteria, voting systems
There is some more excellent research this week by Nick Yee and Nicolas Ducheneaut in the PlayOn blog. Again, their research provides good insight into social group dynamics as they appear in online games.
I last mentioned their research on guild sizes in my blog post Dunbar & World of Warcraft where I compare the distribution of guild sizes in Ultima Online vs PlayOn's results from World of Warcraft. However, both distribution tables suffer from a variety of biases due to the nature of the different game designs, many of which are discussed in the commments in the post. For instance both guild graphs include "alts", which are alternative characters of individual players. Thus one player might be represented multiple times in a guild.
In Nic & Nick's more recent research, they are looking deeper and are mapping social networks (now 404, use Internet Archive) on World of Warcraft. Using tools available to them, over the course of the month of August they looked at 241,378 characters and 5569 guilds. From that data they were able to able to discover how often guild members were co-present (online at the same time) or co-located (online and in the same zone in the game)
For example, in this map you can see:
...we have a guild where we see more distinct cliques. There's a somewhat hard-core 4 character cluster on the left- hand side, a mid-level triad on the bottom left, a mid 20's clique that's held together by the druid in the middle, and finally a more casual low-level clique on the top right.
Using this same data, they then looked at the max subgraph size of guilds (now 404, use Internet Archive). A "subgraph" in a social network can be thought of as a "clique" of people that interact with each other.
For instance, in the social network of this guild there are 5 subgraphs, 4 of which are just 2 people and 1 subgraph that is of 6. The 4 subgraphs that are only of 2 people are members of the guild who did not participate much in the guild. But the number of people in the subgraph of 6 is much more interesting, as it shows that these members are "cohesive".
When the max subgraph size of a guild is plotted against the guild size, you get some interesting results -- the maximum guild cohesiveness occurs around the guild size of 50. Larger then that, guilds have a much more difficult time remaining cohesive.
These results strong support my hypothesis in my original Dunbar Number post, where I...
hypothesize that the optimal size for active group members for creative and technical groups -- as opposed to exclusively survival-oriented groups, such as villages -- hovers somewhere between 25-80, but is best around 45-50. Anything more than this and the group has to spend too much time "grooming" to keep group cohesion, rather then focusing on why the people want to spend the effort on that group in the first place -- say to deliver a software product, learn a technology, promote a meme, or have fun playing a game. Anything less than this and you risk losing critical mass because you don't have requisite variety.
Note that the mean amount of time that guilds spend together in the band 31 - 60 is only 95 minutes and that the standard deviation is the lowest, as compared to groups that are larger then 120 where the mean time is 141 minutes at the standard deviation is higher. This says to me that keeping a large group cohesive requires significantly more time in social "grooming".
In my original Dunbar Number post I had a secondary hypothesis where I talk about peak in satisfaction in smaller groups:
In my opinion it is at 5 that the feeling of "team" really starts. At 5 to 8 people, you can have a meeting where everyone can speak out about what the entire group is doing, and everyone feels highly empowered. However, at 9 to 12 people this begins to break down -- not enough "attention" is given to everyone and meetings risk becoming either too noisy, too boring, too long, or some combination thereof.
Here you can see a strong peak at 10 people, that rapidly falls with guild size. This shows that guild cohesion is relatively easy to maintain up to size 10, but becomes much more difficult to maintain as guilds grow larger.
I've been saying for some time that studying online games is a valuable place to understanding social software. I'm quite pleased that this research is proving this to be true.
Some other posts about the Dunbar Number and group size issues: