What does an 18th-century English minister have in common with such modern innovations as e-mail spam filters, Google searches, and even Clippy, the iconic (and sometimes reviled) paperclip-shaped Microsoft Office Assistant?
In The Theory That Would Not Die, Sharon Bertsch McGrayne depicts the quiet birth and controversial coming of age of Bayes’ rule, a one-line theorem invented by the Reverend Thomas Bayes sometime in the 1740s. The theorem itself seemed simple enough, stating that we can arrive at new and improved beliefs by modifying initial beliefs in accordance to objective new information. However, it was actually groundbreaking in its transformation of probability from “a gambler’s measure of frequency into a measure of informed belief.”
After achieving this breakthrough, Bayes seemed to have abandoned it. His essay gathered dust with all of his other papers and was published posthumously with assistance from his friend Richard Price. Bayes’ rule remained in relative obscurity until Pierre Simon Laplace, a major figure in the development of mathematics astronomy and statistics, discovered a version of this theorem on his own in 1774 and advanced it into its current form. These days, Bayes’ (or really, as McGrayne argues, Laplace’s) rule has permeated systems and networks all around us – reading this blog post means that you’ve probably interacted with some Bayesian ideas just now. In The Theory That Would Not Die, McGrayne draws attention to some of the ways in which the theorem intersects with Internet technologies that have grown indispensable to us:
Spam filters
Bayesian methods attack spam by using words and phrases in the messages to determine the probability that the message is unwanted. An e-mail’s spam score can soar near certainty, 0.9999, when it contains phrases like “our price” and “most trusted”; coded words like “genierc virgaa”; and uppercase letters and punctuation like !!! or $$$. High-scoring messages are automatically banished to junk mail files. Users refine their own filters by reading low-scoring messages and either keeping them or sending them to trash and junk files.
Microsoft
Bayesian theory is firmly embedded in Microsoft’s Windows operating system. In addition, a variety of Bayesian techniques are involved in Microsoft’s handwriting recognition; recommender systems; the question-answering box in the upper right corner of a PC’s monitor screen; a data-mining software package for tracking business sales; a program that infers the applications that users will want and preloads them before they are requested; and software to make traffic jam predictions for drivers to check before they commute.
Bayes was blamed—unfairly, say [David] Heckerman and [Eric] Horvitz—for Microsoft’s memorably annoying paperclip, Clippy. The cartoon character was originally programmed using Bayesian belief networks to make inferences about what a user knew and did not know about letter writing. After the writer reached a certain threshold of ignorance and frustration, Clippy popped up cheerily with the grammatically improper observation, “It looks like you’re writing a letter. Would you like help?” Before Clippy was introduced to the world, however, non-Bayesians had substituted a cruder algorithm that made Clippy pop up irritatingly often. The program was so unpopular it was retired.
Netflix
Bayes and Laplace would probably be appalled to learn that their work is heavily involved in selling products. Much online commerce relies on recommender filters, also called collaborative filters, built on the assumption that people who agreed about one product will probably agree on another. As the e-commerce refrain goes, “I you liked this book/song/movie, you’ll like that one too.” The updating used in machine learning does not necessarily follow Bayes’ theorem formally but “shares its perspective.” A $1-million contest sponsored by Netflix.com illustrates the prominent role of Bayesian concepts in modern e-commerce and learning theory. In 2006 the online film-rental company launched a search for the best recommender system to improve its own algorithm. More than 50,000 contestants from 186 countries vied over the four years of the competition. The AT&T Labs team organized around Yehuda Koren, Christopher T. Volinsky, and Robert M. Bell won the prize in September 2009.
Interestingly, although no contestants questioned Bayes as a legitimate method, almost none wrote a formal Bayesian model. The winning group relied on empirical Bayes but estimated the initial priors according to their frequencies. The film-rental company’s data set was too big and too filled with unknowns for anyone to—almost instantaneously—created a model, assign priors, update posteriors repeatedly, and recommend films to clients. Instead, the winning algorithm had a Bayesian “perspective” and was laced with Bayesian “flavors.” However, by far the most important lesson learned from the Netflix competition originated as a Bayesian idea: sharing.
Volinsky had used Bayesian model averaging for sharing and averaging complementary models while working in 1997 on his Ph.D. thesis about predicting the probability that a patient will have a stroke. But the Volinsky and Bell team did not employ the method directly for Netflix. Nevertheless, Volinsky emphasized how “due to my Bayesian Model Averaging training, it was quite intuitive for me that combining models was going to be the best way to improve predictive performance. Bayesian Model Averaging studies show that when two models that are not highly correlated are combined in a smart way, the combination often does better than either individual model.” The contest publicized Bayes’ reputation as a fertile approach to learning far beyond mere Bayesian technology.
Google searches
Web users employ several forms of Bayes to search through the billions of documents and located what they want. Before that can happen, though, each document must be profiled or categorized, organized, and sorted, and its probable interconnectedness with other documents must be calculated. At that point, we can type in a search engine the unrelated keywords we want to appear in a document, for example, “parrots,” “madrigals,” and “Afghan language.” Bayes’ rule can winnow through billions of web pages and find two relevant ones in 0.31 seconds. “They’re inferential problems,” says Peter Hoff at the University of Washington. “Given that you find one document interesting, can you find other documents that will interest you too?”
Excerpted from The Theory That Would Not Die by Sharon Bertsch McGrayne. Copyright © 2011 by Sharon Bertsch McGrayne. All rights reserved.