University of Pennsylvania
The development of a perceptron classifier for Inform, a language for the development of interactive fiction (IF), is described and its effectiveness at a sample task is assessed. The Perform kit consists of Inform libraries and scripts in Perl and Octave that can be used to train a binary classifier, specifically, a perceptron. To demonstrate the perceptron classifier and as a test of its utility, it was trained and implemented to do a "real world" task that is done in an existing piece of interactive fiction. While there remain many practical and conceptual barriers to use of machine learning techniques, this work demonstrates that such approaches can be incorporated into the creations of IF authors and have the potential to contribute to the advancement of the form.
Interactive fiction authors have innovated in many ways in recent years, but, as work on a recent article (Montfort 2005) led me to realize, artificial intelligence has not been involved in this innovation. A strong case has been made that the development of creative artificial intelligence systems can not only further art and literature, but can also advance artificial intelligence research (Mateas 2001). As I find this idea persuasive, I have undertaken an exploration of how machine learning techniques can be used in interactive fiction developed in Inform. The result is Perform, a library which will hopefully be played with and built upon by others who will use it to construct parts of their interactive fiction programs.
Inform is a flexible, effective language for creating interactive fiction that was developed by Graham Nelson in 1993 (Nelson 2001). Nelson also developed well-crafted libraries for Inform, including a capable parser. The language and libraries have been used to create hundreds of games and interactive literary works, and they have evolved thanks to the contributions of many in the IF community.
The platform for which Inform was originally developed to compile to is the Z-Machine (Nelson 1997). The Z-Machine was developed in 1979 by Joel Berez and Marc Blank of Infocom. It is the first commercial virtual machine, predating the now-popular Java VM by about fifteen years. It was revived as a target platform for non-commercial IF authors in the 1990s. Z-Machine interpreters are available today for literally dozens of platforms, including discontinued older systems, current desktop/laptop operating systems, Web browsers, and PDAs. The Z-Machine system is compact and typical interactive fiction runs quickly in it.
As useful and popular as the Inform/Z-Machine system is, it is also extremely austere. The size of story files, in which programs reside, cannot exceed 512KB; readable memory is only 64KB; there is no built-in support for floating point numbers, long integers, or manipulable strings. Rather surprisingly, an Inform extension does provide floating point support on the Z-Machine, although the example program demonstrating this library does not function properly when compiled with Inform 6.30, the current version.
A new virtual machine, Glulx, has been developed by Andrew Plotkin, which Inform 6.30 can target. The Glulx VM is rich in cross-platform multimedia capabilities and has 32-bit addresses to allow much larger games, but it was designed with traditional IF requirements in mind and so lacks features that have no obvious use in traditional IF, such as support for floating point numbers and the ability to communicate with other processes through sockets.
Since interest in compiling Inform to the Z-Machine is likely to continue for some time, Perform was developed with the Z-Machine in mind. Inform code developed as part of the project compiled to both target platforms.
First, a classification task was devised that was known to be linearly separable and synthetic data was created for training. A large inflected lexicon of English was divided into those with fewer than two vowels and those with two or more vowels, where vowel was defined lexically as [aeiou], not phonetically. There were 6542 negative examples including aft, biffs, yttric, and zyzzyvas. The list of 103040 positive examples contained abaci, braillewriter, yelped, and zwieback.
A random subsample of 1000 positive and 1000 negative examples was used for training, with the remaining data set aside to use as a test set. A perl program created a vector of 780 binary features for each word:
The sets can be shown to be linearly separable simply by describing a weight vector that classifies feature vectors above according to this rule. Set the bias term to -2. Set all other weights except the ones associated with a,e,i,o, and u "occur anywhere" and "occur more than once" to 0. Set the remaining ten values to 1. Now, all words that have at least two vowels will sum to nonnegative values, while words with one vowel will sum to -1 and words with no vowels to -2. Since this weight vector and bias term divides the training data perfectly, the data, as expressed in these features, is linearly separable. Thus, a separator can be learned by a perceptron hooked to the 780 features above using the perceptron learning algorithm. The set of features above, created for this synthetic test, is an intentionally "sloppy" set, containing 770 unnecessary features that can only impair the system's performance.
An Octave implementation of the perceptron learning algorithm was used to learn a weight vector and bias term. After 43 iterations train.m converged. test.m was then used to see how the weight vector and bias term classified the test data. Of the 107582 points remaining in the test set, 169 of the 5542 negative points were misclassified, as were 1800 of the 102040 positive points, for an error rate of about 1.83%. The set of words that were misclassified includes those in table 1.
The misclassified words kindled a suspicion that the features associated with "c" or "l" occurring anywhere, or "c" occurring at the end of a word, may be one of the factors the blame. The weight vector did in fact have a rather large component associated with the occurrences of "c" and "l": 136 and 176, the two largest components associated with consonant occurrence. The averaged component associated with a vowel occurrence was 572, and two of those would (by themselves) exceed the bias of -892. The large features for "c" and "l" occurrences may simply result from these consonants being least frequent in words with fewer than two vowels. From examination of the weight vector, it seems that culling low-magnitude features and retraining could help.
The algorithms for training and testing were also implemented in Perl, as train.pl and test.pl. The redundant implementation was undertaken to allow interactive fiction authors, who generally don't have the same software available as do research computer scientists, to begin working with machine learning techniques more quickly, without purchasing Matlab or installing Octave. Confirming that the runs of both scripts were identical also provided a check of sorts on the correctness of the implementation, although, of course, the same error might of course have been duplicated in both implementations. There is certainly a price to using the more readily available but less appropriate Perl: While Octave completed startup, loading the matrices, and training in 1 minute 37 seconds on a Macintosh 450 MHz PowerPC G4, training in Perl was almost exactly five times slower, taking 7 minutes 53 seconds on the same G4.
A specific aspect of Infocom's 1986 interactive fiction Moonmist, by Stu Galley and Jim Lawrence, motivates the next task. In Moonmist, the interactor is required to type the name and title (Lord, Professor, Nurse, etc.) of the player character when the PC is at the gate of the mansion in which the main action of the game takes place. What seems to be a simple rule-based system — matching against a word list — determines whether the title is masculine (Baron, Sir, Mister, etc.) or feminine (Countess, Lady, Miss, etc.). If the title is classified as masculine, a female character flirts with the player character and the player character is referred to as male. The gender references and the gender of the flirter are reversed if the title is classified as female, altering the whole experience of the game in subtle ways. If there is no match, either because a non-gendered title has been entered or because a gendered title not on the list has been entered, the game proceeds without reference to the gender of the player character. Steven Meretzky's Leather Goddesses of Phobos (Infocom, 1986) was similarly gender-sensitive, with the PC's gender determined by whether the PC enters the men's or women's bathroom at the beginning of the game.
Taking inspiration from the way Moonmist determines gender and colors the experience of the game accordingly, a perceptron was trained to distinguish male and female first names, so that the system could guess the gender associated with the name. Although a large margin classifier would have been more suitable in terms of the information it provides, a simpler system, a perceptron, was trained to evaluate its performance. Data from the 1990 United States census (the top 1218 male names and 4275 female names) was used; the top 500 names were taken from each list to use for training. Clearly, this data is not linearly separable: some points are classified both negatively and positively. "Pat" is both the 387th female name and the 457th male name.
To find an optimal weight vector for this data, the perceptron was trained using the pocket algorithm with ratchet (Gallant 1990). The algorithm draws examples at random from the training data and checks them against the current weight vector. It also keeps one set of weights "in its pocket." Both current and pocket weights initially have all elements set to 0. The current weights are adjusted in response to misclassifications that are encountered as usual, by subtracting false negatives from them and adding false positives. Only if the current weights have a run of consecutive correct classifications that is longer than the run the pocket weights had, and if the current weights classify strictly more of the training data correctly than do the pocket weights, do the pocket weights get replaced by the current weights. After a large number of iterations leaves the pocket weights unchanged, the pocket weight vector is, with high probability, optimal.
A vector of 2054 features was used, representing the presence or absence of multiple single letter occurrences, initial bigrams, final bigrams, and medial bigrams. After 131875 iterations a weight vector was determined that misclassified only 57 of the 1000 points. For this task, simply obtaining good performance on training data is of substantial benefit, as authors are likely to appreciate the ability to compactly and quickly distinguish between the 1000 most popular male and female names without taking up too much of the at most 512KB story file, and the limited memory of the Z-Machine, with a huge name list.
|Male names misclassified as female|
|Joseph, Kenneth, Gary, Jose, Larry, Randy, Johnny, Allen, Ronnie, Dean, Shane, Jessie, Lance, Kelly, Casey, Johnnie, Rene, Dana, Donnie, Shannon, Kenny, Jean, Ira, Bennie, Robin, Benny, Kim, Guadalupe, Jody, Josh, Terence, Pat, Moses|
|Female names misclassified as male|
|Alice, Doris, Cheryl, Mildred, Dawn, Connie, Tracy, Leslie, Jamie, Lynn, Marion, Willie, Jackie, Courtney, Terry, Christy, Iris, Jan, Lee, Francis, Angel, Ginger, Kerry, Jaime|
Considering the next 150 male names (in order of commonness), 47 were misclassified (31% error). Of the next 150 female names, 24 were misclassified (16% error). Note that because of demographics and the differences in names associated with different nations, cultures, and languages, the 501th through the 650th most popular names are probably not drawn from the same distribution as are the top 500, so this probably overstates the generalization error on samples from the same distribution. Since the true distribution of names that will be encountered is not known, there is no principled way to test generalization error without collecting actual names that are typed in by interactors.
Disambiguating common names reliably is likely to be considered more important than is dealing with uncommon names. For instance, in an interactive fiction application a character whose reaction is determined by the classifier might be expected to make mistakes about rare names. Thus, despite the error rate of 24% for the next 150 most popular names in each class, the performance of the perceptron seemed quite acceptable for this application.
A set of 2054 features is certainly not large by ordinary standards, and it does make for a small enough set in the current situation, but even a number of this magnitude could strain a Inform/Z-Machine program if multiple classifiers were to be implemented for numerous different tasks. In general, the selection of fewer features will be necessary in many cases when implementing classifiers in Inform. Feature selection is also motivated by the possibility of improved performance, which could come with removing irrelevant features.
A method like PCA that requires projection or other mathematical manipulation of inputs was deemed undesirable for the rather mathematically challenged Inform/Z-Machine platform. Instead, sensitivity analysis was used to determine which of the four sets of features (multiple occurrences of same letter, initial bigram, medial bigram, final bigram) could be omitted with the least degradation (or greatest improvement) in performance. The technique, which has performed well on different sorts of data (Addison et al. 2003), has the disadvantage of taking a great amount of time, as it involves removing features or sets of features, training with each reduced feature set, testing with each, and then removing the one "worst" feature set and repeating the whole process on the new, reduced data. However, classification is rapid once the process is finished.
|All Features||No Multiple Letters||No Initial Bigram||No Final Bigram||No Medial Bigrams|
The results of sensitivity analysis are shown in table 3. Intuitively, people find that the ending of names contains information about the gender of those names, and they sometimes use this information to guess the gender of an unknown name. The final bigram feature set was revealed by the analysis to be the most important, as is consistent with this intuition. The feature that counts multiple occurrences of letters could probably have been dispensed with little effect, demonstrating that even a crude form of sensitivity analysis, done to reduce the size of the weight vector, can be useful.
Finally, the trained classifier was implemented in Inform. An example of a session in a classifier-equipped IF program is shown in figure 1. All 2054 features were included, but to reduce the storage required for the weight vector, elements were each stored in a single byte, with 256 added to negative elements and the elements converted back before summing.
The completed classifier, with code and weight vector, runs quickly and adds only 3KB to the story file. The list of 943 correctly classified words from the training set exceeds 6KB in size by itself, so a brute force approach to matching these names would consume significantly more space towards the 512KB limit. Z-Machine memory use increased only 3440 bytes with the addition of the perceptron. The most problematic resource exhausted by the perceptron code was the variable/array space; 2117 additional bytes were allocated; in the version 5 Z-Machine, 10000 is the limit. A "brute force" name matching would require the storage of 6734 characters, however, so the perceptron is certainly more parsimonious with this resource. The brute force matching would also likely be much slower. Pattern-matching is not built into Inform, so character-by-character comparisons through a long array would be needed — unless all names were added as vocabulary words, which would uses up another a limited region of memory. Of course, the list-matching approach also would not generalize at all.
Figure 1. An example session in mfclass.z5, The Z-Machine classifier.
Perform (Perceptron Classifier in Inform) v1.0 Nick Montfort http://nickm.com 2004-06-25 Release 0 / Serial number 051126 / Inform v6.30 Library 6/11 S Testing the Classifier Type "WHO IS [NAME]?" to see an Inform classifier in action. The classifier is trained on first names recorded in the U.S. census. >who is will? Will? I suppose he's just some random person. >who is jill? Jill? I suppose she's just some random person. >who is bill? Bill? I suppose he's just some random person. >who is franklin? Franklin? I suppose he's just some random person. >who is winifred Winifred? I suppose she's just some random person. >who is jack Jack? I suppose he's just some random person. >who is jacqueline Jacqueline? I suppose she's just some random person. >who is fernando Fernando? I suppose he's just some random person. >who is lawrence Lawrence? I suppose he's just some random person. >who is fei? Fei? I suppose he's just some random person. >who is boulos? Boulos? I suppose he's just some random person. >who is kilian? Kilian? I suppose she's just some random person.
The perceptron, as implemented, makes for an effective classifier of words. An approach that provided a principled measure of uncertainty would improve on the perceptron, since it would allow the system to ask for disambiguation in borderline cases. While lacking this useful ability, the perceptron sits easily in an interactive fiction program, consuming little memory, returning a label quickly, and providing accuracy that is adequate for many literary and gaming purposes. It seems likely that similar classifiers, which can derive features from the current state of the game, will not only be useful for classifying words, but will also find other application in interactive fiction.