[x]Blackmoor Vituperative

Saturday, 2009-02-28

Doublet solver for Python

Filed under: Programming — bblackmoor @ 17:50

I was recently shown an interesting programming puzzle. After reading it, I recognized it as a variation on the classic “doublet” game (I read a lot of Lewis Carroll as a kid, as well as being a huge fan of Alice in her various guises).

The Rules

  1. Call two words “adjacent” if you can change one word into the other by adding, deleting, or changing a single letter.
  2. Using the computer language of your choice, write a program which takes two words as inputs and creates an ordered list of unique words where each word in the sequence is adjacent to the previous and next words in the sequence.
  3. If you know Perl, a Perl solution is preferred.

The puzzle suggests using Perl to solve the puzzle, but I do not care for Perl (despite having worked with it longer than any other language I know — or perhaps because of that). I have been learning more about Python (and liking it more the more I use it, unlike Perl), so I decided to give a Python based solution a go.

I started with a basic plan for the script:

  • Input the two starting words
  • Load the dictionary
  • Filter the dictionary to words of the appropriate length
  • Find every variation of every word in the filtered dictionary.
  • Find variations of the starting word
  • If none of them are the ending word, find variations of those variations
  • Repeat until the ending word is found

(This was not the best way to start. More on that later.)

A couple of things immediately occurred to me. First, that removing letters would be easier than adding them, if the two end words are of different length. Second, I decided that I wanted to be able to pass arguments and options to the script on the command line: the two words, obviously, but I also wanted to be able to specify a dictionary file of valid words, as well as being able to pass an option that would show me every word tried while the script was running.

I am relatively new to Python, so I did not know how to handle command-line parameters gracefully. My first thought was to test each argument for a leading dash “-” character, and then split them up into arrays. Fortunately, I stumbled across a site with a sample script demonstrating the Python “getopt” module, which does just what I needed. While reading up on that, I then discovered another Python goodie called “doc strings“. I was really rolling now!

My next task was to load the dictionary, and filter out words of an inappropriate length. My initial idea was to build a cross-reference table where the first column would be all of the words of an appropriate length, and the other columns would be every possible single-step variation on those words. For example, for “dog” there would be “cog”, “log”, “dogs”, and so on.

But then I hit a hurdle. When reading in the dictionary file, I only wanted words which are no longer than the longer of the two starting words, and no shorter than the shorter of the two starting words. So naturally, after reading each line, I would test for that, but it seemed that Python was not automatically stripping off the newline characters. But then, why would it, since it does not know which style of newline I happen to be using (I am working on this project on a Windows laptop, but I usually use Unix line endings because I do most of my real work on Linux machines). How does Python treat newline characters when doing readline? And how do I strip them off?

There are two steps to that answer. First, open the file with the U option, for “universal newline support” (e.g., dictionary_file = open(dictionary,"rU")). Second, use strip() to remove whitespace (e.g., line = line.strip()).

Now for my next problem: as far as I know, Python’s arrays (aka, “lists”) are only able to use integers as indexes. For my initial dictionary idea to work, I needed some way of using strings as indexes (aka, “associative arrays”), the way PHP can. As it turns out, the Python feature that offers this functionality is actually called a dictionary. Go figure. Python’s dictionaries aren’t treated like normal (normal for PHP, that is) associative arrays: it does not have the same syntax or methods, and personally, this is an area where I think PHP is superior to Python. Of course, after learning all of this, I went back and thought of a simpler way of building the dictionary that did not use “dictionaries” (aka associative arrays), and simply used lists (aka arrays).

(Why do people insist on renaming everything? Why not call something the name by which it is nigh-universally known? “Arrays” and “associative arrays”. Is that so difficult?)

Before too long, I ran into another problem: how backslashes are treated in strings. I wanted to insert “\b” (for a word boundary) into a string, but Python interprets this to mean something special. So I needed to add an “r” before the opening quotation mark to indicate that this is a “raw string”. This is the first time I have encountered something in Python which reminds me of Perl (which is littered with bizarre, nonintuitive things like this).

These hurdles overcome, I tested my script using a small set of sample data (a dictionary of about twenty words), and it worked great. The problem came with larger dictionaries. When using a dictionary with thousands of words, creating the initial cross-reference index took forever. I also soon learned that my first idea, which was to try every possible variation of a word before giving up and going to the next word resulted in ridiculously long word paths. It soon dawned on me that it would probably be faster to search one letter-layer at a time (what people in computer science classes call a “breadth-first search“) than going down the full path of a particular word path (aka, a “depth-first search“). Amusingly enough, the code to do so was actually simpler than my initial “build a massive cross-reference table” approach.

(By the way, while refreshing my memory on breadth-first searches, I also ran across an algorithm for finding shortest paths called Dijkstra’s algorithm. That looks really interesting, and I want to play with that some day, but I figured I would have enough on my plate just implementing the search in Python, so Dijkstra’s algorithm will have to wait for another day.)

The final hitch I ran into was getting the recursive function to return its value. I must have banged my head against that for three hours before realizing I had made a stupid mistake. I was only returning the value when the function was successful, not when it was calling itself recursively. Since it’s recursive, it needed a return statement for every step of the recursion. A silly mistake on my part, but hey, it happens.

So here is a link to the script, for anyone who might find it educational (rename it to doublet.py in order to run it), and the official Scrabble word list (you will need to unzip it, of course).

For instructions on how to use the script, run doublet.pl -h.

Here is some typical output:

doublet.py -d scrabble_dictionary.txt beginning end
Fail: could not find a path between words
2 potential paths were examined, taking 1.5630 seconds

doublet.py -d scrabble_dictionary.txt begin end
707 potential paths were examined, taking 5.8280 seconds
begin >> began >> bean >> bead >> bend >> end

And here are my answers to the “bonus questions” asked by the original puzzle:

  • What is the shortest path between “crawl” and “run”?
    In theory, the shortest path would be five (including the two words at the ends). As it happens, that is also the shortest path when using the dictionary I used.
  • What is the shortest path between “mouse” and “elephant”?
    In theory, the shortest path would be eight (including the two words at the ends). However, I could not find a path using the dictionary I used, possibly due to one of the assumptions I used (see below).
  • Does your program necessarily return the shortest path?
    Yes.
  • What assumptions did you make in your program?
    Other than the parameters provided by the puzzle, I set a limitation that the words which would be examined when trying to make a path would be no shorter than one character less than the shorter of the two words, and no longer than one character more than the longer of the two words. I realize that this will, in some cases, result in not finding a path where one may actually exist.
  • How did you test your program?
    Using the words given in the puzzle, as well as a few words that I did not think had a path between them. I also tested the various failure conditions.
  • What is the Big-O complexity of your program?
    I think it’s O(n^2), but it’s been a number of years since I studied big-O notation, so I could be wrong.
  • Suppose each letter had a point value (for example, as in Scrabble). Discuss (but don’t code) how your algorithm would change if you wanted to find the the path with the lowest possible point total.
    I would have had to continue the script after it finds a path, rather than immediately ending. Then I would have had to contemplate its point value. Finally, when all of the paths are found, or some arbitrary limit is reached, the script would return the found path with the lowest point value.
  • Sometimes the shortest path isn’t unique. Discuss (but don’t code) how your algorithm would change if you needed to return all of the shortest paths between two words.
    I would have had to continue the script after it finds a path, rather than immediately ending. Finally, when all of the paths of that length are found, the script would return the found paths.
  • Discuss (don’t code) how you might change your program if your concern was minimizing memory usage.
    I would probably replace the search path array with files for each path.
  • Discuss (don’t code) how you might change your program if your concern was minimizing CPU time.
    I could be wrong, but I think it already minimizes CPU time. Slicing up the dictionary file by word length would probably help, though.
  • Discuss (don’t code) how you might change your program if your concern was minimizing programmer time.
    If conserving my time was important, I would not have added the various optional parameters or tested for so many error conditions.
  • Discuss (but don’t code) how your algorithm would change if you wanted to find the longest path between two words.
    I would have had to continue the script after it finds a path, rather than immediately ending. Then, when all of the paths are found, or some arbitrary limit is reached, the script would return the last path found.

Final thoughts:

Just thinking about this took about three days. I wasn’t even sure how to approach the problem at first. Once I had a plan, though, things progressed pretty quickly. I would say that writing the actual code took about four days (an hour here, an hour there). Much of this time was simply spent learning how Python does things, but a good chunk was spent redesigning the script after I realized the problems with my first design.

The script works, and I do not think I will refine it any further. However, I am not entirely satisfied with it, and I can see several things I would like to do differently next time. I did not even attempt to make this object oriented, and that is something that I want to tackle in Python soon. Also, I am not entirely pleased with how I handled the global variables in this script: there are far too many of them for my taste. I am also slightly disillusioned about Python itself. While I think it is a significant improvement over Perl, it still has some peculiar deficiencies. However, I do consider it a significant improvement over Perl, and considering how much time I spent writing and modifying scripts, I think learning Python will be time well spent.