[x]Blackmoor Vituperative

Saturday, 2009-02-28

Doublet solver for Python

Filed under: Programming — bblackmoor @ 17:50

I was recently shown an interesting programming puzzle. After reading it, I recognized it as a variation on the classic “doublet” game (I read a lot of Lewis Carroll as a kid, as well as being a huge fan of Alice in her various guises).

The Rules

  1. Call two words “adjacent” if you can change one word into the other by adding, deleting, or changing a single letter.
  2. Using the computer language of your choice, write a program which takes two words as inputs and creates an ordered list of unique words where each word in the sequence is adjacent to the previous and next words in the sequence.
  3. If you know Perl, a Perl solution is preferred.

The puzzle suggests using Perl to solve the puzzle, but I do not care for Perl (despite having worked with it longer than any other language I know — or perhaps because of that). I have been learning more about Python (and liking it more the more I use it, unlike Perl), so I decided to give a Python based solution a go.

I started with a basic plan for the script:

  • Input the two starting words
  • Load the dictionary
  • Filter the dictionary to words of the appropriate length
  • Find every variation of every word in the filtered dictionary.
  • Find variations of the starting word
  • If none of them are the ending word, find variations of those variations
  • Repeat until the ending word is found

(This was not the best way to start. More on that later.)

A couple of things immediately occurred to me. First, that removing letters would be easier than adding them, if the two end words are of different length. Second, I decided that I wanted to be able to pass arguments and options to the script on the command line: the two words, obviously, but I also wanted to be able to specify a dictionary file of valid words, as well as being able to pass an option that would show me every word tried while the script was running.

I am relatively new to Python, so I did not know how to handle command-line parameters gracefully. My first thought was to test each argument for a leading dash “-” character, and then split them up into arrays. Fortunately, I stumbled across a site with a sample script demonstrating the Python “getopt” module, which does just what I needed. While reading up on that, I then discovered another Python goodie called “doc strings“. I was really rolling now!

My next task was to load the dictionary, and filter out words of an inappropriate length. My initial idea was to build a cross-reference table where the first column would be all of the words of an appropriate length, and the other columns would be every possible single-step variation on those words. For example, for “dog” there would be “cog”, “log”, “dogs”, and so on.

But then I hit a hurdle. When reading in the dictionary file, I only wanted words which are no longer than the longer of the two starting words, and no shorter than the shorter of the two starting words. So naturally, after reading each line, I would test for that, but it seemed that Python was not automatically stripping off the newline characters. But then, why would it, since it does not know which style of newline I happen to be using (I am working on this project on a Windows laptop, but I usually use Unix line endings because I do most of my real work on Linux machines). How does Python treat newline characters when doing readline? And how do I strip them off?

There are two steps to that answer. First, open the file with the U option, for “universal newline support” (e.g., dictionary_file = open(dictionary,"rU")). Second, use strip() to remove whitespace (e.g., line = line.strip()).

Now for my next problem: as far as I know, Python’s arrays (aka, “lists”) are only able to use integers as indexes. For my initial dictionary idea to work, I needed some way of using strings as indexes (aka, “associative arrays”), the way PHP can. As it turns out, the Python feature that offers this functionality is actually called a dictionary. Go figure. Python’s dictionaries aren’t treated like normal (normal for PHP, that is) associative arrays: it does not have the same syntax or methods, and personally, this is an area where I think PHP is superior to Python. Of course, after learning all of this, I went back and thought of a simpler way of building the dictionary that did not use “dictionaries” (aka associative arrays), and simply used lists (aka arrays).

(Why do people insist on renaming everything? Why not call something the name by which it is nigh-universally known? “Arrays” and “associative arrays”. Is that so difficult?)

Before too long, I ran into another problem: how backslashes are treated in strings. I wanted to insert “\b” (for a word boundary) into a string, but Python interprets this to mean something special. So I needed to add an “r” before the opening quotation mark to indicate that this is a “raw string”. This is the first time I have encountered something in Python which reminds me of Perl (which is littered with bizarre, nonintuitive things like this).

These hurdles overcome, I tested my script using a small set of sample data (a dictionary of about twenty words), and it worked great. The problem came with larger dictionaries. When using a dictionary with thousands of words, creating the initial cross-reference index took forever. I also soon learned that my first idea, which was to try every possible variation of a word before giving up and going to the next word resulted in ridiculously long word paths. It soon dawned on me that it would probably be faster to search one letter-layer at a time (what people in computer science classes call a “breadth-first search“) than going down the full path of a particular word path (aka, a “depth-first search“). Amusingly enough, the code to do so was actually simpler than my initial “build a massive cross-reference table” approach.

(By the way, while refreshing my memory on breadth-first searches, I also ran across an algorithm for finding shortest paths called Dijkstra’s algorithm. That looks really interesting, and I want to play with that some day, but I figured I would have enough on my plate just implementing the search in Python, so Dijkstra’s algorithm will have to wait for another day.)

The final hitch I ran into was getting the recursive function to return its value. I must have banged my head against that for three hours before realizing I had made a stupid mistake. I was only returning the value when the function was successful, not when it was calling itself recursively. Since it’s recursive, it needed a return statement for every step of the recursion. A silly mistake on my part, but hey, it happens.

So here is a link to the script, for anyone who might find it educational (rename it to doublet.py in order to run it), and the official Scrabble word list (you will need to unzip it, of course).

For instructions on how to use the script, run doublet.pl -h.

Here is some typical output:

doublet.py -d scrabble_dictionary.txt beginning end
Fail: could not find a path between words
2 potential paths were examined, taking 1.5630 seconds

doublet.py -d scrabble_dictionary.txt begin end
707 potential paths were examined, taking 5.8280 seconds
begin >> began >> bean >> bead >> bend >> end

And here are my answers to the “bonus questions” asked by the original puzzle:

  • What is the shortest path between “crawl” and “run”?
    In theory, the shortest path would be five (including the two words at the ends). As it happens, that is also the shortest path when using the dictionary I used.
  • What is the shortest path between “mouse” and “elephant”?
    In theory, the shortest path would be eight (including the two words at the ends). However, I could not find a path using the dictionary I used, possibly due to one of the assumptions I used (see below).
  • Does your program necessarily return the shortest path?
    Yes.
  • What assumptions did you make in your program?
    Other than the parameters provided by the puzzle, I set a limitation that the words which would be examined when trying to make a path would be no shorter than one character less than the shorter of the two words, and no longer than one character more than the longer of the two words. I realize that this will, in some cases, result in not finding a path where one may actually exist.
  • How did you test your program?
    Using the words given in the puzzle, as well as a few words that I did not think had a path between them. I also tested the various failure conditions.
  • What is the Big-O complexity of your program?
    I think it’s O(n^2), but it’s been a number of years since I studied big-O notation, so I could be wrong.
  • Suppose each letter had a point value (for example, as in Scrabble). Discuss (but don’t code) how your algorithm would change if you wanted to find the the path with the lowest possible point total.
    I would have had to continue the script after it finds a path, rather than immediately ending. Then I would have had to contemplate its point value. Finally, when all of the paths are found, or some arbitrary limit is reached, the script would return the found path with the lowest point value.
  • Sometimes the shortest path isn’t unique. Discuss (but don’t code) how your algorithm would change if you needed to return all of the shortest paths between two words.
    I would have had to continue the script after it finds a path, rather than immediately ending. Finally, when all of the paths of that length are found, the script would return the found paths.
  • Discuss (don’t code) how you might change your program if your concern was minimizing memory usage.
    I would probably replace the search path array with files for each path.
  • Discuss (don’t code) how you might change your program if your concern was minimizing CPU time.
    I could be wrong, but I think it already minimizes CPU time. Slicing up the dictionary file by word length would probably help, though.
  • Discuss (don’t code) how you might change your program if your concern was minimizing programmer time.
    If conserving my time was important, I would not have added the various optional parameters or tested for so many error conditions.
  • Discuss (but don’t code) how your algorithm would change if you wanted to find the longest path between two words.
    I would have had to continue the script after it finds a path, rather than immediately ending. Then, when all of the paths are found, or some arbitrary limit is reached, the script would return the last path found.

Final thoughts:

Just thinking about this took about three days. I wasn’t even sure how to approach the problem at first. Once I had a plan, though, things progressed pretty quickly. I would say that writing the actual code took about four days (an hour here, an hour there). Much of this time was simply spent learning how Python does things, but a good chunk was spent redesigning the script after I realized the problems with my first design.

The script works, and I do not think I will refine it any further. However, I am not entirely satisfied with it, and I can see several things I would like to do differently next time. I did not even attempt to make this object oriented, and that is something that I want to tackle in Python soon. Also, I am not entirely pleased with how I handled the global variables in this script: there are far too many of them for my taste. I am also slightly disillusioned about Python itself. While I think it is a significant improvement over Perl, it still has some peculiar deficiencies. However, I do consider it a significant improvement over Perl, and considering how much time I spent writing and modifying scripts, I think learning Python will be time well spent.

Friday, 2009-02-27

It’s time to shut down RIAA

Filed under: Intellectual Property,Music — bblackmoor @ 18:44

Ray Beckerman is calling for new legislation to protect investors in tech companies from frivolous RIAA lawsuits. Why should investors get special treatment? The problem is not that investors are being sued. The problem is that anyone is being sued.

What we need is not additional legislation to protect investors. What we need is the arrest and conviction of RIAA’s board of directors under federal Rico statues.

Thursday, 2009-02-26

Treehouse games

Filed under: Gaming — bblackmoor @ 12:15

I just stumbled across this earlier today: Treehouse Games. They look really fun!

Friday, 2009-02-20

How to wreck an economy

Filed under: Society — bblackmoor @ 14:25

“If this crisis proves nothing else, it proves you cannot help people by lending them more money than they can pay back. […] it should give us pause in responding to the financial crisis of today to realize that this crisis itself was in part an unintended consequence of the monetary policy we employed to deal with the previous recession.”

read more…

Tuesday, 2009-02-17

The Ultimate War Simulation Game

Filed under: Entertainment,Society — bblackmoor @ 20:58

This is the funniest thing I have read this week.

Wednesday, 2009-02-11

Good for the gander

Filed under: Art,Intellectual Property — bblackmoor @ 19:01

Hope springs eternal

Last April, Shepard Fairey mobilized his legal team to send Baxter Orr a cease and desist order threatening legal action against him. The Austin, Texas, artist made a parody of Fairey’s Andre the Giant design, adorning it with a SARS mask and the title “Protect Yourself.”

Now the Associated Press is suing Fairey for copyright infringement for using a wire service photo as a model for his artwork

What comes around goes around. Personally, I think both suits are absurd, and simply illuminate the absurdity of our copyright laws and the legal system that keeps them in place.

A real stimulus plan

Filed under: Society — bblackmoor @ 14:47

“One-off bailouts and rushed rescue packages are not the answer to our financial woes. See why the FairTax — which solves the root problem of current economic crisis not just the effect on Wall Street — is the answer that America needs now more than ever.”

read more…

Tuesday, 2009-02-10

Economists against Obama

Filed under: Society — bblackmoor @ 10:46

President Obama went on national television this evening to urge Congress to set aside “petty differences” and vote his “recovery package” into law. Bold words: it makes President Obama sound like the brave visionary, and those who oppose him as “same old, same old” politicians. But what if his bold plan is tantamount to steering the Titanic straight at the iceberg? What if pouring trillions of dollars down a hole is not the way to restore fiscal responsibility in this country?

What if the economists against Obama are right?

Call me crazy, but I think that when one is in difficult financial straights, spending money like it’s going out of style is the exact wrong thing to do.

Saturday, 2009-02-07

Causes of death in the USA

Filed under: Science,Society — bblackmoor @ 17:47

According to the CDC, roughly 440,000 deaths each year are associated with smoking.

Also according to the CDC, roughly 400,000 deaths each year are associated with obesity.

Many more people die each year in the USA from motor vehicle accidents (roughly 40,000) than in airplane crashes (fewer than 1,000). But people spend a lot more time in cars than in airplanes. The per-hour death rate of driving versus flying is about equal.

And according to the NIH, roughly 12,000 deaths (excluding suicides) each year are associated with firearms.

The leading causes of death in 2000 were tobacco (435 000 deaths; 18.1% of total US deaths), poor diet and physical inactivity (400 000 deaths; 16.6%), and alcohol consumption (85 000 deaths; 3.5%). Other actual causes of death were microbial agents (75 000), toxic agents (55 000), motor vehicle crashes (43 000), incidents involving firearms (29 000), sexual behaviors (20 000), and illicit use of drugs (17 000).

(from CDC: Obesity approaching tobacco as top preventable cause of death, DoctorsLounge)

So “sexual behavior” is just behind “firearms” in terms of the raw number of people killed — and is far ahead of firearms when the roughly 17,000 suicides who used firearms are excluded (as they should be, for obvious reasons).

Interesting.

Surreal Estate

Filed under: Entertainment,Travel — bblackmoor @ 14:15

PointClickHome has a slideshow of unusual architectural projects that is pretty fun to look at.

Next Page »