Jan. 3, 2015

Our Daily Bard

I finally shipped a project and put it on GitHub.

My only New Year’s Resolution this year is to ship more projects, now that I have several sitting on my laptop drive in various states of undress.

This first project is called “Our Daily Bard”. It’s a very simple set of pages which let you read Shakespeare works via RSS, a few lines each day. I wrote it back in May but then didn’t go the final mile and actually set it up on a server. You can now find it here.


All the data is taken from the Open Source Shakespeare project.

The code is quite straightforward: the only complexity is how to break up the plays into chunks in such a way that the work isn’t rendered incomprehensible. Simply doing it by scene ends up with very uneven breaks, quite unsuitable for reading day-by-day. Instead of trying to mark up every play manually, I went for an algorithmic approach.

The process I finally used is a simple scoring system. The play is formatted into unique lines, then the lines are scored using some simple rules:

  • downscore a line if it would leave a very small chunk at the end of the play;
  • downscore a line if it would break a character’s speech up, particularly at the start and end (unless the speech is very long). Breaking very near the start or end is penalized more than in the middle;
  • upscore a line if it is the start of a scene.

We then start looking at the play to see if it would break well within a certain range. Currently this range is 50 to 150 lines. We then check the scoring in this range and pick the line with the best score. Then repeat.

The solution isn’t yet quite as good as if a human performs manual markup: it lacks the “overall view” that a human can process very quickly when looking at blocks of text. To solve this I’d like to add a second pass that relaxes the boundaries between chunks once the initial scoring has been done. Perhaps that will come later. Ship early!

The only other work left to do was then to allow you to deploy the scripts as easily as possible. My target environment was a bare-bones Unix system with Python and some kind of CGI — no Node or Django thanks! The design is such that running scripts in the environment downloads the corpus data, generates the chunked play data, then builds templated CGI files with the correct paths stamped in, invoking the core Python code. When combined with the use of Fabric to do the remote deployment, it ended up being quite painless.

Anyway, all the code is there on GitHub waiting for you to fork it. I hope someone can find a use for it.