Wednesday, December 12, 2012

Another Tool: Emacs

The tool chain I use for these experiments in predicting NCAA basketball games consists primarily of three tools:  Web Harvest, Emacs, SBCL (an implementation of Common Lisp), and RapidMiner.   I have written previously on this blog about RapidMiner, and today I'm going to touch on Emacs.

Most people use editors in a fairly static way.  They figure out how to do the editing tasks they need as some combination of commands and keystrokes, and that becomes part of their editing knowledge.  For example, suppose you're the sort of typist who regularly transposes letters, writing "the" as "teh" and "and" as "adn".  You'll fairly quickly figure out how to fix that problem -- backspace, backspace and retype, or maybe mark with a mouse, delete and retype -- and that becomes part of your editing knowledge.

A well-designed editor ensures that most of what you need to do is efficient.  The most common tasks have easy keystrokes and so on.  Of course, the editor is designed for some general, idealized user, who may be very like you in some ways and very different from you in other ways.  So it's likely that much of what you do in your favorite editor is efficient, but some of it is very inefficient and repetitive.

Emacs takes a different philosophy -- probably because it grew out of a community of programmers.

I recently read a blog posting saying that the essence of programming is to avoid repetition.  Programmers go to great lengths creating subroutines, libraries and sometimes whole new programming languages just to avoid mindless repetition of some task.  The ideal programmer spends all his time creating unique solutions to unique problems -- everything else is automated.  Emacs captures this same philosophy in an editor.  Emacs is designed so that it can be customized/programmed by the user in powerful and flexible ways, so that the user spends all his time being "maximally" productive.  (Expert Emacs users do so much customization of the editor that a big topic is how to best manage the customizations!)

To return to the above example, if you're a typist who often transposes letters and Emacs is your editor, your response is to customize/program Emacs to take care of your transposing problem more efficiently.  For example, you'd likely define a key to reverse the transposition of the previous two characters, so that you could just hit that key and fix the problem whenever it occurred.

(As it happens, Emacs already has this capability -- it's ctl-t -- but you see the point.)

This philosophy changes the way you use your editor -- it becomes a kind of Swiss Army knife tool for solving all text-related tasks (and often other types of tasks as well).  For example, in the basketball predictor I scrape scores and other statistics from the Web and use them in both Common Lisp and in RapidMiner.  For use in Common Lisp, it is convenient to have the data in a list format like this:
("Miss. Valley St" 40 18 70 4 29 9 14 16 42 70 "2012-12-10")
For use in RapidMiner (and Excel) it is more convenient to have the data in a comma-separated values format like this:
"Miss. Valley St", 40, 18, 70, 4, 29, 9, 14, 16, 42, 70, "2012-12-10"
It's not difficult to translate from one to the other, but it is boring and repetitious.  The natural response in Emacs is to automate the task.

Emacs has a variety of ways to do this.  (If you know Emacs, this won't surprise you!)  One of the simplest is a "keyboard macro".  You tell Emacs you want to define a keyboard macro, and then you start editing.  When you tell it your done, it captures all the editing you did in-between and allows you to repeat that with a single keystroke.  In this case, I would start a keyboard macro, go through all the editing necessary to convert one line of my data file from one format to the other, and then end the macro.  Then I could go to the next line and tell Emacs to "execute the keyboard macro" and -- voila! -- that line would get the same editing. It takes some practice and thought to create an editing sequence that will do the right thing when it is repeated on the next line, but this turns out to be a very powerful and handy feature.

One of the drawbacks of the keyboard macro is that it disappears when you end your editing session.  So it's not useful per se for a task that you're going to want to repeat another day on a different file.  Fortunately, Emacs provides a way to save a keyboard macro in a format that looks like this:

(fset 'fix-sched
   [escape ?x ?r ?e ?p ?l ?a ?c ?e ?- ?r ?e ?g ?e ?x ?p return ?^ return ?\( ?\" return escape ?< escape ?x ?r ?e ?p ?l ?a ?c ?e ?- ?r ?e ?g ?e ?x ?p return ?  ?* ?, return ?\" ?  ?\" return escape ?< escape ?r ?\" ?+ return return escape ?< escape ?r ?\" ?- return return escape ?< escape ?x ?r ?e ?p ?l ?a ?c ?e ?- ?r ?e ?g ?e ?x ?p return ?$ return ?\) return])
That's exactly what it looks like -- a literal transcription of the keystrokes of the macro.  In this form, it can be saved your Emacs configuration file so that the next time you start up Emacs it will be available for reuse.

At a more complex level, you can program Emacs using a form of Lisp.  You can use this to create arbitrary functionality.  For example, if Emacs didn't provide a way to save a keyboard macro, you could program that yourself.  This allows you to build functionality that isn't easy to capture in a keyboard macro.  For example, here's an Emacs function I wrote for fixing a certain type of score file:
(defun fix-scores ()
 "Fix the scores from Marsee"
 (interactive "*")
 (let ((dt (format-time-string current-date-format-marsee)))
   (replace-regexp
    "^\\([A-Za-z \\&]+[A-Za-z]\\)\\s +\\([0-9]+\\)\\s +\\([A-Za-z \\&]+[A-Za-z]\\)\\s +\\([0-9]+\\).*$"
    (concat "(\"" dt "\" \"\\1\" \\2 \"\\3\" \\4)")
    )
   )
 )
Without going into the gory details, you can see that part of this function reformats the date from the score file into a more desirable format, using an Emacs function called "format-time-string".   Emacs Lisp is infinitely powerful, so if you're a good programmer you can extend the Emacs functionality in unlimited ways.

I have more to say on Emacs, but this posting has gotten fairly long so I'll leave further thoughts to another day.

2 comments:

  1. I particularly like using Google Refine for this type of string manipulation

    ReplyDelete
    Replies
    1. I've never tried using it, but it looks good. If I needed to do something like this as part of an automated workflow, something like Refine would be a good choice.

      Delete