Two Guys Arguing

of stats and scraping

Posted in lisp by youngnh on 05.13.09

This all started when I wanted to share some stats I had been calculating privately with the rest of my fantasy baseball league.

Last season I wrote a handful of emacs lisp functions to scrape Yahoo!’s fantasy baseball site with the intention of having a status update on my team every time I opened an emacs session to do my days work (The faithful among us will note that I am far out of my depth and Doing It Wrong if I actually close my emacs sessions at all, much less doing so every morning. Duly noted). So I had the raw numbers within my reach.

I decided that a CouchDB database was the best place for this data to live. Its new, so there’s cool points for making something useful with it, but more than that, when retrieving data from a CouchDB instance, it is returned in JSON format. That decided it for me. Javascript is a language near and dear to my heart. Just about every developer in my league has been in web development, so they’ve all hard-won some twisted semblance of competence in the language, and if that experience had scarred them irreparably, then just about every other major language out there has a JSON library. Best of all, data has a public http url for easy access (The CouchDB quarter will say here that CouchDB is RESTful, which is kind of a sly techo-descriptive pun as CouchDB’s slogan is “Time to Relax”).

With data in CouchDB, anybody could get access to my data (Yahoo!’s data), manipulate it, and make something cool for themselves. The driving forces behind CouchDB, though, hold even more potential. Users could write Javascript functions and upload them back to the database and then everyone would have the cool thing instead of everyone having their own cool things. CouchDB calls itself a database, but it can host code, keep revisions of data, and serve documents. “Excuse me, but you got your wiki in my GitHub”, if you will. If your users have some ability to write code, the applications that can grow out of a single CouchDB instance are enormous. However, nothing would sprout in my database if I couldn’t put data into it.

This was a bit of a problem as all of my fantasy baseball code was written in emacs lisp. Emacs can invoke programs from your system and redirect their output to one of its buffers. My code leaned heavily on this feature to invoke wget, using it to handle sessions and cookies while I slurped up pages from our league’s site, and then used emacs’ extensive buffer search facilities to ferret out and collect the data I was interested in.

In order for the web site to be useful at all, it would need to be updated daily (something that Yahoo! itself has a spotty track record of). It would have been possible to cron a task that started emacs and executed my emacs lisp scripts, but, again, those faithful among us would be wincing in pain at the thought of opening and closing emacs so often. Its not an elegant solution.

I settled on using Steel Bank Common Lisp to solve my problems. It adds threads and function scheduling to the Common Lisp toolkit. I decided I would start an SBCL image and keep it running for the duration of the season. I would schedule a function to run every morning that would scrape stats and then insert those stats into a CouchDB database. It would work beautifully, harmoniously. The faithful would be pleased. Problem was, I had never used Common Lisp like this before. My exposure was limited to starting up an image, hacking around until I produced some string or file or number useful to me, and then I shut it down and went on my way. I had no right to expect that I could keep a SBCL session running for 182 games during the summer.

If I did not have a touch of mental illness — a little miswiring in my wetware subroutines — I might have let that thought stop me, and my next few posts would be about a wonderfully prosaic, conventional and dull system written in Java. It won’t be.