Two Guys Arguing

stumbling towards the clojure api

Posted in clojure by youngnh on 08.26.10

Here are two examples of code I’ve written and re-written lately:

My first example came about from dealing with DOM Document elements and Nodes, specifically getting a named attribute from a given Node:

;; bad, throws a NullPointerException if any of the Java methods return nil
(defn get-attribute1 [elt attr]
  (.. elt getAttributes (getNamedItem attr) getValue))

;; this works great
(defn get-attribute2 [elt attr]
  (when-let [attrs (. elt getAttributes)]
    (when-let [item (. attrs (getNamedItem attr))]
      (. item getValue))))

;; because lord knows I love monads
(require '[clojure.contrib.monads :as m])
(defn get-attribute3 [elt attr]
  (m/domonad m/maybe-m
    [attrs (. elt getAttributes)
     item (. attrs (getNamedItem attr))]
    (. item getValue)))

;; a slight improvement
(defn get-attribute4 [elt attr]
  ((m/with-monad m/maybe-m
     (m/m-chain [(memfn getAttributes)
		 #(. % (getNamedItem attr)) ;; take note that (memfn getNamedItem attr) will _not_ work here
		 (memfn getValue)]))
   elt))

;; recommended (look at the source of .?. for yet another way to express this, as well as some other nil-safe variants)
(use '[clojure.contrib.core :only (.?.)])
(defn get-attribute5 [elt attr]
  (.?. elt getAttributes (getNamedItem attr) getValue))

My second example came about while trying to remove duplicates from a list while keeping the unique elements in the order they were originally given.


;; this is no good, it doesn't preserve order
(defn no-dupes0 [lst]
  (into #{} lst))

;; this only works for objects implementing comparable, and even then would
;; need to be done so in a clever way that incorporates the list which the elements came from
;; as it's not a natural ordering
(defn no-dupes1 [lst]
  (sorted-set lst))

;; at a loss for what to do, I banged out this loop/recur solution quickly
(defn no-dupes2 [lst]
  (loop [result (list)
	 seen #{}
	 xs lst]
    (if (empty? xs)
      (reverse result)
      (let [x (first xs)]
	(if (contains? seen x)
	  (recur result seen (rest xs))
	  (recur (conj result x) (conj seen x) (rest xs)))))))

;; this took me a long time to come up with as I'd never tried to write
;; a lazy sequence without higher level functions like map, iterate, et al
;; http://clojure.org/lazy was a great help figuring out how to use lazy-seq
(defn no-dupes3 [lst]
  ;; I have long wondered when to use letfn instead of let when defining functions
  ;; this is a good example, as a fn defined in a let isn't visible to itself
  (letfn [(internal [lst seen]
	    (lazy-seq
	     (when (seq lst) ;; learned: (seq lst) can be used as the opposite of (empty? lst)
	       (let [x (first lst)]
		 (if (contains? seen x)
		   (internal (rest lst) seen)
		   (cons x (internal (rest lst) (conj seen x))))))))]
    (internal lst #{})))

;; recommended (look at the source to see the pros at work.
;; it uses destructuring to avoid my (let [x (first lst)] clause
;; as well as recur, which probably saves the function from blowing the
;; stack on ridiculously long seqs)
(defn no-dupes4 [lst]
  (distinct lst))

Sum to a Million

Posted in clojure by youngnh on 08.09.10

I spent this past weekend trying to sum the numbers from 1 to a million as fast as possible. Here’s what I learned.

This function ran very fast:

(defn bigsum1 []
  (let [sum (atom 0)]
    (dotimes [i 1000000]
      (swap! sum + (inc 1)))
    @sum))

This one only barely slower:

(defn bigsum2 []
  (apply + (range 1 (inc 1000000))))

I ran each one of the above in a “test harness” like so:

(dotimes [i 100]
  (time (bigsum1)))

It’s not a perfect solution, but it gives you a pretty good idea of how long a typical run lasts.

Out of curiosity, I ran the following to get a sense of how long it takes to walk a list, since, if I understand what Clojure’s doing under the hood correctly, last and count both have to walk the given lazy list from head to tail without taking any shortcuts.

(last (range 1 (inc 1000000)))
(count (range 1 (inc 1000000)))

Interestingly enough, last and count both took longer than the dotimes and apply forms.

I thought perhaps spawning a bunch of threads, each to update a counter might perform better, which was the motivation for my next variation:

(defn bigsum3 []
  (let [sum (agent 0)]
    (dotimes [i 1000000]
      (send sum + (inc i)))
    (await sum)
    @sum))

I think agents nicely match the solution I was trying for. I created a thread-safe object to hold the sum and then dispatched a bunch of updates to it using the send function which would be executed in a Clojure-managed threadpool. + is associative and commutative so it doesn’t matter what order those updates were actually executed in as long as they all got executed. So finally, the code makes sure that all of those updates got executed by calling await before returning the current value of the sum.

This function took a looong time. Much longer than I expected it to, and much longer than any variant before it. Before actually using them, I had it in my mind that agents used threads and threads were another word for speedspeedspeed. I thought that, for sure, this code would run faster than bigsum1 and 2. Not out of any reasoned analysis, but simply based on the fact that they used those magical things called threads. I’m not saying that Clojure agents are slow, or that with some tweaking I couldn’t get an agent-based function to run very very fast, merely that my initial impressions of them were askew. I would love to hear from the Clojure-verse how I might tweak my agents example to make it fly.


Side-bar: What not to do:
Faced with the initial slowness of the agents solution, I was tempted to write this:

(defn bigsum4 []
  (let [sum (atom 0)]
    (dotimes [i 1000000]
      (.send (Thread. (fn []
        (swap! sum + (inc i))))))
    @sum))


I figured that maybe if I spawned the threads myself and used a thread-safe atom, my code would be faster than bigsum3. Not only was this code no faster than bigsum3, it’s incorrect. All of the threads it spawn may not have finished by the time it derefs sum. This code is very likely to return some intermediate result, atomically updated, but no less wrong. The lesson I learned here was not to confuse thread-safe with “don’t have to think about threads”.

Follow

Get every new post delivered to your Inbox.