Two Guys Arguing

Sum to a Million

Posted in clojure by youngnh on 08.09.10

I spent this past weekend trying to sum the numbers from 1 to a million as fast as possible. Here’s what I learned.

This function ran very fast:

(defn bigsum1 []
  (let [sum (atom 0)]
    (dotimes [i 1000000]
      (swap! sum + (inc 1)))
    @sum))

This one only barely slower:

(defn bigsum2 []
  (apply + (range 1 (inc 1000000))))

I ran each one of the above in a “test harness” like so:

(dotimes [i 100]
  (time (bigsum1)))

It’s not a perfect solution, but it gives you a pretty good idea of how long a typical run lasts.

Out of curiosity, I ran the following to get a sense of how long it takes to walk a list, since, if I understand what Clojure’s doing under the hood correctly, last and count both have to walk the given lazy list from head to tail without taking any shortcuts.

(last (range 1 (inc 1000000)))
(count (range 1 (inc 1000000)))

Interestingly enough, last and count both took longer than the dotimes and apply forms.

I thought perhaps spawning a bunch of threads, each to update a counter might perform better, which was the motivation for my next variation:

(defn bigsum3 []
  (let [sum (agent 0)]
    (dotimes [i 1000000]
      (send sum + (inc i)))
    (await sum)
    @sum))

I think agents nicely match the solution I was trying for. I created a thread-safe object to hold the sum and then dispatched a bunch of updates to it using the send function which would be executed in a Clojure-managed threadpool. + is associative and commutative so it doesn’t matter what order those updates were actually executed in as long as they all got executed. So finally, the code makes sure that all of those updates got executed by calling await before returning the current value of the sum.

This function took a looong time. Much longer than I expected it to, and much longer than any variant before it. Before actually using them, I had it in my mind that agents used threads and threads were another word for speedspeedspeed. I thought that, for sure, this code would run faster than bigsum1 and 2. Not out of any reasoned analysis, but simply based on the fact that they used those magical things called threads. I’m not saying that Clojure agents are slow, or that with some tweaking I couldn’t get an agent-based function to run very very fast, merely that my initial impressions of them were askew. I would love to hear from the Clojure-verse how I might tweak my agents example to make it fly.


Side-bar: What not to do:
Faced with the initial slowness of the agents solution, I was tempted to write this:

(defn bigsum4 []
  (let [sum (atom 0)]
    (dotimes [i 1000000]
      (.send (Thread. (fn []
        (swap! sum + (inc i))))))
    @sum))


I figured that maybe if I spawned the threads myself and used a thread-safe atom, my code would be faster than bigsum3. Not only was this code no faster than bigsum3, it’s incorrect. All of the threads it spawn may not have finished by the time it derefs sum. This code is very likely to return some intermediate result, atomically updated, but no less wrong. The lesson I learned here was not to confuse thread-safe with “don’t have to think about threads”.

About these ads

7 Responses

Subscribe to comments with RSS.

  1. Alex Miller said, on 08.09.10 at 11:45 pm

    This is a great example of how task granularity affects parallel computation. In this case, you’ve broken an overall task of (naively) 1000000 units into 1000000 tasks, which is very fine-grained parallelism. The problem here is that the overhead of sending and receiving those tasks (not to mention the contention on updating the single output result) dwarfs the gains from using multiple threads.

    Two things to try:

    1) Reduce granularity by having each “task” involve counting 1 to N instead of just 1. Run this for a range of N’s and find the sweet spot.

    2) Reduce contention – instead of using agents, explicitly spawn (+ 2 (.availableProcessors (Runtime/getRuntime))) threads (which is what agent is using) and have each thread count to 1000000/THREADS, then combine the result at the end.

  2. benjaminplee said, on 08.10.10 at 7:36 am

    Firstly, did you try (/ (* 10000­00 10000­00) 2) ?

    ;-) I would hope that would be the fastest … sum of all numbers 1 to n = (n*(n+1))/2

    Snide remarks aside, you do pose an interesting problem. A surprisingly small amount of the code we write is actually CPU bound; most is bound by some form of I/O. Since there is absolutely no IO in our calculation, it stands to reason that you “could” see significant performance gains by calculating in parallel up to the number of processing units available (e.g. CPU cores). More than that and you would probably be wasting time switching between tasks on the same CPU.

    It should be easy to run some metrics and produce graphs showing the performance differences of calculating the value across varying numbers of agents.

    It would also be interesting to see if you could leverage your machine’s GPU. I have lost touch with how newer GPU internals work but if I recall correctly they are specialized to do large amounts of simple math calculations highly parallellized.

    • Amos King said, on 08.10.10 at 9:58 pm

      Darn you beat me to it. I had this loaded up the other day, but didn’t get to read it until now. I posted my comment and then noticed there were more comments. Damn you BEN!

  3. Justin Kramer said, on 08.10.10 at 11:35 am

    (n*(n+1))/2 notwithstanding, here’s a version that uses primitive longs and unchecked ops:


    (defn triangle-num [n]
    (let [n (long n)]
    (loop [sum (long 0)
    i (long 1)]
    (if (> i n)
    sum
    (recur (unchecked-add sum i)
    (unchecked-inc i))))))

    user> (time (triangle-num 1000000))
    "Elapsed time: 0.868 msecs"
    500000500000

    I’d be surprised if a parallel version could match this.

  4. Justin Kramer said, on 08.10.10 at 1:13 pm

    Follow-up: I was curious and wrote a parallel version. It seems to perform comparably when summing to 1 million and about twice as fast summing to 100 million. Apparently spawning threads is fast!

    It divides the work into batches similar to how Alex suggests and then spawns futures to do the crunching. Here’s the code: http://gist.github.com/517696

  5. Amos King said, on 08.10.10 at 9:56 pm

    n(n+1)/2 done

    (/ (* n (+ n 1)) 2) or is that better for you.

    n = 1000000

  6. Ngoc said, on 08.17.10 at 4:07 am

    > I might tweak my agents example to make it fly

    I think you cannot fly with an agent because agent only solves the locking problem. All computations are sent to only one unit of work.

    To fly, you should have many units of work. For example, use “pmap” to sum parallelly with 2 units of work:
    * Sum from 1 to 500000
    * Sum from 500000 + 1 to 1000000


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.