Two Guys Arguing

Competing Parsers

Posted in clojure, haskell by youngnh on 09.28.11

So, I’m writing a parser for the Turtle serialization format for RDF. In addition to being a format we use all the time at Revelytix, its a decently compact grammar, giving me a good chance to implement it using The Parsatron and suss out some of the library’s rough edges.

I hit my first rough edge with the longString production:

longString ::= """" lcharacter* """

but, having already implemented the lcharacter parser already, I didn’t see the subtleties in this production and plowed ahead with this straightforward definition:

(defparser long-string []
  (between (times 3 (char \")) (times 3 (char \"))
           (many (lcharacter))))

Which looks great and compiles without complaint, but when you feed it input, it immediately complains:

> (run (long-string) "\"\"\"roughshod\"\"\"")

Unexpected end of input at line: 1 column: 16
[Thrown class java.lang.RuntimeException]

The message here could be better, and I’ll work on that. I would want it to say Unexpected end of input, expected '"""', because what happened was the (many (lcharacter)) parser consumed too much.

Turns out, lcharacter is defined in the grammar to include double quotes, so (many (lcharacter)) ate as many as it could until it literally ran out of input.

A good regex can handle this:

> (re-find #"\"\"\".*\"\"\"" "\"\"\"roughshod\"\"\"")
"\"\"\"roughshod\"\"\""

So we should be able to as well. To keep track of whether or not we’ve consumed a hat trick of quotes, my first attempt looked something like this:

(defparser long-string []
  (letfn [(middle-part [s]
            (let->> [c (lcharacter)]
              (if (= c \")
                (two-left s)
                (middle-part (concat s [c])))))
          (two-left [s]
            (let->> [c (lcharacter)]
              (if (= c \")
                (one-left s)
                (middle-part (concat s [\" c])))))
          (one-left [s]
            (let->> [c (lcharacter)]
              (if (= c \")
                (always s)
                (middle-part (concat s [\" \" c])))))]
    (>> (times 3 (char \"))
        (middle-part []))))

Which uses 3 local, mutually-recursive functions to “count” each consecutive double quote. And they all look a lot alike. I refactored to this:

(defparser long-string []
  (letfn [(middle-part [s n]
            (let->> [c (lcharacter)]
              (if (= c \")
                (case n
                      0 (middle-part s 1)
                      1 (middle-part s 2)
                      2 (always s))
                (middle-part (concat s (repeat n \") [c]) 0))))]
    (>> (times 3 (char \"))
        (middle-part [] 0))))

The above works well, but the problem arises in the first place because lcharacter and """ share the same single-character lookahead. By examining only the next character in the input, we can’t
tell if it should belong to lcharacter or """. This suggests that we can lookahead 3 characters at a time and if we receive """, then we can interpret that not as 3 lcharacters, but as a terminating triple double quote.

(defparser long-string []
  (between (times 3 (char \")) (times 3 (char \"))
           (many
            (let->> [cs (lookahead (times 3 (lcharacter)))]
              (if (= cs [\" \" \"])
                (never)
                (lcharacter))))))

I’m not sure quite which way to go, nor can I immediately see a way to make a higher-level lookahead parser that ensures that 2 parsers don’t stomp all over each other, though that would be quite ideal. If you can, chime in below in the comments.

If you’d like to follow the development of The Parsatron, it’s on github

Tagged with:

One fn to bind them

Posted in clojure, haskell by youngnh on 11.21.10

I had a chance to work on my parsec port a little this weekend. Say hello to one of the most important and ubiquitous parsers in the parsec arsenal, parser-bind.

The idea behind parser-bind is that it should squish two parsers together. It represents parsing one thing after another. The only other parser we’ve built that squishes two parsers together is parser-plus, which operates more like “or” in that if the first one fails, it tries the second. This parser will quit immediately whenever either fails. If this parser succeeds, it’s because each matched successive input.

(defn parser-bind [m n]
  (fn [state cok cerr eok eerr]
    (letfn [(mcok [item state]
              (n state cok cerr cok cerr)))
            (meok [item state]
              (n state cok cerr eok eerr)))]
      (m state mcok cerr meok eerr))))

If the first parser, m, consumes ok, but the second one, n, does not consume, our combined parser will still call the cok continuation. Conversely, if the first one is empty and ok, but the second one consumes, we will also escape via the cok continuation. parser-bind does not override any of the error handling continuations because if something goes wrong, we use them to exit immediately.

The useful part of parser-bind isn’t in the the above implementation. It isn’t how parsec implements the idea. Parsec’s implementation does take the first parser, m, but for it’s second argument, it takes a function that, when executed, returns the second parser.

This is a neat idea because the unlike a parser that has to be fully specified at write-time, a function can bind intermediate, runtime results. Those intermediate results, once bound and named can be used to create further parsers. It allows us to write let-like forms:

(p-let [c (one-of "abc")]
  (char c))

Where each binding form in the parser let has to be a destructuring form and parser pair. The above is a parser that parses a character, and then looks for a duplicate of what it just parsed, similar to capture groups in regular expressions. p-let uses parser-bind under the covers:

(defmacro p-let [[& bindings] & body]
  (let [[bind-form p] (take 2 bindings)]
    (if (= 2 (count bindings))
      `(parser-bind ~p (fn [~bind-form] ~@body))
      `(parser-bind ~p (fn [~bind-form] (p-let ~(drop 2 bindings) ~@body))))))

Given only a single binding pair, we make the parser in it the first argument to parser-bind, and wrap a function with it’s destructing form as args, returning the body. In longer binding forms, we produce a recursive structure that macroexpand will continue to expand one binding form at a time.

Tagged with:

Porting Haskell’s Parsec

Posted in clojure, haskell by youngnh on 11.11.10

Parsec is a parser-combinator library. Parser combinators are built around the idea of making a bunch of very small and focused parsers and combining them using operators that one would more usually see in regular expressions. Ultimately leading to parsers that feel more like function calls in a program that a stiff declaration of a grammar. Parsec is king in Haskell-land. In Clojure, however, there are a number of libraries fully- and not-so-fully written that can be used to write parsing programs. fnparse, Amotoen, clarsec, parser, clj-peg are just a few, feel free to mention your favorites in the comments. I don’t mean to leave any out, but rather point to out that what I’m doing here is not new. I do hope it’s illuminating for some.

Parsec, as I see it, boils down to 2 ideas.
A parser either consumes input or doesn’t. Consumed or Empty.
A parser either succeeds in parsing or it fails. Ok or Err.

These outcomes can be combined into 4 continuation functions that are passed to every parser:

  • cok – Consumed & Ok
  • cerr – Consumed & Err
  • eok – Empty & Ok
  • eerr – Empty & Err

As for errors, Parsec defines two types of them. Those that we can say something about, and those that we can say nothing about. These are errors with messages and unknown errors, respectively. Of the errors that we can say something about, some are the result of not finding input that the parser was expecting, which lead to messages like “expected ‘a’ and found ‘b’”, and some are the result of not finding input where we expected to, which lead to messages like “unexpected end of input”.

Finally, Parsec keeps tabs on the thing it’s parsing, it maintains state. The state is made up of 2 elements, the input stream itself, of which a Clojure seq models nicely and the current source position, itself made up of the name of the input, and one’s current line and column location in it.

The Most Basic Parsers

The simplest parser is the one that no matter what, returns a constant value. This is called parserReturn in Haskell, but in Clojure, it’s more akin to the constantly function, so I’ve named it always, and here’s it’s simplified implementation:

(defn always [x]
  (fn [state cok cerr eok eerr]
    (eok x state)))

This implementation makes sense. No matter what, it returns a new parser. A parser is merely a fn that takes a state and 4 continuations. The always parser always calls the Empty & Ok continuation. Nothing was removed from the stream (hence the Empty part), and everything should continue on as normal (the Ok part).

Equally simple is the parser that always fails. This is called parserZero in Haskell, since it represents a “nothing” parser.

(defn parser-zero []
  (fn [state cok cerr eok eerr]
    (eerr (unknown-error state))))

(defn unknown-error [{:keys [pos] :as state}]
  (ParseError. pos []))

More Interesting Parsers

One of the more basic parsers in Parsec is tokenPrim, which processes a single element from the underlying stream. It unconses the first element from the head of the input, tests if it is supposed to be consumed and then updates the state’s current position in the input. To do this, it takes 3 functions.

nextpos calculates a new source position based on the item consumed and the old position.
test takes a single element from the underlying stream and returns whether or not to consume it
showToken is used to create readable error messages by returning a string representation of stream elements

(defn token-prim [show-f nextpos-f consume?]
  (fn [{:keys [input pos] :as state} cok cerr eok eerr]
    (if-let [s (seq input)]
      (let [item (first s)
            rest-of-input (next s)]
        (if (consume? item)
          (let [newpos (nextpos-f pos item rest-of-input)
                newstate (InputState. rest-of-input newpos)]
            (cok item newstate))
          (eerr (unexpect-error (show-f item) pos))))
      (eerr (unexpect-error "" pos)))))

There are three ways the above function continues. Two are through eerr, one when there is nothing left in the seq when we were expecting to parse something, and one when we did parse something, but our test told us not to consume it. In the second case we can produce a decently readable description of the item so that we can later present it to the user. Finally, if our test tells us to go ahead and consume the item, we call cok passing it the item and a newly calculated state with a new position and the input without our consumed item on the front.

There’s a lot of parsers we can implement on top of token-prim, however, it’s got no brain. You can only line up a number of token parsers one after another and let them tell you if the input matched in the order you thought it would. We can’t express the idea of “or” with it. For that, Parsec relies on the parserPlus parser. It’s called “plus” because it’s used to glue multiple parsers into a single one, analagous to how addition of numbers glues them all together into a new, single number (I never used to think about things like this. Haskell has made me re-understand everything I already knew).

The strategy for implementing parserPlus is that it will take 2 parsers and try the first one. If that succeeds, we’ll go with that. If it doesn’t, we try the second one, and if it succeeds we want our combined parser to be indistinguishable from that second parser. If neither work, then our parser didn’t work and we want to escape like any other parser would if it failed. Calling the first parser is easy. For the sake of staying close to the original Haskell, we’ll call this parser m. Parsers in Haskell and Clojure are simply functions, so in order to try it, we can invoke it and pass the current state and the 4 continuations it expects.

The continuations are our hook to intercept failures. We know that if m fails, it will call the fourth continuation we pass it. So to try the second parser, n second we’re going to wrap the eerr function (the 4th continuation) by trying that second parser before giving up and calling eerr. Here’s how it looks in Clojure:

(defn parser-plus [m n]
  (fn [state cok cerr eok eerr]
    (letfn [(meerr [err]
               (letfn [(neok [item state-prime]
                          (eok item state-prime))
                        (neerr [err-prime]
                          (eerr (merge-error err err-prime)))]
                 (n state cok cerr neok neerr)))]
      (m state cok cerr eok meerr))))

The loacally nested functions aren’t exactly readable at a glance, but combined with the knowledge of what’s happening it’s a really elegant way to express the idea. Also, as a small note, there aren’t great names for some of the nested function parameters. state-prime and err-prime? Well, that’s a holdover from Haskell to express that the thing is an altered version of the thing it came from. In mathematics, this is expressed as a tick, state' and err'. Those aren’t legal Clojure 1.2 identifiers, so I opted to be verbose. Starting with the Clojure 1.3 alphas available now, tick is a legal constituent character, which means you can use it anywhere in an identifier except as the first character.

The last parser I’d like to tackle in this blog post is manyAccum. This parser wraps behavior around an existing parser and so becomes a tangle of continuation functions just like parser-plus was, but unlike parser-plus, manyAccum only accepts one parser and attempts to apply it 0 or more times. This is the Parser equivalent of the Kleene operator.

Just like parser-plus, we’re going to invoke the parser manyAccum is given and create a new parser by manipulating the continuations we pass to it. Specifically, if the parser we’re given fails to consume any input (calls eerr), we’re going to hijack that and report that it was instead an eok with an empty list. If the parser succeeds in consuming input, we’re going to try to get it to do it again. And again. And again forever. Here’s what it looks like:

(defn many-accum [p]
  (fn [state cok cerr cok eerr]
    (letfn [(many-err [err] (throw (RuntimeException. "combinator '*' is applied to a parser that accepts an empty string")))
             (continue [coll item state-prime]
               (p state-prime (partial continue (cons item coll)) cerr many-err (fn [_] (cok (cons item coll) state-prime))))]
      (p state (partial continue (seq [])) cerr many-err (fn [_] (eok [] state))))))

We define many-err to immediately quit with an exception if the third continuation, eok, is called since that means that p accepts empty strings and would spin forever if we let it. The only other trick to many-accum is that we create continue to accumulate items by first calling it with an empty seq, (seq []) and then consing further consumed items onto the front. Haskell’s many-accum takes a cons-like operator in addition to p as a more flexible way of creating a list of elements.

A final Note

I intentionally stayed away from Monads in this post (which is no easy task when porting Haskell), averting my eyes from Konrad Hinsen’s clojure.contrib.monad and trying wherever possible to make Clojure functions feel less like Haskell functions obsessed with parentheses. Not because Monads are particularly special or complex, but rather just the opposite. Monads fall out of designs that favor composability and uniformity. The first parser of this post, always, is half of an implementation of Monad. parser-zero and parser-plus are 100% of an smaller class of monads called MonadPlus. Reading clj-http’s source, I felt like it was such clean and idiomatic Clojure, with fantastic composablity properties that made it easy to build on top of, but also like it would be very easy to expess in Haskell and not feel forced or awkward. So it’ll be interesting to finish this port and see if I can succeed in doing the same in the opposite direction.

Tagged with:

OpenGL and Haskell

Posted in haskell by youngnh on 02.20.10

Haskell doesn’t sacrifice speed for power, or abstraction for control

Haskell is an amazing language for this type of program for a couple of reasons. It’s static typing eliminates an entire class of errors before your program will even compile. It is garbage collected, so tricky memory-management code is non-existent, it has a fantastic foreign function interface allowing it to wrap and call any C code ever written (and vice versa), and it compiles to native libraries and executables, making it not only fast, but a legitimate candidate to run on platforms that don’t yet have or won’t allow virtual machine code interpreters (iPhones, Pres and their ilk).

So where do we start learning how to write OpenGL in Haskell?

The first thing to realize is that there are a million OpenGL tutorials in C and in comparison, far fewer in Haskell. That’s ok, because Haskell’s bindings to OpenGL are low-level enough that you can actually use C examples to guide your Haskell code. For example, I found this OpenGL example demonstrating how to use GLUT to make drawing a cube super-easy. To make the executable, I had to download a ton of libraries, but they were all available via apt in Ubuntu and I could finally write this Makefile to make compiling and linking a one-command affair:

cube: cube.o
	gcc -o cube cube.o -lglut

Running the resultant cube executable, produces a pretty picture:

Red Cube

The rest is getting familiar with the Haskell OpenGL and GLUT libraries. The functions reside in the Graphics.Rendering.OpenGL and Graphics.UI.GLUT modules. You can find their documentation here and here.

I wrote my first version in a manner that I thought most closely resembled the C syntax. The n, faces and v functions in the Haskell version are standins for the arrays the C version uses.

n :: [Normal3 GLfloat]
n = [(Normal3 (-1.0) 0.0 0.0),
     (Normal3 0.0 1.0 0.0),
     (Normal3 1.0 0.0 0.0),
     (Normal3 0.0 (-1.0) 0.0),
     (Normal3 0.0 0.0 1.0),
     (Normal3 0.0 0.0 (-1.0))]

faces :: [[Vertex3 GLfloat]]
faces = [[(v 0), (v 1), (v 2), (v 3)],
         [(v 3), (v 2), (v 6), (v 7)],
         [(v 7), (v 6), (v 5), (v 4)],
         [(v 4), (v 5), (v 1), (v 0)],
         [(v 5), (v 6), (v 2), (v 1)],
         [(v 7), (v 4), (v 0), (v 3)]]

v :: Int -> Vertex3 GLfloat
v x = Vertex3 v0 v1 v2
    where v0
              | x == 0 || x == 1 || x == 2 || x == 3 = -1
              | x == 4 || x == 5 || x == 6 || x == 7 = 1
          v1
              | x == 0 || x == 1 || x == 4 || x == 5 = -1
              | x == 2 || x == 3 || x == 6 || x == 7 = 1
          v2
              | x == 0 || x == 3 || x == 4 || x == 7 = 1
              | x == 1 || x == 2 || x == 5 || x == 6 = -1

And here’s the C code:

GLfloat light_diffuse[] = {1.0, 0.0, 0.0, 1.0};  /* Red diffuse light. */
GLfloat light_position[] = {1.0, 1.0, 1.0, 0.0};  /* Infinite light location. */
GLfloat n[6][3] = {  /* Normals for the 6 faces of a cube. */
  {-1.0, 0.0, 0.0}, 
  {0.0, 1.0, 0.0}, 
  {1.0, 0.0, 0.0},
  {0.0, -1.0, 0.0}, 
  {0.0, 0.0, 1.0}, 
  {0.0, 0.0, -1.0} };
GLint faces[6][4] = {  /* Vertex indices for the 6 faces of a cube. */
  {0, 1, 2, 3}, 
  {3, 2, 6, 7}, 
  {7, 6, 5, 4},
  {4, 5, 1, 0}, 
  {5, 6, 2, 1}, 
  {7, 4, 0, 3} };
GLfloat v[8][3];  /* Will be filled in with X,Y,Z vertexes. */
  /* Setup cube vertex data. */
  v[0][0] = v[1][0] = v[2][0] = v[3][0] = -1;
  v[4][0] = v[5][0] = v[6][0] = v[7][0] = 1;
  v[0][1] = v[1][1] = v[4][1] = v[5][1] = -1;
  v[2][1] = v[3][1] = v[6][1] = v[7][1] = 1;
  v[0][2] = v[3][2] = v[4][2] = v[7][2] = 1;
  v[1][2] = v[2][2] = v[5][2] = v[6][2] = -1;

Also, the drawBox function was interesting to write, because Haskell has no for loop, so I had to rethink what was going on and translate the idea of executing a block of code over a list of data into it’s functional equivalent.

drawBox :: IO ()
drawBox = let nfaces = zip n faces
          in do mapM (\(n, [v0, v1, v2, v3]) -> do
                        renderPrimitive Quads $ do
                          normal n
                          vertex v0
                          vertex v1
                          vertex v2
                          vertex v3) nfaces
                return ()

The Haskell OpenGL bindings have no glBegin/glEnd functions, but rather, renderPrimitive, which takes a PrimitiveMode and a function/block of vertex-related actions.

Beyond that, the only other thing that tripped me up was that I couldn’t figure out how to enable depth testing. I passed the option to display mode to use a depth buffer:

initialDisplayMode $= [DoubleBuffered, RGBMode, WithDepthBuffer]

and in C, there’s a single call to enable it:

glEnable(GL_DEPTH_TEST);

which I couldn’t find anywhere in the Haskell API. I figured it would look like the call to enable lighting:

lighting $= Enabled

so I ended up choosing depthMask $= Enabled, but my first clue that things weren’t working was what my program displayed:

Red Cube Without Face

More of a box than a cube, really.

The key, I found, was that the C version makes a call to enable depth buffering, but omits a call to set the actual depth test the depth buffer uses, relying on default behavior. The Haskell API only provides the second, depth test-setting operation.

depthFunc $= Just Lequal

in place of my misguided depthMask, that code did the trick.

You can find my finished Haskell source in this gist right above the original C program I translated. If you’re interested in running it, runhaskell Cube.hs should do the trick.

This is by no means the best looking Haskell code. It could be more idiomatic, but I wrote it this way to show how similar it looked to the C code it came from. It ended up still being 20 lines shorter and allows you to focus on more important ideas than properly updating index counters or setting integer bit-flags properly.

Tagged with: ,

rock smashes scissors

Posted in haskell by youngnh on 01.27.10

as simple as A -> B -> C

So, almost two months ago I issued a challenge to Ben to implement a silly spec of a problem I had recently been playing around with. The response and the knowledge I got out of this has been tremendous.

Kevin Berridge wrote an implementation.
Josh Schramm started his own fork.
Ben, my co-blogger, wrote his own implementation.
Heath Borders wrote an implementation and then gave a fantastic presentation on his solution, which sparked a lot of discussion amongst some very smart folks at our office.

I first implemented this in Haskell. It took me a shade under 2 hours and came in at 100 lines of code even. Prodding from the peanut gallery on twitter eventually led me to cut even that number by 25 lines.

To my eye, the solution that I came up was eminently readable, flexible, and precisely solved the problem with no bloat. What amazed me more, was that this was my first try at the problem. I’ve produced code I fell in love with on more than one occasion and in more than one language, but never on the first try. Something about Haskell made turning thoughts and words into working software immensely easy.

I think it is Haskell’s type declarations. Originally, before anybody else had written their implementations, I conceived of this challenge as a way of showing how cool and natural and easy it is to go from a series of type declarations to a fully working program. This was my thought process while writing my Haskell RPS:

The two basic datatypes are Player, which is a container type for a name, and Throw of which there are Rock, Paper and Scissors specializations. Everything else comes from that.

A Haskell type declaration looks like this:


A -> B -> C

This declaration describes a function that takes a type A, a type B and returns a type C.

If you want to display a message to the user? Haskell has a type signature for that:


String -> IO ()

The IO () part is a peculiarity of Haskell’s purity. It basically says that nothing is returned, that’s the () part, but that input and output have been performed, that’s the IO part.

Now, say you want to read that string back into an object your program can use?


(Read a) => String -> a

This signature says that given a String, the function will return a type that implements the Read typeclass. Read is most like a Java interface, but the function implementing this type declaration would utilize it using techniques more akin to Java generics.

Those two functions can then be combined into a function that takes a prompt string and returns an object:


(Read a) => String -> IO a

What the above signature says is that, given a string, this function will perform IO and return any type that implements Read. Exactly which type that is depends on the context the function is used in. That the same function depending on how it is used can return any conforming datatype, is the true power of generics and type inference.

This signature really encompasses the first couple of lines of the spec. By implementing the Read typeclass for Player and Throw, we can now prompt the user and get back Player and Throw objects that we can then go on to do processing with. Being able to prompt and read in responses gets us halfway through the spec. Halfway!

What’s left is to define when a player has won and to play the game until that situation occurs. There are a couple of ways you could choose to determine a winner, but my type signature would look like this:


(Player, Integer) -> (Player, Integer) -> Maybe Player

Which says that given a tuple for each player and their score, you may or may not be able to return a winner. The cool part is that the “may not” part is the same type as the “may” part, which makes handling each case super-explicit and easy to follow.

With how to win the game in place, finally, it is time to implement the overall game. Here’s the signature I came up with:


WinLogic -> (Player, [Throw]) -> (Player, [Throw]) -> Player

I called the operation of determining a winner WinLogic. This says that if you can pass in a WinLogic object and a tuple of each player and all of their throws, then you can finally return a winning player. You’re thinking, wait, how can you pass in all of a player’s throws if they won’t have given them to you until the game’s over? Well, that’s where Haskell’s lazy evaluation makes things easy, but it’s really no more magic than passing an Iterator object in Java.

With a few extra flourishes, the Haskell implementation is done. This is the post that I had written in my mind when I first issued my co-blogger the Rock Paper Scissors challenge. Of course, what happened after made things a bit messier, and there’s some other interesting lessons to learn from everybody else’s efforts, so I’ll cover some of the ensuing happenings in later posts.

Follow

Get every new post delivered to your Inbox.