Two Guys Arguing

HTML Entities

Posted in javascript by youngnh on 04.01.09

I found John Resig’s excellent env.js a few weeks ago, and immediately decided that I could do a better job, and so tried writing my own version.  The promise of the tool for me, is that with it I can use very natural javascript dom methods to scrape web pages instead of complicated and fragile regular expressions.

Part of my decision to write my own version stemmed from the fact that when I first tried out env.js, now being maintained by thatcher on github, it choked all over the very first web page I gave it.  Its a wonder I noticed the errors it spit out at all, though, as env.js logs an awful lot of useless information to the console.  Maybe its a unix aesthetic, but I feel solid working programs are quiet programs.

Those two minor quibbles aside, though, there are a lot of neat ideas going on in env.js.  Besides, with git, I could very easily fork thatcher’s work and go whichever way I wanted to.  Long story short, I sat back down with env.js last night and fed it some web pages.  It did choke all over them, but this time, I stuck around and figured out what the problem was.

env.js uses an internal SAX parser written byDavid Joham and Scott Severtson, that looks for any ampersands in its input and tries to interpret them as HTML escape sequences.  That in itself isn’t all that illogical.  If you type &amp; into a web page, env.js should be able to handle it properly.  However, it was doing so inside of <script> tags and dying when it encountered the javascript && operator.  Furthermore, the unescaping logic searched from the first ampersand to the first semicolon after that ampersand to decide what escape sequence it was looking at.  Bad news when it was looking at a language descended from C, which uses operators that start with & and statements the end with ;

I put in a ticket at env.js’s lighthouse site, and am currently working on a patch.

Tagged with: , ,

2 Responses

Subscribe to comments with RSS.

  1. benjaminplee said, on 05.27.09 at 9:35 am

    Did you ever finish the patch?

    • youngnh said, on 05.27.09 at 9:51 am

      I did actually. Take a look at my github commit for it: Line 3446 is the start of the meat of the change. (on a side note, I wish I knew how to commit only textual and not whitespace changes in git)

      I sent Thatcher (the current maintainer of env-js) a pull request and I think he incorporated my changes.

      It basically consisted of checking while parsing if the parser was in a script tag or not and then only replacing html entities if it wasn’t.

      I wasn’t very happy with my solution because it’s really only a band-aid over the larger problem of env-js lacking a proper HTML parser. The parser it has right now, I believe, is only a simple XML SAX parser. There’s been a flurry of activity on env-js that I haven’t been following very well lately, so this may no longer be the case.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s