Two Guys Arguing

HTML Entities

Posted in javascript by youngnh on 04.01.09

I found John Resig’s excellent env.js a few weeks ago, and immediately decided that I could do a better job, and so tried writing my own version.  The promise of the tool for me, is that with it I can use very natural javascript dom methods to scrape web pages instead of complicated and fragile regular expressions.

Part of my decision to write my own version stemmed from the fact that when I first tried out env.js, now being maintained by thatcher on github, it choked all over the very first web page I gave it.  Its a wonder I noticed the errors it spit out at all, though, as env.js logs an awful lot of useless information to the console.  Maybe its a unix aesthetic, but I feel solid working programs are quiet programs.

Those two minor quibbles aside, though, there are a lot of neat ideas going on in env.js.  Besides, with git, I could very easily fork thatcher’s work and go whichever way I wanted to.  Long story short, I sat back down with env.js last night and fed it some web pages.  It did choke all over them, but this time, I stuck around and figured out what the problem was.

env.js uses an internal SAX parser written byDavid Joham and Scott Severtson, that looks for any ampersands in its input and tries to interpret them as HTML escape sequences.  That in itself isn’t all that illogical.  If you type &amp; into a web page, env.js should be able to handle it properly.  However, it was doing so inside of <script> tags and dying when it encountered the javascript && operator.  Furthermore, the unescaping logic searched from the first ampersand to the first semicolon after that ampersand to decide what escape sequence it was looking at.  Bad news when it was looking at a language descended from C, which uses operators that start with & and statements the end with ;

I put in a ticket at env.js’s lighthouse site, and am currently working on a patch.

Tagged with: , ,