Two Guys Arguing

One Infinite Loop

Posted in common lisp by youngnh on 08.03.09

I wanted to write a small bot to scrape my netflix queue from their website, but making a simple request to their homepage redirected in a loop forever.

I’ve used Edi Weitz’s Drakma library for this sort of thing before and I really like it. Here was the code that I wrote:

(let ((cookies (make-instance 'cookie-jar)))
  (http-request "https://www.netflix.com/"
		     :method :get
		     :cookie-jar cookies))

Requesting http://www.netflix.com would result in being redirected to http://www.netflix.com/Default?tcw=1&cqs=, which would result in being redirected back to netflix.com and so on forever. Drakma actually handled the situation very gracefully. After 6 redirect attempts, it threw an exception saying it had exceeded its redirection limit. I just couldn’t figure out why.

I tried messing with my user agent. Writing scrapers has always made me a little paranoid that the developers on the other end know what I’m up to, are locked in a struggle-to-the-death to prevent my unauthorized uses and of course use the most sophisticated and obvious option available to them: my User-Agent header. Drakma has a remarkably convenient built-in facility for spoofing major browser versions. It’s literally a 20 character change to the function call. I tried that. No dice. Apparently the Netflix developers are as overworked as everyone else and bots just aren’t a big enough problem to justify snuffing them out by filtering User-Agent headers.

The redirection eventually terminated in Firefox. Otherwise I and millions of other users wouldn’t be able to see the page at all. Maybe Firefox has a built in limit to redirection, at which point it just says “fuck it, I’m just gonna go with this page”. But that doesn’t make any sense either, becuase 302’s don’t come with any HTML to display just in case the browser decides it’s tired of bouncing from page to page.

Firebug’s net panel showed that with cookies on, I didn’t redirect at all, netflix.com returned a 200 OK. With cookies off, I redirected once to /Default, then back to the homepage and that time, a 200 came back with gloriously renderable HTML.

I sent Drakma’s HTTP headers to *standard-output* with a handy little one-liner:

(setq *header-stream* *standard-output*)

Here’s what I saw:

GET / HTTP/1.1
Host: www.netflix.com
User-Agent: Drakma/1.0.0 (SBCL 1.0.30; Linux; 2.6.30-ARCH; http://weitz.de/drakma/)
Accept: */*
Connection: close

HTTP/1.1 302 Moved Temporarily
Date: Mon, 03 Aug 2009 12:47:31 GMT
Server: Apache-Coyote/1.1
P3P: CP="CAO DSP COR DEVa TAIa OUR BUS UNI STA"
Location: http://www.netflix.com/Default?tcw=1&cqs=
Content-Length: 0
Set-Cookie: VisitorId=XXX; Domain=.netflix.com; Path=/
Set-Cookie: country=X; Domain=.netflix.com; Expires=Tue, 03-Aug-2010 12:47:32 GMT; Path=/
Set-Cookie: nflxsid=XXX; Domain=.netflix.com; Path=/
Set-Cookie: NetflixSession=XXX; Domain=.netflix.com; Path=/
Set-Cookie: NetflixCookies=XXX; Domain=.netflix.com; Expires=Wed, 02-Sep-2009 12:47:32 GMT; Path=/
Set-Cookie: asearch=XXX; Domain=.netflix.com; Expires=Tue, 03-Aug-2010 12:47:32 GMT; Path=/
Vary: Accept-Encoding
Cache-Control: private
Keep-Alive: timeout=15, max=93
Connection: Keep-Alive
Content-Type: text/plain
Set-Cookie: NSC_xxx.ofugmjy.dpn=XXX;path=/

GET /Default?tcw=1&cqs= HTTP/1.1
Host: www.netflix.com
User-Agent: Drakma/1.0.0 (SBCL 1.0.30; Linux; 2.6.30-ARCH; http://weitz.de/drakma/)
Accept: */*
Cookie: NSC_xxx.ofugmjy.dpn=XXX; asearch=XXX; NetflixCookies=XXX; NetflixSession=XXX; nflxsid=XXX; country=XXX; VisitorId=XXX
Connection: close

HTTP/1.1 302 Moved Temporarily
Date: Mon, 03 Aug 2009 12:47:32 GMT
Server: Apache-Coyote/1.1
P3P: CP="CAO DSP COR DEVa TAIa OUR BUS UNI STA"
Location: http://www.netflix.com/
Content-Length: 0
Set-Cookie: lastHitTime=XXX; Domain=.netflix.com; Path=/
Set-Cookie: VisitorId=XXX; Domain=.netflix.com; Path=/
Set-Cookie: country=XXX; Domain=.netflix.com; Expires=Tue, 03-Aug-2010 12:47:32 GMT; Path=/
Set-Cookie: nflxsid=XXX; Domain=.netflix.com; Path=/
Set-Cookie: NetflixCookies=XXX; Domain=.netflix.com; Expires=Wed, 02-Sep-2009 12:47:32 GMT; Path=/
Vary: Accept-Encoding
Cache-Control: private
Keep-Alive: timeout=15, max=84
Connection: Keep-Alive
Content-Type: text/plain

GET / HTTP/1.1
Host: www.netflix.com
User-Agent: Drakma/1.0.0 (SBCL 1.0.30; Linux; 2.6.30-ARCH; http://weitz.de/drakma/)
Accept: */*
Connection: close

HTTP/1.1 302 Moved Temporarily
Date: Mon, 03 Aug 2009 12:47:32 GMT
Server: Apache-Coyote/1.1
P3P: CP="CAO DSP COR DEVa TAIa OUR BUS UNI STA"
Location: http://www.netflix.com/Default?tcw=1&cqs=
Content-Length: 0
Set-Cookie: VisitorId=XXX; Domain=.netflix.com; Path=/
Set-Cookie: country=XXX; Domain=.netflix.com; Expires=Tue, 03-Aug-2010 12:47:32 GMT; Path=/
Set-Cookie: nflxsid=XXX; Domain=.netflix.com; Path=/
Set-Cookie: NetflixSession=XXX; Domain=.netflix.com; Path=/
Set-Cookie: NetflixCookies=XXX; Domain=.netflix.com; Expires=Wed, 02-Sep-2009 12:47:32 GMT; Path=/
Set-Cookie: asearch=XXX; Domain=.netflix.com; Expires=Tue, 03-Aug-2010 12:47:32 GMT; Path=/
Vary: Accept-Encoding
Cache-Control: private
Keep-Alive: timeout=15, max=69
Connection: Keep-Alive
Content-Type: text/plain
Set-Cookie: NSC_xxx.ofugmjy.dpn=XXX;path=/

GET /Default?tcw=1&cqs= HTTP/1.1
Host: www.netflix.com
User-Agent: Drakma/1.0.0 (SBCL 1.0.30; Linux; 2.6.30-ARCH; http://weitz.de/drakma/)
Accept: */*
Cookie: lastHitTime=XXX; NSC_xxx.ofugmjy.dpn=XXX; asearch=XXX; NetflixCookies=XXX; NetflixSession=XXX; nflxsid=XXX; country=XXX; VisitorId=XXX
Connection: close

HTTP/1.1 302 Moved Temporarily
Date: Mon, 03 Aug 2009 12:47:32 GMT
Server: Apache-Coyote/1.1
P3P: CP="CAO DSP COR DEVa TAIa OUR BUS UNI STA"
Location: http://www.netflix.com/
Content-Length: 0
Set-Cookie: lastHitTime=XXX; Domain=.netflix.com; Path=/
Set-Cookie: VisitorId=XXX; Domain=.netflix.com; Path=/
Set-Cookie: country=XXX; Domain=.netflix.com; Expires=Tue, 03-Aug-2010 12:47:33 GMT; Path=/
Set-Cookie: nflxsid=XXX; Domain=.netflix.com; Path=/
Set-Cookie: NetflixCookies=XXX; Domain=.netflix.com; Expires=Wed, 02-Sep-2009 12:47:33 GMT; Path=/
Vary: Accept-Encoding
Cache-Control: private
Keep-Alive: timeout=15, max=92
Connection: Keep-Alive
Content-Type: text/plain

GET / HTTP/1.1
Host: www.netflix.com
User-Agent: Drakma/1.0.0 (SBCL 1.0.30; Linux; 2.6.30-ARCH; http://weitz.de/drakma/)
Accept: */*
Connection: close

HTTP/1.1 302 Moved Temporarily
Date: Mon, 03 Aug 2009 12:47:32 GMT
Server: Apache-Coyote/1.1
P3P: CP="CAO DSP COR DEVa TAIa OUR BUS UNI STA"
Location: http://www.netflix.com/Default?tcw=1&cqs=
Content-Length: 0
Set-Cookie: VisitorId=XXX; Domain=.netflix.com; Path=/
Set-Cookie: country=XXX; Domain=.netflix.com; Expires=Tue, 03-Aug-2010 12:47:33 GMT; Path=/
Set-Cookie: nflxsid=XXX; Domain=.netflix.com; Path=/
Set-Cookie: NetflixSession=XXX; Domain=.netflix.com; Path=/
Set-Cookie: NetflixCookies=XXX; Domain=.netflix.com; Expires=Wed, 02-Sep-2009 12:47:33 GMT; Path=/
Set-Cookie: asearch=XXX; Domain=.netflix.com; Expires=Tue, 03-Aug-2010 12:47:33 GMT; Path=/
Vary: Accept-Encoding
Cache-Control: private
Keep-Alive: timeout=15, max=93
Connection: Keep-Alive
Content-Type: text/plain
Set-Cookie: NSC_xxx.ofugmjy.dpn=XXX;path=/

GET /Default?tcw=1&cqs= HTTP/1.1
Host: www.netflix.com
User-Agent: Drakma/1.0.0 (SBCL 1.0.30; Linux; 2.6.30-ARCH; http://weitz.de/drakma/)
Accept: */*
Cookie: lastHitTime=XXX; NSC_xxx.ofugmjy.dpn=XXX; asearch=XXX; NetflixCookies=XXX; NetflixSession=XXX; nflxsid=XXX; country=XXX; VisitorId=XXX
Connection: close

HTTP/1.1 302 Moved Temporarily
Date: Mon, 03 Aug 2009 12:47:32 GMT
Server: Apache-Coyote/1.1
P3P: CP="CAO DSP COR DEVa TAIa OUR BUS UNI STA"
Location: http://www.netflix.com/
Content-Length: 0
Set-Cookie: lastHitTime=XXX; Domain=.netflix.com; Path=/
Set-Cookie: VisitorId=XXX; Domain=.netflix.com; Path=/
Set-Cookie: country=XXX; Domain=.netflix.com; Expires=Tue, 03-Aug-2010 12:47:33 GMT; Path=/
Set-Cookie: nflxsid=XXX; Domain=.netflix.com; Path=/
Set-Cookie: NetflixCookies=XXX; Domain=.netflix.com; Expires=Wed, 02-Sep-2009 12:47:33 GMT; Path=/
Vary: Accept-Encoding
Cache-Control: private
Keep-Alive: timeout=15, max=15
Connection: Keep-Alive
Content-Type: text/plain

Huh, the issue was in Drakma. I would request their homepage with no cookies, and Netflix would respond with a redirect to a different URL and a bunch of Set-Cookie headers. Drakma would follow, dutifully passing a Cookie header, and be redirected back to the original URL, also with Set-Cookie headers. Drakma would again follow, but this time not include cookies in the request.

Netflix is doing the redirects — I think — to ‘prime the pump’. If you don’t send cookies to the second URL, you get redirected to a you need to turn cookies on page. If you do, Netflix figures you’ll end up back at their homepage with a bunch of initialized cookies. It’s weird behavior on the part of Drakma that’s breaking the system.

I wonder who to talk to about this? Edi Weitz? #lisp?

About these ads
Tagged with: , , , , ,

3 Responses

Subscribe to comments with RSS.

  1. benjaminplee said, on 08.03.09 at 8:57 am

    I wonder if you wrote a really simple HTTP server and mimicked NetFlix’s behavior if you could verify it isn’t because of some strange header setup with the client. You might be able to verify that Drakma doesn’t resend cookies during the same session as a rule.

    You also might try using something like WebScarab to “pause” the responses as they come back to the browser and/or your Drakma client and try removing various headers. It might have something to do with how your Drakma client handles (or doesn’t handle) Keep-Alive connections or Cache-Control: private. Also, if you injected those cookies manually you could at least verify that Drama handles the response correctly.

    Just some thoughts. Good post.

  2. benjaminplee said, on 08.03.09 at 2:25 pm

    This is probably just a messed up copy/paste job but I also noticed that your second GET contains a cookie: NSC_xxx.ofugmjy.dpn which isn’t set on the first response nor is it sent with the initial request.

    One more thing to look into. Good luck.

  3. youngnh said, on 08.03.09 at 4:55 pm

    So, the NSC_xxx.ofugmjy.dpn cookie gets set as the last header of the first 302 response.

    I really think that Drakma did something clever and functional, like reuse the output from a previous request, or improperly binding a function in a closure, instead of something boring and stateful and imperative like using a loop.

    My point was supposed to be that I worked myself into a tizzy over the Netflix server’s behavior before I realized that my tooling was probably to blame.

    I’ll dip into the source code later this month and hopefully post back here with an update (and a fix?).


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.