Wednesday, December 8, 2010

cookie monster

So had an interesting web scraping issue with systems behind a cookie based "paywall". Turns out you can quickly and simply extract a working cookie file from Chrome's sqlite3 database. There is a whole bunch of stuff scattered across the web that has bits and pieces of how to do this. There are, as with all things a million ways to do anything. So assume you have carefully auth'd to the site using Chrome, you can then go on to delve into the cookie database to use it with curl etc. later on.  There are also ways that you can embed UID/PASSWD inside curl/wget cmd lines. However, plain text on the CLI is always something that gives me the heebie jeebies ;-) Here's what I found (oh and cygwin on windows is totally awesome! Don't try and run windows without it ;-)) First up, the Chrome schema. You can find the database file in:


c:\:Documents\ and\ Settings\user/Local\
Settings/Application\ Data/Google/Chrome/User\ Data/Default/Cookies

PRAGMA table_info() is the sqlite magic to get at this:
$ sqlite3.exe /cygdrive/c/Documents\ and\ Settings/user/Local\ Settings/Application\ Data/Google/Chrome/User\ Data/Default/Cookies
"PRAGMA table_info(cookies)"
0|creation_utc|INTEGER|99||1
1|host_key|TEXT|99||0
2|name|TEXT|99||0
3|value|TEXT|99||0
4|path|TEXT|99||0
5|expires_utc|INTEGER|99||0
6|secure|INTEGER|99||0
7|httponly|INTEGER|99||0
8|last_access_utc|INTEGER|99||0

Then because of this documented sqlite3 issue over here: http://old.nabble.com/.separator-/t-not-working-td18622714.html and because your cookie file needs to be TSV, you can either use:

1) CTRL-V + TAB to insert an actual tab into the sqlite3.exe -separator '' cli option
2) Or more twisted, plonk a quick sed 's/,/\t/g' on the end of your sqlite3.exe -csv command line option
3) But it turns out a really simple sqlite3.exe -separator $'\t' works just fine

This example just extracts out the cookies for a particular host you are interested in, you can also dump the whole DB by removing the 'like' piece:
$ sqlite3.exe -separator $'\t' /cygdrive/c/Documents\ and\ Settings/user/Local\
Settings/Application\ Data/Google/Chrome/User\ Data/Default/Cookies
"select host_key, 'TRUE','/', 'FALSE', expires_utc, name, value from
cookies where host_key like '%jcuff.net%'" > cookie.txt

Then you can simply use curl with this new reformatted cookie.txt file to go about your scripted scrapings!
curl -L --cookie ./cookie.txt http://mypaywalledsite.net/
QED ;-)

[any opinions here are all mine, and have absolutely nothing to do with my employer]
(c) 2011 James Cuff