So you want your own data
The AOL search data has spawned a number of websites. Besides the one I mentioned the other day there is AOL Data Search, AOL Search Logs, and Don’t Delete which offer similar services. AOL Stalker offers a nice slick interface and was actually how I had originally thought of implementing it. It not only allows you to search the logs, but displays what other people are searching for. Its quite interesting to see what people search for on AOL, but it is equally as telling to see what people searched for in search queries.
Anyways you can easily set up your own database. The first step is to get the data, which is still available for download. You can either download the 10, 400 mb zipped text files separately or through bittorrent. Once you have the data you can do whatever you want with it, write perl scripts or examine it in any language you want. The real problem is that the files are so huge and it is hard to move around in a flat file like that. So I would recommend tossing them in a database so you can pull out the information you want, whenever you want, and in reasonable quantities. I wrote a bit of code in Java that just grabs each line from the file and puts it in the database. Simple and could probably be cleaned up to remove duplicate queries and such. If you really wanted to be efficient you could start collecting statistics as you were putting it into the database.
You can grab the java file to save some time, you will need to edit it for your configuration and grab the mySqlConnector or whatever database connector you use. Use, modify, redistribute at real it isn’t anything impressive and probably has bugs. I suggest you import into a local database, otherwise it will take forever to send all the queries over the Internet. If you want to put it on a separate server somewhere you should probably parse the file into the database locally, dump the database, FTP it to the server, and then source it in. If you don’t have MySQL installed grab XAMP its perfect for this kind of stuff.
Speaking of data, I really really want the Google N-gram data, to bad its $150.
January 21st, 2007 at 4:47 am
Check http://aolpsycho.com, it’s a community project to discuss AOL data.