So you want your own data

January 20th, 2007

The AOL search data has spawned a number of websites. Besides the one I mentioned the other day there is AOL Data Search, AOL Search Logs, and Don’t Delete which offer similar services. AOL Stalker offers a nice slick interface and was actually how I had originally thought of implementing it. It not only allows you to search the logs, but displays what other people are searching for. Its quite interesting to see what people search for on AOL, but it is equally as telling to see what people searched for in search queries.

Anyways you can easily set up your own database. The first step is to get the data, which is still available for download. You can either download the 10, 400 mb zipped text files separately or through bittorrent. Once you have the data you can do whatever you want with it, write perl scripts or examine it in any language you want. The real problem is that the files are so huge and it is hard to move around in a flat file like that. So I would recommend tossing them in a database so you can pull out the information you want, whenever you want, and in reasonable quantities. I wrote a bit of code in Java that just grabs each line from the file and puts it in the database. Simple and could probably be cleaned up to remove duplicate queries and such. If you really wanted to be efficient you could start collecting statistics as you were putting it into the database.

You can grab the java file to save some time, you will need to edit it for your configuration and grab the mySqlConnector or whatever database connector you use. Use, modify, redistribute at real it isn’t anything impressive and probably has bugs. I suggest you import into a local database, otherwise it will take forever to send all the queries over the Internet. If you want to put it on a separate server somewhere you should probably parse the file into the database locally, dump the database, FTP it to the server, and then source it in. If you don’t have MySQL installed grab XAMP its perfect for this kind of stuff.

Speaking of data, I really really want the Google N-gram data, to bad its $150.

AOL Search Database

January 13th, 2007

Software is meaningless without data. This is especially in fields that rely on that data to learn/test/train the system. I have done some research into machine learning, natural language processing, AI, A-Life, etc and while the subjects are complex perhaps the hardest part of it is all is obtaining data. Data makes all the difference in the world.

In the middle of last year AOL released about 21 million search queries. When I heard about free data I raced home from work and started downloading it. Later that night I imported it all into a database and made a PHP page to search it. I didn’t have a webhost or anything so that was about all I could do with it except gather statistics. I thought I would finally toss all that on this server. It isn’t very pretty and I only added about 2.5 million records to the database. AOL Search Database already has a much nicer setup so I didn’t really see a need to completely duplicate it. I just wanted something up so I could play with it and try to come up with something a bit more interesting with it.

Here is the stats on the data:

  • 36,389,567 lines of data
  • 21,011,340 search queries
  • 7,887,022 requests for “next page” of results
  • 19,442,629 user click-through events
  • 16,946,938 queries w/o user click-through
  • 657,426 different users

One of the hard parts about gathering ’statistics’ on the data is that each ‘query’ shows up multiple times. For example, if someone searched for ‘cat‘ and then clicked on four links it would should up four times, even though it is only one query with four click throughs. If they click the ‘next page’ it shows up as a new query with a new time stamp, even though it is the same search. Most of these items can be found by checking the time stamp and keeping track of what the rank of the link was that they clicked on.

This is pretty old news by Internet standards, but now it is up and I can play with the search pages and maybe I can come up with some interesting statistics or experiment that others haven’t done yet. Speaking of which, I like KrazyDad’s statistics and write up of the data.

Your browser does not support advanced CSS, you are probably using IE6. This site will render poorly if you do not upgrade your browser