AOL Search Database

January 13th, 2007

Software is meaningless without data. This is especially in fields that rely on that data to learn/test/train the system. I have done some research into machine learning, natural language processing, AI, A-Life, etc and while the subjects are complex perhaps the hardest part of it is all is obtaining data. Data makes all the difference in the world.

In the middle of last year AOL released about 21 million search queries. When I heard about free data I raced home from work and started downloading it. Later that night I imported it all into a database and made a PHP page to search it. I didn’t have a webhost or anything so that was about all I could do with it except gather statistics. I thought I would finally toss all that on this server. It isn’t very pretty and I only added about 2.5 million records to the database. AOL Search Database already has a much nicer setup so I didn’t really see a need to completely duplicate it. I just wanted something up so I could play with it and try to come up with something a bit more interesting with it.

Here is the stats on the data:

  • 36,389,567 lines of data
  • 21,011,340 search queries
  • 7,887,022 requests for “next page” of results
  • 19,442,629 user click-through events
  • 16,946,938 queries w/o user click-through
  • 657,426 different users

One of the hard parts about gathering ’statistics’ on the data is that each ‘query’ shows up multiple times. For example, if someone searched for ‘cat‘ and then clicked on four links it would should up four times, even though it is only one query with four click throughs. If they click the ‘next page’ it shows up as a new query with a new time stamp, even though it is the same search. Most of these items can be found by checking the time stamp and keeping track of what the rank of the link was that they clicked on.

This is pretty old news by Internet standards, but now it is up and I can play with the search pages and maybe I can come up with some interesting statistics or experiment that others haven’t done yet. Speaking of which, I like KrazyDad’s statistics and write up of the data.

Your browser does not support advanced CSS, you are probably using IE6. This site will render poorly if you do not upgrade your browser