So you want your own data

January 20th, 2007

The AOL search data has spawned a number of websites. Besides the one I mentioned the other day there is AOL Data Search, AOL Search Logs, and Don’t Delete which offer similar services. AOL Stalker offers a nice slick interface and was actually how I had originally thought of implementing it. It not only allows you to search the logs, but displays what other people are searching for. Its quite interesting to see what people search for on AOL, but it is equally as telling to see what people searched for in search queries.

Anyways you can easily set up your own database. The first step is to get the data, which is still available for download. You can either download the 10, 400 mb zipped text files separately or through bittorrent. Once you have the data you can do whatever you want with it, write perl scripts or examine it in any language you want. The real problem is that the files are so huge and it is hard to move around in a flat file like that. So I would recommend tossing them in a database so you can pull out the information you want, whenever you want, and in reasonable quantities. I wrote a bit of code in Java that just grabs each line from the file and puts it in the database. Simple and could probably be cleaned up to remove duplicate queries and such. If you really wanted to be efficient you could start collecting statistics as you were putting it into the database.

You can grab the java file to save some time, you will need to edit it for your configuration and grab the mySqlConnector or whatever database connector you use. Use, modify, redistribute at real it isn’t anything impressive and probably has bugs. I suggest you import into a local database, otherwise it will take forever to send all the queries over the Internet. If you want to put it on a separate server somewhere you should probably parse the file into the database locally, dump the database, FTP it to the server, and then source it in. If you don’t have MySQL installed grab XAMP its perfect for this kind of stuff.

Speaking of data, I really really want the Google N-gram data, to bad its $150.

Some changes

January 18th, 2007

I grabbed some plugins and made some small changes. I installed a random quote plugin so that I could do something with all the random quotes I have in various text files all over my computers. So far I have only put in one quote so that would be why it doesn’t change.

I also installed the Wordpress Ultimate Gamer’s Pack, which is suppose to render the site nicely for the PSP, Wii and DS.

Finally, I used the Mii editor over at Joystiq to create a nice icon for my about page.

AOL Search Database

January 13th, 2007

Software is meaningless without data. This is especially in fields that rely on that data to learn/test/train the system. I have done some research into machine learning, natural language processing, AI, A-Life, etc and while the subjects are complex perhaps the hardest part of it is all is obtaining data. Data makes all the difference in the world.

In the middle of last year AOL released about 21 million search queries. When I heard about free data I raced home from work and started downloading it. Later that night I imported it all into a database and made a PHP page to search it. I didn’t have a webhost or anything so that was about all I could do with it except gather statistics. I thought I would finally toss all that on this server. It isn’t very pretty and I only added about 2.5 million records to the database. AOL Search Database already has a much nicer setup so I didn’t really see a need to completely duplicate it. I just wanted something up so I could play with it and try to come up with something a bit more interesting with it.

Here is the stats on the data:

  • 36,389,567 lines of data
  • 21,011,340 search queries
  • 7,887,022 requests for “next page” of results
  • 19,442,629 user click-through events
  • 16,946,938 queries w/o user click-through
  • 657,426 different users

One of the hard parts about gathering ’statistics’ on the data is that each ‘query’ shows up multiple times. For example, if someone searched for ‘cat‘ and then clicked on four links it would should up four times, even though it is only one query with four click throughs. If they click the ‘next page’ it shows up as a new query with a new time stamp, even though it is the same search. Most of these items can be found by checking the time stamp and keeping track of what the rank of the link was that they clicked on.

This is pretty old news by Internet standards, but now it is up and I can play with the search pages and maybe I can come up with some interesting statistics or experiment that others haven’t done yet. Speaking of which, I like KrazyDad’s statistics and write up of the data.

Just to finish it up

January 10th, 2007

I found a host, learned CSS, fell in love with MODx and built my perfect site. Except not. As I said before that didn’t quite happen. I couldn’t get MODx to do exactly what I wanted. At this point I just wanted to launch the site, started looking at what other blogs use. It seems that if you are not using Blogger then, you are using Wordpress. As it turns out Dreamhost has Wordpress as a ‘one-click’ install. I clicked it and installed Wordpress and decided to ignore inperfections. It would work.

I ported my CSS design from my MODx setup and created a custom theme for Wordpress fairly easily. In fact everything was fairly easy and ‘just worked.’ I feel lame for not sticking with MODx and one day I hope to go back to it, but for now I will stick with Wordpress and thus the site was born.

Alright I will try to get start posting some much more interesting items soon.

Oh and APPLE your iPhone kills me!!! I didn’t flinch at the price. The Cingular exclusive bugged me, but I understand why. But I can’t ever go back to a phone with a camera on it!! I suppose come June I will go to the Apple Store and cry over beautiful technology containing an unforgivable flaw.

Inspiration from CSS

January 7th, 2007

I have seen CSS used all over the place, but have never bothered to even glance at it. It was just always there. Unfortunately it is an integral part of a website now so I had to learn it. I began my journey where I begin every journey…the Internet. Typically I learn new languages entirely on the Internet. I love my books though so after I learn what I need to know I spend a bunch of money on books about the language…without learning anything new. I just have to have the books though. It is also nicer to grab a book, turn it to the page you need, and use that as a reference, rather than switch to a web browser and try to remember which website has the best writeup of x feature that you don’t exactly remember the specifics of. One of these days I’ll get a dual LCD setup.

Luckily CSS has great Internet references, the most useful is at W3Schools. However, to figure out exactly what I can do with CSS and how to do it I read some articles at A list apart and Eric Meyer’s css/edge. Eric Meyer is where I got the idea for the & symbol in the background throughout the site. I visited the the css zen garden and thought to myself, “I guess this is a pretty page.” Then I looked elsewhere, completely missing the point of the site. That is until I was rummaging around at Barnes & Nobles and came across The Zen of CSS Design.

I didn’t really know how I wanted my site to look and reading The Zen of CSS Design was amazing. It doesn’t exactly teach you CSS, but it shows tons of examples of how to use it. The way it presented the information was perfect and I ended up learning most of what I know about CSS from it. It also explains a lot of aspects of web design beyond CSS: contrast, minimalism, how to make a good user interface, browser hacks, etc. Most importantly it is full of ideas and inspiration. After you read it it makes a beautiful coffee table book.

Now I am no artist, in fact I am horrible at pretty much anything artistic (this includes writing). The art on this site basically consisted of typing the & symbol into Photoshop and changing the colors. This was something my limited artistic talent could do and I think it turned out well. It is a little busier than most blogs that I like to read. I am definitely prone to read an extremely simply looking blogs. White page, black text over a plain unobtrusive background and a busy side bar that I usually ignore, but I like the way & symbol looks. Most people read blogs through RSS now anyways.

There are still a lot of things I don’t like about the site, but most of it has to do with the way the pages interact with each other, not the design. This site is also entirely geared towards Opera/Firefox. I have barely glanced at it in IE. Unless I decide that I can somehow make a profit on this site or the information here is so interesting and useful everyone should have access to it, I probably won’t worry about how it looks in IE, people shouldn’t be using it anyways. I may even put a redirection page at the beginning that says the site won’t work in IE and you should download Opera, regardless of whether it works in IE or not. Sort of revenge for all the Firefox/IE sites that don’t support Opera (even though Opera works fine in most of them).

I’m sure I will fiddle with the css and design more. Especially as I start learning more about Wordpress. The beauty of CSS is I can leave the page the same and them simply add new style sheets to give it a completely different look, so at some point I may create alternate CSS designs (perhaps a ’simpler’ design for people to use). I am definitely a fan of CSS now.

I am definitely interested in what other people think about it though, so send me your critism.

Finding a host

January 2nd, 2007

I couldn’t very well host a site on my puny cable connection with my dual PIII 450 Linux box. I love my little box and I’m sure it had the power to host the limited views the site would get, but I couldn’t spare the upload bandwidth. I found that most sites offering cheap webhosting were just that. Cheap web hosting. You had to pay a lot extra to get any additional features. What I really wanted was shell access. Shell access costs money, lots of it. Any place that even offered shell access was insanely expensive for the amount of bandwidth and disk space that was available. Then there was CPU usage restrictions….

Then I found DreamHost. They offered more disk space, more bandwidth, and more features than I could ever ask for. Dreamhost lets me host an unlimited number of websites and I can use it for all my programming projects. I just log in and write some Perl scripts, compile some C code, etc. I can write some code on my computer and use their SQL databases so I can test my programs from any computer without setting up MySQL on everything. It’s great.

Shameless plug: If you are interested in hosting, I don’t have a bad thing to say about these guys nor have I read anything bad about their hosting service. Use the promo code ANDAMP40 and get $40 off your purchase and I’ll be able to get some money off my hosting costs as well. They even offer a 97 day full refund if you don’t like it.

Your browser does not support advanced CSS, you are probably using IE6. This site will render poorly if you do not upgrade your browser