On Sept 28, 2006, AOL released the search histories of more than 650,000 of its users (21 million queries) on its new research website. Although the data was stripped of personal identifiers, it still made privacy advocates extremely upset. AOL issued an apology 10 days later and yanked the data from their site, but it had already been replicated.
For a researcher involved in information retrieval, this data is a gold mine. Most researchers don’t have access to data like this. Unless you work for a search engine, you have to rely on search data from your institution or beg for it from other locations.
On the other hand, some search data could be linked to specific individuals, and I can see why that would be alarming to some. Perhaps there’s a middle ground? What if location data could be randomly swapped? For example, a search for “boston hair cut” could be changed to “denver hair cut”. Although this would make the location information worthless, all the other important information (query length, word length, subject matter) would still be present. Other heuristics could be applied to muddle the location. Of course this doesn’t address all the privacy issues, but it’s a start.
Many of the queries are very disturbing. Many of the queries deal with pornography, grief, and revenge. The queries are like the random private thoughts of their owners. Although they would likely never mutter this stuff to a friend, they have no problems entering it into a search box. One thing that is very clear, there are a lot of hurting people out there.
What I also found very interesting was the way people make their queries. The lengths of many queries are very long. Users are apparently adding more words to get better precision. As the Web has gotten much larger, it has become necessary to use more words. Just six years ago a long query would result in very few hits, but not anymore. Also people sometimes use slang or misspellings which would likely match fewer results. For example, one user entered “u” in several queries where “you” would obviously be more appropriate. Search engines may need to adopt to the use of slang and make automatic substitutions when possible.
It’s really too bad that AOL has received so much heat for what has happened, especially since other companies like Excite and AltaVista have done the same thing in the past. The difference today is that we are much more aware of privacy issues, and the queries are becoming much more tuned to individuals. I would still like to see Google, MSN, Yahoo, and others also give up some detailed search data like this in the future.