Tuesday, January 29, 2008

Internet Archive and censorship

Just yesterday, an anonymous Wikipedia editor using the IP address claimed that the Internet Archive had "censored" pages from around the 9/11 time frame. To prove their point, they provided a link to a snapshot showing the gaps in the Archive. They cited no other external source that discusses the issue.

I'm sure there is a good explanation for the pages not appearing in the Archive... most likely Alexa (who supplies a majority of the archived pages to IA) just didn't crawl those sites on those dates. The Web is a very large place, and Alexa doesn't crawl every website on the same schedule.

I've changed the wording to read "missing pages" instead of "censored", but I suspect will be reverting the edits back sometime today. This anonymous editor obviously thinks there is something sinister going on.

Update on 1/30/2008:

Sure enough, the conspiracy theory will not die. Someone at added further "evidence" that IA has yanked the pages, and apparently agrees. is apparently Victoria Sachs, or at least that's the name given in this Internet Archive forum posting which asks about the missing pages. Brewster Kahle, founder of the IA, responded to her post:
There was nothing special happening before 9/11/2001-- the attempt is to crawl every site ever 2 months. sometimes things would be more frequent, but mostly 2 months.

with 9/11 events, the crawler team tried to archive things much more frequently. the news sites had trouble staying up, so that record is a bit spotty.

I hope this helps.

Sorry, Brewster, but that explanation will not appease Victoria and There are other websites that were archived in the weeks before 9/11, so it must be a huge conspiracy to keep the world in the dark. Victoria also wrote:
Yet the Internet Archive's records clearly show the major sites listed above have never before experienced such an enormous 'missing cache' gap in the history of the Internet Archive.
Of course this is not true. Here's one counter example where Time.com is missing all of July 2000, a 50+ day gap. I could give number other counter examples, but alas I have students to teach. (BTW, Victoria, the IA doesn't cache pages, they archive them. There is a difference. wink)

Victoria's Wikipedia article edit concludes:
Unfortunately, future generations will never be able to read what Newsweek.com, Reuters.com, Altnet.org, ABC.com, Time.com, MSNBC.com, ABCnews.com, Nasdaq.com, Bloomberg.com, LAtimes.com, Timesofindia.com, CNN.com, UAL.com, CBSnews.com, and NYtimes.com, published during the weeks preceding September 11th, 2001.
With this I do agree... it is too bad IA doesn't have those pages archived. But I seriously doubt they purposely removed those pages from their Archive; it completely runs contrary to what they are attempting to do.

I am visiting with some folks at IA in San Francisco next month. I'll bring up this matter with them and report back what they say.

Update on 2/12/08:

I think the controversy has finally subsided.

On Feb 10, a user from (Victoria?) removed their posting and signed off with
OK, censorship wins.
The Talk page basically outlines the rambling (and unsigned) thoughts of and how other editors tried to reason with him/her. It makes for some interesting reading. Even Mr. Wikipedia got into the action:
This guy has been emailing me as well. It is an unsubstantiated crackpot conspiracy theory that doesn't pass the basic sniff test. Even the links he sends to show he is right, thoroughly refute his claims. (As said above, the gaps, while regrettable, are normal for that era of the archive.)--Jimbo Wales (talk) 17:39, 6 February 2008 (UTC)
I think that about sums it up.