Saturday, September 09, 2006

Google's cached date = last request date

Vanessa Fox, a member of the crawl team at Google, announced on Tuesday that Google would start posting the last request date on their cached pages. Google used to only indicate the date that the page was last retrieved, so if Google made an If-Modified-Since request and the web server responded with a 304 (not modified) response, the cached date would be left unchanged. Now the cached date will indicate the date of the 200 or 304 response. Matt Cutts also discussed this change and even made a little video for those that needed a visual explanation.

Frankly, I was very surprised to learn that Google’s cache date worked this way. In effect, it’s was much like Yahoo’s Last-Modified date… it was really just an indication as to when they noticed the page changed. I have crawl data from 2005 that indicates Google would periodically issue regular HTTP GET requests, possibly just to verify that the content had indeed not been changing.

I’m not totally sure what MSN’s cache date is indicating. From my 2005 crawl data, MSN apparently never issued an If-Modified-Since request. If they are still operating with the same crawl policy, then they are storing the time they last contacted the web server, so their cache date would indicate the same thing as Google’s.

What this means for Warrick: Google will more frequently now have the most recent version of a page. Therefore Google’s overall percentage of contributed resources will likely increase in the reconstructions I’ve been performing the last few weeks.

On a side note, someone asked Matt Cutts why Google does not post the cached date of PDFs, and Matt said he was going to ask the crawl team about it.