Tuesday, May 09, 2006

Server encoding caching experiment

To determine if my server-side component encodings could be inserted into indexable/cacheable HTML files, I ran a little experiment. I created 3 HTML files that contained encoded chunks in HTML comments at the base of each file:

html_encoded1.html - 2 KB
html_encoded2.html - 45 KB
html_encoded3.html - 99 KB

If you view the source of the pages, you’ll see something like this at the end:

<!-- BEGIN_FILERECOVERY
chunks = 4
filename = xor.o
recover = 2
orig_size = 1105
block_size = 554
block_num = 3

fY/xaGQn0V5MOOpLnM1WIsIUMirrVBQ2XNhidvc5yjL9tEyKTmNjNPjcrJzcPWvs INxxHl1Gt5lKQAYoNi1DXOhFI5ExBm15Nxx1T/hFCwVvsyaHsQQdd3lcqWJl+WTw BTlkiI8yWcPPoy38dqgTVnc4aSNd+0YQWW0bDl67/6XTnych3rSXn5YEYhVMU2eS LCR/0N4pAhKgeMb7SXtdJNQ6WykqDXYJAjtTOIrT2CLaPNRdKbU/ydsvUSDenSt+

Etc…
END_FILERECOVERY -->
I placed these files in my public_html folder on April 19, and linked to them from my index.html page. Today I checked Google, MSN, Yahoo, and Ask to see if any of them were cached. Here’s the results:

Google – cached all three
MSN – cached 1 and 2
Yahoo – indexed 2 only (not available in their cache)
Ask – nada

To see if Google can handle any more, I have created 4 new files of 150, 200, 250, and 300 KB. Looks like 99 KB is too large for MSN. Yahoo’s cache is really inconsistent- maybe 2 is in there, maybe it’s not. Why didn’t they grab 1?

I’ll check back in a couple of weeks and see if anything else has been cached.

Update: 6/20/06

Google and MSN have cached all files that range up to 300 KB. Yahoo has only indexed the first 3 (none are cached), and Ask has nothing.

Now I'm going to create a 400 KB, 500 KB, and 1 MB file and see what happens.

Update: 2/21/07

The cache limits for the search engines appear to be the following: Google - 977 KB, Yahoo - 214 KB, and MSN - 1 MB. I still cannot tell for sure what Ask's limit is, but I ran an experiment where I found 984 KB cached for a document that was 1.6 MB. Google's limit has been confirmed by others.