Thursday, August 03, 2006

Yahoo transforming FRAME tags

The past several months I’ve been ramping-up for a huge experiment where I’ll be reconstructing several hundred websites. I’ve been learning to use Heritrix and process ARC files, and I’ve been periodically tweaking Warrick. Today I found out that Yahoo has changed the way it caches HTML pages that contain frames.

For example, the page at http://www.harding.edu/comp/ contains the following HTML:

<FRAMESET COLS="195,*" FRAMEBORDER=no FRAMESPACING=0>
<FRAME SRC=menu.html NAME="MENU" MARGINWIDTH=0 MARGINHEIGHT=0>
<FRAME SRC=welcome.html NAME="MAIN">
</FRAMESET>

In Yahoo’s cached page for this URL, the FRAME tags are converted to the following (I’ve added some white space for readability):

<frameset rows="200,*"><frame scrolling="no" noresize="" frameborder="0" marginwidth="0" marginheight="0" src="http://216.109.125.130/search/cache?.intl=us&u=www.harding.edu%2fcomp%2f&
w=%22harding+.edu%22&d=XS7fRmP9NNYx&origargs=p%3durl%253Ahttp%253A%252F%252F
www.harding.edu%252Fcomp%252F%26toggle%3d1%26ei%3dUTF-8%26_intl%3dus&frameid=-1">


<FRAMESET COLS="195,*" FRAMEBORDER=no FRAMESPACING=0>

<frame security="restricted" MARGINHEIGHT="0" MARGINWIDTH="0" NAME="MENU" SRC="http://216.109.125.130/search/cache?.intl=us&u=www.harding.edu%2fcomp%2f&
w=%22harding+.edu%22&d=XS7fRmP9NNYx&origargs=p%3durl%253Ahttp%253A%252F%252F
www.harding.edu%252Fcomp%252F%26toggle%3d1%26ei%3dUTF-8%26_intl%3dus&frameid=1" >


<frame security="restricted" NAME="MAIN" SRC="http://216.109.125.130/search/cache?.intl=us&u=www.harding.edu%2fcomp%2f&
w=%22harding+.edu%22&d=XS7fRmP9NNYx&origargs=p%3durl%253Ahttp%253A%252F%252F
www.harding.edu%252Fcomp%252F%26toggle%3d1%26ei%3dUTF-8%26_intl%3dus&frameid=2" >


</frameset></FRAMESET>

Yahoo is placing their own FRAMESET tags around mine and loading the two column frames with pages directly from their cache. Notice the use of security="restricted" within the FRAME tag which tells the browser to place security constraints on the frame sources; this disables any JavaScript in my pages.

While this conversion of FRAME tags makes the page easier to view from their cache, it completely destroys the original HTML. There’s no way I can even parse through the arguments to tell what URL used to be in the SRC attribute. ARG! Now I’m going to have to add a rule to Warrick that tells it to ignore Yahoo cached pages that contain FRAME tags. Google and MSN have yet to implement this “trick”, and hopefully they never do.