Friday, January 11, 2008

Search engine class, Nutch, and Wikia Search

This is the final week before classes begin, and I'm frantically preparing for my Search Engine Development class. There are just a handful of courses taught like this that I'm aware of, and thankfully most of the lecture notes are available online.

I've really been really struggling with how much development work to give my students... do I require them to write the complete engine from scratch or use existing components? There are advantages and disadvantages to both approaches, so I'm shooting for something in the middle.

I've decided that we're write a few components ourselves, but we're also going use Nutch, an open source search engine written in Java. I hope we'll be able to make a major contribution to Nutch, although I'm not sure exactly what that will be yet. By using a somewhat mature open source project, my students will get to experience what it's like to learn a large pre-existing code base and understand how software is developed in the open source arena.

Just a few days ago, Wikia Search (alpha) was launched to less than stellar reviews. Wikia Search is Jimmy Wales' attempt to create an open source search engine that uses human feedback. Wales expects Wikia Search to compete with Google and hoping it will some day capture around 5% of all searches. Wikia Search is using Nutch although they don't make that clear on their website. (I wrote a little about Wikia Search [or Wikiasari] about this time last year.)

I've tried out Wikia Search myself, and the results are pretty poor. But, as Whales points out, this is an attempt to build a search engine, not the final product. And had people judged Wikipedia's quality when it first launched, they would have thought it useless.

2 comments:

  1. On search-engine-from-scratch vs. use of frameworks: In my view, software development approaches, at least outside academia, are more and more reliant on how effectively frameworks can be evaluated and utilized. For personality types like mine, spending time writing your own is just a waste of time.

    The times I have gone more low-level with code is when a given framework needs modding or improving - then my knowledge of the low level flows in the opposite direction (framewk > code) rather than the, in my view, static approach of learning the low-level first.

    I'm no coder, but this fits with the pressures of the real world to "get it done". I think it also fits the millenials' mindset of knowledge management: know the tools to help you find the tools (just-in-time knowledge) to do your job.

    Have fun!

    ReplyDelete
  2. Some of us still think Wikipedia is useless, and worse than useless - positively misleading and dangerous.

    ReplyDelete