Tailored Search Engines

I've had quite a bit of experience with Lucene over the last few years, so I'm pleased to see this and its related projects are maturing.

Over the last week I've been putting together a web based search engine using Nutch, of which Version 9 was released last month. Performance wise, I'm noticing a big difference, though I note there are quite a few of the tools I was use to using that have name changes and plenty of deprecated comments for old scripts I have to update.

I know there are quite a few free tailored search engine offers out there, inclusing Google's Custom Search Engine, but I have tried these and not been that happy. Even if Google are now offering a Business Edition which starts at $100 per Year.

Anyway, in preparation for developing the search engine I went back and browsed some classic advice on optimisation such as Anna Patterson's 'Why Writing Your Own Search Engine is Hard' and Mike Cafarella
and Doug Cutting's article, "Nutch: Open Source Web Search'. There's plenty of other stuff out there on Crawling the Web in general.

As part of this I did consider other possibilities such as A9's (aka Amazon's) OpenSearch paid for service, but as a search engine I want a bit more control than their index. Perhaps somewhere down the line using their elastic cloud web-service for what I'm doing with Lucene and Nutch may be the answer to scaling. For now I'm just honing the focus.

Aside from this the Carrot2 project that adds clustering to Nutch and Lucene looks useful.

Of all these projects and developments though the most interesting is the news that the Taste (collaborative filtering for Java) donated their code to the Apache Mahout project (part of the Lucene group).




Comments:

Post a Comment:
  • HTML Syntax: Allowed