Organizatorium: April 2009

I've recently spent one day on a deep immersion into the world of alternative search engines, at AltSearchEngines Day II conference in San Francisco. It was a low key, boutique conference, fully and easily contained by just one medium-sized conference rooms at the downtown InterContinental. According to Charles Knight, the engine behind the event, AltSearchEngines aims to represent everything search except Google. A year before it apparently meant "without any of the major search engines". Not this year, Yahoo!'s BOSS was one of the highlights of the conference and PowerSet/Microsoft was the lead sponsor.

BOSS stands for Build your Own Search Service and is the embodiment of Yahoo!'s initiative of opening up their search index. At the time of the talk it looked like I was the only one in the room who didn't know what Yahoo! BOSS was, considering the number of hands that rose when Bill Michaels, the General Manager and the Director of the Open Search Platform at Yahoo!, asked the audience if they're familiar with, or already using, BOSS.

Yahoo! decided to open up access to their index and expose it over a RESTful API, apparently to encourage companies to innovate in top of it by adding their "special sauces" (this is how their web site describes it, not me). According to Bill, such "sauces" are social graphs, semantic technologies, third party structured data, all layered in top of a 50 billion document index. There's no point in re-crawling the web, he said. Yahoo! will do it for you and offer the index data as commodity, on a pay-as-you-go or flat fee basis. The pricing model's details are being worked on as we speak. Tapping into this resource would save your company capital expenditures in the range of $300 million, which I assume is how much Yahoo! is spending on its crawling and indexing infrastructure.

BOSS gives you query handling, ranking, indexing and crawling, which is pretty convenient considering the amount of computing power and bandwidth required to achieve that if you're to start on your own. All this and link counting is what made Google. Yahoo! exposes web index, image search, spelling, and result re-ranking APIs. They also came up with concept of "Vertical Lens" which are "highly customized and tuned vertical and niche search engines". If you're only interested in a subset of the index, you have the option of specifying a "white list" of URLs that would restrict the domain. Companies like Techcrunch, OneRiot and SurfCanyon are already using BOSS.

The way I see it, Yahoo! is trying to do with web data what Amazon EC2 did for computing cycles and storage. I was actually intrigued by the idea, so I floated it to Bill, and I don't remember him saying it's not true. Or maybe he did, it was a crowded and noisy room. Anyway, I suggested he could repeat this presentation, possibly with a little bit more technical edge to it in front of our gathering of Java geeks at SDForum JavaSIG. Check out the JavaSIG page of upcoming events or register to our mail list if you're interested in the subject.

Then there was the Semantic Search panel, hosted by speakers from Digger, OrcaTec, TrueKnowledge, and PowerSet. Some random points: everyone on the panel and the audience seemed to agree that blind use of synonyms makes a mess, they're more trouble than help; the attempt of fully understanding what is in one's index is a very difficult proposition, not many companies are doing it, if any; and yes, semantic search is absolutely possible without a semantic web. Has anyone heard of "psyche"? Apparently a long running project whose goal is to capture all common knowledge of the world. Not easy to find it during first three minutes of search, so maybe I didn't get the name right.

TrueKnowledge creates structured knowledge technology, enabling you to find the answer to a question directly, instead of being presented with a list of web pages where you may (or may not) find the answer. The knowledge base they work with is built through automatic mining of the web, databases, or it is manually entered by contributors. If you go to their site you can't miss the "Add Knowledge" tab, which allows users to add entities and facts. To my delight, I found out that I may be an "agreement-making entity" (which, it's true, TrueKnowledge currently doesn't know anything about). They have an API for integrating their technology into third party systems.

OrcaTec is the technology provider behind truevert.com, and they take a different approach on semantic search: they don't use structured information, taxonomies, ontologies or thesauri, but derive meaning directly from documents. Apparently, they do so by expanding the query on the basis of a language model, learn the meaning of words from the context and use those words for search. When it comes to ranking, they use proprietary language modeling techniques and statistical linguistics. They claim they can build a vertical in about an hour. This sounds great, indeed.

Digger is another provider of semantic search technology; unfortunately their process of obtaining a beta invitation is not as streamlined as TrueKnowledge's so I have to rely exclusively on what I remember from their presentation, which is mainly a control panel that guides the user into refining the query, by asking her to validate the terms of the query and further define the topic she is researching. My impression was they gather possible synonyms and conceptually close words and ask the user to validate them.

Other more or less new and cool stuff: mobile search (taptu), visual search that is actually not based on keywords, but comparing the images with visual examples (imprezzeo, gazopa.com), real time search (collecta.com, oneriot.com - the OneRiot presenter candidly admitted they don't have any revenue model, they're just burning money in building a real cool product), federated search (DeepWeb), medical verticals (searchmedica.com, righthealth.com, yottalook.com), visualization (viewzi.com).

Organizatorium

Wednesday, April 1, 2009

The World of Alternative Search Engines

Followers

Blog Archive

About Me