Dumb search

Here’s something I’ve been thinking about for a while.

Most papers in the Information Retrieval field begin with a sentence along the lines of “In this era of information overload the need for better search mechanisms is evident.”. Or something. Indeed there is an information overload. This situation is actually worse because it isn’t any kind of overload, we’re overloaded with pointless stupid information, e.g. MySpace and Facebookuser pages.

All right, most people would call me a cynical bastard and, granted, I am. But searching these days is, at best, a torture. Here I definitely agree with the folks from Google who say that the problem of searching is far from solved.

So here are my two cents to improve the search experience (and hopefully the searcher’s IQ in the meantime). Let’s call it dumbsearch.

Let me begin with a very simplified view of how a search engine works. The process begins with a user submitting a query to the search engine. In turn the search engine’s super-sweet ranking algorithm looks at its oh-so-comprehensive database of web pages looking for a match. All the matches (pages potentially containing the information being searched for) are then ranked according to some criteria (hopefully some approximation of relevance).

Simple, right? Right.

Now, fancier algorithms (such as Google’s pagerank) not only look at keyword matching but at other types of information (usually query independent) such as authority of a document, structure of the document, etc. and here’s where I suggest we should work on.

What I propose is rather simple. In the domain of text processing a very common thing to do is to calculate different metrics that are supposed to measure how similar documents are to each other. These similarities are usually interpreted as being semantic, even though they might only be geometric.

So I’ll accept that these geometric similarities actually do capture some semantic similarities. All right. Assuming that, the rest is really easy.

Let’s take a look at how dumb search could operate.

Again the user submits a query. The search engine looks for matches and ranks them. However this time the search engine (dumbsearch for short) does two extra searches (which should be rather cheap as it already has an oh-so-comprehensive database). The first search is a search for the entered keywords but restricted to Wikipedia. The second search is also for the entered keywords but this time restricted to Facebook, MySpace, etc.

Now dumbsearch has three retrieved document sets, the original document set (let’s call it origset), the one restricted to Wikipedia pages (let’s call it smartset) and the one restricted to Facebook, etc. pages (let’s call it the dumbset). I suggest that the documents in origset should be re-ranked according to their similarity (or disimilarity) to the documents (or only the first one, you take your pick) in both the smartset and the dumbset.

Dumbsearch should ideally favour documents that are most similar (hopefully in terms of language) to Wikipedia articles, which are edited and proofread by many many people and penalise documents that are most similar to Facebook, Myspace, etc. which are edited by teenagers whose hormone levels are usually sky high.

A simple formula would be something along the lines of score(d) ~ rank(d) + lambda * sim(d, wp_p) + (1 - lambda) / sim(d, fb_p)

where wp_p is a page from Wikipedia, fb_p is a page from Facebook and lambda is a tuning parameter. In the formula above we can see that the more similar a page is to a Wikipedia article the the more the left factor contributes to the new score. This is coupled with the inverse of the similarity to a Facebook page, so the more disimilar the page is to a Facebook page the more the right factor will contribute to the score of document d.

Then all you have to do is to rank the document by their new score(.)

The assumptions behind dumbsearch are twofold:
1) Wikipedia is a source for intelligent content
2) Facebook, Myspace, etc. are a source for stupid pointless content

I think both assumptions are fair.

Would we get “smarter information”? (if there’s such a thing) I don’t know. I’m just ranting here.

PS: I don’t intend here to offend anybody working for Facebook, Myspace, etc. as they are not responsible for the content uploaded to the website. Actually the people behind those sites are rather bright, it’s just their users giving them a bad reputation.

Post a Comment

Your email is never published nor shared. Required fields are marked *