Introduction to the Open Telecom Platform @ spawn_link

September 6th, 2008  |  Published in programming

I’ve recently started dabbling with Erlang and one of the things that I still have to fully grok is OTP. The Open Telecom Platform (OTP) is, allegedly, one of the strongest selling points of Erlang as it provides you with a structure to create fault tolerant, highly available applications … for free. Go figure.

In the introduction to a new series of Erlang/OTP articles for the beginner, Mitchell introduces some basic concepts and makes us a promise: he will hold our hands trough the creation of an OTP baking application. Nice.

I will definitely follow this series of articles and perhaps so should you (if you’re interested in the Erlang/OTP world that is).

Thanks Mitchell!

Forcing indentation: yes, it is a good thing

September 2nd, 2008  |  Published in programming, rants

At work I’m trying to get some of my colleagues to give Python a try. Every time I bring up the subject I have to hear things like but there are not brackets … and what’s up with the forced indentation?

Needless to say I always reply that indentation is good as the end result is usually code that is easier to read and that once you have blocks of code defined by their indentation level you don’t need brackets anymore. But no, they will have none of that.

At work I’m currently writing (more like learning how to) code in C#. I’ve recently posted on C# scope oddities that puzzled me and here I am writing about C# (although this rant applies to other languages as well) again.

Assume for a second that you had written a piece of code along the lines of

if(condition_1_is_met)

if(condition_2_is_met) {

// do your stuff here

}

else {

// handle condition_1_not_met here? Wrong! condition_2_not_met is handled here!

}

Unfortunately, for me and you, this doesn’t work. The else clause is actually, possibly due to design reasons, associated to the second if clause (condition_2_is_met) and not to the first one. Oh yes, I can already hear you saying but you should group condition_1 and condition_2!!! And I could almost agree to that if that didn’t require an extra if in case we wanted to handle the else clauses independently. So, to actually be able to attach an else clause to the first if clause we need an extra else that does nothing as you can see below:

if(condition_1_is_met)

if(condition_2_is_met) {

// do your stuff here

} else {

// here’s where you handle condition_2_not_met

}

else {

// handle condition_1_not_met here

}

That is just fine and possibly makes the whole code clearer (?) Unless you don’t care for having an else clause attached to the second if …

And in comes Python with it’s forced indentation:

if condition_1_is_met:

if condition_2_is_met:

# do your stuff in here

else:

# handle condition_1_not_met here

How nice is that then? Indentation there is not only telling you (when you read the code) to which if clause the else clause is attached to but also is telling the compiler the same thing. No extra empty else clauses, etc. Just indentation guiding the flow … as it should be.

I rest my case.

C# scopes — am I missing something here?

August 29th, 2008  |  Published in programming  |  1 Comment

Last time I checked, the scope of a variable drew the line after which it ceased to exist. For instance, in Java, when you defined a variable in a for loop, the minute the loop was over the variable ceased to exist:

for(int i = 0; i < N; i ++) {

int j = 0;

// do stuff with j

}

// j should cease to exist after here

I hoped that scopes would be the same in C# (with the language being very similar to Java) however in this situation the compiler whinges to death:

foreach(int i in some_array) {

int j = 10;

// do stuff here
}

// after here j should be no more yet …

int j = 20;

And that’s where the compiler goes Whoa, hang on a minute, I just saw and int j right above this other one… and it goes poop.

In the C# scope specs you can read:

The scope of a local variable declared in a for-initializer of a for statement (Section 8.8.3) is the for-initializer, the for-condition, the for-iterator, and the contained statement of the for statement.

I’m assuming here that a foreach statement qualifies as a for statement, so I definitely don’t understand what’s going on here.

What’s even more interesting, although reasonable, is that if we sorround the new definition of j with curly brackets, therefore creating a new scope, we make the compiler happy.

Oh well …

Fiddling with an Internet Tablet (Nokia)

August 25th, 2008  |  Published in hacking, misc

A few months ago I bought an old internet tablet. More specifically a Nokia 770. I bought it mainly because it runs GNU/Linux and I wanted to play with it.

Needless to say the first thing I tried to do was to replace the distribution that comes with it (rolled by Nokia engs.) and put a Debian in it. I can’t say that it was a total success as I couldn’t replace the kernel and that means that:

  • Sound didn’t work
  • Wireless worked to some extent

Having an Internet tablet with no internet access is, of course, pointless. Still I kept on playing with it. After debootstrapping Debian onto it I set to compile and install the wonderfull Enlightenment 17, e17 for short. Below you can see a video of e17 running on the tablet. Sorry for the quality and length, it’s an old video and I just found it and uploaded it (meant to do it a long time ago):

Toying with the tablet has proved to be much much fun … but also an unnecessary distraction at the moment so I have, unfortunately, stopped doing it. If you’re interested in the details of both deboostrapping a custom Debian onto your tablet and compiling/running e17 on top of it please refer to the thread (possibly dead now) on the Internet Tablet Talk forum.

Sampling from a given probability distribution

August 10th, 2008  |  Published in programming

In a project I’m working on I’ve come across the necessity to sample a vector of values from a Dirichlet distribution.

I don’t know you, but during my undergraduate degree we were never taught how to write code to do this type of stuff, i.e. sampling from different probability distributions, etc. Fortunately Wikipedia, being a good starting point for most topics, comes to the rescue.

After reading about random number generators it became sort of clear to me (yes, sorry, if you thought otherwise, no I’m not clever) how I should go about it.

According to Wikipedia generating random numbers from a probability distribution f(x) is equivalent to transforming, somehow, a random number generated from a normal distribution. Amongst all the possible methods, two methods are explicitly mentioned there: The inversion method and The acceptance-rejection method.

Bluntly said, the inversion method involves sampling a random number u from a uniform distribution and then finding x such that u = F(x) where F(.) is the cumulative distribution function (CDF) of the distribution we wish to sample from (always making sure that some conditions are met). Simple, right? Right.

Being the lazy sod that I am this method seemed all right for me so I went for it.

Even though the algorithm is straightforward there’s one issue which must be addressed: searching for that x which corresponds to u = F(x).

From the top of my head three different types approaches could be tried:

  1. Do a linear search for x so simply try F(i * epsilon) = u. As soon as you have a match, return i * epsilon (epsilon is the precision desired) — this approach is not too bad (and by bad I mean good) for generating a single random value
  2. Pre-calculate all the F(x) values and store them in a array. Then array[i] = F(i * epsilon). Do a search on the array. This approach is not too bad for generating several values (it improves on the previous approach)
  3. Do a binary search using F(x) = u as stopping criteria. So, do a binary search for x in [a, b) and test if |F(x) - u| < epsilon if so, stop and return x

In a follow up post I will describe all three approaches and include some java code for each of them. In the meantime, if you can come up with a better approach I’d love to hear from you.

PS: If you want to read more about sampling from probability distributions you can always read this free book: Non-Uniform Random Variate Generation.

Dumb search

August 10th, 2008  |  Published in rants

Here’s something I’ve been thinking about for a while.

Most papers in the Information Retrieval field begin with a sentence along the lines of “In this era of information overload the need for better search mechanisms is evident.”. Or something. Indeed there is an information overload. This situation is actually worse because it isn’t any kind of overload, we’re overloaded with pointless stupid information, e.g. MySpace and Facebookuser pages.

All right, most people would call me a cynical bastard and, granted, I am. But searching these days is, at best, a torture. Here I definitely agree with the folks from Google who say that the problem of searching is far from solved.

So here are my two cents to improve the search experience (and hopefully the searcher’s IQ in the meantime). Let’s call it dumbsearch.

Let me begin with a very simplified view of how a search engine works. The process begins with a user submitting a query to the search engine. In turn the search engine’s super-sweet ranking algorithm looks at its oh-so-comprehensive database of web pages looking for a match. All the matches (pages potentially containing the information being searched for) are then ranked according to some criteria (hopefully some approximation of relevance).

Simple, right? Right.

Now, fancier algorithms (such as Google’s pagerank) not only look at keyword matching but at other types of information (usually query independent) such as authority of a document, structure of the document, etc. and here’s where I suggest we should work on.

What I propose is rather simple. In the domain of text processing a very common thing to do is to calculate different metrics that are supposed to measure how similar documents are to each other. These similarities are usually interpreted as being semantic, even though they might only be geometric.

So I’ll accept that these geometric similarities actually do capture some semantic similarities. All right. Assuming that, the rest is really easy.

Let’s take a look at how dumb search could operate.

Again the user submits a query. The search engine looks for matches and ranks them. However this time the search engine (dumbsearch for short) does two extra searches (which should be rather cheap as it already has an oh-so-comprehensive database). The first search is a search for the entered keywords but restricted to Wikipedia. The second search is also for the entered keywords but this time restricted to Facebook, MySpace, etc.

Now dumbsearch has three retrieved document sets, the original document set (let’s call it origset), the one restricted to Wikipedia pages (let’s call it smartset) and the one restricted to Facebook, etc. pages (let’s call it the dumbset). I suggest that the documents in origset should be re-ranked according to their similarity (or disimilarity) to the documents (or only the first one, you take your pick) in both the smartset and the dumbset.

Dumbsearch should ideally favour documents that are most similar (hopefully in terms of language) to Wikipedia articles, which are edited and proofread by many many people and penalise documents that are most similar to Facebook, Myspace, etc. which are edited by teenagers whose hormone levels are usually sky high.

A simple formula would be something along the lines of score(d) ~ rank(d) + lambda * sim(d, wp_p) + (1 - lambda) / sim(d, fb_p)

where wp_p is a page from Wikipedia, fb_p is a page from Facebook and lambda is a tuning parameter. In the formula above we can see that the more similar a page is to a Wikipedia article the the more the left factor contributes to the new score. This is coupled with the inverse of the similarity to a Facebook page, so the more disimilar the page is to a Facebook page the more the right factor will contribute to the score of document d.

Then all you have to do is to rank the document by their new score(.)

The assumptions behind dumbsearch are twofold:
1) Wikipedia is a source for intelligent content
2) Facebook, Myspace, etc. are a source for stupid pointless content

I think both assumptions are fair.

Would we get “smarter information”? (if there’s such a thing) I don’t know. I’m just ranting here.

PS: I don’t intend here to offend anybody working for Facebook, Myspace, etc. as they are not responsible for the content uploaded to the website. Actually the people behind those sites are rather bright, it’s just their users giving them a bad reputation.

When there’s no knowledge

August 10th, 2008  |  Published in rants

People are raving (ranting?) about www.cuil.com and I just have to do it too.

For me it’s simple: search engines can help you find the stuff you think is relevant but not give you stuff they thinks is relevant.

When I tried cuil today the first thing that ticked me off (for a start) was it’s Apple look and feel. All right, we get it, iPods, iPhones, they look nice. But c’mon, get over it already. Let’s let that one fly though.

The people behind Cuil claim that they will find relevant stuff and that is what really gets to me. I’m sorry but relevance is a highly subjective and human notion. Computers, hands off. If you fancy a read there’s tons of literature on the subject and only recently (since 50 odd years) it’s has been being discussed in the Information {Retrieval/Science/Seeking} communities.

Size also seems to be Cuilt’s self-proclaimed strength, together with a large random number on their main page. So, I would assume that they’re bigger than the well established search engines and whatever I can find through them I could also find through Cuil, but even more relevant?All right, let’s give it a go then…

I tried searching for myself (being an egotistical bastard) and I did indeed find my homepage through Google. So obviously Google thinks I’m important enough to be indexed. Cuil? No thanks, they don’t seem to even acknowledge my sheer existence. This raises the question of size vs. coverage. Does bigger size equal better coverage? Of course not. Think of a search engine indexing every possible web page of about 10 different sites totalling a whopping 10,000 pages. Now think of a search engine indexing only the index page of about 10,000 different sites. They both have the same number of pages indexed (same size) however I’d argue that the coverage of the second might be a bit better.

All and all, and in total fairness, my opinion that cuil is just another search engine derives mainly from my belief that search is a really difficult task but that we could start making some serious progress as soon as we started acknowledging (instead of ignoring) that there are certain things that a search engine cannot (and possibly never will) do.

Cuil? No thanks.

Baby steps

August 10th, 2008  |  Published in programming

I’ve been taking my first steps in Erlang during these past few days. Of course, being the busy boy that I am, when I say the past few days I mean that I’ve managed to fiddle with it for about 1-2 hours.

At first sight I have to say that it is rather nice to get back in touch with an old acquaintance of mine: the functional language. I remember writing a compiler, back in the day, in ML and rather enjoying things the language offered such as pattern matching and anonymous variables (yes, call me frivolous!)

I haven’t really done anything impressive in Erlang as of yet, just printing values and manipulating tuples and lists but I foresee a near future in which I’ll be coding cool stuff such as … err … well, just cool stuff.

If you too want to get started perhaps a good starting point would be the Getting Started document or even buying Joe’s own book: Programming Erlang: Software for a Concurrent World.

Needles to say in www.erlang.org you’ll find tons of documentation and useful resources to get you started and keep you going.

All right, that’s me out for the day. Happy Erlang-ing!