Some time ago I posted here that I was trying to learn Erlang. My first encounter with the language had been quite a number of years back and thanks to a dear friend who was my teacher at the time and I must admit it hadn’t been the happiest of all my encounters.
It seems to be that the new trend is all about map/reduce, cloud computing and RoR. Well, fine, we can drop the last one. But still, all the hype seems to be about fault tolerance, high availability, replication, speed, massive number crunching and data processing (feel free to thrown in a few extra buzz words/concepts). So why not jump in and see what this is all about. Enters Erlang.
Much like with spoken languages a programming language is only worth learning if it teaches you how to think, and therefore code, differently. I will work one example in the remainder of this post (I promise I will write a second post with a second and more interesting example) in two languages: Python and Erlang. The two languages are going to be focused on text processing so I expect one of the two to end up looking like the most approriate for both tasks yet I also expect a few surprises to creep up along the road.
Parsing text
Parsing text in some languages is really trivial. In this example I’d like to have a very simple tokeniser, i.e. a function that given a string returns a bunch of tokens according to some criteria. The simplest case of such beast would be the one that uses whitespace as separator. When passed “foo bar” this simple tokeniser would return _foo_ and _bar_ in the appropriate data structure. The Python code for it is insultingly trivial:
def simple_tokeniser(line): return line.split(' ')
Granted, if passed anything but a string simple_tokeniser will whinge as if there was no tomorrow but you surely wouldn’t do that, now would you?
In Python, string objects have a split method (much like in Java and other languages) which takes two parameters: the separator and the maximum number of tokens to return. If the limit is omitted all tokens are returned. In our case we want all of them so we just leave it out. Trying it out on a python console we get:
>>> def simple_tokeniser(line): ... return line.split(' ') ... >>> simple_tokeniser("foo bar baz") ['foo', 'bar', 'baz']
Not only it works as expected but now we also know in which data structure we get our tokens: a list! Fair enough, let’s try to mimic the behaviour in Erlang then.
My code is largely based motivated on the code and explanation appearing in Joel’s excellent post on parsing binary text in Erlang which I adapted to suit my needs. Please bare with the lack of syntax highlighting:
1 2 3 4 5 6 7 8 9 10 11 12 | -module(simple_tokeniser).
-compile(export_all).
tokenise(Bin) ->
tokenise(Bin, [], []).
tokenise(<<$ , Rest/binary>>, Chars, Words) ->
tokenise(Rest, [], [lists:reverse(Chars)|Words]);
tokenise(<<Char, Rest/binary>>, Chars, Words) ->
tokenise(Rest, [Char|Chars], Words);
tokenise(<<>>, Chars, Words) ->
[lists:reverse(Chars)|Words]. |
Not so trivial I reckon (perhaps yes to the experienced Erlanger though!) so an explanation is in order. Lines 1 and 2 are telling the compiler that we will be working in the module called simple_tokeniser and that it should export everything in it, i.e. make all functions, etc. visible to the outside world. Then on line 4 we define a function tokenise that takes a single parameter which we denote as Bin. The result of this function actually depends on the value of yet another function also called tokenise but this time with three parameters (has the world gone mad?) It turns out to be that this is absolutely valid and that tokenise/1 and tokenise/3 are actually 2 different functions (the notation fn/arity is explained here).
This other tokenise/3 function is defined further down and we first meet it on lines 7 and 8. A nice feature of Erlang (and some other languages) is the fact that you can define your functions in a pattern-matching fashion so that when a parameter matches a particular pattern that particular definition of the function is invoked. In line 7 the pattern to match is
(<<$ , Rest/binary>>, Chars, Words)
that is an Erlang binary (see 2.4), anything which we match with Char and again anything which we match with Words (the first time tokenise/3 is called though Chars will be matched to the empty list [] and so will Words).
This is probably a good time to get back to a tangent point I tried to make earlier on: that a (programming) language is only worth learning if it teaches you to think differently. Taking this example as a base case, should I bother learning Perl? Not really as it surely provides a split method that behaves quite similarly to the one provided by Python? What about Ruby? Probably the same but smugger.
So what is it in Erlang that forces you to think about a problem differently? Well first of all the fact that there is no string data type. But most importantly that Erlang belongs to a very different school of thought from the one Python, Perl and others belong to. In this other school there are no for loops, no side effects and other rules you must abide to at all costs. It is customary for members of this school to require to do tasks in a different way so we comply.
As there are no for loops in Erlang (although there are equivalent constructs) it is fairly common to do recursion. For maximum efficiency tail recursion is always preferred by the Erlang compiler so accumulators are used. For example the following definition of factorial is not tail recursive:
factorial(0) ->
1;
factorial(N) ->
N * factorial(N - 1).because the value returned depends not only on factorial/1 itself but also on *. A proper tail recursive definition looks like:
factorial(N) ->
factorial(N, 0).
factorial(0, A) ->
A;
factorial(N, A) ->
factorial(N - 1, A * N).because the returned value depends ultimately on factorial/2 alone. But I digress. Let’s get back to the example. I said that in lines 7 and 8 we encountered the first definition of tokenise/3 in which we can see that we will use 2 accumulators: Chars and Words. In pseudo-code the algorithm could look like this:
input: string, Chars, Words while there are letters in the string do: if next_letter in string is whitespace: start again using string = rest of string, Chars = empty, Words = Words + reverse of Chars else: start again using string = rest of string, Chars = next_letter + Chars, Words = Words return Words
Lines 7 and 8 of the Erlang listing correspond to the first if in the pseudo-code and lines 9 and 10 to the else bit. How do we check that there are still more characters in the string we’re trying to parse? In lines 11 and 12 we’re stating that when the string is empty (<<>>) our job is done so we should append whatever is left in the characters accumulator and finish returning the list of words. Simple eh?
The only stone left in our shoe, in case you haven’t noticed yet, is that call to lists:reverse(Chars) we have in line 8. The reason behind that is that we are accumulating letters in a reversed fashion, i.e. appending them at the beginning of the Chars list. This is because appending elements to a list to its head is more efficient than appending elements to its tail and you will have to trust me on this one (and pretty much everybody else) however I can’t really tell you why as I don’t quite grok that myself yet.
In my next post I will look at a second example in which we’ll see the differences in building a Trie Tree in both Python and Erlang
Bonus code
What if we wanted to use a different token separator or several such as whitespace or newline (\n)? Well, the following code would do this for you, a trivial extension to is_separator/1 would be in order to add as many new separators as you like:
is_separator($ ) -> true;
is_separator($\n) -> true;
is_separator(_) -> false.
better_tokeniser(<<Char, Rest/binary>>, Chars, Words) ->
case is_separator(Char) of
true ->
better_tokeniser(Rest, [], [lists:reverse(Chars)|Words])
false ->
better_tokeniser(Rest, [Char|Chars], Words);
end;
better_tokeniser(<<>>, Chars, Words) ->
[lists:reverse(Chars)|Words].Enjoy!
UPDATE 7 Dec 2008: Paul Bonser recently pointed out that I could’ve simply used the Erlang standard library string:tokens/2 to parse tokens from a string. He is right, but moreover the comparison between the manual parsing of tokens in Erlang vs. string.split() in Python is fundamentally flawed and biased against Erlang. A more fair comparison would have been the manual parsing of tokens in Python like:
def simple_tokeniser(string, separators): results = [] acc = [] for c in string: if c in separators: results.append(acc) acc = [] else: acc.append(c) results.append(acc) # include whatever is left at the end return results
Thanks for pointing this out Paul!
Comments 2
You could also use string:tokens/2 from the standard library:
1> string:tokens(”foo bar\nbaz_bazzle”, ” \n_”).
["foo","bar","baz","bazzle"]
http://www.erlang.org/doc/man/string.html#tokens-2
Posted 06 Dec 2008 at 6:16 pm ¶@Paul
Spot on. I’m still new to Erlang so most BIFs are still strangers to me. I probably should’ve made more emphasis on the ‘tangential’ point of learning a new language only if it forces you to think and code differently.
Thanks for stopping by though
U
Posted 06 Dec 2008 at 7:49 pm ¶Post a Comment