"We can only see a short distance ahead, but we can see plenty there that needs to be done" (Alan Turing)

Software and tools

Automatic Language Detection...

Download application
Language: English
Supported systems: Windows XP/Vista/7

The idea...

Automatic language identification is a tool in itself, it is incorporated in other natural language processing systems. You may have encountered such a tool when you used Google Translate or even a word processor.

My motivation for designing such a tool is two-fold:

1) How hard is it really to implement such a system?

2) It is a very good basis for other tools.

Let me be clear for everyone that we are talking about the Indo-European language family, languages like (French, Spanish, German, English, Italian, Russian etc.), about 439 in number.

The other language group is made up of Chinese and its dialects and varieties. The major difference between these two families is the writing system. While the first family has an alphabet based on letters, the other one has only symbols. Also, Chinese lacks inflections and the expression of tense and gender is made through word order.

Of course, there are other language families to be mentioned, but I want to point out what are the differences when building a language detection system.

With that in mind, one care easily realize that writing a language detection tool is not as easy as writing one for English.

The mechanics of the application...

There are two approaches when building such a tool:

1) Linguistic

2) Probabilistic

The linguistic method uses certain 'constructs' to identify the language and usually it is not general, some languages may be identify incorrectly. One may think that a language could be identified by the presence or absence of some words. For instance if we took the 20 most frequent words from each language and build a system so that the computer would match each word from the text to be recognized against those list of words. Let's take Romanian and French languages.

In each of the two languages, the word 'un' is an article. Both words would be in the 20 most frequent words. One can easily notice that this method would decrease the accuracy.

The probabilistic method is general, it uses n-gram model. n-grams are widely used in Computational Linguistics for text processing. An n-gram is a contiguous sequence of n items from a given text or speech. These items can be letters, words, sentences etc.

An n-gram model is a probabilistic language model from predicting the next item in a sequence n-1 type. For instance a sentence like:

I go to the cinema everyday.

As a bigram model, the sentence would be broken up into 6 bigrams:

1-'Begin_I' 2-'I_go' 3-'go_to' 4-'to the' 5-'the cinema' 6-'cinema_End'

The system had to be trained for each language with various texts and a bigram language model was implemented. For the application to work properly, at least 200 characters have to be typed in.

Currently, the application supports the following languages: English, French, Romanian, Spanish and German.

If you have any questions, comments or need some help, please feel free to contact me.

Application preview...