Detecting Luxembourgish using a spam filter and wikipedia

How the langdetect library can easily be improved to accurately recognise luxembourgish text.

Schweiere Otem

Language detection tools tend to ignore Luxembourgish. It alternatively gets detected as German or Dutch.

"Translate from German"
Translate from what?

Detecting the language of a piece of text isn’t straightforward. Do you use a dictionary, and choose the language with the least mistakes? This naive solution quickly runs into problems. Luxembourgish spelling is notoriously variable, and Luxembourgish texts tend to mix in many different languages: at least 6 in the Wikipedia article on Belval for example. It is also a living language, with a constant stream of new words: the dictionary approach won’t recognise Frangipanestaart or pétitiounsmidd

Bayesian filters are commonly used for spam filters. By training one with bits of Luxembourgish words instead of spam, we can classify text by recognising its probability of luxembourgishness without having to have seen every word that exists, just like you’ll recognise Dutch or Italian even if you don’t speak it fluently.

The open source language-detection library uses Bayesian filters to detect 55 different languages. It was originally a Google project written in Java, and was later ported to python as langdetect, which I’ll use here. It doesn’t detect Luxembourgish out of the box, but it’s easy to add new language profiles to it.

To train a bayesian filter, you need to feed it a significant amount of data. The Luxembourgish wikipedia is a rich, diverse corpus of modern Luxembourgish, and it can be downloaded as a whole as open data – in short, it is ideal to train our filter.

Other possible candidates would be the recent lod.lu open data export, or chd.lu if it provided an open data archive of its transcripts. I haven’t tried those, but I suspect that the variability of the spelling and the diversity of the subjects and domains of the Luxembourgish Wikipedia actually make it a better training corpus.

I have already generated a downloadable Luxembourgish language profile which you can use with the Java library or its python port. Reproducing my build sequence is simple, and can easily be adapted to other languages.

Download langdetect.jar and the latest lbwiki-*-abstract.xml, and run the tool on the dump:

mkdir -p abstracts/profiles
wget https://github.com/shuyo/language-detection/raw/master/lib/langdetect.jar
wget -P abstracts https://dumps.wikimedia.org/lbwiki/20170701/lbwiki-20170701-abstract.xml
java -jar langdetect.jar --genprofile -d abstracts lb

After a few seconds, you will see something like lb:19413, and have a brand new abstracts/profiles/lb file.

It’s then very easy to add that profile to the python langdetect library, for example:

python3 -m venv venv; source venv/bin/activate; pip install langdetect
cp abstracts/profiles/lb /venv/lib/python3*/site-packages/langdetect/profiles

You can also simply use my already generated Luxembourgish language profile – download it, rename it to lb, and place it in the right folder.

That’s it! You can now pretty accurately detect Luxembourgish!

$ python
[…]
>>>from langdetect import detect
>>> detect(“War doesn’t show who’s right, just who’s left.“)
‘en’
>>> detect(“D’Aart ass an Europa, Asien an Afrika wäit verbreet. A munneche Regiounen ass de Mitock seele ginn”)
‘lb’
>>> detect(“Renert oder de Fûuss am Frack an a Mansgre’sst. Op en Neis photographe’ert vun engem Letzebreger.“)
‘lb’
detect(“heinsdo wan de stress an da scheul oda am job ze krass sin, muss en einfa mol fochtfurn an d sèil bauml losn vie sj zerhueln”)
‘lb’

Of course, it isn’t perfect, and fails on extreme examples of short, badly mangled text:

>>> from langdetect import detect_langs
>>> detect_langs(“seugue di krastn wessnshaflta moosn choohet fun sachen an kevin”)
[en:0.5714275487707763, nl:0.28571371508809157, lb:0.14285852207676758]

I’d be curious to know if, and if yes how, this is useful to anyone. Please leave a comment.

CC BY 4.0 This work is licensed under a Creative Commons Attribution 4.0 International License.

Leave a Reply

Your email address will not be published. Required fields are marked *