sequential tagger for chunking, NER-tagging, etc.

Webcorpus Creator
pipeline for the rapid creation of large webcorpora

bilingual dictionary extractor from parallel and comparable corpora

Wiktionary parser tool for many language editions

Multilingual dictionaries

Here we present dictionaries built in various ways:
  1. Wiktionary extraction using Wikt2dict
  2. Triangulating on Wiktionary data using Wikt2dict
  3. Dictionary extraction from parallel corpora (Bible) using Hundict
  4. Dictionary extraction from comparable corpora (Wikipedia articles) using Hundict
We ran each method on at least 40 languages from the following list: Arabic, Azerbaijani, Basque, Bulgarian, Catalan, Chinese, Croatian, Czech, Danish, Dutch, English, Esperanto, Estonian, Finnish, French, Galician, Georgian, German, Greek, Hebrew, Hindi, Hungarian, Indonesian, Italian, Japanese, Kazakh, Korean, Latin, Limburgish, Lithuanian, Macedonian, Malagasy, Malay, Norwegian, Occitan, Persian, Polish, Portuguese, Romanian, Russian, Serbian, Slovak, Slovenian, Spanish, Swahili, Swedish, Thai, Turkish, Ukrainian, Vietnamese.

In one file By language pair
Wiktionary (1 file) Wiktionary
Triangulating (confidence=2) Triangulating (conf 2)
Triangulating (confidence=5) Triangulating (conf 5)
From parallel corpora (Bible) By language pair
From comparable corpora (Wikipedia) By language pair
(Every file is UTF-8.)


Webcorpora and frequency dictionaries

a silver standard corpus for Hungarian Named Entity Recognition


Multilingual HunSpell/HunStem/HunMorph resources

Wikipedia languages

4lang concept lexicon

Digital Language Death joined tsv data

Digital Language Death metadata

MBO raw list