Software

HunTag
sequential tagger for chunking, NER-tagging, etc.

Webcorpus Creator
pipeline for the rapid creation of large webcorpora

Hundict
bilingual dictionary extractor from parallel and comparable corpora

Wikt2dict
Wiktionary parser tool for many language editions

Multilingual dictionaries

Here we present dictionaries built in various ways:
  1. Wiktionary extraction using Wikt2dict
  2. Triangulating on Wiktionary data using Wikt2dict
  3. Dictionary extraction from parallel corpora (Bible) using Hundict
  4. Dictionary extraction from comparable corpora (Wikipedia articles) using Hundict
We ran each method on at least 40 languages from the following list: Arabic, Azerbaijani, Basque, Bulgarian, Catalan, Chinese, Croatian, Czech, Danish, Dutch, English, Esperanto, Estonian, Finnish, French, Galician, Georgian, German, Greek, Hebrew, Hindi, Hungarian, Indonesian, Italian, Japanese, Kazakh, Korean, Latin, Limburgish, Lithuanian, Macedonian, Malagasy, Malay, Norwegian, Occitan, Persian, Polish, Portuguese, Romanian, Russian, Serbian, Slovak, Slovenian, Spanish, Swahili, Swedish, Thai, Turkish, Ukrainian, Vietnamese.

In one file By language pair
Wiktionary (1 file) Wiktionary
Triangulating (confidence=2) Triangulating (conf 2)
Triangulating (confidence=5) Triangulating (conf 5)
From parallel corpora (Bible) By language pair
From comparable corpora (Wikipedia) By language pair
(Every file is UTF-8.)

Corpora

Webcorpora and frequency dictionaries

hunNERwiki
a silver standard corpus for Hungarian Named Entity Recognition

Misc

Multilingual HunSpell/HunStem/HunMorph resources

Wikipedia languages

4lang concept lexicon

Digital Language Death joined tsv data

Digital Language Death metadata

MBO raw list