Software
HunTag
sequential tagger for chunking, NER-tagging, etc.
Webcorpus Creator
pipeline for the rapid creation of large webcorpora
Hundict
bilingual dictionary extractor from parallel and comparable corpora
Wikt2dict
Wiktionary parser tool for many language editions
Multilingual dictionaries
Here we present dictionaries built in various ways:
- Wiktionary extraction using Wikt2dict
- Triangulating on Wiktionary data using Wikt2dict
- Dictionary extraction from parallel corpora (Bible) using Hundict
- Dictionary extraction from comparable corpora (Wikipedia articles) using Hundict
We ran each method on at least 40 languages from the following list:
Arabic, Azerbaijani, Basque, Bulgarian, Catalan, Chinese, Croatian, Czech, Danish, Dutch, English, Esperanto, Estonian, Finnish, French, Galician, Georgian, German, Greek, Hebrew, Hindi, Hungarian, Indonesian, Italian, Japanese, Kazakh, Korean, Latin, Limburgish, Lithuanian, Macedonian, Malagasy, Malay, Norwegian, Occitan, Persian, Polish, Portuguese, Romanian, Russian, Serbian, Slovak, Slovenian, Spanish, Swahili, Swedish, Thai, Turkish, Ukrainian, Vietnamese.
(Every file is UTF-8.)
Corpora
Webcorpora and frequency dictionaries
hunNERwiki
a silver standard corpus for Hungarian Named Entity Recognition
Misc
Multilingual HunSpell/HunStem/HunMorph resources
Wikipedia languages
4lang concept lexicon
Digital Language Death joined tsv data
Digital Language Death metadata
MBO raw list