hunNERwiki: a silver standard corpus for Hungarian Named Entity Recognition

hunNERwiki — a silver standard corpus for Hungarian Named Entity Recognition

Supervised Named Entity Recognizers require large amounts of annotated text. Since manual annotation is a highly costly procedure, reducing the annotation cost is essential. To this end, we have created hunNERwiki, a silver standard corpus for Hungarian Named Entity Recognition. The corpus has been automatically generated from the Hungarian Wikipedia, using the entity categorization of DBpedia. At 19 108 597 tokens, it is the largest Hungarian NER training corpus by far; by comparison, the largest gold standard corpus for Hungarian, the Criminal NE corpus, comprises of 562 822 tokens. The corpus uses the Szeged NER annotation scheme.

Our NER system, trained and tested on hunNERwiki, achieved 89.76% F-score. For reference, when trained and tested on the Szeged NER corpus, the same system reached 94.5%.

Data Files

Download the corpora from here:

Reference

If you use this corpus in your academic work, please include a reference to the following papers:

Paper	Download
Eszter Simon, Dávid Márk Nemeskey. 2012. Automatically generated NE tagged corpora for English and Hungarian. In: Proceedings of the 4th Named Entity Workshop (NEWS) 2012, pages 38--46. Jeju, Korea, July 2012. Association for Computational Linguistics.	article	bibtex
Nemeskey Dávid Márk, Simon Eszter. 2013. Automatikus korpuszépítés tulajdonnév-felismerés céljára. In: IX. Magyar Számítógépes Nyelvészeti Konferencia, pages 106--117. Szeged, Hungary, January 2013.	article

Licensing

In accordance with the Wikipedia copyright policy, the hunNERwiki corpus is available under the Creative Commons Attribution-Sharealike 3.0 Unported License. A summary of the license is available here.