Webcorpus Creator

This project is a collection of scripts and programs for creating a webcorpus from crawled data. The input data is extracted by the Wire crawler and the output is a text file with document separators and raw text

Download and documentation

The Webcorpus Creator is hosted on GitHub. Documentation is included in the downloadable repository.


Webcorpus Creator was written by Attila Zséder and Dániel Varga.


Webcorpus Creator is made available under the GNU Lesser General Public License v3.0.


If you use the tool, please cite this paper.