This project is a collection of scripts and programs for creating a webcorpus
from crawled data. The input data is extracted by the Wire crawler and the output is a text file with document separators and raw text
Download and documentation
The Webcorpus Creator is hosted on GitHub. Documentation is included in the downloadable repository.
Webcorpus Creator was written by Attila Zséder and Dániel Varga.
Webcorpus Creator is made available under the GNU Lesser General Public
If you use the tool, please cite this paper.