This paper describes
the methods used in creating the following webcorpora and frequency
dictionaries.
NOTE: the numbers cited in the paper are outdated. The current size of
individual corpora are listed in the tables below.
Language | sorted by frequency | sorted alphabetically |
---|---|---|
catalan | X | X |
croatian | X | X |
czech | X | X |
danish | X | X |
dutch | X | X |
finnish | X | X |
indonesian | X | X |
lithuanian | X | X |
norwegian | X | X |
polish | X | X |
portuguese | X | X |
romanian | X | X |
serbian_sh | X | X |
serbian_sr | X | X |
slovak | X | X |
spanish | X | X |
swedish | X | X |
Language | toks (M) | size (gzipped) | |
---|---|---|---|
catalan | 658 | 998M | X |
croatian | 1491 | 2.7G | X |
czech | 612 | 1.1G | X |
danish | 496 | 816M | X |
dutch | 1989 | 3.1G | X |
finnish | 846 | 2.1G | X |
indonesian | 310 | 539M | X |
lithuanian | 1405 | 2.8G | X |
norwegian | 1620 | 2.7G | X |
polish | 1426 | 2.6G | X |
portuguese | 963 | 1.9G | X |
romanian | 1067 | 2.2G | X |
serbian.sh | 2337 | 1.5G | X |
serbian.sr | 845 | 176M | X |
slovak | 862 | 2.1G | X |
spanish | 1397 | 2.7G | X |
swedish | 893 | 1.5G | X |