This paper describes
the methods used in creating the following webcorpora and frequency
dictionaries.
NOTE: the numbers cited in the paper are outdated. The current size of
individual corpora are listed in the tables below.
| Language | sorted by frequency | sorted alphabetically |
|---|---|---|
| catalan | X | X |
| croatian | X | X |
| czech | X | X |
| danish | X | X |
| dutch | X | X |
| finnish | X | X |
| indonesian | X | X |
| lithuanian | X | X |
| norwegian | X | X |
| polish | X | X |
| portuguese | X | X |
| romanian | X | X |
| serbian_sh | X | X |
| serbian_sr | X | X |
| slovak | X | X |
| spanish | X | X |
| swedish | X | X |
| Language | toks (M) | size (gzipped) | |
|---|---|---|---|
| catalan | 658 | 998M | X |
| croatian | 1491 | 2.7G | X |
| czech | 612 | 1.1G | X |
| danish | 496 | 816M | X |
| dutch | 1989 | 3.1G | X |
| finnish | 846 | 2.1G | X |
| indonesian | 310 | 539M | X |
| lithuanian | 1405 | 2.8G | X |
| norwegian | 1620 | 2.7G | X |
| polish | 1426 | 2.6G | X |
| portuguese | 963 | 1.9G | X |
| romanian | 1067 | 2.2G | X |
| serbian.sh | 2337 | 1.5G | X |
| serbian.sr | 845 | 176M | X |
| slovak | 862 | 2.1G | X |
| spanish | 1397 | 2.7G | X |
| swedish | 893 | 1.5G | X |