From the very beginning, the structure of the Swiss Text Corpus was designed to cover the vocabulary of 20th century standard German in Switzerland as widely as possible. The corpus consists of printed and typewritten texts of very different production and publication forms. It is a balanced according to time, form and content criteria:
- Text class: formal criterion
- Quarter of century: time criterion
- Domain: content criterion
With this structure, the Swiss Text Corpus is a balanced data resource for all kinds of linguistic research questions.
The Swiss Text Corpus contains the following amounts of text (according to the criteria mentioned above):
1900-1924 | 1925-1949 | 1950-1974 | 1975-1999 | 2000-2018 | total | |||||||
d | w | d | w | d | w | d | w | d | w | d | w | |
functional texts | 1042 | 1'122'547 | 1'465 | 1'235'998 | 969 | 1'165'808 | 1'417 | 1'036'198 | 1'238 | 944'778 | 6'131 | 5'505'329 |
factual texts | 167 | 1'447'644 | 433 | 2'043'191 | 804 | 1'943'462 | 276 | 1'846'198 | 898 | 985'400 | 2'578 | 8'265'832 |
journalistic texts | 833 | 501'527 | 1'107 | 1'006'662 | 993 | 970'560 | 1'929 | 1'117'639 | 1'267 | 973'282 | 6'129 | 4'569'670 |
fiction | 188 | 1'116'823 | 50 | 1'248'864 | 159 | 1'122'446 | 59 | 1'147'943 | 40 | 942'760 | 496 | 5'578'836 |
total | 2'230 | 4'188'541 | 3'055 | 5'534'715 | 2'925 | 5'202'276 | 3'681 | 5'147'978 | 3'443 | 3'845'700 | 15'334 | 23'919'667 |
d = documents
w = words (tokens minus punctuation characters)