PolDi – a Polish Diachronic Online Corpus
PolDi is a collection of texts from Polish language history, made accessible online for linguistic research. The corpus has been experimentally annotated morphosyntactically using the modern Polish tagger Morfeusz (Saloni, Gruszczyński, Woliński and Wołosz 2011). Part of the material was also annotated manually with syntactic information relevant to the DFG project "Corpus linguistics and diachronic syntax: Grammaticalization of non-canonical subjects in Slavonic languages": Subtypes of null subjects, subtypes of reflexive verb forms, passives and -no/-to forms. All textual content is given in normalized orthography.
Most of the texts in PolDi were transcribed and/or digitized by external contributors, notably by the Instytut Języka Polskiego of the Polish Academy of Sciences in Kraków, and by Gerd Hentschel, Oldenburg and former colleagues from Göttingen. Thanks are also due to Rafał Górski and Thomas Menzel for advice, making contacts and supporting our requests. We gratefully acknowledge all these efforts, and hope that the way in which the texts are presented here will be useful for everybody. Note that these and more Old Polish texts from the IJP in Kraków, being also available online as XML files, were re-published in a nicely readable form on DVD: Biblioteka zabytków Polskiego piśmiennictwa średniowiesznego. Edycja elektroniczna. Instytut Języka Polskiego PAN, Kraków 2006.
On our part in Regensburg,
- Arek Danszczyk, Björn Hansen, Thomas Menzel, and Roland Meyer
- discussed and decided upon the selection of texts
- Arek Danszczyk
- was responsible for textual encoding and bibliographical research, checked for errors and reliability of the editions, and added structural markup
- Roland Meyer
- supervised the encoding, wrote the necessary convertors into GATE and from GATE to PAULA, integrated Morfeusz into GATE, did the manual annotation of "subject" categories, set up the ANNIS-2 database/web interface, and maintains and further develops PolDi
At the present stage, the following texts are online:
- Modlitewnik Nawojki [Naw], 1st h. 15th c.
- Kazania gnieźnieńskie [Gn], 1st h. 15th c.
- Ewangeliarz Zamojskich [EwZam], 2nd h. 15th c.
- Modlitwy Wacława [MW], 1482
- Żywot świętego Błażeja [ZywBlaz], 1st h. 16th c.
- Jewłaszewski: Pamiętnik [PamJewl], 2nd h. 16th c.
- Konstytucja 3 maja [konstytucja], 1791
This list is constantly being updated as the SaltNPepper convertor spits out more texts. 40 texts are ready for integration, and will be available for querying soon.
The purpose of devising PolDi within our current project was to do research into the diachronic development of various subtypes of null subjects and reflexive constructions in Polish. This actually requires quite deep annotation, as it depends on formal, but also semantic and textual (coreferential) information. We planned to combine the mature automatic tools available for modern Polish with a good deal of manual annotation, at least for larger excerpts, as has been common in diachronic corpus linguistics (cf. the Helsinki Corpus). In our experience, the annotation process should be maximally flexible as to the forging of "shortcuts", such as (regular expression) rules and easy replacements over annotations; but, at the same time, it should be restrictive as to the available feature values and manual input routines, in order to avoid typing errors and the like. In the ideal case, external automatic annotators (e.g, taggers) can be easily applied at any point without breaking the current annotation. A tool which combines these properties in a very convenient way, is GATE. (Another one would be UIMA.)
We used GATE for the whole annotation process. Morfeusz was integrated as external "Generic Tagger", patched up with some tricky JAPE rules for postprocessing. Main issues were (i) that Morfeusz outputs all possible lemma/tag annotations for a given token (not only a single one), and (ii) that it uses – linguistically correct – non-orthographic tokenisation for the clitic auxiliaries (-śmy, -ście, -by- etc.). In the present version of PolDi, all possible modern Polish tags according to Morfeusz are provided in the "tag" annotation tier, separated by |. The same holds for the possible lemmas, given in the "lemma" annotation tier.
GATE uses a standoff-XML format in one large file per text. In a way, this is already close to the input format for Annis-2, PAULA. But some conversion is necessary. A tool which came in handy was the Exporter from GATE devised by the American National Corpus. Its output consists in XML elements for the GATE annotation types, together with the respective spans on the token baseline, and annotation feature values. From this, we convert further into the EXMARaLDA format with the help of a little python program, ordering the information in tiers and relabeling annotation features according to some configurable specification. EXMARaLDA XML can be processed by the SaltNPepper convertor into the respective Annis-2 database tables automatically.
Annis-2 is the database and web interface of choice for this type of corpus. Its main purpose is to visualize and make queryable "complex multilevel linguistic corpora with diverse types of annotation".
Why so complicated?
First of all, because the desired annotation itself is complicated. It applies to overlapping, but non-identical units at several levels. It represents, within the corpus and visibly for everybody, the kind of linguistic categorisations that used to be hidden in old-fashioned card files, but also in personal electronic databases. Secondly, because the annotation process is difficult, depending on a flexible combination of automatic, semi-automatic, and fully manual steps. And finally, because the procedure described will pay off in the long run. It is strong enough to allow for future extensions, notably into real syntactic annotation.
PolDi's corpus composition and all technical aspects of the project are discussed in more detail in Roland Meyer's 2011 habilitation thesis (ch. 2), available upon request from the address below.
At present, experimental access is provided to some texts (login poldi/poldi). This will change as soon as all 40 texts are online. If you would like to use PolDi (for research purposes only), please print out and sign the license agreement, and return it to us by fax or scan it in and send it by email. You will then receive credentials to access the full corpus.