Digital Scholarship in the Humanities
This article motivates and details the first implementation of a freely available part of speech tag set and tagger for Coptic. Coptic is the last phase of the Egyptian language family and a descendant of the hieroglyphs of ancient Egypt. Unlike classical Greek and Latin, few resources for digital and computational work have existed for ancient Egyptian language and literature until now. We evaluate our tag set in an inter-annotator agreement experiment and examine some of the difficulties in tagging Coptic data. Using an existing digital lexicon and a small training corpus taken from several genres of literary Sahidic Coptic in the first half of the first millennium, we evaluate the performance of a stochastic tagger applying a fine-grained and coarse-grained set of tags within and outside the domain of literary texts. Our results show that a relatively high accuracy of 94–95% correct automatic tag assignment can be reached for literary texts, with substantially worse performance on documentary papyrus data. We also present some preliminary applications of natural language processing to the study of genre, style, and authorship attribution in Coptic and discuss future directions in applying computational linguistics methods to the analysis of Coptic texts.
Schroeder, C. T.,
Computational Methods for Coptic: Developing and Using Part-of-Speech Tagging for Digital Scholarship in the Humanities.
Digital Scholarship in the Humanities, 30(1), i164–i176.