Quantifying the impact of messy data on historical text analysis

Activity: Talk or presentation typesInvited talk

Mark J. Hill - Speaker

Simon Hengchen - Speaker

Quantitative methods for historical text analysis offer exciting opportunities for researchers interested in gaining new insights into long studied texts. However, the methodological underpinnings of these methods remains underexplored. In light of this, this paper takes two datasets made up of identical early eighteenth century titles (periodicals) and compares them. The first corpus is a collection of clean versions of texts from various sources, while the second corpus is made up of messy (in terms of OCR) versions extracted from the Eighteenth Century Collection Online (ECCO). With these two corpora the aim is to achieve four things: First, offer some descriptive analyses. This includes differences and similarities in word, sentence, and paragraph counts; average sentence length; and variances in differences between correct and incorrect words. The second aim is to use this information to engage in statistical analyses. That is, use the differences recorded between OCR errors and clean data to quantify the significance of those errors (i.e., to what extent is the messy version representative of the clean version). Third, the paper will offer some--more qualitative--reflections on differences in outputs from specific text analysis methods. These include comparing: inter-corpus and cross-corpus similarities in vector space; LDA topic modelling; and outputs from the Stanford Lexical Parser. Finally, the paper concludes by offering some thoughts on how those engaging with messy data can (or cannot) move forward – in particular, quantifying the problem and highlighting some methods which are less susceptible to errors caused by bad OCR.
4 Jul 2018

ID: 108627795