New Advance in Automated Transcription and Implications for Record Linkage

Christian Møller Dahl, University of Southern Denmark
Christian E. Westermann, University of Southern Denmark

In this study, we address the critical challenge of achieving precise machine learning based transcription for very large collections of dense tabular documents, such as US Census Tables. The accuracy of transcriptions heavily relies on effective table segmentation, i.e., the extracting of table cells of interest. To address this, we introduce tableParser, a sophisticated pipeline that offers precise, fast, and robust identification and segmentation of tables within documents. Our approach leverages advanced deep learning architectures, specifically segFormer (Xie et al, 2021) for semantic segmentation and PCRNet (Sarode et al, 2019) for point set registration. Notably, our machine learning-based transcriptions operate at the character/token level, encompassing names, locations, and dates. An innovative aspect of our approach lies in the adaptability of the tableParser configuration, resulting in different visual representations, or "looks" at the segmented tables (table cells of interest). This adaptability contributes to the nuanced presentation of transcribed content. Additionally, our system provides confidence measures for estimated character/token recognition, allowing for the identification not only of the most likely entities but also a ranked list of "top candidates." We showcase the practical application of our machine learning based methodology in extending record linkage. By utilizing lists of "top candidates" across various "looks," we enhance the potential for improving overall matching rates.

No extended abstract or paper available

 Presented in Session 50. Methodological Innovations in Linking