Martha Bailey, University of California, Los Angeles
Jonas Helgertz, University of Minnesota/Lund University
Connor Cole, Office of Tax Administration
Joaquin Serrano, University of California, Los Angeles
New data linking technology is revolutionizing economic history, allowing the creation of large-scale longitudinal and intergenerational datasets. However, recent work calls into question the accuracy of the machine methods used to create “big data.” This paper reconciles seemingly disparate conclusions about the accuracy of machine models and explores variation in error rates across subgroups and linked census datasets. Using the Oldest Old data from the Early Indicators and IPUMS-MLP/CLP links, we find that the seemingly different error rates reflect different parameters. Then we examine how the number of unlinkable individuals affects linking errors as well as how linking errors vary across subpopulations. The results show that linking algorithms perform worse with more disadvantaged or mobile groups, potentially adding significant measurement error to links for lower socio-economic status groups, racial subgroups, and immigrants. In addition, the results show that the share of individuals unlinkable due to death or emigration raises linking error rates, meaning that error rates are likely higher in datasets like the census than in the Oldest Old data where deaths are observed. Third, we illustrate how consequential the differences in the conditional and unconditional linking errors can be in practice.
No extended abstract or paper available
Presented in Session 50. Methodological Innovations in Linking