Developing programmatic workflows to activate the UC Davis archives for wine economists and historians.

This project is a collaboration between the DataLab and UC Davis Library funded by the Sloan Foundation to extract historical price data from an archive of wine catalogs published by Sherry Lehmann. The primary goal of the project was to create a database of historical price information that could help wine economists study wine markets over time. Secondary goals included the development of open-source table-extraction software for images built upon the Rtesseract package (an R interface to the tesseract OCR – Optical Character Recognition – system), and hosting hackathons promoting authentic data science skills for UC Davis students. 

Bounding boxes for text recognition and parsing from a Sherry-Lehmann wine catalog.

Our resulting R package for table extraction featured a Google Cloud architecture with docker and integrated with a postgres database. Using this workflow we extracted over 365 thousand wine and spirit prices.

A presentation of our work entitled “Mining Historic Realia: Automatic Generation of Historic Wine Pricing,” was presented at the 13th Annual Conference of the American Association of Wine Economists in Vienna in 2019.

Wine project database.