Getty Foundation funded project leveraging content based image recognition for interoperable descriptive metadata and image cataloging of early modern works.

Project Description

Through the generous support of The Getty Foundation, DataLab is working to develop an infrastructure that leverages Content Based Image Recognition (CBIR) to facilitate shared cataloging of early printed images from the early modern period.  Our vision is to develop an environment in which a cataloger or archivist who is describing an image can use CBIR to search across collections and institutions for copies of the same or similar images, retrieve the cataloging records for matched images, and easily ingest retrieved cataloging data into the local datastore.  In short, we intend to provide an infrastructure that allows image catalogers to quickly and easily ask, “Has anyone else described an image like this?” and, if so, “How was it described?” Such a system would improve the quality and interoperability of descriptive metadata and speed up image cataloging efforts, thereby improving access to collections worldwide.  

Work on this platform began in 2012 with a Start-Up award from the National Endowment for the Humanities Digital Humanities to deploy CBIR technologies to facilitate cataloging of the woodcut impressions in the English Broadside Ballad Archive.[1]  Under this award, we produced a software platform called Archive Vision (Arch-V).[2]   Arch-V is currently deployed by the English Broadside Ballad Archive as both a cataloging tool and to allow users to perform image-based searching of the archive. We are also currently in the process of indexing the digital collections of our second major adopter, the Folger Shakespeare Library.  The statistics of our Github repository also indicate that an additional 12 users have downloaded the code.   Support for ongoing development of the Arch-V software, which has been available as an open source CBIR software platform since 2015, has been provided by the UC Davis Library since the completion of the NEH Start-Up award period.

We currently seek funding to allow us to package a native, web-based cataloging and management interface as part of the Arch-V distribution.  The current version of Arch-V is controlled via command-line operations and users communicate with the system via a collection of APIs.  The web-based interface will provide an easy to use, browser-based way for non-technical users (such as catalogers) to index local image collections for searching, search the collections of other Arch-V adopters for the same or similar images, and easily create catalog records for local items based on metadata associated with matching images found in other collections.

Funding from this proposal will also be used to enhance Arch-V’s image matching capabilities. At present, Arch-V is able to find duplicate copies of the same image and to cluster individual image “items” under a single “manifestation” entity.[3]The system can, for example, find all examples of a particular ship with a particular pattern on its sails.  Funding from this proposal would be used to extend this capability to allow us to produce higher level clusters based on abstractions such as, for example, “all sailing ships” or a “all ships.” 

We also plan on hosting two symposia focused on using CBIR technologies in general, and Arch-V specifically, to enhance image cataloging.  Each of the symposia will bring together current Arch-V adopters and key potential adopters of the system to demonstrate the potential for leveraging these technologies, to promote the adoption of the Arch-V platform by a wider community of users, and to solicit feedback to guide product feature development.  Arch-V’s power lies not in its image matching capabilities but in its ability to connect and make communally searchable a distributed network of image collections housed at multiple institutions around the world.  The Arch-V symposia will provide the platform for establishing a working community of network peers, thereby enhancing the overall value of the platform.

Design and Information Architecture

The current version of Arch-V consists of three primary technology components: 1) A C++ application that includes a collection of libraries for performing image feature extraction, analysis, indexing, clustering, and searching; 2) A Java Servlet API that allows users to interact with the C++ application via Restful HTTP requests; and 3) A MySQL database that stores indexing, cluster, and other cached information and is accessed by both the C++ application and the Java Servlet API.  All user interaction with Arch-V is conducted via the Unix command line or the API.  Making images searchable through Arch-V requires several steps as follows:

  1. A “processimages” command is run to extract features from each image in a designated directory.   The output of this process is a new directory of text files, one each for each of the processed images, each containing the parameters for the collection of features found in the associated image file.  The standard Arch-V configuration uses SURF features, but the system is configurable to use other feature definitions.
  2. A “computecnnclusters” command which uses a trained convolutional neural network (CNN) to group images based on the similarity of their feature files as a means of associating like images.  The results of this clustering is saved to the MySQL database.
  3. A “computehclusters” command is run to create an alternate hierarchical clustering model of sameness.  The results of this clustering is also saved to the MySQL database.
  4. A “deriveclusters” command is run to compare the neural net and hierarchical clustering models into a single cluster model.  Generally speaking, we rely on the hierarchal clustering to identify isolates from the CNN clustering model that should belong to a larger cluster and found matches are moved into the appropriate cluster. 

Once images have been indexed, they are then searchable via Unix command line or the API.  The following search and retrieval functions are currently available.

  1. A “scandatabase” command receives an input (seed) image as an argument and executes a search of the system as follows:  1) features are extracted from the seed image; and 2) the features from the seed image are matched using a distance calculation with the most representative image (the one with the most features) from each indexed cluster.  
  2. A “showkeypoints” command receives an input (seed) image as an argument and returns a copy of that image with all found features visually identified.  See Figure 7 for an example.  This feature helps human users understand how the computer is “seeing” the image.
  3. A “drawmatches” command receives two image as arguments and returns a new image that shows all feature matches that the system has identified between the two images.  See Figure 8 for an example.  The feature helps users understand exactly how and why the computer is matching two images.

We intend to maintain the above described system architecture but will be adding a Java Server Page (JSP) based web admin and cataloging interface to the package.  We chose JSP as the technology platform for serving this interface because it currently serves as the server platform for Arch-V’s Restful API.  Using the packaged JSP server to deliver the admin and cataloging interface will therefore simplify software installation and software requirements. 

The new web-based interface will provide a user friendly, password protected, secure GUI for interacting with the system, through which adopters will be able to perform all Arch-V functions.  This interface will also provide the gateway that catalogers will use for image cataloging across collections.  We currently imagine a configuration setting that lets users select which participating network sites they wish to search for any given activity.  Users will then be able to use the interface to submit a seed image, which the system will then submit for search via API across the desired network sites.  Images and metadata returned by network sites will then be compiled and delivered to the user in a relevance organized manner that lets them select matches, copy metadata, alter and or supplement captured metadata, and then save a cataloging record for the seed image locally for import into the local catalog/archival system.  As part of this process, the system will also validate data against known controlled vocabularies and also make suggestions (based on distance matches and through transitive relationships with other similar images) designed to normalize cataloging to controlled vocabularies.  Whenever possible, the system will capture URIs as well as string literals for cataloging terms.  Created cataloging records will be made available in a variety of formats as determined by the needs of the network participants.

Project Outcomes and Deliverables

The proposal has two distinct lines of effort: one technological, devoted to improving the software; and one administrative, devoted to stimulating and managing communications and discussion between external adopters (sites who have or will agree to deploy the software and join the network) and ensuring the features of the software meet the needs of the community. Each effort line has its own set of project deliverables:

Technology development will focus on the following specific tasks:

  1. Implementation of Neural Network Classification system to perform initial categorization of image type (woodblock, engraving, painting, drawing. Etc.) prior to feature point extraction and matching:  Over the past several months we have been experimenting with training a neural net to recognize major divisions in image type. This allows us to more finely tune feature point extraction parameters for specific image types, which in turn improves overall matching capabilities.  We will use funding from this proposal to perfect this method and integrate into the Arch-V platform.
  2. Improve algorithms for clustering semantically similar images into FRBR Expression and Work clusters: We have already had significant success using hierarchical clusters of image features to group images according to semantic similarity.  Over the past several months, we have been working to improve this functionality by using predictive modeling methods to project feature spaces for “ideal” images and then matching individual images against that ideal (as opposed to more conventional methods of principal component classification) and using the results of these comparisons to tune the hierarchical network.  We will use funding from this proposal to perfect this technique and write code libraries to use the network to mint Uniform Resource Identifiers for Expressions and Works.  These code libraries will be packaged as part of the Arch-V platform so that if a new Expression or Work is created at one participating site it becomes immediately available to all sites, thereby insuring alignment of metadata records across institutions.
  3. Modify the Arch-V API so that it delivers both index data [feature point definitions used by CBIR system] and descriptive metadata as Linked Data according to a shared standard:  In 2011-2014 we collaborated with the Bodleian Library in a JISC funded initiative to develop a Linked Data ontology for describing broadside ballads. For this initiative we would build on this work to develop a common schema for delivering image metadata as linked data and modify the API to work with this standard. 
  4. Add a Cataloging Web Interface to Arch-V: The current Arch-V software is an API-based web application.  After installing the software, users communicate with the system both for indexing collections and for searching indexed collections via JSON requests and response to the API.  At present, there are two site implementations of the software.  The English Broadside Ballad Archive (EBBA) <http://ebba.english.ucsb.edu> and an Arch-V demonstration Site <http://ds.lib.ucdavis.edu/archv>, each of which maintains its own website that makes calls to Arch-V and formats the results from these calls for return to the user.  Arch-V itself has no web-based user interface at present.  All interaction is via the Unix command line or API.  We will use funding from this proposal to develop an integrated management and cataloging web interface to Arch-V.  Users who adopt the software will still be able to build custom connections to the software between existing websites using the API, but the new cataloging interface will provide an easy to use gateway to the software that will allow users to catalog across the network of adopting sites from within their own instance of Arch-V.  

Administrative activity will focus on the following specific areas:

  1. Arch-V Symposiums:  At the time of this writing, Arch-V is being actively deployed at two sites (EBBA and the Arch-V demonstration site) and we have just begun indexing a large collection of digital images from the Folger Shakespeare Library as well.  The value of the proposed platform, however, increases directly with the number of participating sites and collections indexed and made searchable through the network.  At each symposium we will demonstrate the platform, including presentations from current adopters that highlight use value and success stories and solicit feedback from both current and potential adopters regarding feature development. We will also Arch-V adoption commitments from participants, encouraging them to allow their assets to be indexed and made searchable using the platform.  We believe that bringing a group together in person so we can demonstrate the software will have a direct, positive impact on overall adoption.
  2. Meetings of Project Principal Investigators and Consultants: As outlined in the project proposal, the core project team includes team members from UC Davis and consultants from UC Berkeley and XXXXX. We area also requesting funds for a project startup meeting to bring the core team together for an initial physical meeting.  The core team will also meet virtually and in person during breakout sessions during the Arch-V Symposium.
  3. We also request funding to send the Principal Investigator, the project Data Scientist, and the project Postdoc to conferences to promote the project.  This will allow us to make the wider scholarly community aware of the work.

[1]Video about the EBBA project.

[3]Cluster map of all images in the English Broadside Ballad Archive.

[2]Arch-V project description.