Source: Carl Stahmer, PhD

This post is in response to a flurry of Twitter and blogging activity surrounding #EEBOGate (Proquests now retracted announcement that access to Early English Books Online (EEBO) [1]. would no longer be made available to users through membership in scholarly societies and organizations.) The specific topic of the post is the potential role the English Short Title Catalogue (ESTC) could play in the development of a community supported corpus to replace EEBO [2]. While I am affiliated with the ESTC, the opinions expressed in this post are entirely my own and do not reflect an official position of the ESTC.

Several years ago the ESTC began to add links to external digital surrogates of the items in its catalogue. While the change was (and remains) invisible to many, it significantly altered the nature of the catalogue. Whereas the ESTC had historically served as an inventory of extant textuality, a kind of record of the history of early print and a finding aid to the hidden reaches of rare book collections, it began to function as a textual gateway. It ceased being a catalogue and became, in digital parlance, an Archive, if only to a limited degree.

Becoming a digital archive was not the vision of the original ESTC (having been developed before the birth of the digital archive.) However, the inclusion of links to full text surrogates situates it as a potential important aggregator of and gateway to textual material. The next generation ESCT, the EESTC21, currently under development thanks to the generous support of the Andrew W. Mellon Foundation, will amplify this potential.

Currently, external links in the ESTC are, or have been, contributed by the various holding institutions with items in the catalogue. Libraries and museums deliver their holding data to the ESTC and the data are ingested into the catalogue by staff working at UC Riverside and the British Library. The system works reasonably well, and it has allowed us to ingest and make available links to many external resources.

But what happens when a new surrogate becomes available? Is it reasonable to assume that every library will track, record, and resubmit data to the ESTC every time they make available a new digital text? And what of the cases where digital surrogates are created by a person or institution other than the holding institution? With increasing frequency corporate or individual bodies are producing high quality surrogates for items held in another institutions collection. Good examples of this include the English Broadside Ballad Archive, the Shelley-Godwin Archive, and the Whitman Archive [3].

With the increased availability of Open Access digital texts and images of texts, we will see an ever growing corpus of texts produced autonomously by scholars with no affiliation with the institutions that hold those items. Some of these will be diplomatic transcription. Others will be rich editorial editions. Some will be optimized for human readability. Others will be optimized for machine readability. All will add significantly to our ability to study the works in the ESTC catalogue, and all need to be made available through a common search and navigation gateway.

The #EEBOGate controversy prompted many scholars to suggest that the ESTC could and should subsume the role of EEBO in the scholarly community by offering access to the same body of text through an open-access, community supported portal. Foremost among these were @BenjaminPauley, @djp2025, @wynkenhimselfand, and @john_overholt.

In a blog post titled, “Together, we can FrEEBO,” Overholt concisely sums up the flurry of Twitter discussion following Proqest’s now retracted announcement that they planned to deny access to scholars currently subscribed through the Renaissance Society of America (RSA), the primary ethos of which was that it was time for scholars to take action and build their own, open-access textual corpus [4]. As Overholt puts it:

There is literally no reason for these centuries-old books to be the monopoly of a commercial publisher who owns not a single one of them. It is entirely within the power of the libraries of Great Britain and the US to make this invaluable resource available to everyone in the world without a massive subscription fee or even a relatively more modest but still expensive society membership. These books are part of our cultural heritage, and it’s high time we made them available to everyone.

I agree with Overholt that it is time for such an effort. In fact, scholars such as Martin Muller have already begun developing an infrastructure for community correction curation of the EEBO/TCP TEI texts as a step towards achieving exactly this goal. In a 2014 article in the Spenser Review, Mueller, writing specifically about the TCP texts, issues a call to arms for scholars and librarians to “join a movement” to produce a community corpus of early modern texts, According to Mueller, “It is a social rather than a technical challenge to get to a point where early modernists think of the TCP as something that they own and need to take care of themselves [5].

Mueller is correct that, as with most problems in the Digital Humanities, the social problems are far more difficult to solve than the technical ones. But some attention to the technical must still be paid. How would hundreds, or even thousands of scholars collaborate on such a curation project. And once the texts are made properly fit for scholarly consumption, how would they be made available to the community at large? How would they be situated such they can interact with other relevant texts from the period? How would users find and navigate such a large body of texts?

The above questions are at the forefront of the currently ongoing redesign of the ESTC, the ESTC21. With the generous of the Andrew W. Mellon Foundation, the ESTC has been transforming itself into a collaborative, scholarly environment rather than an updated library catalogue. Central to the ESTC redesign is the addition of functionality to allow scholars to “Annotate” items in the catalogue (suggesting corrections, additions, and the like) and to allow other scholars to peer review these annotations.

Annotations are concrete and directed assertions about items in the catalogue. They can be related to a record as a whole, such as matching an institutional holding record to an ESTC entry, or to items within a record, such as correcting a date, identifying a publisher, or correcting a typographic error introduced during the cataloguing process.

Once a user has made an annotation, it becomes immediately visible to other ESTC users; but it is situated as a provisional piece of information and not as part of the official item record. Other users are then invited to vote on whether or not they agree with the made assertion. Annotations that receive positive review by the community will then be reviewed by ESTC staff for migration into the official ESTC record. Whether an Annotation ultimately makes its way into the official record or not, the entire transaction history is recorded and made visible to ESCT21 users.

ESTC Social Curation Model

Of particular significance in the context of #EEBOGate is that users working in the ESTC21 interface can annotate records to include links to external digital surrogates. This functionality will allow individual users, or consortia of users, to develop texts independent of the ESTC and then make these texts easily and widely available to the public by entering them into the ESTC21. Additionally, the social review system described above provides a mechanism for peer review of these texts.

The development of this kind of community supported corpus is more in reach today than it has ever been. Digitization of texts is now relatively cheap and easy. Images of texts taken with just a smartphone surpass the images in EEBO in both quality and readability. Additionally, software tools such as Tesseract and its extended eMop suite of software, developed and maintained by the IDHMC at Texas A&M University, make it easy to convert these images to text which a higher level of accuracy than is reflected in EEBO/TCP [6]. In other words, the technology currently exists to allow the community to develop its own corpus. We have an app for that!

It is worth noting that there are other efforts underway to develop and make available a similarly comprehensive body of texts. Foremost amongst these is JISC Historical Texts [7]. Historical Texts aggregates texts from EEBO, Eighteenth Century Collections Online (ECCO), and the British Library’s 19th century collection in a common, aggregated search and discovery interface. The project provides valuable access to wide body of texts, but it is only fully available to those affiliated with institutions that are members of the UK Higher Education and Further Education Councils. Additionally, it lacks any social curation functionality that would allow community members to include links to additional texts or to peer review newly included texts.

The ESTC21’s social functionality offers the potential for the community of ESTC users to grow a digital collection organically. This is important, as large, top-down efforts are seldom successful without major financial backing—which we, as a community, do not have. The digital surrogates already connected to the ESCT provide a basis from which to begin such an effort. The addition of the TCP texts, even in their current form, would add significantly to the effort, as would a community effort to simply add links to digital editions that already exist. Finally, individual scholars (or groups of scholars), working in their own domains, could produce editions of their own and immediately add them to the growing library aggregated by the ESTC21.

The ESTC21 will launch as beta in the Spring of 2016. It is our hope that, when it does, the community will capitalize on its new infrastructure to develop and make available a reliable and rich collection of texts that surpasses in both quality and number that which is currently accessible through EEBO.

[1] See http://eebo.chadwyck.com/home.

[2] See http://estc.bl.uk.

[3] See http://ebba.english.ucsb.edu, http://shelleygodwinarchive.org, and http://www.whitmanarchive.org/.

[4] https://medium.com/@john_overholt/together-we-can-freebo-b33d39618f8#.lkhbsvrgw

[5] http://www.english.cam.ac.uk/spenseronline/review/volume-44/442/digital-projects/the-eebo-tcp-phase-i-public-release/

[6] See https://code.google.com/p/tesseract-ocr/, http://emop.tamu.edu/, http://src-online.ca/index.php/src/article/view/226/448, and http://idhmc.tamu.edu/.

[7] http://historicaltexts.jisc.ac.uk/.