Digital Humanities Application Development in the Cloud

This paper outlines an attempt to migrate some humanistic research into the cloud. This undertaking raises a number of questions, but it will be clarifying to focus on two: why would humanists want to use the cloud and why should the cloud have more active humanists on it? In order to properly answer these questions, the paper first examines what we mean by "the cloud" when we talk about it in our roles as academic research computing specialists; second, by laying out my particular use case's processes and products in this context; and finally, by reflecting on why this might be significant for both research computing and what is broadly called the digital humanities.


Cloud Work
It is interesting that the singular, determinate article "the" is used for a noun as nebulous as "cloud." What this tells us is that phrase "the cloud" is an argument: that the availability of computation should be taken for granted. Computation is everywhere, now, not so much pool as an aquifer. And in academic computing, whether we believe that the meter should be running when the faucet is turned on or that the use should be a freely available common, we generally agree that it should be understood as a standing reserve -theoretically finite but always there, to be used [14].
This availabilization of resources by the cloud plays out in two ways: scalability through networking and fungibility through virtualization. The networking of assets theoretically reduces the friction for scaling processes: in universities for instance, approved users will no longer request resources from IT but use web interfaces to self-provision storage and compute cycles as needed for projects. Virtualization theoretically makes these resources more fungible, allowing computational power to be presented in a more immediately utilizable form.
It is important to critique these promises that the cloud makes about the availabilization of computing, because they are fundamentally ideological. Above all, valorizations of "the cloud" take it as an unquestioned good that researchers will be able to turn on the spigot without talking to a plumber -the point is that researchers should just be able to get on with their work and scale it in a relatively unregulated way. It is important to ask why that is held up as a necessary good.
In an academic context, we can formulate the question this way: if cloud is justified on accelerating research, what is the primary impediment that is seen as being overcome? One answer is labor. Defenders of administrative and research assistant labor point out that their roles cannot be wished away. In the case of research computing support staff, there are technologies that automate certain tasks, but the notion of "the cloud" plays the role of making labor invisible. This labor theory of added value in cloud computing is implicit in the focus of Indiana University Pervasive Technology Institute 2019 panel at PEARC, "Humans in the Loop: Enabling and Facilitating Research on Cloud Computing." The focus, as Brian Voss puts it, is specifically on "the human element" of cyberinfrastructure [1]. Daniel Sholler's paper for the panel, on the ways in which cloud computing is fetishized, draws on Susan Leigh Star's work in order to make this point about the invisibilization of labor quite clearly [15]. We ask, "But who will containerize the containers?" "What will the facilitators facilitate?" The argument here is that the desire to have an always-on, seamless, scalable allocation of virtual storage, environments, and cycles depends on a constant, background hum of IT.
In other words, the IT labor response to the automation and virtualization arguments in the cloud is to show that IT labor isn't getting in the way of research -in fact, it's what's facilitating Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from Permissions@acm.org. research, even if "the cloud" makes that hard to see. Someone has to maintain and build these machines, and universities will eventually pay for these people and machines one way or another, whether that's when deferred maintenance on containerized technologies brings systems crashing down as a result of widespread security vulnerabilities [10], or when short-term cost savings are wiped out by price hikes by the commercial cloud provider you have migrated everything to, or even in more nebulous ways, as when speculative ventures of uncertain value never come into being because they are not profitable as a turnkey service and the university IT department has lost its institutional memory by being hollowed out. With regard to the last: while such system-effects are obviously difficult to measure, some legal theorists have begun to revisit them as a rich site for thinking through the consequences of monopolistic behavior, reviving a postwar liberal understanding of marketplace diversity and competition [12].
The IT labor critique of cloud computing, in other words, is quite strong! But it's not enough. We know well enough under our current Silicon Valley economic regime, which has been called platform capitalism or surveillance capitalism by various critics, that customers participate in the hollowing out of industries even when they know it's going to lead to them being fleeced later on [18] [17]. IT has been hollowing out other industries for years, and its disintermediating effects have hollowed out middlemanagement as much as they have traditional labor; from this perspective, cloud is just IT turning on itself.
But it didn't have to be that way, with IT hollowing out various labor forces, and it doesn't have to be this way either-it was a matter of institutional politics and economic forces, not a technologically-determined outcome. As Alan Liu has argued, drawing on Manuel Castells, the technology itself is a doubleedged sword [13]. With respect to the institutional/infrastructural relevance of cloud technology to research computing, this perspective allows us to make a simple claim: the availabilization of virtual computing resources can lead to academic researchers reinvesting in finding novel uses for computing technology.
But what would that look like? It would at least involve the customers (that is, academic researchers) being able to articulate what the value of such institutional knowledge and capabilities are -and it would involve researchers being able in some way to quantify this value when advocating for it to administration. Below, then, I outline a use case of developing custom software in "the cloud" whose process and products demonstrate one way that we might better serve humanists and increase their investment in institutional support for such projects.

Use Case
What I will present here then is a use-case for getting humanities researchers invested in cloud technology, along with reflections as to why I think that is a good thing for them in this case, and the tradeoffs compared to the alternatives. I will close by contextualizing this strategy for developing humanities software in the cloud within the broader field of digital humanities (DH) project development.
The work of a humanities research computing facilitator falls into two main categories: connecting researchers to resources and developing capacity for unmet needs. However, because most research computing infrastructure was built with the needs of nonhumanists in mind, most solutions involve some form of custom development.
An example of an unmet need caused by a platform lock-in problem: A while back, a professor asked me to help her export material from Zotero, a popular citation-management system. 1 She is a historian and had stored thousands of notes in the platform, attached to thousands of bibliographical entries, many including photos. Archival search is a grind. The organizational structure that you have to maintain in order to keep track of manuscripts and rare books is highly complex and the descriptive metadata can be even worse. She wanted to do a structured data dump: search on a few terms or select a few collections and export her notes to sift through them as she wrote her book. However, the export functions were not parseable.
The professor hand-migrated all her good Zotero content over to Airtable. She has thousands of cross-linked entries that she can draw on and which have allowed her to layer more annotations on. She is very involved in the use of digital technologies to accelerate historical research, being one of the principals on a diachronic mapping platform that allows users to navigate Rio de Janiero's cityscape in detail over the course of centuries. 2 And she teaches her graduate students how to leverage digital technology to deepen and expand their own work.
This past semester, she came to me with another export problem. She wanted to show her students how to use Tropy, an image archiving platform, 3 but became concerned when she found she was having trouble getting data out of the platform. As of the writing of this paper, one can currently write plugins, or parse JSON exports, or perform structured exports to OMEKA, 4 but it is difficult to get a good csv dump [2] or an export to networked storage.
It is worth saying, then, that lock-in is not a problem unique to commercial platforms. There are good reasons that nonprofit, open-source software developers do not go out of their way to make their platforms export-friendly -it is not irrational to think that, for instance, making a Google Drive sync plugin for your image archive could lead to your platform being replaced by Google Drive itself. Indeed, Google is already taking a step in the direction of allowing more flexible, custom metadata fields for objects using the Drive API; their public development plan suggests that they will be soon be surfacing this capability in the Drive interface [8].
That isn't particularly fair to the researcher, however, who on the one hand has an unlimited institutional Google Drive storage option and on the other hand has a software package that only works with photos that are interfaced with using their application.
The lock-in strategy, in fact, has led Tropy down a development path that in fact makes it quite difficult to leverage cloud technology. All the traffic ultimately goes through the local machine, which makes scalability impractical.
We therefore built a bare-bones web application that assumes users will keep their photos on networked storage. The researchers I spoke to were storing everything on external hard drives, sometimes with multiple backups in different locations. Getting them to use scalable networked storage would remove any hard limit on the size of their archive, backups of both photos and metadata could be automated, and large batch operations such as OCR could be run at scale in the background. An added benefit, of particular interest to our client, is that the data would be stored on an SQL database we controlled, making custom exports more practicable.
This summer, two Rice students, David Yi and Jiacheng Wang, have been hired as interns to create a working prototype of this system under my guidance and with input from the client. We are completing the project on Rice infrastructure but could be deployed on any networked assets (See Fig. 1). A CentOS7 VM on Rice's private cloud, ORION: • Mounts a networked SMB share from Rice's Isilon that holds all the image assets, • Hosts an SQL/MariaDB instance that tracks all these assets and metadata, • Hosts an International Image Interoperability Framework (IIIF) Loris 5 server that serves these images on demand, • Runs a Python Flask web interface using a modified bootstrap.js template that allows users to organize, annotate, and transform these images (rotation, contrast, cropping), • Will run background cron jobs to automate image importing. 5 https://github.com/loris-imageserver/loris When a user drops new photos into the networked storage, it will import these into a hashed directory structure and update the database with entries for the new assets. The interface (Fig.'s 2-3) queries the database for assets, displays their metadata, and uses the IIIF server to render the images. If this basic functionality is completed before the summer is finished, we will build, in the following order: a Google Drive backup function that uses the API's custom metadata capabilities and an automated OCR batch job submission function. The prototype interface can be seen in Figures 2-3.
The OCR plugin functionality is particularly interesting to me because it is one of the few areas in which scalable CPU power has been of demonstrable utility in the humanities, and the workflow is relatively straightforward. If a researcher could push thousands of images of typewritten (or even handwritten?) files onto networked storage and then with a press of a button see the metadata fields for these images populated with full text transcriptions, that would be a major selling point for humanities researchers to use cloud storage. For this reason, my other two interns, Shengjing Zhang and Hongfei Ye, have built an automated OCR HTC pipeline in Python that scales linearly. A user can currently dump typewritten pdf's in a directory structure that will allocate resources as needed, split the pdf's, and write the outputs in a structured manner. This is, in effect, what people want out of cloud supercomputing: a job to scale efficiently as needed. Rice's CRC, like most research computing centers, does not allow automated job submission in our clusters. However, because we have had growing requests for cloud supercomputing,

Development Model
This platform is not particularly technically novel; the point of the exercise is to rethink the form that IT support for humanities research can take. Simple workflows and interfaces can go a long way towards getting funding for more robust platforms, but much of the programming work that goes into making these is not reproducible because the needs are so particularized and the development model is so individualized. In a science lab, a doctoral candidate would just write the needed code and get a publication or two out of the results; but in the humanities, there is neither the training for such work, nor the collaboratory working environment necessary for the organic development of such tools, nor a widely-accepted credit model for such work. And so if we want humanists to have the tools to use scalable research computing infrastructure, we will have to find a way to see that it is built for them.
Much current work in DH project development is to involve humanists in the design and construction process from the beginning. This model of project funding is called "jump-start" or "start-up" grants [5]. The P.I. is given a small block grant to prototype a project. The potential downside is that the P.I. humanities scholar ends up wearing many hats: the archival researcher, the project manager, and oftentimes a coder or designer to boot. It is possible for jump-start P.I.'s to be spread thin by asking so much of them and offering them only a small amount of block-grant funding.
In fact, it is not quite fair to call this a downside because some DH practitioners utilize it as an opportunity to build communities of practice and deepen skill bases; I myself have benefited enormously from this model and have seen it build crossinstitutional capacity [3]. This community-based approach can therefore can be understood as responding to the same problem of the overburdened P.I. that our prototyping approach takes as its impetus. King's College Digital Laboratory in the UK offers a useful reference point to triangulate these approaches: their team, directed by James Smithies, provides custom software as a service to humanities researchers [16].
As the above narrative shows, the custom development in the cloud model is different than the jump-start model but could be deployed as a supplement or augmentation to it. It attempts to unburden the researcher of being a project manager or coder, and instead to try offering quickly-built custom solutions that can be scaled if successful. The research computing facilitator takes on the role of project manager (with a little coding), our coding interns build (and sometimes design) the back-end and front-end of a web service, and the researcher provides regular feedback while readying his/her data and collection processes for integration into the platform. This could be aligned with the jumpstart funding model simply by redefining the humanities researcher's role from start-up entrepreneur to client.
There are potential downsides to my development model, of course, such as reinventing the wheel, building something that is quickly superseded by a slicker version, committing to maintenance of the platform, or dampening the possibility that humanities researchers might learn how to use research computing on their own. But there are tangible upsides as well, such as onboarding user groups who have otherwise been underrepresented in academic research computing, deepening humanists' commitment to institutional support for research computing infrastructure, and discovering new uses for scalable computing on humanities datasets. I can say that my own use of Rice's supercomputers, networked storage, and virtual machines pool has transformed the way that I do digital humanities workmy code almost never runs on a local machine anymore -and I have spent the last year helping researchers to scale their projects using standard tools, such as migrating them to hosted databases, getting them to store data on networked storage, or building more robust data-driven websites customized for their specific needs. The provisional measures of success for such projects would be: 1) the number of users of a given platform, 2) the size of the datasets that users put on networked storage, 3) the frequency and scale of their use, 4) the plugins built onto or branches built out of the codebase. Currently, we have two prospective users of the platform described above: the tenured faculty researcher and one of her graduate students. We are OCR'ing the graduate student's collection of rare Portuguese printed books for him in order to demonstrate the utility of scaled computing for him, and he has already uploaded 10GB of his photographic data to networked storage for this purpose.
What I believe is unique about the project I have outlined is that we are developing custom applications specifically in order to get researchers onto a cloud. This goal was set based on three interrelated assumptions: that to the extent researchers in the humanities do need custom applications or workflows built for them, cloud technology accelerates such development; that there is value to be had in applying scalable computing to their datasets; and that these days this means the relatively frictionless access to networked assets, which is what we usually mean by "cloud." By using cloud technology to make research computing resources more accessible and functional for humanities researchers, our model can also encourage their buy-in to research computing more broadly. If I am justified in this hope that cloud technology can be used to cultivate humanist investment in research computing and vice versa, then this would provide a prime example as to how cloud can be used not to disintermediate academic IT but rather to reinvigorate it as an area of institutional innovation. And even more than that: having humanists invested in what you do is a very good way-even if it slows things down and one has to put up with our annoying habit of asking unexpected questions all the time-of gaining clarity and new insights into what it is that you do. One only need look at humanistic and sociological work on infrastructure and institutions to see that practices benefit from having their core ideas critiqued and experimented with by these disciplines -again, I refer the reader to work by Alan Liu and James Smithies, this time the Critical Infrastructure Studies initiative [4]. This paper has attempted to: 1) articulate how the development of custom software for humanities research relates to the technological milieu that we call the cloud, 2) introduce a broader audience to a conversation about this mode of cloud development in humanities research, and finally, 3) point towards how this sort of experimental collaboration can be beneficial both for the clients and the practitioners going forward.