Skip to content

CNI 2019 Spring Membership Meeting

April 17, 2019
Busch Stadium, St. Louis

Busch Stadium, St. Louis – CC Joe Penniston, flickr

Many thanks to CNI’s staff and leadership for a wonderful and engaging event. At this iteration of CNI, I took part in their series of Executive Roundtables, this one exploring how institutions are moving services and infrastructure to the cloud. At a Canadian institution, this is of particular interest given that most major cloud vendors are U.S.-based. I look forward to the summary report that CNI will issue based on all of our input on the topic.

Of the talks below, I would highlight Herbert van de Sompel and Martin Klein’s talk on archiving the bits of intellectual property that researchers sprinkle around the Web in the course of their work. I’ve already raised it several times in the last week in conversations with faculty and administrators here and it finds immediate resonance.

A Research Agenda for Historical and Multilingual OCR
Ryan Cordell – Northeastern U

OCR quality was diminishing the efficacy of text mining research. Led to a report on the state of OCR with an eye toward steps that could be taken to make it work better.

Cordell noted that we tend not to think about OCR when it’s working or when we perceive it to be working well enough. Showed an example from a Vermont newspaper illustrating just how poor OCR can be. Mentioned the Trove example in Australia, which offers users the ability to transcribe newspapers. Great work, but does not scale, at all.

His interest and his collaborator’s interest arose from working with 19th-century newspapers and magazines. Part of their work involves finding reprinted information; locating these requires some fuzzy matching, not least because OCR is not 100% accurate. He observed at conferences that many people would show a slide with dirty OCR and everyone would groan and say, what can you do. Began to get interested in exploring it as an academic topic, exploring some of the common failures and issues of OCR. This led to some insights. If you have, for example, 300 copies of a text to which OCR has been applied, one could perhaps use them to improve the overall quality of the text captured.

Their project had a fairly wide potential audience, including computer scientists, libraries and librarians, archives and archivists, OCR tool developers, funders, scholars working with various text corpora, and scholarly societies. Their project involved a survey sent to various segments of this potential audience (62 responses). Also conducted 26 interviews with select scholars or teams. They also held five virtual working group discussions, a workshop at Northeastern, and ran experiments with various corpora.

Main research question:

  • Given adequate time, attention, and funding, what innovation(s) in OCR would most significantly move research forward in your discipline?

The full report is available via NU’s repository. The findings/recommendations include:

  1. Improve statistical analysis of OCR output – scholars know what dirty OCR is, but there are few measures of this. How does one communicate the issues to peers, clearly? More than a “fuzzy sense.”
  2. Formulate standards for annotation and evaluation of document layout – when looking at a document, what part of the page needs to be captured and how? This is a particular issue with newspapers and some non-Latin character texts.
  3. Exploit existing digital editions for training and test data – there are a lot of extant corpora, many with very high quality because they are derived via manual transcription. This could be excellent training data, however, these projects often ignore attributes such as page positioning and segmentation. As he noted, many doing that work would not know how to do this, but with some minimal level of engagement and knowledge sharing, a little bit of effort could yield huge results.
  4. Develop a reusable contribution system for OCR ground truth – in other words, creating a platform where this work can come together so that those who need it can find it and make use of it. One example is Transkribus, but he noted it’s not entirely community oriented.
  5. Develop model adaptation and search for comparable training sets – it’s not always possible or practical to have training data, e.g., a subset of a corpus that has been manually transcribed. So this recommendation is about extending processes across mass corpora so that training models from a variety of projects can be pulled together to work on a specific project (this is a very loose summary of something Cordell was summarizing from his collaborator, David Smith).
  6. Train and test OCR on linguistically diverse texts – showed examples of pages that are familiar to many scholars where multiple languages and even typefaces appear inline adjacent to each other.
  7. Convene OCR Institutes in critical research areas – they are convinced that there are domains where a great deal of progress could be made relatively quickly if the right people got together, now. They are targeting this recommendation to funders. This could be languages or periods. This would be a kind of “challenge grant” situation.
  8. Create an OCR assessment toolkit for cultural heritage institutions – builds on the assumption that if one were to apply OCR to older text collections that we would now get better results. The earliest projects were often very significant publications–e.g., the New York Herald–but these have terrible OCR. The toolkit would align with existing collection evaluation and work so that improving OCR is just part of what gets done with a collection, not some onerous additional step.
  9. Establish an “OCR Service Bureau”

Ended with a plea for collaboration in this area, not just across the humanities but into computer science and libraries/archives.

An Institutional Perspective to Rescue Scholarly Orphans
Herbert Van de Sompel – DANS; Martin Klein – LANL

Researchers use various platforms for collaboration. Showed a researcher profile that had GitHub, Publons, ORCID, etc. Lots of examples; we all know people doing this (likely ourselves, too).

Some of these platforms are scholarship specific, e.g., Figshare, but there are many that are general purpose, such as GitHub or Wikidata. This creates three problems.

  • The artifacts deposited are invisible to the institution.
  • There are also questions around long-term access to the information and objects. Commercial products can change conditions, raise paywalls, etc., while non-profit portals have inconsistent funding streams.
  • None of these are systematically archived. No LOCKSS, for example. Web archives such as Internet Archive also do not capture these objects.

He showed an example of a scholar’s SlideShare artifact, which generated zero Mementos (searching across a couple dozen public archives). A GitHub example had one Memento in IA. Question: how do we capture these scholarly orphans for long-term archiving?

Their approach starts from the premise that this should be institutionally driven because it is the intellectual property of their researchers. Also aligns with the mission of their libraries. Such institutions also have a long life and thus are suited to the task.

Their second point is that we should take a Web archiving approach. Much of this will have to be automated because scholars will not deposit in the IR. There are too many portals to make this based on bilateral agreements, too.

They have built a prototype tool to do this second piece. Their prototype tracks artifacts by polling APIs looking for something by their researchers. If so, it goes to a capture engine that grabs it and puts it into an institutional archive. Beyond that, they deposit it in a cross-institutional Web archive.

To do the tracking, one needs the Web identity of the researcher. How do you get these? Algorithmic discovery for one, but works well only with unique names, not so much with Paul Smith. It can also occur via a registry. More realistically–and what they are doing with their prototype–one collects them manually. The targets must have an API that allows for a query that scopes to what you want to pull out (e.g., new objects since a point in time).

Many of the portals do support access by Web identity, well, only the commercial portals. The scholarly portals tend not to do so. Some of these have no API and no way to pull objects based on a user ID. Also, how does one distinguish between “personal” and “professional” contributions to these portals?

Capture relies on a URL for the artifact that comes via the API. Capture is in the form of a WARC and is deposited in an institutional archive. There is a boundary problem: sometimes there is more that needs to be captured beyond that URL. How do you do that? Also very difficult to capture Web pages because of the ubiquity of interactive and dynamic content.

Described the third step–archiving in a cross-institutional archive–as the icing on the cake. He mentioned Open Wayback as one such place. Did not encounter many challenges with this part of the process.

Klein then showed a demo of the tool. They created a pool of 16 researchers as their “institution.” All have ORCIDs, all use some of these portals. They tracked about ten portals: Figshare, Blogger, Publons, Stack Overflow, etc. They tracked for artifacts created after August 1, 2018. They skipped the notion of personal/professional for this trial run; as he noted, a picture of a puppy could be personal or professional depending on your field. Showed the interface and how it can be sorted by researcher, portal, artifact type, date, etc.

In their demo, about 75% of the artifacts were from GitHub. Not too surprising given the nature of the platform. When looking across researchers, one sees an irregular distribution; some have thousands of objects, some just a few.

Their (very) functional prototype can be seen at This shows that it can be done, that the technology exists.

Web Archives Analysis at Scale with the Archives Unleashed Cloud
Nick Ruest – York U; Ian Milligan – U of Waterloo

Milligan opened with an idea I have heard him say in other venues: historians who want to study periods after the 1990s will have to use Web archives. That said, they are not ready for this and this is the gist of their project. He notes that access at scale is still an issue. One can troll through the Wayback Machine, but it’s only useful for specific research queries: one has to have the URL and know fairly narrowly what one needs.

The other option is to work with the WARC files behind the Wayback Machine and to work directly with them. This opens the door to text analysis, network analysis, and other methods familiar to us from the digital humanities. One can move between distant reading and close reading. It’s not easy to work with these files, however, as the tools aren’t accessible to many humanists. The datasets are quite large; some exceed 10TB and that’s beyond the scale of what most researchers can handle. In sum: the tools aren’t there, which is where their Archives Unleashed project picks up.

AU uses an interdisciplinary team: Milligan (historian), Ruest (librarian), and Lin (computer scientist). They are building a toolkit, an open source platform that uses Apache Spark and is scalable. It’s built on the FAAV cycle: filter, analyze, aggregate, visualize. It’s neat, so why doesn’t everyone use it? Well, the interface was too hard, i.e., command line (“a bridge too far”).

To address that, they have developed the Archives Unleashed Cloud. It’s a Rails application that sends files and data to Apache Spark. Can be used centrally on their server or run locally. Currently, it’s free, although Ruest alluded to the fact that they have some thoughts around sustainability. The tool pulls a file down and creates a cached copy, so the archived copy remains on the far end and doesn’t require back and forth processing. Currently works with Archive-It and will eventually work with other tools (e.g.- WASAPI).

Ruest closed by noting that in addition to building tools that they want to build a community. To achieve this, they have run four datathons and have four more planned. These datathons lower barriers and bring disparate groups of people together. They are also working hard to build a sustainable model, reaching out to various entities and institutions.


Purposeful Space Design for Libraries
Joan Lippincott – CNI; Thomas Hickerson – U of Calgary; Kelly Miller – U of Miami

Miller spoke about an event in which she participated (sponsored by CNI and the Learning Spaces Collaboratory) where the group identified various types of spaces and developed “job descriptions” for them. The types they came up with were not too revolutionary, but pointed toward realistic types of space use, e.g., staff space or quiet space.

Miller showed an example of a flexible event space with a group of faculty and leaders working together, under the label “learning oriented.” Also showed a “problem-oriented” space being used for a specific type of research. They also asked the question: how can spaces “encourage serendipitous collisions?” Another facet: facilitating conversations rather than transactional interactions. Other categories she mentioned were biophilic (our need for proximity to nature), learner-focused, fostering a sense of belonging (called this the most important).

Hickerson introduced the concept of “permeable” spaces. These are open, transparent spaces where various people can take ownership. These spaces have design that does not dictate behaviour. They have high technology, but with a human touch. Noted that first floors give visitors a sense of the building as a whole. In that light, a cafe is not just about coffee, but indicates a space that is not someone’s personal space and demonstrates the interaction of social use and learning use.

There are elements we can include that help us keep our spaces vital, e.g., raised flooring and de-mountable walls. He made the general observation that we’ve now focused for a while on putting technology on spaces, but that now we need to focus on the human element and how humans want to work and interact in spaces.

Lippincott noted that the LSC has plenty of data on user preferences, so we don’t necessarily need to recreate the wheel by studying students and users. We know already what they want: light, power, etc. What are students trying to do and what can’t they find available to them? Often, campuses have certain spaces, but they are not open to all members of the community. One question she asked: how do we measure if our innovative spaces achieve their goals? We can do this and we should convey this information to our donors and administrators. She cautioned that we can’t assess everything, so we should be selective and assess the elements that are important to our campus.

Advancements in the ORCID US Community: Supporting Researchers & Adding Value for Research Institutions
Karen Estlund – Penn State U; Jason Ronallo – North Carolina State U; Sheila Rabun – Lyrasis; Carmelita Pickett – U of Virginia

Estlund opened her remarks by noting ORCID data that shows how mobile researchers are. They change institutions and an ORCID can be linked to various institutions (showed her own example). Penn State is a Pure institution, and she showed how it is possible to export profile information from Pure to ORCID and she noted that this makes ORCID something of a “fail-safe” behind Pure. In an aside, she also pointed out that vended systems aren’t simple or easy to implement, but require staffing and lobbying.

At NC State, the library has been gathering citation data for faculty publications for over ten years and offering it via an API for faculty use on departmental and college pages (four colleges use it). Believe I heard him say that the main source is Web of Science (or was; they now also harvest from ORCID). Ronallo noted that at NC State they are sometimes asked just to come to faculty meetings and get people to sign up for an ORCID. They even offer a CV service, where a faculty member can just send their CV and they will do the work of insuring that their works are identified and captured. Showed a lot of other neat things they are doing based on this collection (e.g.- using the DOI to use Unpaywall to find alternative access to full text).

Ronallo pointed out that we only need a few minutes of a faculty member’s time to sign up for an ORCID. No need for them to understand why; “they can thank us later.” They ask faculty to unlock services by registering their ORCID with them, by linking it to their campus identifier (Unity ID).

Pickett noted some of the challenges they encounter at Virginia, for example, no central graduate school and multiple faculty reporting systems. They have linked their IR to ORCID. One step they are taking is integrating the ORCID into their campus open access publication system. They are now building a UVA-ORCID connector; as with NC State, the idea, I believe, is to associate firmly ORCIDs to campus identifiers.

Building Systems Interoperability for User Discovery of High-Density Storage Collections
Jeff Carrico – Georgia Tech U; Bob Fox, Bruce Keisling – U of Louisville

Keisling and Fox started with a review of some of their 2006 assumptions. One was that there would be high demand for materials. Also, the planners then assumed that demand would remain constant and that users would require and demand rapid fulfillment. Last, this was the most cost-effective solution, per their 2006 thinking. Worth noting that neither Fox nor Keisling were part of that process.

Their system has capacity for 600,000 volumes with two robotic cranes. The original plan was for twice the space but a spike in steel prices led to cancellation of the second half. Retrieval is integrated into their ILS, originally Voyager and now WMS. Sounds like retrieval with WMS took them a bit longer than they had planned. WMS lacked an API when they adopted it, but rather than develop an API, they built the functionality into WMS. It took longer, but it works.

They think differently about capacity now. Originally, they were promised in 2006 that adding cranes and capacity would be fairly straightforward. Instead, it’s a ship in a bottle project. Moreover, use has not been what they anticipated in 2006 (it’s a long tail situation); parallel to this, they learned from the period of non-integration with WMS that they don’t need “push button” retrieval. It’s also not cost effective: “cranes are like Ferraris,” i.e., everything about them is expensive. Their expansion is “compact shelving on steroids,” 40 ft. high and 100 ft. long. It’s entirely manual. Capacity is 300,000 volumes. It’s a Spacesaver product, but Spacesaver offers no inventory software solution, so they had to figure out how to map out the shelving and make items retrievable. They looked at commercial solutions, some open source solution, also just looked at building their own MySQL database with an Access front end (this is what they ultimately chose). It’s not connected to WMS. This means that they now have to retrieval streams, one automated for their robotic system, the other mediated for their Spacesaver shelving.

Georgia Tech has a facility shared with Emory. GT’s print collections aren’t growing rapidly, but Emory still buys a fair bit of print. Their facility is based on the Harvard model: no robotics, long aisles, tall shelves, manual pickers that follow wire guides with a human pulling from the shelf. GT put nearly all of their print collections into the facility, so they care a great deal about service, while Emory has other options and plans and isn’t as concerned about retrieval times. Both schools are now Alma schools. They used an inventory system designed by W B Meyer, but then they migrated in 2018 to CAIA Software Solutions when Meyer wasn’t able to continue to evolve the product to meet their desires. The new solution will enable them to track items en route because various transit mechanisms can be tracked.

Research Innovation Trends and Priorities in Canadian Research Libraries
Merrilee Proffitt – OCLC; Vivian Lewis – McMaster U

Lewis showed demographic information about Canadian directors. About 65% have been in their roles for less than five years. A majority of 60% believe that use of their physical facilities will increase over the next five years, while 92% say the same about online activity. The number one service they expect faculty/staff to seek from libraries is research support services (95%). Other services they expect to decline, particularly physical loans, ILL, etc. With students the directors tend to see low volatility with student usage of the library, although it will shift from being a passive study or collaboration space toward a technology space, most directors seem to predict.

Research data management, digital scholarship services, publishing support, IR discovery, and research information management represent the top five categories where we need to see innovation in the coming years (not in that order; this is how I captured them!). Many of us are active in these spaces, but it would be worth asking how we are resourcing these areas (my editorial comment).

Proffitt showed contrasting results from Australia and New Zealand. There is some overlap, of course, in priorities, but other areas emerge clearly as priorities in that region that were not top of mind in Canada, such as demonstrating value to funders.

Closing Plenary: Web Archives at the Nexus of Good Fakes and Flawed Originals: “You’re in a Desert Walking Along in the Sand When All of a Sudden You Look Down, and You See a Tortoise…”
Michael L. Nelson – Old Dominion U

I rarely take notes during keynotes. This was a great one, however, even if parts of it made the humanist in me squirm because, well, truth and reality have always been subjective and subject to the whims of power. Watch for the recording to go live. There was a bit of over-the-top painting of worst-case scenarios and no exploration of how text-based disciplines such as history have long dealt with the inaccuracies and inconsistencies in the record, but still a great talk for surfacing various concerns.


Comments are closed.

%d bloggers like this: