CNI Fall 2014 notes

December 9, 2014

tags: cni14f, data management, libraries, wikipedia

My editorial comments are in italics.

Improving the Odds of Preservation

David S. H. Rosenthal, LOCKSS

Various studies have shown that large portions of the digital world are not archived. Over 50% of the journals we hold are preserved, most content linked from e-theses are no longer available, etc. He refers to this as the ‘half-empty archive’ and notes that the bad news is that this is overly optimistic. It’s actually worse. We tend to prefer archiving information that’s easy to access and presents no technical hurdles, e.g.- archiving Elsevier’s output isn’t doing anything terribly useful since it’s well situated content. We do not skew our activities to risk, in other words. Put simply, large, obvious, and well linked collections of information are more likely to be preserved, while all of the smaller yet critical portions go unpreserved.

More issues: we look backwards, not forwards, in other words, we prefer books and journals as preservation objects to more modern forms of information such as social media output and Web content in general. Dynamic and ephemeral content has little chance of being preserved.

Noted briefly, as he has before, that we typically have relied on Kryder’s law (where storage gets cheaper over time) to drive our preservation, because if we can afford to store something for a few years, Kryder’s dictates that we can store it forever, at least in a financial sense. Kryder’s law is, however, breaking down and the curve between price and capacity is flattening.

As I’ve heard him describe before, he pointed out myriad reasons why he finds the commercial cloud to be a poor solution for preservation. One is that we are faced with significant costs if we need to get the data back out, such as with Amazon Glacier. Since we are facing a massive spike in storage needs (as he demonstrated on a graph using industry data), this is all the more critical. An audience member took severe issue with this point, but he defended himself by noting that if the data increases greater than the Kryder rate, we have a problem.

As an aside, he pointed out that ingest (in particular metadata work) and dissemination are going to become more expensive as well.

The crux point of this talk, perhaps, is that when we store data on a system, we will lose bits. As he noted, the question is “how much damage for for many $.” The threats they face are media failure, hardware failure, software failure, network failure, obsolescence, and natural disaster. Now he also identifies operator error, external attack, insider attack, economic failure, and organizational failure. He noted that using less expensive systems where failure is endemic but that cost less means that over time you preserve more content.

His main point is that the greatest risk to digital preservation is “never preserved,” which exceeds all others risks, such as bit rot. How do we resolve this? Cutting costs. We need major cost reductions, but to achieve those we need more and better cost data. As he noted, some of the actors on the preservation market will undermine these efforts since they will always try to position themselves as the “better” preservation solution. Urged everyone to use the Curation Cost Exchange to upload their cost data.

During the Q&A, it came out more clearly than in his talk (although it was clear there) that he is suggesting that we need to be less concerned with metadata, since it’s a major cost point and may or may not be all that critical as the ability to search within data stores gets better. That’s a very radical notion for libraries, who put great value in this work.

A Decade In: Assessing the Impacts and Futures of Internet Identity
Ken Klingenstein, Internet2

The talk emphasized the experience of the United States, noting the contributions of the US academy to what has been built. That said, he did make clear that interfacing with various international groups was essential, and noted that much of what happened in the US had parallels around the world built on this collaboration.

Noted that in recent years, commercial services have surpassed the academy in terms of identity management (a slight paraphrase). It’s critical for multifactor authentication that we have federated identity management in place, otherwise we end up having a bevy of second factors, which defeats the purpose of that mechanism. These commercial interests are good at determining that they are working with the same person, i.e.- they can be precise with identity, but they do not know “who” that person really is. That’s a key difference between commercial and research needs.

Definitions:

Identity is you and your account
Identifiers are unique values tied to a person, often offering privacy instead of identity
Attributes provide privacy, access control, and scale (two categories: verified and self-asserted)

It’s the attributes that “unlock all the doors.”

Described two gateways, SAML2Social and Social2SAML, that allow exchange of attributes between social and academic identity systems. As he noted, it’s quite attractive for institutions, but taking in that social data exposes them to various potential liabilities.

What Role(s) Should the Library Play in Support of Discovery?
Roger Schonfeld, Ithaka S+R

[Arrived late due to a call, hence the paucity of notes on a session I had really wanted to hear.]

Noted that the metaphor of a flood is apt for researchers and their struggle to keep up with the literature of their discipline. His broader point was that discovery is more than search, and he gave examples of how services such as Google Scholar help researchers ‘discover’ papers of potential interest.

Under The Mattress? The Current Landscape of Confidential Data Storage
Jamene Brooks-Kieffer, Kansas

Her premise is that researchers store confidential and sensitive data in unsuitable locations, creating risk for the institution. Most common source for this data is from disciplines that study human subjects, but she did draw a line around the health sciences, which are not popping up in the literature she reviewed. These data come in various formats and sizes, have little grant funding attached to them, and are confidential because they use direct or indirect identifiers and/or are subject to an institution’s data classification policy (e.g.- covered by some sort of non-disclosure agreement).

Ideally, the storage for such data would be secure and well managed. In fact, researchers choose their storage options based on what they know of the options, which may or may not be well known to them. Her literature review uncovered that from 2007-2014, various data management reports cover some aspects of these issues–the nexus of researchers, data, and storage–but not all aspects in tandem. Showed some specific examples from various institutions where certain needs are highlighted, but where one also sees some fairly alarming practices and ideas (e.g.- storing data indefinitely, which may or may not be wise or necessary). Other studies show that practices vary across broad disciplinary groupings, i.e.- humanities has different data practices than the social sciences, and so on. In all, though, each grouping has different concerns, but unsure grasp of what best practice might be.

What do researchers want? Easy. Most research shows that storage needs to be easy. She showed a list of storage options that appeared in the various studies she read, ranging from local machine to an HPC cluster, with a variety of options in between, such as institutional networked storage and external media such as optical disks. She broke these into two groups, one of which she puts under the mattress heading (such as local machine) and the other under secure, such as the HPC cluster or an institutional cloud service.

If there are good options, why don’t people use them? Many reasons: low capacity, high cost, lack of support, difficult to share and collaborate across institutions, device incompatibility (mobile devices), etc. Some of the research represented here also stems from field research, which presents yet another challenge depending on location.

The cloud seems like a good option, since it’s familiar to most researchers and also presents a fairly easy to use set of tools. Alas, there are risks, such as data loss and lack of accountability (she used the Dedoose example). She noted that this is not so much an indictment of researchers as bad data stewards, but indicative of how little professional management is applied to research data by institutions. That professional management is what is needed, not a specific technical solution.

What are the threats for confidential data:

researchers “fending for themselves” (Rochkind)
sync, share, copy, backup
multiple devices and media
research teams
There are other threats as well, such as the impact of a poor data-management practice may impinge on a researcher’s ability to get funding. Looking further ahead, we risk losing data or frankly just creating a mess that makes it impossible to sort out what one might need.

She summed up her remarks by noting that the researcher is responsible ethically and professionally for their data and that institutions are responsible for creating environments where employees can do their work successfully. She feels that libraries are a logical place to solve many of the issues, not least as a mediator between various cultures. We also provide navigation and advocacy.

The real issue is not cloud vs. local, which is an easy way to frame it, but rather shifting from individually managed to professionally managed.

Exposing Library Collections on the Web: Challenges and Lessons Learned
Ted Fons, OCLC; Janina Sarol, Timothy Cole, U of Illinois; Kenning Arlitsch, Montana State

Ted pointed out during his comments that this isn’t about “us” (he meant OCLC, I think) creating/proposing relationship graphs on the Internet, but rather something that is emerging on the Web at the widest scale. A phrase he used was moving from “records to entities,” showing a mockup of a quasi-knowledge card for Malcolm Gladwell. This leads to what he called managing entities rather than managing records. He further noted that the major actors on the Internet agree that entities are the way to go; 15% of the Web uses schema.org, for example.

Timothy reported on a project (project website) at UIUC to expose its catalogue on the Web as linked open data. Janina reported on the details of how they did this, noting that the project is still a work in progress. They have nearly 11 million records, which they exported as MARCXML and then converted to MODS. From there, they apply various linked data sources, the details of which I didn’t catch, but it involved VIAF and schema.org at the very least. They added WorldCat links, VIAF links (for many, but not all names based on how many matches they got), and links for LCSH subject headings. After modifying their MODS records in this fashion, they generated RDF triples, mapping MODS to schema.org, which is fairly, but not entirely, straightforward from the sound of it. For subjects, they used madsrdf for complex subjects.

Kenning spoke about a specific entity issue, namely how our library organizations are represented on the Web, specifically the semantic Web. His talk focused on the knowledge card, the information that often appears to the right of search results when using Google or Bing search, for example. He also showed examples of answer boxes and carousels, two other representations, but focused on libraries and knowledge cards.

His survey showed that of the 125 ARL libraries, that most lack knowledge cards or have very poor knowledge cards. He rated them on a robustness scale of 1-5, where a five is a fully detailed KC, and 1 means you get just a plain map. Specifically, 43 had no knowledge card. Of the 82 who have cards, ten were incorrect, 29 had a robustness of one, so only a minority have anything resembling decent representation on the Web. Used his own library as an example, showing how in 2012 it had a poor card, one that actually pointed to the wrong campus, while in 2014 they have been able to correct this by correctly using the underlying platforms such as Google+. He now gives theirs a three on his scale, so acknowledges that there is still work to do.

So where does this information come from? Mainly DBpedia, but also FreeBase, which Google acquired a number of years ago. What this means is that the library must have an article in Wikipedia, and it helps all the more to be present in FreeBase, Google Places, and Google+. His recommendations are to define the library in Wikipedia (noting that their culture requires caution) and engaging in the other platforms that feed into Google’s semantic model.

Wikipedia and Libraries: Increasing Library Visibility
Merrilee Proffitt, OCLC; Jake Orlowitz, Wikimedia Foundation

Link analysis in Wikipedia shows that the most linked resources are those that are freely available on the Web, e.g.- Google Books, Internet Archive, IMDb, etc. The benefit of our collections and resources makes a great deal of sense in terms of having access to the best source material.

There has been collaboration between Wikipedia and libraries in terms of getting authority data into articles. But her dream is to have links from Wikipedia articles back to the library and its resources.

Wikipedia is currently at 30 million articles in 286 languages. The other metrics are similarly impressive and tell us what we know, which is that it’s become ubiquitous on the Web. It ranks as the fifth most popular site, for example. There are 20 million registered users, of which about 80,000 are active, but only 1400 admins. As Jake put it, it’s only as good as its sources, and we have the best sources. The idea is to connect the eyeballs they have with the sources we have. They routinely eliminate references to blogs, social media, tabloids, etc. As he said, they have these high standards, but that doesn’t mean that every article is at that level, of course.

The concept he started to address this is the Wikipedia Library, which is about creating the same service infrastructure for their editors that they might have if they were part of a university’s community. This will get them access to paywalled sources and connect them to libraries. They have approached various content owners–JSTOR, Oxford, et al.–and asked them to donate access to qualifed and vetted editors. They now have about 20 partners who have donated access. This group is about 2000 editors and it encompasses about 3000 accounts and has a value of about $1.2 million; 22% of it is non-English.

Big question: “what if every publisher donated free access to the 1000 most active Wikipedians in that subject area?”

They are now setting up a visiting scholars program with four universities–Montana State, Rutgers, George Mason, and UC Riverside–where a very small set of selected editors get access to the libraries’ resources (as ‘hosted’ researchers). Now he’d like to see that expand to all institutions, who could have one Wikipedian on their staff to do this work. They call these individuals “Wikipedia visiting scholars,” and these individuals generally have a very narrow topical focus.

While discussing an undergraduate student research project at Rutgers, Merrilee observed that students behave oddly around Wikipedia because they have always been told not to use it. Their attitudes and practices have been shaped by this.

Jake mentioned a tool that they are building to enable readers and editors to check references. Clearly, for paywalled materials, this is a challenge. They want to get around this by utilizing COinS and OpenURL to get as many readers as possible to full text. The WorldCat Knowledge Base API is an essential part of this, if I understood him correctly. Of course, this means that institutions need to populate the Knowledge Base with their information.

Another big question: “what if every reference in a Wikipedia article had a link to the full text next to it?” The related dream is that every reader would be able to access that full text free of charge, as he pointed out.

He then turned his attention to the question of how we can link our special collections to articles, when we have significant holdings on the topic or person. He suggests using either the external links or further reading sections. He admits that editors have sometimes handled libraries that do this a bit roughly. To get around that: create an account, populate the user page with some information (no red links!), and be clear about what one is trying to do. More tips: link to other institutions that have similar or related holdings, and create a link circle by linking to Wikipedia from our sites. Those who have successfully navigated this course have seen increased traffic to their resources coming from Wikipedia.

Comments are closed.

Libraries, Technology, and other matters

CNI Fall 2014 notes

Who I am

Recent

Search

Older posts

Latest tweet

Libraries, Technology, and other matters

CNI Fall 2014 notes

Share this:

Related

Who I am

Recent

Search

Older posts

Latest tweet