CNI spring 2015 notes

April 17, 2015

tags: cni15s, data management, digital preservation, open access, publishing

flickr, runner310

This spring’s CNI moved west to lovely Seattle. As always, I’ve offset my metacommentary on the talks with italics. If you find these notes useful, or have further thoughts, please leave a comment.

What price Open Access?
Stuart Shieber, Harvard; Ivy Anderson, California Digital Library; Ralf Schimmer, Max Planck Digital Library

Shieber started with a thought experiment of what it would cost Harvard to pay APCs for all of the articles its faculty publish. Not surprisingly, this would be a staggering sum at ~$3,000 per article (8000x$3000 = $24,000,000). Their spend on subscriptions was around $5,000,000, so clearly the APC route is not feasible.

Stepping out of the university context, he notes that in 2011, the publishing industry had $9.4 billion in revenue, which produced 1.8 million articles, which is about $5,222/article. In that light, even the high cost of the APC route seems reasonable at $3,000/article.

This analysis ignores that the size and productivity of a university skews the numbers. The cost goes up at other institutions, depending on the relationship between reading and writing articles, as he put it. As he mentioned, Harvard’s serial expenditure per published article is very low compared to a wide range of institutions.

So who should fund research? One principle posits that those that fund the research should pay for its dissemination, e.g.- granting agencies, foundations, and co-authors institutions.

Data shows that there isn’t a strong correlation between article influence and APC, i.e.- the higher the APC goes, this does not mean that the article is de facto of greater influence.

Looping back to the question of Harvard’s spend, he noted that if the APC averaged $1,000, and only 25% were not externally funded, and 60% were internal authors, the cost would be $1.2 million (8,000×1,000x.25x.6). He noted that this is based on a whole host of assumptions and changes that would have to occur to make it real. His co-speakers have done this kind of analysis for their institutions.

Schimmer offered the example of the Max Planck Digital Library, which has had a combined budget for subscriptions and APCs for ten years. He showed the current state of the subscription market, which results in prices ranging from $5,000/article when revenue and output are compared. After a potential OA transformation, even with $2,500 per article, it would halve the funds required.

As he noted, Open Access is taking publications from the subscription side, but not the costs, i.e.- we are spending as much as always on subscriptions. That’s what needs to change. Their data shows that 70% of articles have authors from multiple nations. I didn’t follow his argument entirely, but he noted that French and British studies have overestimated costs based on this overlap. Essentially, it’s a matter of deduplicating to eliminate the ~30% that are being counted twice, where the APC is only being paid once, not twice as the cost calculations currently state.

Showed some simple pie charts that make clear that while Elsevier, Wiley, and Springer make up <50% of the titles published, about 2/3 of our serials budgets goes to these three. He showed the actual numbers for their APC invoices (which are on GitHub); the conclusion of their analysis was that the MPG “as a heavily output-oriented research organization is able and committed to make the transformation.” Their leadership supports this.

In a closing summary, he noted that all subscription spending must be stopped. All of that money must be diverted and repurposed for OA services.

Anderson reported on Mellon-funded projects to study APCs in North America. The core question is whether a conversion to open access via the APC model is viable and sustainable for large North American research-intensive institutions. She noted that in North America, the policy framework tends toward green open access, while in Europe, the preference is gold. Given that each realm is about one third of the global scholarly output, this divergence of direction is troublesome.

An earlier study from 2013 analyzed their costs across nine different publishers, and demonstrated that for some publishers, even using a fairly low APC, the output would still be more than the subscription model (for the UC system). They then ran the numbers based on discipline, and it showed that the level of APC varies widely based on discipline. They then factored in corresponding authorship using 60% in the life and physical sciences, and 100% for the social sciences and humanities. Not surprisingly, this adjustment to the cost side reflects favourably on the APC threshold, resulting in a spend that is about 20% less than their subscription spend. She noted that this was a rough calculation, and their Mellon-funded Pay it Forward project is an attempt to solidify these findings through more fundamental analysis.

She showed the list of project participants, which includes Ohio State, UBC, Harvard, and the 10 UC institutions. They count this as four partners, but it seems more like 13, so the results may skew heavily based on UC figures. The question this raises, of course, is whether UC is typical or in some way different than other systems and institutions.

Publishing Ada: A retrospective look at the first three years of an open peer review multi-modal journal
Karen Estlund, Sarah Hamid, Bryce Peake, U of Oregon

Ada was a journal that “didn’t fit in OJS.” Estlund worked on her own faculty time to help get it off the ground; in the meantime, the library has embraced it and supports it.

They wanted to be open access, but leave authors the choice of Creative Commons licence. In practice, their authors have selected from the full gamut of options, from the most open to the most restrictive.

Their multi-level peer review has some interesting elements. Rejection isn’t a flat-out rejection, rather, they suggest alternative outlets or make suggestions for a resubmission. They use CommentPress for their open review. They also try to include multimodal and interactive pieces.

Since it’s a library conference, she touched on indexing. Now that they have three years, they can apply to ISI. They are also indexed in Google Scholar; they optimized tags to get that to happen correctly. It’s also listed in MLA, but nowhere else at present.

Hamid spoke about some of the challenges the have faced, ranging from requests for post-publication edits, infrastructure, and the coaching necessary to teach people how to access and use the review tools.

Peake pointed out that “the web is set up for articles about the web to succeed.” In other words, articles that link to and discuss other web pages tend to be linked back to by the object of their analysis. They have some fairly complex processes related to the integration of the comments and open reviews, among other sources.

One of their goals is to mentor authors, not act as a gatekeeper to scholarly publication. Peake mentioned some of the critical points that have been made about blind peer review. It’s not as objective as purported, which most likely know intuitively.

Estlund spoke about some of the costs. Their infrastructure costs are mostly “inconsequential,” while others are borne by the U of Oregon Library (leveraging their existing tools/services). Labour is mostly volunteer, it seems, but not free since they use graduate students. They are still working to quantify costs based on their 73 articles.

She closed by talking about the invisible review labour that goes into journals. Other discussions talk about changing the economic model, but they are interested in the publishing model, i.e.- how scholarship is produced and disseminated.

Software curation as a digital preservation service
Keith Webster, Carnegie Mellon; Euan Cochrane, Yale

Webster made the case for curating software based on our experience with digitizing and curating other content forms: text, images, data, etc.

Cochrane spoke about what they are doing at Yale, but offered more background. One point is that we have a great deal of software dependent content, which underscores the need to curate software. As he put it “old software is required to authentically render old content,” since emulation in newer software often fails to render documents faithfully (showed a WordPerfect to LibreOffice Writer conversion example). Research results can be lost without original software.

This means we need to curate and preserve operating systems, applications, fonts, scripts, plug-ins, entire desktop environments, and disk images. The latter would allow software to run on emulated hardware. Without all of these components, we don’t have all the pieces we need.

Main method for doing this work is emulation and virtualization. The latter is “emulation with compatible hardware.” What it allows us to do is “bridge the gap” between recently obsolete hardware and the arrival of new hardware that’s powerful enough to emulate it.

What do we need to do this? We need unique, persistent identifiers for software and for disk images. As he put it, we don’t need hundreds of Win95 disk images. He named a number of other attributes we need, but noted that we have none of these.

We typically provide access to emulated environments in specific physical places, which makes them difficult to access. This is where the idea of emulation as a service emerges. The basic idea is making emulated environments available via any web browser. This takes the abstraction and technical challenges away from the user, and it’s possible to dump changes at the end of a session. The base disk image remains unaltered. It’s fairly easy to restrict copying, downloading, etc.

Yale uses a tool developed at the Uni Freiburg in Germany: bwFLA. EaaS solves a lot of problems. It makes content accessible to users who can’t handle more complex emulation tools. Also, old software can be hard for modern users to understand. He gave many other reasons for EaaS, one of which is that one can outsource specific components, i.e.- emulation can be provided remotely, but the disk images and content can be maintained locally.

In response to a question about getting one’s hands on the installers, Webster noted that if you go to anyone on campus over 40 and ask them to open their desk, you tend to find an interesting array of old software (lots of laughs). I suspect most of us know of such a stash.

Digital Preservation Network progress report
Evviva Weinraub Lajoie, David Pcolar, DPN

Their current technical partners are Academic Preservation Trust (APTrust), Chronopolis/DuraSpace, Stanford Digital Repository, U of Texas Data Repository, and HathiTrust. They use heterogeneous nodes. As Pcolar put it, they are looking out 100 years into the future, so they need to be highly flexible. The heterogenous nodes together form a preservation-level dark archive, but they are different and operated by the various partners. Data replicates across the nodes, with the idea being that in some major disaster, one or more of the nodes will survive.

Pcolar showed a number of slides around their ingest and replication process. Fairly straightforward, Web accessible, REST-based. It hits a node using the BagIt spec, then is replicated across the network. Their architecture relies on the “core capabilities founded on proven institutions and repositories.”

Ran their first successful pilot in late 2014. They are now working on implementing Agile methodologies across the development team. Their production release is slated for July 1, 2015 (soft launch). At that point, data will replicate, but reporting tools will be developing after the soft launch. What they won’t have in July are SLAs, a published business model, clear node storage capacity, timeline for onboarding infrastructure, cost models for ingest/admin nodes, ongoing fixity checks, or a reporting dashboard.

Their model sounds expensive to me. For one, they have staff at the DPN level, which creates expense layered on top of the base infrastructure and institutional costs. Then they have to build and manage a membership, which consumes resources as well, so members are paying to recruit other members. For example, they have a communication strategy, which makes sense given the need to build and grow the network, but it also brings expense with it. As Pcolar put it, DPN isn’t just a storage network, but rather a cooperative. He bases the benefit of DPN on the expense of doing it yourself. I get that point at the institutional level, but our work in Ontario demonstrates that at a certain scale, these costs are not only bearable, but potentially very low. Also, within an existing consortium, there is no need to recruit and manage membership. Pcolar referred to the past years as the “venture capital” phase.

Their costs are a $20,000 membership fee, which provides 5TB per year for the first six years. They estimate the one-time payment per TB for 20 years of preservation and storage in DPN at $5,000-6,000. If one breaks down the $20K assuming a full utilization of the 5TB for six years, it comes out to $667/TB/yr, which isn’t as high as some preservation options (e.g.-DuraCloud), but is high compared to some other existing projects. If one is not using the full 5TB, then DuraCloud and others become more attractive from a price viewpoint. As one commenter noted during the Q&A, DPN’s model is more of a capital model than an operational model.

Picture this! Supporting data visualization research at scale
Carol Hunter, Joe Williams, Jill Sexton, UNC Chapel Hill

Williams showed an example of how they are using Voyant to visualize a text corpus extracted from the Documenting the American South project (data download here). During their presentation, it was generally impressive how many colleagues they mentioned being involved in this work. Showed another example that used ArcGIS Online and Tableau to visualize data on underemployment in North Carolina from an undergraduate course.

Another example Sexton described documented the incidence of catcalls on the UNCCH campus as well as in the larger community. The students’ first iteration was a large cloth map on which they had people place an orange paint thumbprint where they had experienced catcalls. Taking it further, they used ArcGIS Online to capture the data, and extend it with annotations, etc. Perhaps unsurprisingly, this map caught the attention of various administrators on campus who sought to use it as a basis for an anti-catcalling campaign.

Comments are closed.

Libraries, Technology, and other matters

CNI spring 2015 notes

Who I am

Recent

Search

Older posts

Latest tweet

Libraries, Technology, and other matters

CNI spring 2015 notes

Share this:

Related

Who I am

Recent

Search

Older posts

Latest tweet