CNI Spring 2013 notes
Another great CNI membership meeting is behind us. Saw a wide range of presentations, and found myself wishing as usual that there weren’t so many offered in parallel. As always with these notes, I’ve placed my own editorial comments in italics to differentiate them from the speaker’s words.
Opening Plenary – From the Version of a Record to a Version of the Record
Herbert van de Sompel, Los Alamos NL
Scholarly communication now includes many assets beyond textual articles: datasets, software, blogs, et al. The challenges are to group assets and version them.
Gave a tour of the history of efforts to modernize scholarly communication.
1999 – OAI (Open Archives Initiative) – communication via preprints and non-peer-reviewed papers. Tried technical means to achieve this, leveraging metadata exchange. Next came the protocol for metadata harvesting (PMH), but it ignored search engines as an essential part of the Web infrastructure, as they have become in the intervening years.
He pointed out that Web architecture wasn’t really firmly articulated until 2004, so they were working in an era when things had yet to be established, so to some degree they were creating their own architecture. They lacked faith in HTTP, because, as he humorously put it, “we had just lost Gopher” and had been burned by that experience.
With compound objects, the goal now is not to look at the issue from a repository perspective, but from a Web architecture perspective. What is needed are machine-readable resource maps that aggregate objects. What tools can help with this? URI, RDF, ORE (Object Reuse and Exchange) [n.b. – I may have gotten this list wrong as he flipped slides quickly]. For him, it was a major career shift for him to move away from thinking about digital libraries toward working with Web architecture. What this enables is real interoperability with other existing Web applications, and it also allows other communities (not involved in the development) to make use of the work without having to know anything about the community that created it. Conversely, tools they create can be used by libraries. To manage aggregations, they were able to use existing Web tools that already existed.
Critical to remember that compound asset groups can exist in multiple repositories. This raises questions of stewardship. Who controls access and access rights?
Memento, which has been in development for a number of years now, tackles the issue of versioning. A URI only points to the current version of an asset, while there may be legacy versions available via a number of mechanisms. For example, even an article may have multiple manifestations in the form of drafts and review copies.
As he put it, it’s going from the version of record to a version of the record; “fixity is challenged.” It challenges the notion of a scholarly record, and it will become more important to fix certain versions at a point in time, or to learn of their place in time. Web versioning is inherent in the Web, as he showed using the example of a W3C document that showed its previous versions and their unique URIs. Memento seeks to take one from the generic, time-insensitive version to a given time-specific version.
What’s ironic about watching this talk is that I installed Memento when it was a brand new beta with some wicked little bugs, and now, about five years later, I’m seeing a breakdown of how it works. The irony that this is occurring time-shifted is rich given the topic.
What’s the bad news? Many resources are not archived, or if they were, they may not be from the time of original publication. To resolve this, he outlined some steps one can take with their own resources, including using tools that version, such as wikis. It occurs to me that OJS is actually a really good versioning system, but the drafts aren’t publicly accessible. Wonder if there has been any thought for developing a plugin to release those post-publication, with author’s consent, of course. Would be a great resource.
He finished by talking about scholarly activity in social media arenas, particularly those, such as SlideShare, that have object stores. Many assets living in various portals, all linked by user identities. There are also metrics associated with these, which can be used for review purposes. Under the heading “surface the scholars,” he made a plea for institutions to promote scholars who take part in these public portals. The benefits are visibility for both the scholar and the institution, but also more metrics. For society, it’s a way to show return on investment.
Managing Large-Scale Library Digitization Projects Via the Cloud
Timothy Logan, Darryl Stuhr, Baylor U
They run a fairly complex environment: 15 workstations, three specialty scanners, audio/visual capture, etc. They use redundant mirrors, including one in-house, for storage.
Their project management flow uses a wide range of free or almost free tools: Evernote, Basecamp, OmniGraffle, GoogleDocs Spreadsheets, GoogleDocs Documents, Hojoki. In order, these handle: documentation, collaboration, design, tracking development, training, and monitoring.
Evernote is used for meeting notes, as well as to store documents and audio files or images. Evernote actually runs OCR across uploaded documents. Of course, it can be shared with other users. Basecamp is project collaboration software. Seems similar to Redmine, which we use here at McMaster for project management. Basecamp is not free, but it’s fairly inexpensive.
OmniGraffle is not cloud-based, but it’s a MacOS/iOS tool that allows visual workflow design (flowcharts, basically). Also not free, but fairly affordable.
Needless to say, Google Drive is entirely free, and they use both documents and spreadsheets. Hojoki is a productivity application that they use to monitor activity on the other platforms. It issues notifications whenever there are changes. It also compiles daily briefings and weekly metrics. It integrates with a wide variety of services. It’s app-based and available for iOS and Android. It’s currently free, but they believe that they are moving to a subscription model soon.
Economical Big Local Storage
Tom Klingler, Kent State
As with everyone, they have a lot of data and were looking for cheap storage alternatives.
They calculate the price for using Backblaze pods at $.30/GB. Their numbers are slightly lowers than ours at McMaster, because they are paying US prices.
From his description, it sounds as if the pods are available as storage for a number of staff via multiple upload means. They limit their use to 41 basic file types. Since the library also deals with university archives, items receive an expiration date, and the system sends reminder emails that items are expiring. Items can also be flagged as significant, and considered for long-term storage. They organize the pods by workgroup. They’ve offered options to automate metadata creation, since this is a step where many struggle. Also use IPTC data for images, if present.
Users can be in multiple workgroups, can create any number of projects, and can upload items as long as space exists. They use CentOS, RAID6, and have 36TB in each of three pods (only using 15 drive bays). The three pods sync against each other daily, using checksums, fixity checking, etc. They get notification when bad files are found.
Access is Web-based, username/password. User sees first a list of workgroups and proceeds as desired. Also have other menu options, such as search. The software that drives this is locally developed, and they intend to put it out with an open source license.
Taking scholarly note-taking to the Web
Michael Buckland, UC Berkeley; Ryan Shaw, UNC-Chapel Hill
Creating scholarly editions is critical for the humanities, but the workflows are still bound to the print codex. Beyond which, publishers are no longer interested in putting volumes with those page counts out on the market.
Documentary editing has numerous steps and is generally tedious and detailed work. Many documents are required to create even a small portion of an edition. Often projects have spanned decades, so backups can be on magnetic tape, floppies, etc. Many of the notes are ephemeral in nature, e.g.- written on slips of paper and referencing people who may no longer be part of the project. Some of the questions asked and answered (particularly queries with negative results) do not even find their way to the published volumes. In short, this is expensive work done by experts, but much of the work never sees the light of day nor contributes to future scholarship. Space issues are critical, i.e.- space limitations necessary for publishers to make their profit margins.
There are also issues with fact checking, and having time and space to include falsifications and dead ends, or tangential biographical details.
Goal: “finding a safe place for the ‘debris’ of research.” That’s a good line. In other words, improving the return on investment for document editing projects.
They built the software using existing tools where possible. For example, it integrates with Zotero for bibliographical metadata. The most challenging component was building the notes section. Users want a little messiness, a bit of chaos, since that’s part of the process. Ultimately, notes have multiple sections (multipart) reflecting the variety of sources that can inform notes. Their notes are more topical in nature, rather than being built on the basis of a specific document such as a letter. Notes have a status, too: open, closed, hibernating. Notes can have users assigned to them by an editor. They’re stored as HTML with a full revision history.
What changes when using this tool? Notes move from being free text documents to being structured blocks that can be managed and recombined, etc. Creates explicit links to entities that are cited in notes, as opposed to implicit references. Instead of burying things in filing cabinets, they can be made available via open access mechanisms.
One benefit is that it allows the “outer edges” of humanities research to be visible. That’s good PR for those who wonder what humanities scholars do with their time and grant money.
All open source, available via github at https://github.com/editorsnotes. Also built using mainly open source components: Django, PostgreSQL, Haystack, Google Refine, etc.
There are many desirable enhancements, such as visualization tools (temporal, geospatial, etc.). Doing some really interesting work with harvesting information via linked data, too (pilot project). They learned it was possible to do this, but that the editorial control over what’s been harvested needs to be better integrated into the note creation process. They also didn’t articulate the benefits all that well, so they need to figure out how to better exploit it (in-process reconciliation). In short, they need to create incentives for editors to do this, where they see immediate benefit from using structured data coming from external sources.
Scholarly Communication: New Models for Digital Scholarship Workflows
Stephen Griffin, U of Pittsburgh; Ed Fox, Virginia Tech; Micah Altman, MIT
He views the data potential in the humanities as immense. It’s rich and interesting data, but he notes that there’s a lack of funding for arts and humanities scholarship.
In his work, they kept running across a recurring theme. In science, good research can be repeated and reproduced. The ability to do this defines good and valid research. In order to do this, one needs access to more than just research outputs, but also to process, methodology, workflows, etc. The issue becomes how one can capture and preserve such information and make it part of scholarly communication. Also, how does one deal with artifacts that are not captured well by journal articles.
Pointed out that project Websites are also a record of research and should be considered part of the research workflow and linked together with other research assets. This meshes with what one reads in the Short Guide to Digital Humanities.
Fox noted that a place to start with changing how we do things would be electronic theses and dissertations. Students are more likely to be flexible when it comes to new tools. The outcome would be an enriched ETD, with related artifacts available. Rather than ‘vanilla’ theses, the idea is to have a “macadamia nut fudge sundae” or some such concoction. Useful metaphor.
Altman pointed out that we need more evidence for what is being proposed: case studies, etc. His segment was so brief that it was hard to get his main points, which was unfortunate. He has an economist’s view, as he put it, so ended with a comment about the need to do research and then create forecasts that I wanted to hear more about.
Rights, Research, Results: The Copyright Review Management System
Melissa Levine, Richard Adler, U of Michigan
CRMS has gained an impressive list of partners, all U.S., but a diverse group of institutions. There are CRMS-US and CRMS-World components, which is good to see. World in this case means Australia, Canada, and the UK for now. VIAF has proven to be a valuable resource.
CRMS has developed a review process that involves multiple reviews of candidates drawn from a pool. If the reviewers agree, then it moves forward, if there is conflict, then further review is necessary. Their Website has detailed visual descriptions of their processes.
They’ve built a custom interface for reviewers, and also maintain a documentation wiki based on issues that have arisen in the review process (basically, an FAQ). It’s a secure interface, and it even withholds copyrighted material from people who lack view rights. In other words, they are to review the title without reading it (we’re curious beings!). They ask themselves how much of a work they can show (5 pages, 10 pages?) to help reviewers make their decisions without violating rights.
They’ve discovered that it’s difficult to get valid information such as author death dates. Crowdsourcing is an option, but requires quality control. They feel that at this point a crowdsourcing project is more than they can handle. They do see some general QC issues with their review work. They train the reviewers, and require a 25% time commitment, but given that there is turnover, it’s a challenge to keep up with training and quality. It’s working fairly smoothly in the U.S. at the moment. It is essential that the reviewers have training and comfort with books as bibliographic objects and with the tools of bibliography. Typically, those are librarians, but they’re not dogmatic about using only librarians. They don’t want to use students, however, because of the training costs and turnover issues.
Michigan carries the liability, so they have a need to maintain the quality standards. They have not received a great number of takedown notices, some of which are legitimate (e.g.- essay collections not registered as books and therefore not in the Stanford database). They did a trial using the LC copyright office, and it cost around $5000 and six months to review 100 titles. That’s not cost- or time-effective.
They discovered at one point that they could pull death dates from their own catalogue (Mirlyn), which added 80,000+ dates to CRMS-World. That was a pleasant surprise for them and greatly aided review work.
They are considering pilots with other sites, e.g.- Berlin (Michael Seadle), Madrid, and perhaps Japan. Major portions of the works in the HathiTrust are in other languages, of course, with German in second position behind English. Many other pilots are on the table, too: linked data initiatives, women’s names issues, durationator, etc. Language diversity is also an issue. There are many languages (e.g.- Estonian) where there are small numbers of titles in HathiTrust. Durationator is being developed at Tulane, and seeks to deal with determining copyright term (*grossly oversimplified definition*). In short, this tool shows more works in the public domain than CRMS has found.
Research Data Management Services in Germany: Funding Activities of the German Research Foundation
Klaus Tochtermann, ZBW Leibniz Information Centre for Economics; Peter Schirmbacher, Humboldt U, Berlin
Full disclosure: I nearly always attend talks given by Germans at CNI. For one, I’ve spent a lot of my career working in or focusing on Germany, so have a personal interest in what’s going on there. Beyond that, I want to help spread the word in North America about some of the stellar work done by our German colleagues, which too often goes unnoticed on this side of the Atlantic. I’m grateful to CNI that they continue to offer opportunities to speakers from Germany and other countries.
EDaWaX is a joint project between the Leibniz Information Centre for Economics and the German Data Forum, funded by the DFG. Their project addresses the increase of empirical publications in economics, where there is often no means to replicate the results. This results from lack of sharing, and the absence of journal policies that would make data available. He showed an example where one paper was refuted by another based on the ability to repeat the experiment and independently assess the results. That kind of exchange is their motivation with this project.
They seek to implement a data archive for an economics journal and create incentives for economies to publish their data. They started by assessing the current state: how many share, how many journals have data policies, do the policies facilitate replication? Not surprisingly, they found that 90% of economists do not share at all, about 8% share some, and 2% share it all. With journals, about 72% have no policy, while of the remainder 20% have a data availability policy, and 8% have a replication policy. They did ascertain that the number of journals with policies is, however, growing. Ironically, half of the existing policies do not permit replication.
Even where journals gather data, their research demonstrated that that does not lead to good archiving and dissemination. Much of it lands on a publisher Website, but lacks metadata and identifiers that would make it findable or citable. Also, no standards are applied; various formats and software are used, which may or may not be viable for all.
re3data.org has the goal of being a global registry of research data repositories. Their workflow includes an ingest process (suggested entries from researchers and others), but then they review submissions based on established criteria before including a repo in the registry. They try to pull in a great deal of metadata about the repos, based on a vocabulary that they created. They use 31 metadata elements and 22 controlled vocabularies, which result in an icon system that make it easy to see what a given repo offers.
Closing plenary – The Ithaka S+R Faculty Survey US 2012: First Release of Key Findings
Roger Schonfeld, Deanna Marcum, Ithaka S+R; Judith Russell, U of Florida
We got a preview of the latest Ithaka S+R faculty survey results. Since by the time that I publish these notes, the survey results will be public, I opted not to take detailed notes.
It was mildly suprising to see that the library catalogue has rallied somewhat as a starting point for research. Just barely, though, and I wonder if people don’t mean the discovery tool.
At one point, a colleague and I were exchanging thoughts on Twitter wondering if there isn’t some mild conflict of interest going on with the faculty survey, given that Ithaka represents a publisher, in essence. Some of the questions seem, for example, formulated to elicit a publisher-friendly response. It’s a mild criticism, and there’s lots of good in the survey, but it seems worth noting and considering.