Skip to content

CNI Fall 2017 Membership Meeting notes

December 12, 2017

CNI was informative and enriching as always. I had the opportunity this time around to participate in the executive roundtable on moving data to the cloud and enjoyed that environment as always.

At this event, I chose to take most of my notes using a pen and paper, after reading some thoughts on this from Mita Williams and viewing a video on the topic she shared. I found that I took more copious notes when using the pen yet could follow talks much more closely. Also, with my laptop closed, I wasn’t tempted to check email or look up various bits related to the talk. It was kind of refreshing and this might become a new habit.

Archival Collections, Open-Linked Data, and Multi-modal Storytelling
Andrew White – Rensselaer Polytechnic

Have multiple digital archive platforms running in parallel, Digitool, Archon, Inmagic Genie, all end-of-life systems. Right now, they scrape data out of these various tools, augment it, and present information, e.g., on historical buildings, via static HTML pages.

In a rough nutshell, they are now pulling the metadata into a gaming engine (Unity) to build visualizations, as well as using it for other similar transformations. In the process, they created linked data triples to establish connections that allow them to create rich displays with multidimensional navigation.

It was an interesting project, but it seemed uncritical in many ways. As White noted, their work uncovered various new relationships between people and buildings, but it does not seem that they approached the project with any critical questions in mind or with the intent to create a tool that fosters critical inquiry. He did point out that such a tool can surface latent relationships, so that would seem to point in that direction, toward interpretation of institutional history that differ with canonical tales, but he didn’t indicate if this was their intent or an outcome.

Ensuring Access to Culturally Significant, At-Risk Audiovisual Recordings: The EMI Music Canada Archive
Tom Hickerson, Annie Murray – U of Calgary

The EMI Music Canada collection spans the years 1949-2012, when EMI was purchased by Universal Music. Physically the materials are in great condition, having been properly stored and managed. There are about 5,500 boxes and two million items in all. There are four partners involved in this work: Universal Music Canada, University of Calgary, National Music Centre (in Calgary), and the Mellon Foundation. Universal also provided over $1 million to support processing the collection and also covered the costs of transport from Toronto to Calgary. It was the director of the NMC in Calgary who facilitated the connection between the U of C and Universal. The negotiations took about three months, and then there was a year where lawyers from the two sides worked out a legal agreement. The gift was announced on March 31, 2016. They used the code name “Milky Way”–a bit from an Anne Murray song–as a code name prior to the announcement. The announcement was timed to coincide with the Juno awards event in Calgary.

Beyond making it accessible and preserving it, Calgary wants to create a sustainable and scalable model for working with AV materials. They have a temporary reformatting studio and are building a permanent space in their remote storage facility, as well as a cold storage area for the AV materials. The Mellon funding assisted with planning and pilot projects. The goal is to advance models for treating AV materials.

Capitol Records started in Canada in 1949. EMI bought them in 1955, but the Capitol label persisted. In 1992, Virgin bought EMI and changed the name to EMI Music Canada. In 2012, Universal Music bought EMI. Tom noted that some UK artists, such as the Beatles and Pink Floyd, entered the US market via Canada, where their albums appeared first.

There are demo tapes in the collection for music never produced or released. These are unique resources, i.e., there is only this one copy. There are 18,213 audio recordings in 20 formats, 13,000 studio masters (unique items), 2,000,000 documents and photos, 18,617 video recordings in 22 formats, and 4,972 optical disks in five formats. Materials will arrive or have arrived in five transfers. Their initial focus is on the Av materials for preservation reasons.

They have two processes: archival processing (supported by gift from Universal) and media migration, the planning and execution of which is funded by Mellon. By 2020 they want to have migrated every AV file and are currently looking for appropriate platforms for their digital asset management system and digital preservation system.

They have a long list of technical advisors and supporters. Most of the migration is occurring in-house. This requires not only legacy hardware and machines, but also people who can operate and maintain them. They showed their standards for migration, which far exceed what a typical CD offers in terms of bit depth and frequency range. In a nutshell, these are very high and robust quality standards.

Collaboration and Platform Integration in Support of a Federated Research Data Management Service in Canada
Donna Bourne-Tyson, Dalhousie U; Lee Wilson, Portage

Provided an overview of Portage. Six working groups, two more on the way: Institutional RDM Strategy and Ethical Treatment of Sensitive Data. DMP Assistant 2.0 is also on the way.

Introduced FRDR: Federated Research Data Repository, which uses the Open Collections technology from UBC to create a federated search across 31 Canadian data repositories that collectively host 125,000 datasets. FRDR can also host data as well as federating metadata from other repositories. The goal is better discovery and to drive traffic back to the repositories, breaking down silos. FRDR uses the Globus file loader to enable it to ingest large files, which may be too large for other repositories. It is designed for scalability and can use Compute Canada or other storage on the backend. Also has Archivematica integration, which works up to 300GB or 25,000 files.

They also spoke about Dataverse North, which aims to create equitable access to Dataverse for all Canadian researchers. They want to coordinate all of this work in order to make it scalable. It also works with other national-level Dataverse initiatives, again the goal is to break down silos. Lee showed a visualization I failed to capture in my notes, but what he called the “Network of Expertise,” i.e., the human experts involved with the work, acts as “the glue that holds it all together.” The takeaway here for me is the recognition that it’s not just about infrastructure, but a fully supported environment. The goal of FRDR is not to compete with Dataverse but to complement it.

Creating a New Way to Search
Alex Humphreys, JSTOR Labs; Barbara Rockenbach, Columbia U

Alex introduced Text Analyzer, a new search tool from JSTOR Labs. It’s already in public beta release. In a nutshell, rather than searching by keywords, a researcher feeds JSTOR a text and JSTOR ingests and analyzes it. He did a quick live demo and showed the very cool interface that includes “equalizer” sliders so that one can add to or reduce emphasis on various topics. As he noted, the tool helps break down silos by bringing back articles that are in areas the researcher might not have considered. Currently it is English-only, but they are working on other languages.

The tool extracts text even if the document is image files by performing OCR on ingest. The terms identified stem from the JSTOR Thesaurus and the LDA Topic Model and they do some entity recognition using OpenCalais and IBM’s Alchemy and others I didn’t catch. The topic model is a labeled LDA topic model, which they trained using Wikipedia and JSTOR documents. At the core it uses Mallet and has 11,000 topics at its disposal. Each topic represents a distribution of word probabilities, as he put it.

The next steps for the tool are to improve the algorithms, open an API to beta partners, and to integrate article recommendations based on this tool in the search result sidebar in a way that makes sense to users. They’re still learning. Is it a feature, a product, or a service, he asked. Right now, it’s a feature.

They used Columbia’s IR as a corpus to test their tool. It was a widget that showed both JSTOR and Columbia content related to the viewed item.

How do you change researcher behaviour? As he put it, they’re not trying to replace keyword searching, just augment it.

Barbara explained how Columbia used this tool with two groups. One consisted of members of their student library advisory group. The graduate students in this group saw the utility immediately. One asked about the terms of service since one is uploading texts to JSTOR, an astute question.

They also introduced it into undergraduate instruction, but faced some resistance from library liaisons. Some saw it as an advanced tool or noted that it made research “too easy.” They ultimately did not learn as much as they had hoped due to this resistance, but did say that some librarians embraced it and were interested in sharing it with students.

Barbara pointed out that we need to stop assuming we know what our users want or need. A service such as this is another tool in the toolkit, which is a positive thing. She also made the critical point that we need to steward these purchases of resources such as JSTOR, not just pay for them, turn them loose, and then leave them alone.

Open Encyclopedia System: Open Source Platform for Open Access Encyclopedias
Christoph Schimmel, Freie U Berlin

Open Encyclopedia System (OES) is being developed at the Center for Digital Systems (CeDiS) at the Freie Universität in Berlin. As he noted, print encyclopedias are receding from the market, with electronic encyclopedias starting to appear in the mid-1990s. Wikipedia also made a major impact when it appeared early in this century. Noted a recent electronic encyclopedia, the 1914-1918 Online International Encyclopedia of the First World War, an open access, peer-reviewed work with about 1,200 articles and another 500 planned.

The key facets of encyclopedia development are publication workflow, a participation model, a peer review model, and community engagement. Their view is that each of these needs to be configurable, not fixed. To this point, online encyclopedias have been built either with proprietary or custom developed tools, but standardization and efficiency are necessary.

1914-1918 used Semantic MediaWiki, but it had limitations, e.g., the participation model is not configurable and modifying the code to make it so would be time and labour intensive. The need is to shift from unidirectional publication to interactive publication, where one can accept and integrate user input, as he put it, a “living encyclopedia.”

Their goal is to achieve this with OES and they are just starting down this road. They want a highly configurable solution that is tailored to the humanities and social sciences. Also: Web-based, modular, open source licenced, reusable. They want to build it using existing tools as components: WordPress, DSpace, MySQL, Solr, and external data sources such as Zotero or WorldCat (he noted during Q&A that they want to build it so that these components could be subsituted, e.g. another CMS for WordPress or Fedora for DSpace). They want to offer community engagement tools: call for papers, blog, personal work environment, comments, annotations, etc. The key point is to create these living encyclopedias where users can contribute.

They want an improved publication workflow, particularly with regard to versioning, online/offline submission, and individual workflows for various roles. Their versioning will improve on Wikipedia, not least because it’s configurable. For example, small edits can be deemed not to constitute a new version so as not to create version profusion.

The German Research Foundation (DFG) has funded this for 2016-2019. They are working with three extant encyclopedias, including 1914-1918. So far they already have prototypes and added versioning. A 1.0 release is slated for late 2018. They are looking for partners to help develop or at least deploy OES.

Annotation and Publishing Standards Work at the W3C
Tim Cole, U of Illinois Urbana-Champaign

This talk was hard to follow. Dense, small-print slides and I think Tim had a slight cold, so hearing him was a challenge at times. Also, I must confess that at the start I lacked context and background and was scrambling to catch up.

The timeline for Web annotation work begins in 2009. The core idea is to make a framework for annotations and make them interoperable and separate as objects. A URI is not sufficient for this because they have structure, not just data, so they are expressed in JSON.

W3C issued a new (February 2017) Web Annotation Protocol. Currently, annotation requires plug-ins as it’s not yet integrated into browsers. One common tool is Hypothesis, another is Pundit, both Chrome extensions. Europeana Annotations API–very new–stores annotations in Europeana.

Apache Annotator is now incubating and moving along slowly. The idea is to integrate the annotation specification into Apache.

Tim also mentioned the Publishing Working Group, which seeks to bring the publishing world into the Web. As he put it, it’s not sufficient just to toss up HTML docs with stylesheets, JS, etc. This is not a publication. Web Publication operates on the Web as a single resource, even if its components are Web resources. They are distributable as a single file by using a packaging format. There are still many questions to resolve: accessibility, offline access, archiving, etc.

From Stock to Flows
Kristin Antelman, David Kremers, Stephen Davison – California Institute of Technology

David, the scientist in the trio, started off by introducing us to flow. Also used the word churn to describe it. He noted that his work generates stocks as well, e.g., datasets and other research artifacts, which can then feed back into new research flows.

He tried to get the library at CalTech interested in archiving these materials in the mid-1990s but didn’t get anywhere. Data from back then is now nearly entirely lost other than the few images in his slide deck. Now he gets a better reception for curating a flow of information.

Stephen pointed out that at present they have no infrastructure to do this. They are working to address this. They currently have multiple repositories: an IR on Eprints, a research data repository on Invenio, an Islandora DAMS, and ArchivesSpace. The fragmentation here is troublesome, not least since they now want to include external data sources to enrich their metadata: ORCID, Crossref, Fundref, etc. Gluing this together with scripts and middleware creates a jumbled mess.

They now want to pull all the data out of all of the repositories and store it in open formats in one place, harvesting nightly from their disparate repositories. They want to “re-embrace the command line” and move their development focus away from systems and toward tools. The tools they are building are collectively known as Dataset: command line tools that work in any environment, even on a Pi as he pointed out. Dataset can pull from any API, so that external data sources can be easily pulled in.

At harvest, they build temporary JSON files which they read from and write to local repositories. These can be regenerated nightly if needed. Because the tools are so simple, the flow can go in both directions, i.e., to or from the repository.

Their team working on this consists of four people, two developers and two librarians, one of whom has strong coding skills while both have broad technical skills. The tools all reside in GitHub.

Kristin started by stating that organizationally they need to develop “flow” skills to support this new environment. They are offering three carpentry clusters–software, data, author–to teach their community these skills. All of this supports the move to open science.

She showed a graphical representation of the stock model, which was stable but no longer suffices. They want to support flows by removing barriers and ambiguity. She talked a bit about Jupyter notebooks, which made me wonder if researchers at McMaster are using Jupyter. She noted that the new environment still includes stocks, e.g., published articles.

Data provided to them by 1science shows that about 55% of the articles they licence are available via open access, not counting content available via SciHub. She calls this licensed content that is available via open access pseudo-stocks. She used the metaphors of pyramids and the Ise Grand Shrine in Japan to illustrate stocks versus flows. The former are permanent and enduring and represent stocks. The latter is rebuilt every 20 years, but has been rebuilt every 20 years for 1200 years, which is flow. The research lifecycle is a type of flow. We already know and acknowledge this model. In other words, we already do flow so just need to go further in embracing this.

I would have enjoyed taking notes on Herbert Van de Sompel’s closing plenary, but alas my flight back to Toronto was cancelled as I sat down, so I spent his talk on hold as I attempted to rebook my flight and get home.


No comments yet

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: