CNI Spring 2016 Notes

flickr – Pedro Szekely
CNI Spring 2016 in San Antonio was, as always, a worthwhile trip. Somewhat different this time was that I gave an “issues” talk with Lisa Janicke Hinchliffe on our ongoing struggles to integrate IT functions and staff successfully into the library mainstream (title: From Invasive to Integrated: Information Technology and Library Leadership, Structure, and Culture – slides). We had a great turnout and people came ready to contribute, which was both encouraging and much appreciated. We both plan to write reflective blog posts on what transpired in that session.
Monday
Defining the Scholarly Record for Computational Research
Victoria Stodden, U of Illinois, Urbana-Champaign
Noted several ways that technology has impacted research, one that caught my eye was the notion that there are “deep intellectual contributions” embedded only in software. It’s not sufficient simply to capture such work in a methods discussion in an article, so how do we surface these intellectual contributions? Another impact is how nearly all of research has been digitized and has become accessible due to the Internet. These objects also have intellectual property rights attached to them that enable (or not) reproducibility and other extended work.
She spoke about three types of reproducibility:
- empirical
- statistical
- computational
noting that the first two are fairly well established and defined, focusing on the third. Traditional publications do not enable this, since they were designed for empirical reproducibility. It remains to be seen what will enable the computational piece.
She quoted David Donoho paraphrasing Stanford professor Jon Claerbout, who said in the 90s that the article produced by research is not the scholarship, it’s advertising for the scholarship, which itself exists in the work being done, the methods, etc. Missed a bit of the latter part of the quote/notion, but the point is a strong one just by noting the advertising bit.
Closed by addressing the intellectual property issues that arise with software, highlighting Stallman’s work that largely led to the creation of OS licenses. She noted her suggestions that appear in a proposed Reproducible Reearch Standard:
- release media components as cc-by
- distribute code under Modified BSD or similar
- release data to public domain or attach attribution license (noted that data/facts bear no copyright)
She notes that this would move us toward longstanding scientific norms and remove copyright as a barrier to reproducibility.
She closed by posing some very challenging queries to the scholarly record, e.g.- show me a table of effect sizes and p-values in all phase-3 clinical trials for Melanoma published after 1994. These are incredibly difficult queries to execute, today, but she wants us to move forward to an environment where they are possible.
We should set broad goals for the cyberinfrastructure:
- minimize time commitment by user for learning and using the CI
- automate discovery and dissemination as much as possible
- facilitate queries across record
- capture all information required to assess the findings
She noted that there have been some community responses coming from various corners, e.g.- ICERM 2012 and XSEDE 2014.
A Campus Master Plan for Research Storage: A Case in Progress
David Millman, Scott Collard, Lynn Rohrs – New York U
NYU, like many universities, has seen a number of named research entities emerge in recent years, and is also pushing to add a large number of new science faculty. These things have an impact, and they discovered that they had deficiencies in their service structure, particularly with regard to IT services. They analyzed their research peers to see what they were doing; this led to recommendations that went to the CIO and the library dean. What emerged was funding, in other words, positions. The library added six, including GIS Librarian, Digital Scholarship Specialist, etc.
After doing some internal and external reviews of repository services, they created four service bands (scenarios, really) that characterize four different ways researchers and library curators interact with storage. These bands aren’t for the public to digest, but are connected to end users services, such as the institutional repository, HPC, etc. Behind these service bands, there can actually be shared infrastructure, but the user doesn’t need to know or understand this.
There is quite a bit of oversight for this, which they named DRSR (didn’t catch the acronym’s meaning). The executive committee consists of the library dean, campus Chief Digital Officer, and the library AD for information technology. Then there’s a larger group that forms the full committee with various stakeholders represented. They have also spun up a number of DRSR working groups: architecture design, functional validation & prioritization, policy development, and technical design, the last of which follows architecture design. The groups are staffed from the libraries, IT, and some faculty research centres.
Rethinking Library Services
Nassib Nassar, Index Data
This was something of a product introduction for Index Data’s foray into creating a new type of integrated library system, specifically another open source ILS (nothing there yet other than a signup form, but soon). Clearly, it’s mostly conceptual at this point. I had heard some bits and pieces about this project, so it’s good to see it hit the light of day at CNI.
They intend to use the Apache 2 license, which should enable both libraries and vendors to contribute, since it’s a permissive license. Architecture, not surprisingly, is highly modular and built on microservices. As he put it, it’s trendy right now to work with microservices.
He showed their architecture, and the slide made sense. At the base, there are data stores: metadata, users, items, and a KB. Above that, there are elements that look like modules we know: catalogue, user management, circulation, electronic and print acquisitions, etc. On top of it is a UI, but this layer is flexible, and people could roll their own. Also, at the “module” level one could add new elements/modules.
They are looking to foster a highly decentralized development environment, but with microservices, things can get chaotic very quickly, as he noted. One phrase he used was that they want to create an operating system for library functions. That’s a useful mental model.
From his remarks, this is clearly just in its earliest stages. Glad to see these various groups pull together–in addition to Index Data, EBSCO and OLE are involved–but have a bit of concern about how this will be different than previous attempts. EBSCO is funding it, so that’s a change.
First bits of core infrastructure should hit GitHub this summer. Late 2017-early 2018 is when there might be code that one could implement. TIND is also involved as a partner. Many other unnamed partners (soon to be named) on board, per the EBSCO representative present.
In response to a question about successful vs. unsuccessful open source projects (in libraries and beyond), he noted that they are focusing very closely on the developer experience, much as one focuses on user experience. Earlier in his talk, he noted that they want to make it ‘fun’ for developers to contribute. That’s a noble goal. The CEO of Index Data added, in the Q&A, that it should be “fun and exciting” for developers. That’s even more challenging.
Tuesday
Access to DBpedia Versions Using Memento and Triple Pattern Fragments
Herbert Van de Sompel, Los Alamos National Laboratory; Miel Vander Sande, Ghent U
[slides]
Herbert started out by giving a quick overview of Memento and how it relates to linked data. Memento is “time travel for the Web,” a protocol that allows a client to pull prior versions of a site. The resource requested points to a TimeGate, which is aware of previous versions (Mementos) and can point to them. It works in the other direction, too, of course. Legacy versions can point to current versions.
Starting in 2010, he and his collaborators began applying this concept to linked data. In this context, rather than just pointing directly to the current LD representation, one uses the TimeGate to point to previous representations. Showed a quick example of how this works: a graph of nations vs. their GDP per capita across versions of DBpedia.
To further this work, they’ve been building a DBpedia archive, capturing various dumps and putting them in a MongoDB archive. This required custom software, and an upload took 24 hours. At some point, this became unworkable because there was too much complexity (not scalable); it was only subject-URI and they had to freeze it, i.e.- not add more data. They went back to their drawing board to figure out how to scale it up and make it affordable and usable.
Gave an overview of the thought process they followed to do this, weighing costs and benefits across a number of facets: availability, bandwidth, cost, as well as various functionality elements such as interface expressiveness, LOD integration, Memento support, and cross time/data (e.g.- the nations vs. GDP example). Too many details to capture in notes, but the moral of the story is that each method has its pros and cons, but on balance, some are better than others. The three he reviewed don’t look too great, but he tipped his hand by noting that there’s a fourth way.
Miel started by talking about linked data fragments, which describes the state we are in: various access methods return fragments of the linked data (this is a rough summary). As he put it, a fragment is defined by:
- selector: what questions can I ask?
- controls: how do I get more fragments?
- metadata: helpful information for consumption?
“You can’t have it all,” he noted, one has to decide how to balance the tradeoffs inherent in different interfaces. Triple Pattern Fragments was their approach to dealing with this; the server does very little, putting the complexity out to the clients.
He moved on to something called binary RDF representation (HDT – header, dictionary, triples). Didn’t catch much of this, to be honest, but took it to be a key piece of ‘stabilizing’ the environment.
Herbert noted that the addition of these new technologies solves the dilemma of cost vs. access that he had analyzed earlier. “Cheap for the publisher, but valuable for a client.” They now take the DBpedia dumps and create an HDT file (translated to a binary HDT and stored as such). Takes only four hours to load and convert the dump, and results in ~5 billion triples.
Closed by noting that as long as publishing linked data remains hard, we won’t be able to do these things very well. It needs to be easier, so he gave a quick tutorial about how to publish your own linked data in this fashion (i.e.- publishing it as HDT). Links are in the slide deck. The steps are:
- convert archival data set(s) to HDT
- set up a linked data fragment server
- configure the server
the code for all of this is on GitHub. Stressed that this is not hard to do.
On the CUSP: Canadian Universities and Sustainable Publishing
Martha Whitehead, Queens U; Brian Owen, Simon Fraser U
Martha started out by noting that while it is possible to be pessimistic that we are making any progress in the realm of scholarly communication, but then enumerated all of the ways that Canadian institutions have made in the last two decades.
The call to action for CUSP speaks of repatriating scholarly publishing to the academy. There are major issues to address:
- subscription journal models and costs
- alternative journal model
- article processing charges
- sustainable “long-form” publishing
She noted that there is a lot of pushback with regard to APCs, even from the health sciences.
Brian reviewed the Canadian environment and made the case for a distinctive Canadian solution. It’s not about national pride for pride’s sake, but about recognizing and addressing local, regional, and national issues involved in academic publishing.
Both Martha and Brian spoke about the need to elevate these issues to higher instances in our universities. Universities, not just libraries, need to address and solve them. They are also looking beyond journals to other forms, such as monographs and textbooks.
Open SESMO: Innovations in Surfacing Paid Content to Today’s Learners
Jason Clark, Doralyn Rossmann – Montana State U
SESMO = Search Engine Social Media Optimization. Doralyn opened by noting the research that shows that the vast majority of students start their research anywhere but the library Website. Quoted Roger Schonfeld:
- “the library is not the starting point”
- “the campus is not the work location”
- “the proxy server is not the answer”
We tend to spend a lot of effort promoting our local collections; we want people to find our photos, documents, etc. Yet we lavish money on purchased collections and spend little effort on surfacing them. Our Website is structured as a teaching and browsing environment, but what about people who use search engines. Their goal is to “meet users in the search,” and make that a teachable moment. They want to push not only locally developed collections outward, but all collections, even the paid stuff.
They tried an experiment. They took a collection–databases licensed by the university–and applied optimizations in stages so that they could measure impact using Google Analytics for each optimization. They want to increase traffic to the resources and demonstrate increased semantic understanding on the part of search engines.
How? They set up a revised Website architecture based on schema.org concepts, terminating in a sitemap. Details for this are in the slides. In a nutshell, their ‘about’ pages for databases have schema.org types and elements; in sum, they make a ‘product’ of the page, which is another schema.org concept.
Showed their social media optimization, which is built on putting Twitter card data into Web page headers. When this data is present and the link is tweeted, Twitter preloads the card into tweets. Showed the tagging schema and it is not exotic nor difficult to understand. Brilliant. Haven’t seen any library doing anything with this before. The cards also help you track stats for any page that has card metadata.
Comments are closed.