Open Repositories 2016

June 23, 2016

tags: data management, libraries, open source, repositories, software

Dublin Castle doorway

After hearing great things about it for years and trying to fit it into the work calendar, I made it to Open Repositories, held this year in Dublin, Ireland. Part of the reason it was important this year is that I’m trying to get the word out about the Ontario Library Research Cloud, both to share our model as well as to seek input and feedback on some of our choices as well as on the challenges we still face (slides). The sheer number of updates, tools, projects, acronyms, and initiatives represented at OR is dizzying, but below I’ve recorded some notes on what I saw and also noted some ideas and sites I plan to explore further.

Tuesday, June 14

Opening keynote: Knowledge Inequalities: A Marginal View of the Digital Landscape
Laura Czerniewicz, U of Cape Town [slides]

Showed maps that highlight how much the influence of science is located in the global north. Really stark depictions. She asked what causes this, but noted that perhaps “shapes” is a better word. For one: funding. Funding as a percentage of GDP is much higher in the north, e.g.- 2.76% in the U.S. vs. .73% in South Africa. China is 1.96%, and is increasing.

It’s also a question of dissemination. Tools such as Web of Science only capture one type of knowledge exchange, missing talks and other forums. There are also multiple types of research; for example, in Africa it is often consultancy research which is not considered valuable by others. Reward systems also play a role. In South Africa, the national department of education pays a premium for each article published in specific places. This alters scholars’ behaviour in negative ways.

There is also a valorization of citations. These are “closed” networks, favouring the north. Altmetrics are making slow progress. Accessibility is another issue, i.e.- much research is difficult to access because libraries are not well funded, and are in fact losing funding.

She asked what we mean when we say “international journal.” 30% of the articles in these journals are from the US. 20% are from developing countries, but only 1% comes from sub-Saharan Africa. If one breaks it down by empirical studies, the US/European dominance becomes even starker. Even in African studies journals (published in the north), the percentage of articles by African-based authors has declined, due to declining acceptance rates.

Why does this matter? Local knowledge needs to be available to others, and is often missing from global science. Plurality is good for science.

Digital does not mean open. Licensing and DRM mean that much content is closed. New layers of complexity are added: social relations, audiences, forms.

Pointed out that much of this stems from basic infrastructure, e.g.- reliable access to electricity, computers, network bandwidth. She did note that in recent years, mobile internet users have surpassed desktop internet users, and that smartphones have expanded what people can do with technology. We have the devices, but now it is about the cost of data.

Spoke about the role of search engines in making information findable, noting that they are not neutral. Their algorithms are not obvious nor known. She was particularly pointed about personalization, specifically profile personalization, where search results are matched to one’s own search history and the histories of similar individuals. “The world gets smaller, rather than wider.”

Did a fairly straightforward experiment, asking searchers from around the world to repeat two queries in Google and Google Scholar (‘poverty alleviation’ and ‘poverty alleviation south africa’). Not surprisingly, South African research–which is extensive on this topic–is nearly invisible, even to those in South Africa. In all locales, the top result was Wikipedia, and the only common result between Google and Google Scholar was an Elsevier article that has a copy in an IR.

Repository Roundup

This roundup was intended as an introduction for people new to OR and the field. We work with some of these tools, so my interest was in seeing how the speakers would attempt to differentiate what they do.

DSpace

2,000+ global sites, 120 countries, largest repo platform on the planet. SEO is a priority; they work with Google on this. Usual acronym soup applies: OAI, REST, RDF, DOI, et al.

DSpace 6 coming this year, DSpace 7 in 2017 (new user interface/UX).

EPrints

Development continues. Includes ability to handle research data. Features available via a third party app store.

Fedora

No paid Fedora developers at Duraspace; 100% community coded. 300+ public sites, but this is likely underreported, as is typical with open source software. 24 active developers, 10 committers. 2 full-time staff.

Fedora stores assets with metadata, and maintains a full version history.

Core features/standards:

create/read/update/delete
versioning – Memento
authorization – WebAC
batch operations
fixity

Fedora is middleware, he emphasized.

Hydra

Fedora is flexible, but it needed a frontend application on top of it, which is what Hydra is meant to be: one body, many heads. These heads can handle various object types we need to manage, from books to research data and so forth.

Hydra is more of a framework than an application. Not turnkey, as he put it. Deployed at a wide array of institutions for various purposes. Most people roll their own Hydra heads. Now they are working toward solution bundles, e.g.- avalon (Indiana and Northwestern) and Sufia (Penn State). Now moving toward Hydra in a Box – idea being easy to install and maintain.

70-80 Hydra adopters and partners, mostly North American and Western European. Various communication channels (lists) and a robust list of interest and working groups.

Three big trends. First is move to linked data, which they are realizing via the PCDM. Second is architecting layers and gems for code reuse. Third is the emergence of Hydra in a Box. Grant funded (IMLS) to take the framework and make it an installable application that doesn’t require development resources.

Islandora

Showed the cheeseburger depiction, where Islandora is the patty between the buns of Fedora and Drupal. The condiments and toppings are optional tools and features. The Drupal frontend makes it easier to drop in to smaller sites. There are at least 130 installations worldwide. 20 official committers and 104 GitHub members.

Stressed the community aspects of Islandora; many roles, many of which have nothing to do with software development.

Invenio

Invenio is “a digital library framework.” Largely a European community, open source via GitHub, as with pretty much everything in this session.

Got a good laugh with their development roadmap, which looked like a tangled mess.

Researching Researchers: Avalon Media System’s Ethnographic Study of Media Repository Usage
Carolyn Caizzi, Deborah Cane – Northwestern U

They got a Mellon grant to do research into how humanities researchers use audio and video collections; previous research tended to explore instructional issues. We know we need to develop software for users, not just direct it at them, so they undertook this ethnographic study of their users.

They are interested in positioning Avalon to address research use cases, which is of course a way of saying that they want to steer development in directions that support how the tool is or could be used.

Noted that they are doing 10 users at each institution–Indiana and Northwestern–and that one can do about two sessions per day, since it requires intense focus on details of how they work. Takes about 90 minutes per researcher, but is exhausting. Are also doing a diary study to get longitudinal results. Researchers are asked to record (after the observational session and interview) their actions that are relevant to the service. This allows them to record their actions and responses outside of a controlled setting. Diary studies are good for “how” questions: how do you do this, how does this factor into your habits, etc. They happen to be using their LMS for this part of the study. They are hoping for six to twelve entries per user.

Results? For scholars at all levels: more emphasis on content than on interface. “They will use whatever player/tool is available to access the film/music/footage they need for their product.” This includes stealing media, i.e.- illegal copies. Second insight: they use tons of tools. May be ripping a movie (Handbrake) while viewing another and writing at the same time. She showed a large array of tools they use. Showed a horrifically cluttered Mac desktop that got a good laugh; the point is that people are terrible file managers. Her research also highlighted a lack of technology training for their research needs. Scholars use the wrong tool or can’t use any tool, so put their own research materials at risk, i.e.- video they have captured themselves. Another insight: everyone wants a clean screenshot. Important for publication and close analysis. There is also a laock of structural metadata in global or local searches, i.e.- no good way to identify subsections or moments within a longer film. They often rely on scribbled notes to do this.

Wednesday, June 15

Value Added Services to Garner Repository Adoption
Jack Reed, Chris Beer, Jessie Keck – Stanford U

Didn’t capture the details, but their main point was that we need to build open services around our open repositories to drive usage and adoption. They showed this using a Stanford-specific toolkit, e.g.- Blacklight-based tools, but the general idea is clear and one could simply use different tools.

Integration Challenges & Rewards: Heterogeneous Solutions with Fedora4 at the Core
Robin Ruggaber, Ellen C Ramsey – U of Virginia

Point of their talk, in a sense, was to explain why a Hydra institution chose non-Hydra solutions for some of their work. Example: needed data management and chose Dataverse. One downside was its lack of native integration with Fedora.

The Headless Repository
Peter Sefton, Michael Lynch – U of Technology Sydney

In an aside, mentioned that PCDM is a tool that could help with archiving, say, an Omeka site, i.e.- by facilitating interoperability it allows one to have an exit strategy for tools such as Omeka which are unlikely to persist over decades. That said, he also noted that PCDM needs to “settle some implementation arguments.” Asked the scary question: “is PCDM the new OAI-ORE?” and the like. Made the joke that OAI-ORE was last seen disappearing up its own ontology.

Noted that tools such as DSpace combine what he called library workflow states with repository archiving and preservation type states, and questioned whether that is a good thing. His short answer was no, and that we should separate these, doing the latter with Fedora, which as he put it, does nothing in terms of workflows. This reflects some discussions we have had around how to enact preservation for our DSpace instance at McMaster. In general, he was speaking against monolithic solutions, and in favour of using a set of services that are loosely coupled if at all.

24/7 Talks

Rather than taking notes since these were lightning talks, I just noted tidbits and things to check out or explore further.

JISC – Research Data Discovery Service (currently in alpha, but nifty) – idea is to surface datasets in scattered repos.

At the Czech Technical University, their whole routine for doing authority control in their repository is based at its root on their campus identity management system. We have no such authoritative source at McMaster, so often tap around in the dark when it comes to authorizing use of our systems or making intelligent (and useful) inferences about individuals.

SHERPA REF leverages the SHERPA data to allow a researcher to determine quickly and easily if a given journal will enable them to meet the REF open access standards that are coming into force in 2020(?).

The Visual Arts Data Service, VADS, is a UK-based collection of quality and rights-cleared images from over 300 UK libraries and institutions, 140K or so images, many unique. Did some searching and while I could surface some interesting images, they were incredibly low resolution and I didn’t see a way to get at a higher resolution image. If the only purpose here is to lead a user to an image they have to buy, I’m less interested.

Publons – open database of peer review, i.e.- who has done peer reviews for which journals. McMaster faculty are using this, slightly. Is it worth making it part of the recommendations we make? I might have to do more reading about its origins and ties before going all in. To help investigate, I did sign up for an account.

RightsStatements.org: Developing Internationally Interoperable Standardized Rights Statements for Cultural Heritage
Emily Gore, DPLA; Dave Hansen, U of North Carolina; Karen Estlund, Penn State

The general goal here is to create machine readable rights statements. Currently, as most of us know, rights statements are messy throw-ins in the metadata we attach to digital objects. Hansen gave a good overview of copyright law and the categories that emerged as a result of categorization of copyright applications.

There was a lot of information in this session, and since it was new to me I tried to focus on the speakers rather than taking extensive notes. I heard enough to get me very interested in this.

Thursday, June 16 aka Bloomsday

Accuracy, Comparability and Standardisation of Usage Statistics in Open Access Institutional Repositories

Daniel Beucke, SUB Göttingen

Why does one need usage data? For one, their content isn’t included in existing citation models, e.g.- Web of Science. It’s a value added service for repos.

What do we need? Accuracy, comparability, and standards. Various entities are involved, and we need to observe COUNTER rules and standards. Mentioned other projects: PIRUS, OAS (Open Access Statistik), Knowledge Exchange, et al. COAR includes the Open Metrics Interest Group, which is working on these issues. The idea is knowledge exchange between various international usage data initiatives, of which there are many.

Joseph Greene, U College Dublin

For him as a repo manager, he needs statistics for advocacy and to drive deposit rates. As he put it, nothing works as well as being able to show downloads. If we rely so much on them, he begins to have concerns about their accuracy. We need to ground these numbers in reality and defend our model.

Did a study based on a two-year data sample from UCD’s repo. He manually checked it (a sample) to determine if usage was human or robot, then compared his results to the results from their robot detection technique, based on U Minho’s DSpace Stats ann-on. Results coming out in a paper next month.

Preliminary results: 85% of unfiltered downloads are from robots. Others have identified similar (or higher) numbers. Showed an exponential model that demonstrates that catching just a few more robots, we get much better statistics. The converse is, of course, true. Their detection catches 94% of all robots, of which 98.9% were actually robots (not bad, as he put it). So how accurate are their numbers: only 73%, i.e.- only 73% of the download statistics represent human traffic.

He then applied DSpace’s and EPrints’ robot detection tools to his data. He stressed that this was an experiment. These tools do not use nearly as many filters as the Minho DSpace Statistics add-on. The results: DSpace out of the box is 62% human, Eprints 55%, Minho with no outlier checking 59%, adding manual checking to it jumps it to 73%. With no filtration at all it would be 14%.

Stefan Amshey, bepress

Laid out their model for detecting robots, following fairly standard practice. Noted many of the shortcomings of looking at user agent or IP address, and so forth. Walked through their methodology. Pointed out that robots evolve, so our detection techniques must also evolve.

John Salter, White Rose U Consortium

Spoke on behalf of IRUS-UK and COUNTER. By this point, we had heard many of the points from the other speakers, so he brushed past some of this slides and points.

What is interesting about their model, and something to consider, is that they aggregate IR data from 107 UK repos, and then apply the filters and rules to this joint dataset. This means that download numbers from any of those repos can at least be accurately compared to others, or at least that this comparison is more likely to be accurate (some IRs may face bot traffic that others do not, of course).

Comments are closed.

Libraries, Technology, and other matters

Open Repositories 2016

Tuesday, June 14

Wednesday, June 15

Thursday, June 16 aka Bloomsday

Who I am

Recent

Search

Older posts

Latest tweet

Libraries, Technology, and other matters

Open Repositories 2016

Tuesday, June 14

Wednesday, June 15

Thursday, June 16 aka Bloomsday

Share this:

Related

Who I am

Recent

Search

Older posts

Latest tweet