PASIG Prague 2016
This is apparently the first time that PASIG has been held in Europe; it’s also the first time I’ve attended. My goal was to come speak about the Ontario Library Research Cloud, both to get the word out about the model we followed as well as to expose that model to a bit of peer critique and commentary. In that regard, it’s been a successful journey and I hope that it just started a conversation that will continue on.
It was a very well organized and catered event. The reception even included a chartered vintage tram that took us to Obecní dům, a somewhat touristy yet glorious Jugendstil monument in Prague. Given the modest registration fee, the quality has been remarkable. Many thanks to those who planned and carried out this event.
Digital Preservation Bootcamp
OAIS Model, Theory & Practice
Neil Jefferies, Oxford U
Gave an overview of OAIS. Noted, after the overview, that one of the problems is what he defines as a SIP (submission information package) problem, i.e.- we often don’t have good (or any) metadata for certain collections. That’s not a good reason not to ingest them into an archival system. Also, DIP (dissemination information package) problems exist. Much of the audience for our data hasn’t been born yet, for example, so it’s hard to make an economic case for preserving data. DIP also become richer in information over time, so the AIP must be updated to reflect this enrichment. Last, the AIP also have problems. They must be preserved, so as he put it “don’t mangle it to fit a standard!” Archived objects are also not static, so the AIP may require updating (incremental curation).
Much of the meaning or significance of a file depends on its context and provenance, not necessarily the intrinsic information in the object. File names are easily changed, as are formats. As he noted, a file “is a meaningless stream of bytes.” The metadata can be more meaningful than the data itself, given the primacy of context and provenance.
Pointed out that, contrary to the experience of corporations, our institutions (i.e.- universities) are remarkably durable. They nearly never fail, so we will clearly be migrating systems and entire ways of doing things over time. Long-term preservation is a fact of life, not tied to the rise and fall of a corporate entity.
Made some recommendations for setting up preservation systems. One of his first points was not to “bake decisions into systems at a low level.” The more generic/abstract the lower levels are, the more flexible they can be in terms of tolerating multiple formats and techniques for ingest, etc. Best is to use standards and to work modularly and in layers. Break down processes into “simple, small, atomic tasks” so that parts can be deferred, done parallel, etc. Keep originals and metadata, if possible. Use versioning, leave an audit trail.
Trustworthy Digital Preservation Systems
David Minor, U of California San Diego
Trust has to be defined and it’s an iterative process. Funding comes and goes, partners change, etc. Questions to ask include institutional commitment, infrastructure demands, technical systems, sustainability, and so forth. If everything else fails, what then? Are there “fail-safe” partners?
Gave an overview of European and (North) American frameworks for certifying repositories. In Europe, it’s known as European Framework for Audit and Certification of Digital Repositories, and has three levels: basic, extended, and formal. The first is self-assessment, while the last, of course, involves investigation of claims. Outcomes are Data Seal of Approval (basic), nestor (extended), TRAC/ISO 16363 (formal), and DRAMBORA (mix).
TRAC is divided into three sections: organizational infrastructure, digital object management, and technologies / technical infrastructure / security. In North America, the Center for Research Libraries has been the certifying body and has certified four U.S. and two Canadian (Canadiana.org and Scholar’s Portal) entities. Now the push is to make it a standard, hence ISO 16363, which came out in 2012 (TRAC dates to 2007). They are similar, but it has been standardized now. To date, no one has gone through the ISO 16363 process; at present there are no certifying bodies for ISO 16363. There is now a separate standard (from 2014) for becoming an auditing body (ISO 16919) so that’s just getting rolling (training for certifiers). Drambora is UK-based and follows similar patterns.
Mentioned the Digital POWRR Tool Evaluation Grid maintained by NIU. POWRR stands for Preserving (Digital) Objects With Restricted Resources. It breaks down tools by the needs they address and provides basic information and evaluations.
Ended by asking if we are moving toward a single tool or framework. He noted that right now one has to ask which certification one needs to pursue since multiples exist.
Applying DP Standards for Assessment and Planning
Bertram Lyons, AVPreserve
Didn’t take a lot of notes for his talk, but one key point he emphasized is that the non-technical aspects of preservation–planning, documentation, etc.–are often the areas that require more focus and attention. Technical aspects such as hardware and software decisions and management are areas where we tend to focus our efforts.
Another key point is to break the work into smaller chunks and not to attempt to do everything at once, but rather in a prioritized and planned order.
Long-term Digital Preservation Hardware & Systems
Tiered Storage Architectures
Donna Harland, Oracle
How can you lose data? Cannot find it, cannot read it, cannot validate authenticity, or cannot interpret it. This is not a new problem, she noted. Books have burned, tapes have been lost (e.g. – Apollo 11 moon walk original data feed), etc. We have recreated Alexandria, to use her words: obsolete applications, inaccessible documents, unreadable devices, family photos, etc.
PKX / Practitioners Knowledge Exchange: Case Studies in Preservation & Archiving Architectures and Operations
Qatar National Library
Krishna Chowdhury, QNL
She described the institution and the national digitization program. At one point, she showed a slide detailing their technical stack. Clearly they had a bit of money when putting this together. We could never afford anything so luxurious in our context. Worked closely with Oracle on this, so it’s made up of a lot of ‘Sun’ hardware as well as their storage solutions. One number that caught my eye was that they have 28 physical servers in their second phase deploy. That’s a serious deploy, and I wonder what their user population looks like, i.e.- whom does this stack serve/whom will it serve.
Then again, what she had was the opportunity to build this all from scratch. That’s a chance to do things right, so to speak, rather than to inherit a broad set of perhaps less than optimal decisions. In that light, it might make sense to build it more robustly than we ever could.
The University of Oklahoma’s Galileo’s World: Creating New Demands for Digital Archiving & Preservation
Carl Grant, U of Oklahoma
Literally about Galileo. Oklahoma is one of two libraries–the Vatican is the other–to hold first editions of all 12 of Galileo’s works, four of which have marginalia in his hand. Wanted to create an exhibition around these works, not least to celebrate OU’s 125th anniversary. Another goal was to create an exhibition that would live on into the future, as well as to make it appealing to a wide audience, both scholarly and public.
Evolving the LOCKSS Technology
David S.H. Rosenthal, Stanford U
Started with a brief history of LOCKSS, telling the anecdotes related to how he and Vicky came up with the original idea as an analogue to print preservation, of sorts. Much has been written about LOCKSS, so I won’t repeat the details here.
Also gave a tour of the impact of LOCKSS, i.e.- who is using it and how. It has gone well beyond preserving journals. In various contexts, it is being used for dissertations, research data, government information, and so on.
The Ontario Library Research Cloud: Future Considerations and Cost Models
Dale Askey, McMaster University
I gave this talk, so no notes!
Status of Long Term Preservation Service in Finland 2016
Mikko Tiainen, CSC – IT Centre for Science
Noted while speaking about the technical specifications that there is no proprietary software in their entire stack; it’s all either open source or CSC-created. It’s also highly modularized, so there’s no monolithic piece to break down or fail. Individual components can be replaced as needed. Does not use object storage, but rather POSIX. I’ll need to consult some colleagues when I return who can explain the implications of this.
In addition to their open source tools, they’ve got about 15,000 lines of in-house Python code in their production systems, using Python Luigi as their workflow engine. The terminal storage for their data is tape, if I saw correctly, three copies, two tied to the production systems in IBM and Oracle tape vaults, with the third being a dark archive copy.
Project Updates and Digital Preservation Community Developments
EUDAT: a Data Infrastructure for European Research
Rob Baxter, U of Edinburgh
Have built a suite of research data services, all under the name B2xxxxx. They found it challenging to work with a large number of communities, but found it worth doing so rather than just building something and hoping people would come use it. Tying together all of the disparate services (from across Europe) is also a challenge, not least due to varying metadata standards and practices.
Trying to model better research data management. Have now added more B2 services, e.g. B2HANDLE and B2ACCESS, the former creating persistent identifiers and the latter about identity and authorization.
Showed an interesting data pyramid. Base is transient data, which has individual value, the middle is registered data, which has community value, and at the top of the pyramid is citable data, which has societal value. The pyramid model makes sense, as it implies that of the great mass of project-immanent data, only some of it emerges to go to the next stage, and so forth. I often get the sense that researchers here “data sharing” and assume that society is asking them to share everything. Not the case.
The DPC Community: Growth, Progress, and Future Challenges
Paul Wheatley, Digital Preservation Coalition
Gave an overview of the DPC, and also discussed the reaction of the digital preservation community to Vint Cerf’s comments last year about no one doing digital preservation and advocacy for print copies.
Also noted the potential impact of trade agreements on copyright and our ability to do digital preservation, noting that the community needs to continue to push back against these.
Tossed in a nice dig at those who mock emulation as a mechanism for preserving software. It’s possible to do, and we should be taking on this challenge.
Preservation of Electronic Records
Luis Faria, Keep Solutions
Keep solutions is a spinoff of the University of Minho. Spoke about E-ARK, which is a pan-European effort to create and international methodology for archiving digital records and databases. Talked about the preservation of relational databases, something he noted has not been much discussed as yet at this event. There is a preservation standard known as SIARD 2.0, which they hope will emerge, as he put it, the one standard to rule them all with regard to DB preservation.
To go with the standard, there’s a tool to support it, currently in beta and only available as a command line tool: database preservation toolkit.
There is also a push to create pan-European formats for SIP, AIP, and DIP. As he put it, yes, another standard, but necessary. Currently in draft form and open for comment. Tools go with this, too, e.g.- RODA 2.0, currently in alpha. Enables creation of many SIPs easily. Supports both E-ARK and Bagit.
Evolving the LOCKSS Technology
David Rosenthal, Stanford U
Currently, LOCKSS remains financially viable using a “Red Hat” model to sustain itself. A grant a few years back did allow them to increase the amount of programming activity.
Changes in Web architecture, however, necessitate changes in LOCKSS, not least the presence of forms that must be filled to access documents (he suggests that this is done deliberately by some to thwart harvesting) as well as the emergence of AJAX content from some publishers. Other integration desires include Memento as well as Shibboleth for access control by identity rather than by location.
They have also added new polling types to ensure that nodes have identical copies: local, symmetric, proof of possession. Previously, polls looked for disagreement, enabled repairs by agreement. This was remote and asymmetric. Using new metrics, they have shown that the delay between new agreements has dropped from 30 to 10 days.
They are now replacing components of the existing LOCKSS software with open source components, e.g.- Hadoop and OpenWayback. The idea is to make it more scalable, reduce costs, and to contribute to open source technology. To deploy this, it’s more difficult than installing a monolithic package, so they are opting to Dockerize the components to make it easier to do.
One Site Among Many: Stanford and Collaborative Technical Development for Web Archiving
Nicholas Taylor, Stanford U
Beyond link rot, there is the issue of content drift, i.e.- content is not stable over time. The link may be fine, but the content has changed. Showed a graph that made visible how serious this issue is.
Most organizations are relying on external providers for Web archiving. There is at least an upward trend in the number of institutions that are doing Web archiving, which is good. Archive-It has contributed to this. As he put it, it’s never been easier to Web archive, but the scale of content has made it hard for organizations to do it themselves. Most Web archives are stored internally, i.e.- within the organization.
The Web itself has changed. Less static content, more JS and other dynamic content.
What has Stanford done? They have a Web Archive Portal (SWAP – based on OpenWayback). Pushing those into their general catalogue (SearchWorks, based on Blacklight). They also collect specific sites related to Stanford interests and projects. Their architecture is hybrid. They use Archive-It for capture and push it into their own digital repository.
Wants to push the Web archiving toolkit toward using more APIs and make the stack more actively developed (think I got this right). Better than large systems is to build modular components that can interoperate, not least since none of the large systems cover all facets of the work required. First API they are tackling with their grant is an export API to allow a Web archive to push data out to other repositories. The larger goal is to build a community that roadmaps the creation of further APIs and tools to exploit/deploy them.
Neil Jefferies, Oxford U
Gave an introduction to Fedora, noting that it is not so much a software application as a system. As a Fedora institution, most of this is familiar to me. Referred to Fedora as middleware (it sits on your storage) and referred to Islandora as a presentation layer. I often tend to think of Fedora as the foundation and Islandora as the middleware, so this is some food for thought.
Core features: linked data platform, Memento versioning, fixity, etc. It is also extensible.
There are pluggable components that use a standardized API, as well as external components. The details of the latter go beyond my technical ken. Scalability tests have shown that one can upload a 1TB file via the REST API.
These are some things I heard about that I’ll hope to hear/learn more about in the future:
PREFORMA project – EU-funded project to establish standard preservation formats and verification tools for text, graphical, and video objects. Betas available via GitHub.
DMPS – Data Management Planning Systems – Automation of DMP tools, many of which are in development. Currently, they are text documents, but not executable, i.e.- they cannot trigger action in rule-based data management systems. In short, machine readable so that rules and operations can be read from them to ensure/actualize compliance with funding agency requirements. Also referred to it as ADMP (A = Active).
iRODS – Came up in the above talk on DMPS, but have heard about it before, of course. Have never quite wrapped my brain around it or understood if it has potential application in our environment and/or is something we could handle.
AXF – Archive eXchange Format – Oracle uses this, but it’s an open standard.
RODA – My question is whether this is a parallel development to Archivematica, or something else entirely. Would it work in our environment, i.e.- with our standards and related tools, or is it “Europeanized”?
Archive-It – Came up numerous times, most clearly in Nicholas Taylor’s talk. We are not doing much at all with regard to Web archiving. Should perhaps get started.