CNI Fall 2012 notes

January 15, 2013

tags: digital humanities, libraries, linked data

flickr – proteinbiochemist

A bit late with these, but hopefully there are still some useful bits here for people. Missed the entire first day of sessions due to a series of unfortunate airline and airport events, with the result being a 12-hour trip to DC rather than just a 90-minute flight. Tuesday’s sessions more than made it worth the time to go despite missing half the content. The three project briefings I caught were three of the better ones I’ve seen at CNI, and that’s saying something since the general quality is high. As always, my editorial comments are in italics to differentiate them from the speaker’s words.

Piloting Linked Data to Connect Library and Archive Resources to the New World of Data, and Staff to New Skills
Zheng (John) Wang, Notre Dame (ex-Emory); Laura Akerman, Emory

Linked data teaches machines to understand semantics. Everything starts with the RDF triple: subject/predicate/object. John did a good job of introducing the concepts, using CNI speakers as an example.

Emory’s pilot was intended to link existing resources: catalogue, EAD, etc. As he pointed out, it leverages knowledge and expertise already found in libraries. One challenge is the lack of time to learn new things, but one has to take the time and learn the tools.

How did they bring staff on board: ran classes. Learn RDF, learn SPARQL, learn linked data, etc. Two sessions per month, five months in all. Ran a three-month pilot that included tech staff, librarians, archivist, etc., where the time commitment was 1-3 hours per week except for the leader.

To narrow the scope of the pilot, they selected the U.S. Civil War as a topic. They had grand ambitions, but weren’t able to get full support from the development staff, so in the end the results were more modest. They gained valuable insights about the kind of staff and skills one needs to make progress, though.

They are convinced that linked data is coming at us fast, and that it’s going to be a huge issue, but also that, in their words, it’s not entirely “cooked” yet. That reflects what I’ve been hearing elsewhere, and speaks to the need for tools and methods that one can adopt rather than having to create from scratch as they did.

Many other challenges, such as less than ideal data, e.g.- dates buried in notes field. Need more skills with HTML5 and other tools. Despite the pains, they consider it a success on balance. One key conclusion is that they are “beginning to realize how this can be so much more than a better way to provide search,” which is a point that we should all bear in mind.

Their general recommendations:

focus on unique digital content
publish unique triples
reuse existing linked data

Also, within the community we need to create standards and best practices, expand our skill sets, develop and test tools. This resonates with me. Generally speaking, in recent years talks about linked data have been high level and very technical, given by people who are not going to be doing the work of description and processing. This knowledge now needs to be translated into practice and tools so that this can become our work, rather than a topic of discussion and theorizing. This emerged during the Q&A. Many in the audience are quite advanced, but Laura’s point was that not everyone is at an institution that can lead innovation; some just need simpler tools.

A unique role for libraries is interdisciplinary linking. As John put it, we are well positioned and perhaps in a unique role to do this. We should leverage that strength.

Project site.

In response to a question about DBPedia, John quipped that Wikipedia is for humans, and DBPedia is for machines.

Virtual Research Environments in Germany: Funding Activities of the German Research Foundation (DFG)
Sigrun Eckelmann, DFG; Steffen Vogt, University of Freiburg; Rüdiger Glaser, University of Freiburg; Yvonne Rommelfanger, University of Trier

Amazing, as always, to see how the DFG embraces a new direction and creates substantial funding streams. 16 million Euros since 2004 for 34 projects, across all disciplines. Now they are keen to move it from a funded environment to a “stable” environment, i.e.- part of the normal budgetary process, as I understood it.

These VREs are tied to collaborative research centres that exist at various universities. Notion is to create “core research foci” at various sites. One common requirement: a team of infrastructure experts and researchers. This is a new challenge.

Tambora.org – Historical climatology and environmental history.

Historical climatology has three main thrusts:

reconstruction of past weather and climate (mainly documentary, i.e. written, sources)
impact of climate on societies
discourses and social representation of climate

Methods include critical source analysis, hermeneutics, as well as statistical analysis. Find texts, transcribe texts, code them, analyze them; that’s the basic workflow. This work underscores that digitizing texts is simply the first step. Coding required to make them sources for analysis. Last, but not least, publish the results (preferably with a lovely DOI assigned to them!).

This is a classic example of the digital humanities. This work was possible before technology, but was laborious at best. With technology, far more is possible, and new insights can be mined and analyzed. Also allows layering with other data sources and alternative representations, such as graphing of precipitation from textual descriptions.

Tambora has useful data back to 1000, and it runs to the present day. That’s an amazing body of work to study. From their data, they can reconstruct temperature records from texts. This information can be assigned to maps; in other words, textual information becomes quantifiable.

What are the challenges: various disciplines (historians, linguists, meteorologists), different tools, locations, working hours, etc. Availability of tools and data to all partners is also difficult. Also, how do you cite data? Since publication is key, citation becomes a possible incentive to engage in collaboration.

The VRE levels this: 24/7, all tools for all. They have a Web interface for uploading and coding texts, also a geolocation tool based on Google data. They have a glossary of historical place names to support geolocation. The interface also permits searching and finding available data for reuse.

For quality control (among other factors), they use project managers who determine how data finds its way into the repository and ensure that standards and procedures are observed. Beyond that role, they have an entire quality control mechanism. Content quality comes from the scientists, but on the library side, it’s reviewed (metadata completeness, e.g.), has a DOI assigned and registered (DataCite), and is published.

Their project was so successful that the Freiburg library adopted the repository for other research data management needs.

Worth noting how strongly he stressed the publication piece. The DOI is a mechanism, in this case, to enhance global awareness, since it transcends a national or regional framework. Were all such data projects so assiduous about assigning and publishing DOIs, we would be a long ways down the road of making data discoverable.

FuD – Uni Trier, founded 2004. German acronym for Research Network and Database System.

Started as support system for a collaborative research centre around the topic of “strangers and poor people,” which involved 70 researchers in 25 projects. The emphasis now is on creating a virtual research environment for the humanities. As of 2015 it should become a “regular and permanent operation.”

It has three subsystems:

data collection and analysis
editing and publication
archive and information (repository, longterm archiving)

Much like Tambora, it sets out to create a tool where texts can be entered, encoded, and analyzed. Intended to be full featured. From the screenshots, it appears to be an installed application rather than Web-based.

The FuD archiving system uses Fedora for the repo and Blacklight for the public interface.

She pointed out that a system such as FuD can drive any number of applications: creation of text editions (complete works), revision and publication of various documents, preparing print publications, etc. Currently only have German interfaces, but are working on English interfaces.

The Future of Fedora
Edwin Shin, MediaShelf; Matthias Razum, FIZ Karlsruhe; Tom Cramer, Stanford; Jonathan Markow, DuraSpace; Mark Leggott, Discovery Garden

Why has the level of Fedora development tapered off? One reason: the Moore Foundation gave money to get it going, and as the money has tapered off and now gone away, the task has been to engage the community in supporting the existing projects. Also, it’s 12 years old, which means the code becomes harder to maintain, and it’s also difficult to attract new developers to take up the task.

Move now is to do “something significant” to move Fedora forward, improve the existing codebase, and develop new features and elements. Fedora Futures is a movement to push that forward. There’s a steering committee, a tech group, and a fundraising committee, all of which resulted from meetings held in 2012 between interested parties. Mark Leggott is the chair of the steering committee.

So what are Fedora’s existing strengths? Cramer’s conclusion is that Fedora is a “winner.” It works, has a large user base, and fills a distinct niche. Beyond that:

flexible & extensible
support for durability
decade of maturity and use (way past vaporware stage)
large adopter community, many contributors
established in the linked data world (“can meet today’s needs”)

What does it lack?

performance
fault tolerance and scalability (built in a different Web age)
complexity
code base getting old
relatively small cohort of committers

Objectives of Fedora Futures:

preserve the existing strengths (community and architecture)
address needs for robust and full-featured repo services (which are now much clearer)
extend the utility of Fedora (as stable platform) another 5-10 years

In its current state, Cramer believes that without this push, the 5-10 year goal is unrealizable.

Many goals:

work for institutions of all sizes
support traditional IR use cases (as well as other existing use cases)
support data management
interoperability
go ‘native’ in the Web (missed his explanation of this point)

Organizational goals:

more contributors
remake codebase and dev environment (promote “Joy of Coding”)
get the community involved in support and governance

The Fedora Futures groups identified around 30 use cases for Fedora, which were then reduced to four major topics:

manage research data
improve administratibility (heck of a word)
handle heterogeneous more efficiently (particularly with regard to size; ability to handle massive data sets)
interact with linked data/semantic Web

As Mattias put it, it was encouraging to see that all of these use cases converge into these four directions. They also discovered that there aren’t that many actors: curators, administrators, researchers, developers. That’s manageable. “Keep developers happy”–if you don’t, they seek other frameworks and projects.

Technical requirements:

improve scalability and performance
more flexible storage options
support for dynamic metadata
globally unique and reliable identifiers
improved audit trail

Non-technical requirement: easy and fun to use API. That would be a remarkable achievement in itself.

Eddie did a good job of giving context. Pointed out that Fedora is a platform upon which one should build and develop services, and so part of this push is to simplify the core code to emphasize this role. This panel is a good demonstration of that. There are individuals representing projects (and companies) that build services that use Fedora as their platform. This kind of vendor harmony is great to see, so hopefully it continues.

Eddie also was rather blunt (refreshingly so) about how projects go off the rails. Named names, and detailed how they went off track to some degree. Also pointed out that saying that a project is agile is merely stating something about its technical structure and positioning, not an attribute that makes everything good nor leads de facto to success.

How are they going to proceed? A lean methodology: build, measure, learn. Continuity and results (i.e.- shipped product). The whole business launches December 12, 2012. One example of a quick win he provided was functionality for Amazon Glacier. He sees it as a matter of a couple of weeks to turn such a feature around. He hopes to see continual interaction between developers and the product, rather than monumental incremental releases.

Good first question: so is this further development of the existing code base, or a complete rewrite? The answer is that both are under consideration. Another answer was that even with major versioning, it’s still Fedora, and that continuity will be paramount.

Follow up question concerned the timeline, which looks to be around three years (for the Fedora Futures project). But Eddie pointed out that it’s not a three-year development cycle, but rather a series of accomplishments, so that there’s usable code within six or seven months. “We’re not asking the community to wait and hold their breath” for a product that will take three years was how he put it. Even if it were a “green field project,” he thinks it would still be possible to move the idea/spirit of Fedora forward.

Another question concerned support for multiple Fedora versions. One answer was that it’s not possible, long term, to support multiple versions. Another was that there has to be a clear migration path, and then it’s up to adopters to decide when to move forward. In essence, it won’t be an entirely new product that lacks a migration path from the existing state. As Matthias put it, he needs Fedora for his own projects as well, so this is in his interest as well.

In response to another question, Eddie made clear that the core product needs to be leaner. Practice has shown that adopters use other tools for certain tasks and functions, so there’s no need for those tools to exist in Fedora. He gave examples, but I couldn’t catch them since I’m not familiar with Fedora details.

Closing Plenary
Hunter Rawlings III, American Association of Universities

Didn’t take a lot of notes during this talk. For one, it was a fairly high-level gloss of various current topics in academia. He also was speaking about issues–scholarly communication–to a highly knowledgeable audience where many likely know more about the topic than he. Not his fault, really, but I don’t think he was a good fit for a CNI plenary.

Key question: what is college for?

Gave a good overview of the current challenges that we all hear about daily in the press and on campus: value for money, future of education, purpose of an education, governments reducing higher ed to its financial outcomes, etc.

He highlighted several trends that are driving higher ed, first and foremost the incredible “flood” of Chinese students into American higher education. He ran the numbers through, and he pointed out that the trend is not likely to abate anytime soon since there are huge numbers of students in China and they rank US institutions (and other non-Chinese schools) above their own domestic universities. He called it a “tidal wave of students” coming as undergraduates, where earlier it was primarily graduate students. He feels this wave will alter US higher ed, as well as alter China. He feels that their exposure to US higher ed will change China because they will have experienced academic freedom. Hard not to see his point, but could it work the other way around, and is that worth considering?

Comments are closed.

Libraries, Technology, and other matters

CNI Fall 2012 notes

Who I am

Recent

Search

Older posts

Latest tweet

Libraries, Technology, and other matters

CNI Fall 2012 notes

Share this:

Related

Who I am

Recent

Search

Older posts

Latest tweet