CNI Fall 2011 notes
After the spring Coalition for Networked Information (CNI) meeting, I posted my fairly raw notes from the sessions I attended. It proved to be a fairly popular post, so I thought I’d do the same for the recent fall meeting. Lots of good speakers, as always. My editorial comments are in italics to differentiate them from the speaker’s words.
CNI Fall 2011 Membership Meeting
December 12-13, 2011
Clifford Lynch, CNI Executive Director
Interest in “big data” is coming on strong. Even popular outside of the academy, e.g.- The Economist. We should remember that some big data isn’t so hard. It’s often small in size and found in places like Excel spreadsheets.
Lynch got a good laugh by asking who should archive medical records for dead people. Libraries or insurance companies? Funny but serious question.
Cloud solutions: points out that bandwidth and the time needed to replicate data is a major issue. Not possible/smart to change vendors casually. Heard this mentioned repeatedly in various talks.
Ebooks have hit the point where they impact the economics of popular publishing. Public libraries are suffering with ebooks. No one is even asking the questions about how to preserve ebooks. Another impact will be the disappearance of the used book market. This all impacts our mission to preserve the public record. Also, authors are pulling out of the typical publishing stream and our tools are built for a legacy model.
Building your content into apps flies in the face of all the work that’s been done to develop universal content standards. The iPad is currently hot, but it’s not the only thing.
Ideas That Drive Technology Innovation: Perspectives From Two Institutions
Dean Krafft, Cornell University – Beth Sandore Namachchivaya, University of Illinois
Sandore from UIUC gave a survey of their software development activities. Neat stuff, but fairly retro as in we’ve seen similar apps in that same timeframe. Don’t recall that any of these were ever open sourced. The main question to ask about such projects: do they get used? She called them innovative, but they’re not really innovative, but rather iterations of other innovations. They show DIY spirit, but is that enough?
Her talk also included another reiteration of the librarians on the one side, IT on the other. When can we finally merge them?
Cornell has an army of developers. Hard to find points of comparison.
Paying For Long-Term Storage
David S. H. Rosenthal, Stanford University
If you charge twice for storage what it actually costs, you can store forever. The costs drop so quickly that this becomes true (Kryder’s Law).
Data doesn’t survive benign neglect as paper does. This represents a fundamental shift in our obligations.
Is Kryder’s Law endangered by the fact that the desktop PC market is dwindling? If we were still on Kryder’s curve, we’d have 4TB drives by now, but we’re stuck at 3.
Pointed out that paying upfront for longterm storage may not be a good idea. Moving data is not trivial or cheap. This idea recurred in various talks and chats at CNI, becoming on of those things that sticks in your head. The good news: Kryder sees his law continuing until 2020. Bad news: even if you get a 14TB drive for $40, how do you move that data around?
He doesn’t think solid-state drives are going to save us. They’d have to be building the manufacturing capacity now and they aren’t. Simply not enough wafers and semiconductor fabricators. Then again, he’s done an economic analysis that shows it could work. Admits, however, that his model is not entirely realistic.
There is no global solution. Need to always be thinking about what comes next because life of data goes beyond any given hardware system.
Mused on a future oriented investment plan where you set money aside now to pay for future costs based on a standard investment model (bond-linked, for example). He then trashed the whole idea showing that discounted cash flow (upon which the model relies) is entirely bogus.
What’s needed is a simulation environment that includes more variables and includes more volatility. Include all the costs: capital costs, running costs, move-in costs, move-out cost, and service life.
Rosenthal batted aside questions about format migration, because he doesn’t think these migrations are going to happen, as he stated in a recent article.
Crowd Sourcing Metadata
Barbara Taranto, New York Public Library
NYPL had a nice scanned collection of menus, but needed transcriptions and metadata. Got grants (NEH) to create a crowdsourcing application that would draw in users.
Result was What’s on the Menu? application that displays menus and allows user input. The app allows users to export the data for further use.
While listening to this talk, I transcribed a couple of menu pages, which is both satisfying and kind of fun. What I could not understand during Taranto’s talk nor afterward when I asked her a question was why she seemed so negative about the project and crowdsourcing in general. Her points about contributors making mistakes, being inconsistent, and drifting away when they get bored are all well taken, but given that the transcription would likely never be done internally, isn’t something better than nothing? Also, if any topic would attract a general audience these days, wouldn’t food be the one?
Understanding Use of Networked Information Content: MINES for Libraries® Implementations at Scholars Portal
Dana Thomas, Scholars Portal
Alan Darnell, Scholars Portal
Terry Plum, Simmons College
Martha Kyrillidou, Association of Research Libraries
MINES is an intercept survey, presented to random users when they access an SFX menu. For Scholars Portal, it’s an every nth survey, every 250th currently.
Surveys are short and direct: a few demographic questions and then a brief substantive question. Terry Plum referred to the acquired data as “nuanced COUNTER data.”
Survey tool was LimeSurvey, which was running branded surveys for each school (20 total).
MINES allows one to get some sense of how users in the building are or are not using electronic resources. It shows most usage coming from off campus. Since SFX is the vehicle for the survey, one can parse the OpenURL for attributes and do further analysis. It’s possible, for example, to sort usage by consortial origin, e.g.- OCUL, CRKN, Knowledge Ontario.
Not sure I believe their low open access usage numbers. I suspect a lot of OA titles hide in packages that get labelled as local or consortial subscriptions. Also, one may not need to use SFX menu to get appropriate copy, since via Google Scholar, among others, link direct to OA.
Using MINES to identify resource type is less than ideal since one is relying on indexers to include new content such as ebooks, and most don’t.
If 4% of Elsevier usage is from humanities (UofT), then I have doubts about representativeness. They have, at best, a handful of humanities titles.
WissKI: An Architecture For A Transdisciplinary Virtual Research Environment
Guenther Goerz, University of Erlangen-Nuremberg
Siegfried Krause, Germanic National Museum, Nuremberg
DFG (Deutsche Forschungsgemeinschaft) funded, of course, as with nearly all academic or scientific research in Germany. Various partners, including Germanisches Nationalmuseum, University of Nuremberg-Erlangen, and the Zoological Research Museum Koenig (ZFMK).
Project is about breaking down what they call workspace silos. These silos arise for publication purposes. Ultimately much of what they gather does not find its way into publications. Silos have value and should be archived, but how does one do this practically. With paper, “it was easy” because you just tossed the papers in the archive.
The tool has to be easy for curators and researchers to use and they have to like it. Goal was to shield the academic user from the technology. Presentation layer of the tool uses the wiki concept. Entirely built on open source tools. Uses Drupal as its CMS. Adopted standard for semantic linking and import/export. Supports OAI, even though as he put it that is anachronistic.
Simple goal: federate cultural heritage with scientific data. Leads from data to information to knowledge. Higher level of understanding. Tagging not enough, for semantic inference you need a reasoning framework.
Delved into the details of the CIDOC CRM, which made sense while he was going through it, but I can’t reproduce in notes.
This talk really brought the use of triples to light. Experts build relationships that give meaning to the metadata that staff enter when doing their work.
Cost Forecasting Model For New Digitization Projects
Karim Boughida, Martha Whittaker, Dan Chudnov, George Washington University
Linda Colet, DaoPoint Digital, LLC
GWU has major collections from Eurasia and the Middle East. For digitization they use a Kirtas and ABBYY like many do.
They studied existing cost models and looked at their processes to identify bottlenecks and variables. Looked at project budgets from grant applications and actual expenditures. They now have an Excel spreadsheet tool and want to get to having a Web tool for doing cost models.
Some public cost models, such as the Internet Archive’s, don’t necessarily seem realistic. Not sure how they arrived at ten cents per page. British Library Lifecycle model has a formula to follow. Includes preservation and longterm costs.
GWU works with a 3-5 year cost model, mainly helps with calculating costs for grants. Considers four phases: project planning, scanning, processing, and making collections available. Storage, metadata creation, and storage are typically the major costs. Even after process improvements, they average four hours to scan and process a book. Overall, planning is the most labour intensive activity. Scanning is smallest, but processing and making available involve more time. Over time, planning decreases in terms of its share of overall resource investment.
Dan’s role is to transform it from a project to a program and make it routine in the organization.
One thing a cost model does is teach you what you need to track–time, state of items, storage float, etc. One can get carried away with quality. Scale it to the needs.
They plan to find ways to give clients access to raw scans before processing so that they can watch progress and have assurance that things are progressing. Dan refers to this as getting value from the float, in other words from items on the assembly line.
Closing plenary: Five New Paradigms for Science and Academia and an Introduction to DataONE
William Michener, University of New Mexico
Pointed out provision of sensors, all of which are creating data, even the four billion cell phones. He asserts we are heading into an age of big challenges: global warming, population, food supplies, etc. Also touted the rise of citizen science. Hundreds of organizations doing a wide array of work. Scientists work in larger teams across broader spatial, temporal, and thematic scales.
Data are increasingly seen as a valuable product of research. The NSF mandate is a manifestation of this rise. Showed Heidorn’s long tail of orphaned data. A small number of repos hold a massive amount of bytes, but most of the datasets are in the long tail. Mentioned DRYAD‘s work.
A survey he cited of environmental scientists that 80% of them were willing to share their data with others to use in different ways. They’re willing, but they confront challenges, also captured in surveys.