CNI Fall 2013 notes
Another useful and informative CNI just ended. As I typically do, I took fairly copious notes on some of the sessions and am sharing them here.
As always with my notes, I’ve put my commentary, where possible, in italics.
Monday, December 9
Barn Raising in a Virtual World or How Innovative Approaches in Funding Led to an Architecture We Can All Use
Allan Bell, UBC; Alex Garnett, Simon Fraser; Carla Graebner, Simon Fraser; Geoff Harder, U of Alberta
Barn raising is a good metaphor for this work, not least because it is collaborative and community based. As it turns out, between BC and Alberta, one has a wide range of local resources at hand for such a project, both informal and formal. These range from code4libBC to CARL and the Council of Pacific and Prairie University Libraries (COPPUL). Within these various entities, and there are more than I’ve mentioned here, there is a variety of expertise at hand in working groups, initiatives, project groups, etc.
In passing, Carla mentioned that one of the assets they had in hand was a local IT staff with extensive experience with open source solutions. That’s an understatement, to say the least, but I’m glad she pointed it out, since it makes clear that one of the benefits of doing your own thing with open source is that your staff develop and can then apply their expertise to other work. As those of us who by nature lean toward open source often say to those who trot out evergreen statements such as “open source isn’t free, you know,” it is wiser to invest in staff capacity than to fund a software company’s bottom line.
Interesting sidenote: when Alex started speaking, he noted that his title is Digital Preservation and Data Curation Specialist, but then said in passing that he’s a systems librarian. It’s a positive development to note that someone doing his work now has a much more descriptive (and forward-thinking) title than “systems librarian,” which is a legacy from ILS days.
Archivematica was a natural choice for a project in their environment: open source and based in BC. They are now engaged in creating customizations for their work, related in part to automated ingest. Archivematica is generally conceived of us a tool for active use by depositors, curators, and archivists, but their needs are more at the tail end, since they use other front end technologies, such as Islandora. Auto ingest can also work well with theses, which are largely homogenous.
Their customizations largely concern engaging a specific set of the available microservices and applying them to batches. Alex used some screenshots of Archivematica to illustrate this concept.
They want to use Archivematica with a local LOCKSS network, since LOCKSS offers more for bit-level preservation than does Archivematica (which only uses checksums). LOCKSS generates multiple copies which can then be compared and used for repair. They are calling their central LOCKSS instance the LOCKSS-o-matic, and name that Alex attributes to Mark Jordan. He should trademark that ASAP.
Allan spoke mainly about the benefits of collaborative effort, which seem clear in this case. Together they are going to go further much faster then any of them would on their own. It seems that this is something that these institutions do well (as one has seen from previous projects). He made the clear statement that digital preservation strategies and processes require open source solutions for transparency reasons.
They are developing an “Archivematica-as-a-Service” model for other BC and Alberta institutions for whom it doesn’t make sense to build systems at this scale on their own. It has gold, silver, and bronze levels based on services provided/needed. BC, as is well known in Canada, has very strict privacy laws, which limit the ability to use commercial cloud services.
Geoff made many of the same points about monolithic systems: “trust what we can see, not what we’ve been told.” He pointed out some of their pain with such solutions, using an example that was documented at a recent Access conference by Peter Binkley (some notes on that talk can be found here).
He also noted that working in this way enables collaboration beyond the immediate partners. They are looking to be able to exchange research data nationally and internationally, for example. He used the Canadian Polar Data Network as an example of succesful collaboration, but pointed out that this needs to happen routinely, not just related to special initiatives such as the International Polar Year.
He briefly raised some of the definitional problems we have, e.g.- what exactly is a ‘dataset’ and how should the related AIPs be created? One answer could be to use an AIC (Archival Information Collection). They are identical structurally to the AIP, but contain metadata related to a class of AIPs. Allows certain metadata to be stored once and only once, with AIPs associated to the ‘parent’ AIC.
Kudos to Geoff for working in a Rob Ford joke.
They have many tasks ahead:
- linking Archivematica with the CPDN workflows
- connecting Archivematica to new and existing infrastructure: Fedora, OpenStack, Dataverse, etc.
- sustainability – build partnerships to make sure the tools continue to develop and remain viable.
Digital Humanities and Arts Projects at Columbia
Mark Newton, Jackson Harvell, Leyla Williams, Tad Shull
Jazz and Music Information Retrieval – Tad Shull
MIR is based in part on the notion that music and mathematics have a connection. Underneath the digital audio files that we hear as music, there is just data, and this data can be analyzed.
Jazz is something of a special case, and few MIR specialists work with Jazz. “It’s free, it’s chaotic.” Improvisation is the creative element, which creates obvious issues for MIR. Implied beats, polyphony, no score: all of these present challenges.
What can/could one do with Jazz MIR? Identifying soloists, and creating an artistic profile of an individual artist would be on the list. One could use it to organize collections that come to archives and libraries with poor metadata, e.g.- loft jazz from the 1970s, which would facilitate retrieval and access.
Women Film Pioneers Project – Newton, Harvell, Williams
As he was talking about the technical platform for WFPP, Mark mentioned that they were hoping to use this project as a model that they can easily replicate for other work. It’s based on WordPress, so that would seem possible.
Jackson noted that some of his design choices, e.g.- the women in the photo montage on the header, were undone when later considerations came into play about what the images actually represent. They originally intended to use wiki software, with the goal of creating a wiki that would evolve and grow. Eventually, they abandoned that idea, moved to WordPress and got into content and data modeling.
When I see the amount of design work that went into creating this site–a point that Jackson underscored at several points–concerns about scalability come to mind. How many sites like this can we develop and maintain as libraries? Such projects are major commitments, and while in spots and places we can have huge successes, it just doesn’t seem that this boutique model is sustainable. We all do this, and we all seem to find it troublesome, but it seems like we are stuck doing this work for various reasons. It’s certainly something we hear a lot from faculty: we need a Website!
In response to a question, Mark made the clear point that having a centre, i.e.- an actual place where people gather, is key to bringing together diverse functional teams.
Mobile Technologies to Support Field Research
Wayne Johnston, U of Guelph
Wayne showed some interesting photos of researchers in the field carting around an array of tools, including paper printouts of spreadsheets, pencils, wire flags (to mark plants in a field), etc., all of which, as he noted, can be replaced by a mobile device. One researcher self describes as a technophobe, and works mainly with paper, making photocopies of his work as a backup medium.
Gave the example of a large undergraduate class on biodiversity where he is helping introduce mobile technologies. The benefits in terms of both recording and preserving the work are clear. Time is saved, fewer errors, new capabilities such as GPS coordinates, video/audio, etc.
Researchers have concerns about using this technology in environments where the technology can be damaged by the conditions. Wayne’s been investigating ways to protect against physical damage. Theft, loss, and hacking are also concerns.
He gave a brief tour through a number of services and tools that have emerged around the OpenDataKit (University of Washington, funded in part by Google). The intent is to create open source tools that can be used for data research, and to create connectivity between the device and a server where the data can be stored and secured.
Visualizing: A New Data Support Role for Duke University Libraries
Started with a brief tour of why visualization is important, and how it helps humans make sense of statistical data. Also described the history of visualization at Duke, which actually started in 2001. Quite a number of activities have been going on, so it makes sense that they have identified this as a specific area to support.
A question that arose at Duke was where a visualization support service could/should be located. As many of us know, putting it in a specific college or department makes it less likely that it will serve the entire campus. She noted that there are other entities who play a role in this realm, some in the library, some elsewhere (e.g.- the Teaching and Learning Center, Social Science Research Institute).
Her Data Visualization Coordinator position reflects the decision to place this in the library. She’s been there since June 2012 in this role, which reports both to the library (through Data & GIS Services) and to Research Computing in the Office of Information Technology.
Her greatest successes have been visualization workshops, online instructional material, ad hoc consulting (she called it just-in-time), an ongoing visualization seminar series, and a student data visualization contest. The workshops cover tools (Tableau, d3), data processing (text analysis, network analysis), and best practices for designing academic posters/graphics (top 10 dos and don’ts).
She ventured into something of a best practices workshop with us, showing how to use Git/Gist to use some of the js visualization tools, and talking about how one can simplify certain information on the axes to make things more accessible or position a pie chart to make the point you want to make with it even clearer.
Her challenges include convincing certain people on campus that this service exists, and the work she’s doing for those who do seek her out make it hard to market to those people and bring them in. Keeping up with the field has also been a challenge; new tools emerge rapidly. She noted how useful Twitter is for this. She has also encountered issues with disciplinary silos and conventions. There are well established practices in some fields that have not kept up with broader changes, so it’s hard to make headway. Last but not least, there is no data visualization curriculum at Duke, which leads to a skills gap. She has lots of interest, but a lot of people lack data processing skills, such as:
- visualization types and tools
- spreadsheet/database familiarity
- data management practice
- basic graphic design
She wants to address this by creating an active student training program. She also wants to find more opportunities for physical and digital exhibits, as well as to develop more projects and workshops. One of her recommendations for coordinators is to have a stockpile of interesting datasets on hand to create sample visualizations.
Research Data Management at the Smithsonian Using Sidora
Sidora is based on both Islandora and Fedora, hence the need for a new name. He gave a snapshot background of the Smithsonian as an institution, noting that even people who work there don’t know the scope. It was interesting, and it made clear to me that they are more of a research organization than I had realized, which means they have many issues related to data management that are similar to what we experience at universities.
His goal is to support research; he explicitly stated that someone else is going to have to come along to archive it. That’s a different task. They want to capture data as it’s created. Their infrastructure should do the work for the researchers; there’s no desire to turn them into archivists or cataloguers. As one of his slides put it, “researchers will have a workspace, not an archive, curators will make sense of it later.”
Other goals include integrating the software tools into the repository. He also noted that while security is necessary, it should not be a barrier. He spoke quite eloquently about the fact that the Web is their model, i.e.- that whenever researchers go into these things what emerges looks like a network that is full of links and connections that represent human intellectual work. It’s hard to capture in notes the essence of what he said, but it was a convincing vision and the notion that we need to connect, to network, came across loud and clear.
Their model is built around concepts and resources. Concepts have a hierarchy, of sorts, starting with researcher, then project, then various concepts that make sense for a kind of research, e.g.- place, person, dataset, animal or plant, event, etc. Resources can be attached to one or more concepts, and are endpoints in themselves, i.e.- there are no dependencies that descend from resources. A concept and its related resources can be treated as a set. The ontology they use is based on the CASRAI data dictionary.
In response to a question about searching the data, he gave a funny response where he said he is a curmudgeon about search interfaces. While he admits they are important, he is weary of them being the whole point. It will be searchable, but it’s not really a focus. In passing while doing a live demo, he did mention that they are working with a contractor to convert functionality currently in dropdowns to drag and drop features, so clearly user interface and utility for the researcher are higher priorities, which makes sense given what they have set out to do.
In response to a question about permissions, he riffed a bit on their design process, which did not start with asking researchers what they wanted. He noted that he has experience working with researchers, as do others on the project, so it was more important to have something functional for the researchers and then to ask about changes. His point was that if you ask people what they want when they’ve never had such a tool, it really won’t be all that helpful and you will bog down in that process and never create anything. To someone in libraries, this rings very true.
On the discovery side, they want to offer ways to deal with sets, so that once they are found, they can be moved, intact, to a local file system, for example. Beyond that, they’re creating conversions so that a Sidora set can be converted into a Galaxy set and thus be ready for analysis when added to Galaxy.
This is oversimplifying, but in answer to a question he spoke about the connection between the repository and local user file systems, i.e.- a quasi-Dropbox function where data moves into and out of the repo via file structures. This is an exciting prospect, since it means that the researcher’s workflow and habits don’t need to change much, but their work is being synced to the repo. They don’t have to go through the door, or use any particular interface. Seems like the way of the future.
Digital Natives or Digital Naives? The Role of Skill in Internet Use
Eszter Hargittai, Northwestern, webuse.org
Started with the simple insight that being online is not synonymous with being an efficient or effective user of technology. Surely this resonates with nearly everyone in the audience. Also noted that even though the Web has been around for a while now, we still have those who hype its potential and those who find it all overblown.
Pointed out that online behaviours have real world consequences, e.g.- people being fired over Facebook posts. The Web is also an excellent platform for spreading disinformation, such as all of the anti-vaccine bits that have such traction.
Debunked some of the myths about the ‘net generation.’ They are users, but not necessarily savvy about it.
She then pointed out that we have a lot of data about Internet uses, but not a lot of fundamental data about the average Internet user, and that this is a key distinction. The data she relies on stems from in-person observations and interviews and surveys that she has done over the last decade or so. In one of her major surveys of UIC students, she used pencil and paper surveys, noting that if one is assessing or exploring the notion of skill, using an online survey isn’t a wise method. It’s been repeated three times, from 2009-2012.
An aside: to ascertain data quality, she puts in questions in her surveys that tell the respondent “for this question, check very often” just to see if people are reading the questions. Any wrong response to this question means that the results are pulled from the pool. She has a 4% rejection rate based on this method.
In the survey, she assesses their knowledge of some basic Internet terminology, e.g.- reload, bcc, pdf, etc. With regard to bcc, 34% could not–in a multiple choice format–identify what it is. Phishing performed similarly poorly. She gave them a list of four URLs and asked which likely represented the URL of a major bank. Only 12% got this right. Frightening stuff, but not surprising for those engaged in Web work.
Her results show that socioeconomic status relates to skill level, rising with status. The trend from 2009 actually didn’t change in 2012. Everyone went up overall, but the curve looks the same. In other words, people aren’t catching up.
She clearly articulated that socioeconomic status correlates clearly to participation in social media, with participation going up with status. This has implications, she pointed out, for our marketing efforts via these platforms.
One general conclusion she illustrated is that although participation online does improve any number of outcomes (such as academic achievement) those higher in status benefit more, so while everyone is going up, the gap between those at the top and bottom of the SE scale is widening.