DH 2015 Sydney notes – Thursday
Morning Long Paper Session
An Entity-based Approach to Interoperability in the Canadian Writing Research Collaboratory
Susan Brown – U of Guelph, U of Alberta; Jeffery Antoniuk, Michael Brundin, John Simpson, Mihaela Ilovan – U of Alberta; Robert Warren – Carleton University
Proposing how applying linked open data to CWRC projects would facilitate “rudimentary interoperability.” Susan briskly described their entities and usage of authorities. Trying to balance two kinds of projects, those sophisticated DH projects that use well regularized metadata and those that use highly irregular data and metadata. The latter cannot be remediated, as she noted, but the goal is to make it interoperate with other sets.
One result of doing this work is, for example, the ability to do a multi-lingual entity lookup. It becomes possible to find persons by variant names, etc.
Noted that the CWRC Writer is designed to make it possible to generate linked open data when doing general XML-based text editing. One aspect it includes is the ability to create an entity, since otherwise when working with LOD one is dependent on external sources, which will often lack information on individuals found within a given realm of textual sources.
From the Holocaust Victims’ Names to the Description of the Persecution of the European Jews in Nazi Years: The Linked Data Approach and a New Domain Ontology. The Italian Pilot Project.
Silvia Mazzini, Laura Brazzo – CDEC
Their aim is to show how to go from the names of the victims of the holocaust to the victims of the Holocaust. Many sites exist with the names of victims, as well as many well known walls and memorials with names. What can occur is that multiple entries in the various sources can point to the same individual. Showed an example of a man who appeared eight times in a single database, with slight variations between the records. Their question is whether new technologies can help solve this issue. Put simply, in their words:
Millions of Names. How many Victims, though?
They asked: why not create a resource that would have unique IDs for victims, which would allow the many autonomous databases to continue to exist, but to access these IDs using linked data. Their pilot project worked with Italian victims that have been documented previously using various resources–papers, books, video testimonies, photographs–that leads to having them entered in various CDEC database resources (source was CDEC’s archive and library). Showed an example of a woman who existed in various sources and thus in various systems that they maintained, and how the data on her varied by source. Now they have a unified entity that points to the various sources.
Linked data seemed to be the best mechanism for publishing their data to the Web. They created a specific Shoah ontology, which can encompass such actions as arrest, detention, and deportation.
The History and Provenance of Cultural Heritage Collections: New Approaches to Analysis and Visualization
Toby Nicolas Burrows – King’s College London
Focused on one particular EU-funded project related to the Thomas Phillipps manuscript collection. Phillipps was an avid, perhaps somewhat overeager, 19th-century collector. In today’s figures, he spent over 100 million pounds on this collection. He spent nearly 2/3 of his annual income on his collecting; the funds came from his father (he was an illegitimate son of a mill owner). He showed a recreation of his shelves at the Grolier Club in New York that makes visually clear how obsessive he was about this hobby.
After his death, the collection was widely dispersed. During his life, he attempted to sell it to various parties, including the Bodleian, but failed to do so. His heirs sold it off over the next century, but the dispersal continues; it’s still possible to purchase manuscripts on the market.
His project attempts to document where his documents came from, but also where they went. Four questions/tasks:
- show the Irish manuscripts
- show all events that link Phillipps to an earlier or later collector
- how many Phillipps manuscripts are in North America?
- what can we learn about the sources of the collection, the nature of its contents, and the extent of its dispersal
One of his primary research questions, aside from the goal of establishing provenance, was how one can use linked data for this work (coined the phrase network archaeology).
Started by consulting the Schoenberg Database of Manuscripts at Penn, which records sales transactions. Consulted other resources as well, since the Schoenberg’s scope is medieval and Phillipps had non-medieval manuscripts in his collection.
Used Neo4j to create a graph database. Useful for displaying network relations. Enabled them to record and describe the various transaction events related to manuscripts. Also worked with Nodegoat. Noted that one of Nodegoat’s strengths is the out-of-the-box visualizations.
Afternoon Plenary Panel
Plenary Panel: Indigenous Digital Knowledge
Peter Read – Australian National U / U of Western Sydney; Susan Beetson – Queensland U of Technology; Peter Radoll – University of Newcastle; Julia Torpey – Australian National University; Hart Cohen – University of Western Sydney
Found it somewhat difficult to follow this panel, frankly, because of my overwhelming ignorance of Australia’s history and people. Ended up with the impression that while our work often tackles ‘global’ topics, we need to represent the local and the indigenous in what we do. Heard interesting things about giving voice to those who are typically ignored, storytelling, and recording and preserving stories.
Afternoon Short Paper Session
Remembering Books: A Within-book Topic Mapping Technique
Peter Organisciak, Loretta Auvil, J.Stephen Downie – U of Illinois at
Topic modelling smaller things with less data makes it harder to train the tool. Also, with books, an obvious rebuttal to topic modelling desires would be “why not just read the book?” The goal with their work, however, is to aid reader understanding and to enhance a reader’s ability to communicate it.
Work with texts from the HathiTrust Research Center as they have markup that allows the identification of parts of speech, headers, footers, etc. That makes it easier to focus on content.
Explained why they used LDA and chose to use the page as the unit of content. Used a ‘sliding frame’ (window of multiple pages) to get a bit beyond that. All of their code and tutorials can be found on GitHub.
Visualizing the Digital Mitford Project’s Prosopography Data
Elisa Beshero-Bondar – U of Pittsburgh at Greensburg
First Prezi I’ve seen this year.
Trying to produce first database of works and letters of Mary Russell Mitford; secondary mission is to share knowledge of TEI XML and other humanities computing techniques with scholars who work in this area.
Major challenge with the project is to ensure that all editors are encoding similarly and consistently. They’ve created training materials to achieve this. The backbone of their project is their “site index,” an XML document that contains all of the people–real and fictional–related to Mitford, as well as places, events, artworks, etc.
They’ve produced network graphs, but their real goal is to create graphs that are navigational aids that could be used as a site interface. Not sure how to do that and looking for assistance.
Exploratory Search through Interactive Visualization of Topic Models
Patrick Jähnichen, Patrick Österling, Tom Liebmann, Gerhard Heyer, Christoph
Kuras, Gerik Scheuermann – U Leipzig
Presentation written in LaTeX. Made a little joke about this. Looked great, of course.
Used datasets from Stasi records and Eighteenth Century Collections Online (ECCO) for their work. Goal was to create a visual tool that uses the results of topic modelling to create a usable interface that connects works to each other in various ways. He noted that as computer scientists, they asked themselves what tasks that a human might need to do can be facilitated by topic modelling. In other words, as he put it, it’s an interactive visualization of topic modelling outcomes. Tried to avoid presentation of topics as lists of words, as is typical with topic modelling outputs. Now they are working to integrate the beta into the Leipzig Corpus Miner.
Interactive Visual Analysis Of German Poetics
Markus John, Steffen Koch, Florian Heimerl, Andreas Müller, Thomas Ertl, Jonas Kuhn – U Stuttgart
Built a tool called the VarifocalReader to create visual abstractions of document structures. Users can choose their abstractions. Showed a video demo. Hard to follow given the tiny font, but one interesting element is that it includes the scanned pages of texts to allow comparison to source.
Textal: Unstructured Text Analysis Workflows through Interactive Smartphone Visualisations
Steven James Gray, Melissa Terras, Rudolf Ammann, Andrew Hudson-Smith – University College London
Problems with word clouds:
- flat image
- black box
- no statistics behind the words chosen
and that’s just the start.
Textal is a mobile app that allows one to create word clouds with any Web or textual source. Has been on my phone for a couple of years, although I tend to use it rarely. Beyond the flat image, it provides statistics, shows common pairs, calculates the Scrabble score (!), etc. It’s possible to export from the app via email.
Done by a very small team. Goals? Lots of them, but the main one was to build an app, and they wanted to work together.
Described the process they used to build it since an app needs to be fast and seamless. Relies on some server-side operations to enhance speed; the two machines race against each other, one doing a good job, the other doing a “sloppy job” as he put it. This means that doing the same request twice can yield different results in the app. Textal has an API to take the functionality out beyond the app, although he did say something about this being in the future.