Although our blog has been a bit silent for the past several months, we've been busy behind the scenes here at the Carnegie Hall Data Lab. If you saw our Request for Proposals post, you know we've been engaged in a deep review of our data management practices, with an eye towards modernizing, streamlining, and better harmonizing the management and flow of information in the Carnegie Hall Rose Archives. While we're close to wrapping up that work, an unexpected predicament arose with our linked data tech stack that challenged us to turn a potential crisis into an opportunity.
Facing a sudden, steep (nearly 40%) increase in the yearly license for the cloud-based managed service triple store that powers our SPARQL endpoint, we quickly researched alternatives and found a much more cost-effective solution. In the process, we made a few changes under the hood that we think not only improve our data pipeline, but also simplify and clarify our data model.
Simplifying the data model
When we first published our LOD in 2017, one of our primary challenges was that no single ontology existed that could fully express our performance history data model. Rather than attempt to create our own ontology, we decided to work with existing ones, favoring widely-used and well-understood vocabularies such as Schema.org, the Music Ontology, the Event Ontology (itself based on the Music Ontology), Dublin Core, and the Friend of a Friend (FOAF) Ontology. Since we considered the release to be experimental, we took a "let's throw some stuff against the wall and see what sticks" approach, applying overlapping classes and properties for some of the model. While we'd hoped this might provide a crosswalk of sorts, in reality it probably just made querying our data more convoluted and difficult.
You might recall from some of our other posts that we currently manage our performance history data natively in a SQL database, using a data pipeline to map from that proprietary table structure to our RDF data model. Our original pipeline was a functional means to an end and ran smoothly enough, but it was essentially a "processized" version of Rob's original Python scripts from the 2014 LOD prototype. The data schema was a bit buried and difficult to update, and we needed more robust documentation. While we probably could have simply swapped out one triplestore for another at the end of this pipeline, our tech partner, Gabe Mangiante, asked a lot of questions in an effort to understand our data, which in turn led to discussions about how we could clarify and streamline our model. Additionally, Gabe ended up completely revamping the pipeline, using a simplified architecture based in functional programming – no classes, minimal global variables, largely side-effect-free functions (other than file I/O as necessary) – that ingests streams of data, performs transformations, and sends the resulting RDF data to the cloud-based triplestore in as efficient and low-overhead a manner as possible.
Prioritizing Schema.org classes and properties
As far as our data model is concerned, the main change was to remove as many of the additional layers as possible, flattening, simplifying, and streamlining the structure. We did this by pulling back to Schema.org wherever equivalent classes and properties could be found. For example, in our original model, a Carnegie Hall event was assigned three overlapping classes:
The Event Ontology and CIDOC-CRM classes have now been removed, leaving a single classification as a schema:Event.
In cases where Schema.org didn't have a representative class or property, we kept our original mapping, but scaled back to a single namespace. For example, since Schema.org couldn't represent musical genres as classes or properties, we retained our use of the Music Ontology and eliminated the DBpedia Ontology's class and property for genre. So, for this January 27, 1959 performance by the American Opera Society (the Carnegie Hall debut of the legendary soprano, Maria Callas), the event has the property mo:genre, with a value of our own Carnegie Hall genre URI for opera, which is classified as a mo:Genre.
Below is a basic visual representation of our simplified data model:
More information and documentation can be found in our GitHub repository.
Questions? Please email us!
Header image: Men fixing pipeline, Darlington, The History Trust of South Australian, South Australian Government