Data modeling for APIs. Part 4: Linked Data and SPARQL
In the 4th part of this series of posts we look at a different way of data modeling for APIs, one that is based on Linked Data standards. First, some background and terminology. The terms that define the associated technologies have been the subject of debate, as well as the technologies themselves. In essence, these are Semantic Web technologies we are talking about.
The Semantic Web technology stack is meant to support a “Web of data”, the sort of data you find in databases. The ultimate goal of the Web of data is to enable computers to do more useful work and to develop systems that can support trusted interactions over the network. The term “Semantic Web” refers to W3C’s vision of the Web of linked data. Semantic Web technologies enable people to create data stores on the Web, build vocabularies, and write rules for handling data. Linked data are empowered by technologies such as RDF, SPARQL, and OWL. Let us quickly explain the most relevant ones for our use case.
RDF: Resource Description Format – the data format for the Semantic Web. Can be expressed in different notations/syntaxes (e.g. XML, N3, Turtle etc) but in all cases the essence of it that every RDF statement is a triple: <subject> <predicate> <object> – e.g. <https://www.slideshare.net/scorlosquet/produce-and-consume-linked-data-with-drupal> <isA> <presentation>. In essence RDF is an graph-based entity-attribute-value model.
RDFS: RDF Schema. Language to define RDF schemata (aka vocabularies), using constructs such as subclassing, property types/ranges etc.
Ontology: a ‘specification of a conceptualization’ – or a way to represent knowledge about a domain in a way that is formal and machine-interpretable. Ontologies are more advanced RDF schemata, allowing for extra constrains, specifications etc.
OWL: Web ontology language. The W3C standard for ontology representation. OWL ontologies (as well as their ‘instances’ – the data they specify) can be expressed in RDF.
Roughly, RDF vocabularies and OWL ontologies can be seen as more powerful and self-descriptive DB schemata and their instances can be seen as the corresponding DB data.
SPARQL: a query language (and https-based access protocol) for RDF data – the RDF counterpart for SQL, except more powerful, as it allows things such as distributed querying and complex graph processing out of the box. W3C standard.
Linked Data: a way to publish and consume data on the web that relies on 4 rules:
1. Use URIs as names for things
2. Use https URIs so that people can look up those names.
3. When someone looks up a URI, provide useful information, using the standards (RDF, SPARQL)
4. Include links to other URIs. so that they can discover more things.
If you think that the Linked Data approach looks a bit like REST, you’re right. The idea is that resources are reference-able and reachable via their URI. However the use of RDF and SPARQL lend more power and flexibility to the idea.
First, in terms of data modeling. One of the strengths of the Linked Data approach has to do with the powerful data modeling languages and the public repositories of schemata. RDFS and OWL offer capabilities that go way beyond XSD and JSON schema, and on top of that there is a growing repository of publically available Linked Data schemata that can be reused and referenced.
This is not all however. Perhaps the most important aspect of using Linked Data technology for data modeling is that it provides a very powerful API out of the box via SPARQL. So not only can you specify unambiguously the semantics of your data objects and their properties, but you can also enable access to them via SPARQL. SPARQL is a powerful mechanism that goes way beyond the typical PUT/GET/etc REST access to enable a fully fledged query language on top of your data.
Let us see a quick example of how this would work in practice. Imagine you are developing some application and want to provide an API for it, how would this be done using Linked Data technology?
To begin with, it all depends on how you have chosen to develop your data model and what your data store is. We will only cover the relational store case here, as currently this is the most commonplace option and the one that is well covered by Linked Data tooling. The data model may have been developed independently, or it may have been informed by an existing Linked Data schema. If the latter is the case, the relational db structure to support it will somehow reflect the schema it is based on.
- In any case, the first step in the process is to find or develop a schema to map to. This depends on the domain obviously, so for e-commerce applications we would use Good Relations, while for social applications SIOC would be appropriate. Of course, nothing prevents us from using 2 or more schemas in cases where our application spans domains, as long as the schemas we use are consistent, or from developing and publishing our own schema, in case we cannot locate an existing one that matches our purposes. We must note that in addition to standard web search, there are also specialized schema search engines that can be used for this purpose such as Swoogle.
- The second step is to choose a tool to provide the SPARQL layer on top of our data store. The top 2 choices there currently are D2R Server, an open source tool that is part of the LOD2 stack of Linked Data tooling, and Revelytix Spyder, a tool developed by a privately held company. What these tools do is they provide an https access layer, plus a SPARQL-to-SQL mapping layer that sits on top of it. The https layer gets SPARQL queries and the SPARQL-to-SQL layer translates them, forwards them to the underlying RDBMS, gets the results and serializes them in Linked Data format and then passes them to the https layer to be returned to the client.
- The third step is to provide the actual mapping. This requires a thorough knowledge of the application schema, as well as good knowledge of the chosen Linked Data schema. The former is obviously implied, while the latter can usually be obtained via the schema’s documentation, as most popular schemata are well documented. Each tool has its own way of expressing mappings, and there is also a W3C standard for this called R2RML.
This is it! If these 3 steps are completed, the result is a SPARQL front end that lets any type of query be executed remotely against your application’s data store. This is the equivalent of having direct access to the db, so it should be handled with care. On the one hand, there may be security issues, although they can be handled on the web application layer, as this is what mapping tools essentially are. On the other hand, this may not suffice for cases where complex sequences of actions are required. In such cases, the development of an additional API facade layer to abstract the intrinsic details of low-level data manipulation will be needed.
Now let us see how this approach fares in terms of the criteria set previously to evaluate data modeling techniques for APIs:
- Semantic clarity & expressiveness. RDF(S) and OWL schemas are among the most elaborate ways of modeling and their semantics are not only clear, but also typically well documented. Although the approach we present here is not fully based on them, the facts that a. they can inform the data model design and b. the exposed data objects and attributes have clear semantics due to their “anchoring” give a clear advantage in this aspect.
- Modeling flexibility. Again, even though this mixed approach does not rely on RDF(S) and OWL to fully leverage one of their benefits, namely extreme flexibility, as schemata can be changed on the fly without significant side-effects (at least for changes that do not affect major design issues such as hierarchy), some benefits do apply. When the underlying schema changes, mapping may or may not be needed, but in any case alternative mappings may co-exist, thus providing benefits in terms of versioning and alternative views of the same schema for different use cases.
- Ease of use & communication. SPARQL is definitely not the most widespread query language in the world, and using any query language to access data directly instead of relying on an API to mediate requires a mind shift. Once past that initial barrier however, using SPARQL can be rewarding as clients no longer rely on server-side developers to implement often trivial access methods. As far as communication goes, this clearly depends on how good a documentation your schemata of choice have. In the case of Good Relations and SIOC for example, you get top-level documentation for free and out of the box. For other less popular schemata, things may not be as bright.
- Documentation & tooling support. We already mentioned the documentation issue, so let us now examine how easy it is to create your own schema (or modify an existing one) and document it using existing tooling. The most popular tools for the former are TopBraid Composer and Protege, commercial and open source respectively. They have both come a long way and are relatively easy to use, however arguably the biggest barrier there lies not with the tools themselves, but rather with the concept of modeling for the Semantic Web which remains somewhat esoteric, without this necessarily translating to being painstaking. As far as documentation goes, there are some options out there, with my personal favorite being SpecGen – although i admit i have not tried LODE. To get an idea of what SpecGen can do for you, check this example out.
- Performance. Clearly, adding an extra layer on top of your data that uses https for transport, some form of Linked Data textual serialization and re-writes SPARQL queries in SQL is not going to perform as well as a binary protocol and native SQL queries implemented in the back end. In practice however i can report from personal experience that this is not a show-stopper – at least for relatively small databases. Unfortunately, the only reference to performance i could find dates back to 2008 and only includes D2R, so if anyone can report on their experiences using these tools it would be most welcome.
While i understand that this option is not for everyone, i must say it is one i personally favor. Indeed, it takes some getting used to to overcome the “meh” that Semantic Web seems to inspire to most business users, however if you make the investment, it will most likely pay off. We have only begun to scratch the surface of what is possible with Linked Data here, using as much of a down to earth use case as possible. We have not even touched at some other benefits of this technology, such as inference – the ability to use your domain model to generate additional knowledge.
However let us conclude by emphasizing the fact that by leveraging this technology, your data can instantly become part of the global Linked Data cloud. What this means in practice is that data break free of their hidden web silo and become discoverable, enabling mashups and the generation of value beyond the application that produced them in the first place. This however is big topic in and by itself, and therefore beyond the scope of this post.
In the next part of the series we’ll look into even more exotic options for data modeling for APIs, such as Google protobuf.