SPARQL City and Benchmarks
We have written in the past about SPARQL, Hadoop and benchmarks. In this post, we take a look at a company that combines all of these subjects, SPARQL City, on the occasion of the results they released after subjecting their product, SPARQLVerse, to the SP2 benchmark.
This year’s NoSQLNow conference was colocated with SemTechBiz, providing vendors a welcoming venue to showcase their offerings and achievements. There was one in particular that caught my eye, SPARQL City’s announcement about their SP2 benchmark results. As i have long been interested in Linked Data and SPARQL benchmarks, i took the opportunity to look closer into SPARQL City, the benchmark and what they do. Their CEO, Barry Zane, and the rest of the team were kind enough to brief me about their current status and future plans, so the following is a mix of my own findings and thoughts and their insights.
Let’s start with the basics. SPARQL is a (W3C standardized) query language, similar to SQL, used to query RDF databases much in the same way SQL is used to query relational databases. The difference is RDF is a graph data structure, which makes SPARQL a graph query language. SPARQL City relies on SPARQL to offer graph analytics, which is a concept i have explored myself: originally, Linked Data Orchestration was conceived as an Linked Data analytics company before it turned into a consultancy.
RDF databases are not the only ones around accommodating graph structures, but the fact that SPARQL is a standardized query language and an HTTP-based communication protocol all-in-one, and provides federated querying out of the box make it an attractive choice. Even though SPARQL has not managed to ride on the Big Data – NoSQL wave as much as it could, there is an ecosystem of RDF stores that keep improving, the language itself is evolving and adopters using it for data analysis applications include the likes of BBC and Chevron.
There are a few benchmarks around used to evaluate the performance of RDF stores, such as LUBM, BSBM and SP2. My personal favorite is the BSBM (Berlin SPARQL Benchmark), because it is the most comprehensive one in terms of use cases and inclusion and offers a single point of comparison for all included stores.
It has evolved through the years to include use cases covering aspects unique to SPARQL, as opposed to porting SQL benchmarks to SPARQL. Most important in this case, it includes a specific business intelligence use case to address graph queries for typical analytics tasks. In addition, the BSBM team is rigorous in their effort to include as many RDF store vendors as possible in their periodic execution of the benchmark, as well as in their publishing of results.
So this begs the question: why did SPARQL City choose to evaluate their offering using the SP2 benchmark? My guess involved a number of potential explanations:
- They do not consider themselves a RDF store per se. Certainly, a turn-key analytics solution is much more appealing than infrastructure to build analytics on. So SPARQL City may not want to present itself as “just” an RDF store to be listed among the likes of Virtuoso or 4store, as such solutions do not enjoy the kind of popularity analytics solutions do. Not to mention, some of these may have been hard to beat.
- They did not want to wait until the next round of BSBM execution was scheduled. BSBM is executed periodically by the team that maintains the benchmark, and it may have been messy and time consuming to set this up and run it beyond their confines in time to have results that could be announced in the conference.
- They preferred the SP2 benchmark over BSBM. Universal vendor policy: if you are going to run benchmarks, pick something that will make you look good. Available results for the competition (such as Dydra and Stardog) date back to 2011, so looking good is easier.
It seems that all 3 apply to some extent. Although SPARQL City relies on RDF and SPARQL, it still does not position itself in this market, but rather in the analytics market. This has some interesting implications on the way their solution works on the technical side, which we are going to cover in the next part of this post. For the time being, suffice to say that SPARQL City is Hadoop-based.
In all honesty though, benchmarking is always a tricky topic. Unless, like in the BSBM case, all participating vendors compete on the exact same time / configuration and under the auspices of a regulating entry. But just to close the benchmarking issue before moving on to see what goes on under the hood in SPARQL City, i need to note one last thing.
For a vendor like SPARQL City who wants to be considered as an analytics solution, perhaps it would have made more sense to go for BSBM. SP2 includes a number of queries, while BSBM also groups them into use cases, and an analytics use case is the latest addition. Plus, BSBM is more recent and up to date that SP2 that dates back to 2009, when SPARQL 1.1 had not been introduced yet. This is important, as it is in this version that many features pertaining to analytics were introduced.