Stardog:

A Deep Dive

Michael Grove

Clark & Parsia, LLC

About Us

Founded in 2005; offices in Washington DC & Boston
Customers in US Gov't, banking/financial, energy, health/bio, retail
Strong academic partnerships in US, UK, Europe, and Mexico
Expertise in

Information Integration, Expertise Location, Policy Management, Enterprise Decision Support, Application Development
OWL/RDF/SPARQL/SWRL

Use Cases

Customer 360

Unify customer information: integrate all data about a customer as it's discovered

Past, Present, and Future
Pull from a variety of sources, including unstructured, many of which are non-relational
Take advantage of flexible nature of semantic technology

Data Provenance

Capture the who-what-when-where-why-how of data throughout its lifecycle
Utilize this information to enable data governance and regulatory compliance
Annotate data as it comes in; continuously updated
W3C spec dedicated to this: PROV

Reference Data

Create a 'gold standard' for names, labels and identities

Represent core industry terms and concepts
Ties into Data Provenance

Modeling complex relationships between entities can be trivial using semantic technology
FIBO is a great example

Compliance

Reduce compliance efforts to query answering and graph analytics
Legal regulations are complex

And tracking related policies is a time consuming job
Cost of implementation high, cost of a failure catastrophic

Utilize reasoning & rules

Express regulations and policies as complex relationships
Workflows and compliance checking can be performed by a reasoner

Automated compliance analysis with explanations

Analytics & Decision Support

Empower human decision making with contextualized, relevant information
There is a lot of value in querying structured information automatically extracted from unstructured data

As you build up structured corpus from data sources, you create actionable information
Sift through the data to find the facts so a human can make decisions more quickly and easily

What's the Common Thread?

All information integration problems

i.e. not really financial services problems

So how do you solve them?

Specifically, what's the best way to perform information integration?

Semantic Graphs

Semantic Graphs

Create graphs with meaning

Encoded within the graph

By giving formal, declarative definitions of the nodes and edges
Using a high-level language

Specifically, to create computer understandable meaning

So the computer can help

This lets us use the appropriate abstractions
And is the obvious choice for information integration problems

Benefits of Declarative

Let non-programmers perform complex information processing tasks without writing code
More directly capture expertise

By letting the actual experts author the business logic

Easier and more maintainable for programmers, too

Using the appropriate abstractions
Inference rules & queries

So the computer can do the work

Prospects...

Fortune 50 IT...OEM
Fortune 250 Publishing
Top 50 private American firm in business publishing

Stardog

Leading RDF graph database
Pure Java
Community, Developer & Enterprise Editions
Great developer experience

Rich feature set
Currently version 2.2.1
56 public releases in 3 years (!)

Enterprise Features

HA Cluster (beta) offers strong consistency guarantees (2PC)

Open source cluster deployment tool for AWS

JMX server monitoring
Hot Backup & Restore
Access/Audit logging
Web console built on Stardog Web Framework
PROV and SKOS support
ACID Transactions
Rich Security model

Stardog Cluster

HA Cluster
Active Replication

2PC-based commit protocol for strong consistency
Writes processed by coordinator to determine order of operations
Reads are distributed over all nodes

Performance

Query

Query 100M triples, thru-put: 3M+ queries per hour. 1B with ˜500k queries/hour and 10B with ˜40k queries/hour

This is BSBM with 64 concurrent clients

Fastest SP2B benchmark results at 5M, only known implementation to complete 25M, close to completing 100M

Scale

Up to 50B triples/quads on modest hardware

Load rates up to 500k triples/second

That's 100M triples in 3 minutes, 1B in 30, and 20B in 20 hours.

Query

SPARQL 1.1

Update, query, graph protocol

Custom query planner, optimized for complex queries

Targets BI/analytic queries
And also reasoning
But does not sacrifice performance at low scales or with simple queries

Scalable query answering

Intermediate results can get big, and fast
Runtime will automatically flow results off-heap, and then to disk as needed

Query management

Developers

First and foremost, we are developers too

We intend for the best out of box experience possible
Excellent documentation
Easy installation, just unzip

We <3 the command line

Modeled after the Git command line, autocomplete support

Like Java?

Jena, Sesame, Spring, SNARL?

Prefer Javascript, Ruby, Python, .Net, Groovy, Clojure

Annex middleware: pure REST plus JSON-LD to shield developers from semantic graph details

Full Text Search

Embeds Lucene

Automatically managed by database as if another RDF index

Enables full-text searches over your RDF

Literals are indexed by Lucene
Uses the Lucene query language

Seamless integration via SPARQL

Join results of full-text searches with regular SPARQL query

Also available via SNARL Java API

Graph Analytics

Coming in Stardog 2.3
RDF graphs are still just graphs
Graph measures: in-degree, out-degree, PageRank, betweenness centrality
Clustering: weak/strongly connected components, clique finding
Path finding: BFS and shortest path
Seamless SPARQL integration
Adding support for (de facto) graph standard: TinkerPop 3
Native implementation for Gremlin, TinkerPop 3 based on PSW/PAL work from CMU

Graph Versioning

Version control is insanely useful

Sometimes I wonder how people live without it
So why not for an RDF database?

Stardog adds commit management features similar to many popular VCS systems

Add metadata, like comments, to commits
Create tags
Revert to a previous version
Get diffs between versions

Oh, all of this is stored as RDF

So you can query your version history

What is reasoning?

Make implicit information explicit

Implicit in the schema, or data, or both
Represent domain knowledge in a formal declarative model

Called an ontology

Like UML, but with formal semantics

W3C specification called OWL, Web Ontology Language

Reasoners consume ontologies to derive new information

Answer queries, find inconsistencies

Complex, but manageable

OWL divided into profiles with less expressivity, but better computational properties

Reasoning

Unmatched OWL support

All OWL2 profiles (RL, EL, QL, DL) and Stardog profile (SL)
Caveats, no equality reasoning, no datatype reasoning, no DL reasoning over your ABox

Query time reasoning

No write performance penalty
Pay for what you use

Explanations

Inference you don't understand?
Reasoner will give you the proof used to derive it!

Reasoning Services

Consistency checking, satisfiability

Stardog Rules

Stardog supports SWRL

Part of the SL profile
You cannot write it by hand, SWRL/RDF is unusable
Much easier use Stardog Rules

If-Then style rules based on SPARQL syntax:

PREFIX :
PREFIX math: 
IF {
    ?c a :Circle ;
         :radius ?r
    BIND (math:pi() * math:pow(?r, 2) AS ?area)
}
THEN {
    ?c :area ?area
}

ICV

Integrity Constraint Validation keeps data safe and consistent
Prevent modifications that violate your integrity constrains

'Guard mode'
Constraint violations abort transactions

Also support 'oracle' mode, aka 'middleware' mode

Outside of a transaction
Check if data valid w.r.t some constraints

Violations can be explained
Inferences can satisfy or violate a constraint
Constraints expressed in SPARQL, OWL, SWRL, or Stardog Rules

High-level declarative languages make it easy to write simple constraints, possible to write complex ones

ICV Example

Every supervisor should supervise at least one employee

Supervisor subClassOf supervises some Employee

IF { 
    ?x a Supervisor 
} 
THEN { 
    ?x supervises ?y . 
    ?y a Employee 
}

select * { 
    ?x a Supervisor. 
    FILTER NOT EXISTS {
        ?x supervises ?y . 
        ?y a Employee 
    } 
}

Another ICV Example

If a project is funded by only internal funding sources, then it should be approved by the internal budget office


Project and (fundedBy only InternalFundingSource) subClassOf approvedBy value InternalBudgetOffice

select * where { 
    ?x a Project . 
    FILTER NOT EXISTS {
        ?x fundedBy ?y . 
        FILTER NOT EXISTS { 
            ?y a InternalFundingSource 
        } 
    } . 
    FILTER NOT EXISTS {
        ?x approvedBy InternalBudgetOffice 
    } 
}

ICV Explanations

If you are using ICV

You may not understand why a violation occurred
Or want to communicate it to the user

Explanations

Tells you why the violation occurred

Shows exactly the data that caused the violation
Gives you the proof used to derive the violation

ICV Explanation Example

Every Supervisor should supervise at least one Employee

Supervisor subClassOf supervises some Employee
Alice a Supervisor

VIOLATED Supervisor subClassOf (supervises some Employee)
   ASSERTED     Alice a Supervisor
   NOT_INFERRED x a Employee
                Alice supervises x

Admin Console

In Stardog 2.0 we added the Web Console

Expose the features of the stardog CLI in an easy to use web interface

Add/Remove data, execute queries, etc.
Or simply browse your data

In 2.2, we added an administrative web console

Create and drop database, manage security, etc.
Everything you can do via the stardog-admin CLI

Questions?

Thanks!

http://clarkparsia.com

http://stardog.com

Transactions & Security

Transactions

ACID
Guarded (optionally) by ICV
2 Phase Commit over all database components

RDF Index, Lucene, KB, etc.
Automatically managed by the database

Security

RBAC model

Based on Apache Shiro
R/W ACLs for access to individual databases
Administrative controls for actions against DBMS

Online/offline a database, modify security settings, etc.

Reasoning Example

For example, enforcing security (ACLs)
Can Bob access Resource1?

Bob is-a Admin OR Bob created Resource1 OR (Bob hasRole ?r AND ?r canAccess Resource1) OR ...

Hard to maintain, encoded domain knowledge into the query
Can leverage reasoning to simplify

Bob canAccess Resource1

More concise and maintainable

Reasoner handles the implementing logic transparently

Stardog: A Deep Dive

By Michael Grove

Stardog: A Deep Dive

An overview of the features and performance of Stardog

2,156

Stardog:

A Deep Dive

About Us

Use Cases

Customer 360

Data Provenance

Reference Data

Compliance

Analytics & Decision Support

What's the Common Thread?

Semantic Graphs

Benefits of Declarative

Prospects...

Stardog

Enterprise Features

Stardog Cluster

Performance

Query

Developers

Full Text Search

Graph Analytics

Graph Versioning

What is reasoning?

Reasoning

Stardog Rules

ICV

ICV Example

Another ICV Example

ICV Explanations

ICV Explanation Example

Admin Console

Questions?

Thanks!

Transactions & Security

Reasoning Example

Stardog: A Deep Dive

More from Michael Grove