Stardog Unleashed



Michael Grove
Chief Software Architect - Clark & Parsia

About Us


  • Founded in 2005; offices in Washington DC & Boston
  • Application development & consulting
  • Customers in US Gov't, banking/financial, energy, health/bio, retail
  • Strong academic partnerships in US, UK, Europe, and Mexico
  • Experts in all things semantic
    • OWL/RDF/SPARQL/SWRL and any other acronym
    • Information Integration, Expertise Location, Policy Management, Enterprise Decision Support, Application Development

Graphs are for everybody


  • Graphs are everywhere
    • Natural way to model many problems and domains
    • Ideal for data integration
  • Semantics are a natural complement
    • Declarative, formal descriptions of nodes, edges, and their relationships
    • Semantic Graphs
      • Graphs with meaning
      • Obvious choice for data integration and analysis problems

Semantic Graphs


  • Non-programmers exist!
    • Not everything needs to be in code
    • Especially the business logic
    • Can't teach everyone programming
  • What can do we do?
    • Encode the business logic using semantics
      • Frees it from the codebase, and thus programmers
        • By encoding it in the graph
    • Lets non-programmers do complex information processing without having to write code
      • Actual experts in business logic can implement it

Smart Data


  • Scale is not a necessary condition for utility
    • Not all problems are solved by adding more data
    • Getting value from data comes down to how easily you can do something with it
  • Smart data is data with semantics attached to it
    • Gives data meaning
      • More specifically, data with a computer understandable meaning
    • Which means the computer can help
      • And that makes it easier to utilize data
      • Analysis, BI, decision support, etc.

Semantic Technologies


  • Based on W3C standards
    • HTTP as the protocol, URL/URI as identifiers
      • RDF: graph data model
      • SPARQL: Query over RDF
      • OWL & Rules: encode business logic in the graph
    • Data format, protocol, semantics; none are proprietary
      • Promotes re-use and makes interoperability easier
        • Semantics do not change; data retains original meaning
  • Both schema-less and schema-rich
    • And everything in between
  • Maintain advantages of graph database while adding features valuable to an enterprise




Stardog


Performance


  • Query
    • Loading lots of data is not useful if you cannot query it
    • Query 100M triples with a throughput of 3M+ queries per hour.  1B with nearly 500k queries/hour and 10B with nearly 20k queries/hour
      • This is BSBM with 64 concurrent clients
    • Fastest SP2B benchmark results at 5M, only known implementation to complete 25M, close to completing 100M
  • Scale
    • Up to 50B triples/quads on modest hardware
  • Load rates up to 500k triples/second
    • That's 100M triples in 3 minutes, 1B in 30, and 20B in 20 hours.


  • Stardog powers the Best Buy Metis "Like for Like" service
    • Given a product SKU, return most similar products
  • Searches over 60 million triples representing the Best Buy product catalog
  • Also used for the backend to their 2013 Stocking Stuffer promo

US Government Customer


  • Represent security policies in XACML
  • Automatically convert policies to OWL
  • Load into Stardog and using reasoning for ...
    • Are there holes in the policy set?
      • i.e. Do my policies do what I think they do?
    • Do any policies allow actions that other policies deny?
    • Are there any policies which are redundant?
    • Create a unit test suite for policy sets
  • Explanations for traceability into policy issues
  • ICV to keep policy data consistent and correct



  • Rapidly standing up new internal SPARQL endpoints
  • Storing voiD and SPARQL service descriptions in Stardog
    • Use as a catalog of what data is available, and where
  • Using reasoning to specify ACLs for data sources 


  • NASA is a big, interesting organization
    • 100k+ employees across 12 centers throughout the US
    • They have a universe of data
  • But one simple, terrestrial problem, finding experts
    • COTS solutions were not working
    • NASA had all the information they needed in-house
  • Enter POPS
    • Treat expertise location as a data integration problem
      • Create model to unify distinct, but relevant, data sources
    • Use reasoning to infer new connections in the data
    • Saved NASA $38M a year
    • An official W3C case study for use of semantic technologies




Developers



ICV


  • Integrity Constraint Validation keeps data safe and consistent
  • Prevent modifications that violate your integrity constrains
    • 'Guard mode'
    • Constraint violations abort transactions
  • Also support 'oracle' mode, aka 'middleware' mode
    • Outside of a transaction
    • Check if data valid w.r.t some constraints
  • Violations can be explained
  • Inferences can satisfy or violate a constraint
  • Constraints expressed in SPARQL, OWL, SWRL, or Stardog Rules
    • High-level declarative languages make it easy to write simple constraints, possible to write complex ones

ICV Example


Every supervisor should supervise at least one employee
Supervisor subClassOf supervises some Employee  
IF { 
    ?x a Supervisor 
} 
THEN { 
    ?x supervises ?y . 
    ?y a Employee 
}  
select * { 
    ?x a Supervisor. 
    FILTER NOT EXISTS {
        ?x supervises ?y . 
        ?y a Employee 
    } 
} 

Another ICV Example


If a project is funded by only internal funding sources, then it should be approved by the internal budget office

Project and (fundedBy only InternalFundingSource) subClassOf approvedBy value InternalBudgetOffice 
select * where { 
    ?x a Project . 
    FILTER NOT EXISTS {
        ?x fundedBy ?y . 
        FILTER NOT EXISTS { 
            ?y a InternalFundingSource 
        } 
    } . 
    FILTER NOT EXISTS {
        ?x approvedBy InternalBudgetOffice 
    } 
} 

ICV Explanations


  • If you are using ICV
    • You may not understand why a violation occurred
    • Or want to communicate it to the user
  • Explanations
    • Tells you why the violation occurred
      • Shows exactly the data that caused the violation
      • Gives you the proof used to derive the violation

ICV Explanation Example

Every Supervisor should have at least one Employee

Supervisor subClassOf supervises some Employee
Alice a Supervisor 
VIOLATED Supervisor subClassOf (supervises some Employee)
   ASSERTED     Alice a Supervisor
   NOT_INFERRED x a Employee
                Alice supervises x 

What is reasoning?


  • Make implicit information explicit
    • Implicit in the schema, or data, or both
    • Represent domain knowledge in a formal declarative model
      • Called an ontology
        • Like UML, but with formal semantics
      • W3C specification called OWL, Web Ontology Language
  • Reasoners consume ontologies to derive new information
    • Answer queries, find inconsistencies
  • Complex, but manageable
    • OWL divided into profiles with less expressivity, but better computational properties 

Reasoning


  • Unmatched OWL support
    • All OWL2 profiles (RL, EL, QL, DL) and Stardog profile (SL)
    • Caveats, no equality reasoning, no datatype reasoning, no DL reasoning over your ABox
  • Query time reasoning
    • No write performance penalty
    • Pay for what you use
  • Explanations
    • Inference you don't understand?
    • Reasoner will give you the proof used to derive it!
  • Reasoning Services
      • Consistency checking, satisfiability

    Stardog Rules


    • Stardog supports SWRL
      • Part of the SL profile
      • You cannot write it by hand, SWRL/RDF is unusable
      • Much easier use Stardog Rules
        • If-Then style rules based on SPARQL syntax:
     
    PREFIX :
    PREFIX math: 
    IF {
        ?c a :Circle ;
             :radius ?r
        BIND (math:pi() * math:pow(?r, 2) AS ?area)
    }
    THEN {
        ?c :area ?area
    } 

    Query


    • SPARQL 1.1
      • Update, query, graph protocol
    • Custom query planner, optimized for complex queries
      • Targets BI/analytic queries
      • And also reasoning
      • But does not sacrifice performance at low scales or with simple queries
    • Scalable query answering
      • Intermediate results can get big, and fast
      • Runtime will automatically flow results off-heap, and then to disk as needed
    • Query management  

    Full Text Search


    • Embeds Lucene
      • Automatically managed by database as if another RDF index
    • Enables full-text searches over your RDF
      • Literals are indexed by Lucene
      • Lucene query language used search data
    • Seamless integration via SPARQL
      • Join results of full-text searches with regular SPARQL query
    • Also available via SNARL Java API

    Enterprise Features


    • JMX server monitoring
    • High Availability
    • Hot Backup & Restore
    • Access/Audit logging
    • Web console built on Stardog Web Framework
    • PROV and SKOS support
    • ACID Transactions
    • Rich Security model

    What's Next?


    • Graph analytics
    • Model versioning
    • Named graph security
    • Stored Procedures
    • GeoSPARQL
    • Materialized views
    • Equality reasoning
    • Administrative Web Console
    • And as always, faster & more scalable

    Stardog Web


    • Focus on the Web part of Semantic Web
      • Organizations don't always have experts in semtech available
    • Provide a framework that abstracts away these details
      • Stick to well-known web technologies
        • HTML, CSS, Javascript, JSON as data
        • backbone.js as a model layer, SPARQL Routes middleware
    • Goal is to provide good out of the box capabilities
      • Faceted browsing, semantic search, REST, CRUD, etc.
      • Minimal programming or configuration required, just add data
      • Provide basis for quickly building web apps based on semtech
        • Aimed at data discovery/exploration use cases
    • Stardog Web Console built on this technology




    Demo





    Questions?




    Thanks!

    Transactions & Security


    • Transactions
      • ACID
      • Guarded (optionally) by ICV
      • 2 Phase Commit over all database components
        • RDF Index, Lucene, KB, etc.
        • Automatically managed by the database
    • Security
      • RBAC model
        • Based on Apache Shiro
        • R/W ACLs for access to individual databases
        • Administrative controls for actions against DBMS
          • Online/offline a database, modify security settings, etc.

    Graph Analytics


    • Coming in Stardog 2.2
    • RDF graphs are still just graphs
    • Graph measures: in-degree, out-degree, PageRank, betweenness centrality
    • Clustering: weak/strongly connected components, clique finding
    • Path finding: BFS and shortest path
    • Seamless SPARQL integration

    Reasoning Example


    • For example, enforcing security (ACLs)
    • Can Bob access Resource1?
    Bob is-a Admin OR Bob created Resource1 OR (Bob hasRole ?r AND ?r canAccess Resource1) OR ... 
    • Hard to maintain, encoded domain knowledge into the query
    • Can leverage reasoning to simplify
    Bob canAccess Resource1 
    • More concise and maintainable
      • Reasoner handles the implementing logic transparently

    Stardog Unleashed

    By Michael Grove

    Stardog Unleashed

    An overview of the features and performance of Stardog

    • 5,521