Trey is SVP of Engineering @ Lucidworks, co-author of Solr in Action, founder or Celiaccess.com, researcher/ public speaker on search, analytics, recommendation systems, and natural language processing.

Today, I was in Montreal, Quebec, Canada presenting a research paper entitled “The Semantic Knowledge Graph: A compact, auto-generated model for real-time traversal and ranking of any relationship within a domain“. I presented the paper at the 2016 IEEE 3rd International Conference on Data Science and Advanced Analytics, where the paper was accepted for publication.

Published Resources:
Research Paper (Arxiv)
Source Code (Github)
Solr Contribution: SOLR-9480 (Apache JIRA)


I originally conceived of the core idea underlying the Semantic Knowledge Graph in 2010 and built the first pieces of it in early 2011 as part of a content-based recommendation system algorithm I was developing at the time for document to document matching (supporting a single-level graph traversal and edge scoring). Over the next few years, I would be involved in the development of distributed pivot facets in Apache Solr, and would ultimately come to the realization that a multi-level graph traversal would be possible and could yield some very promising results in terms of discovering and scoring the strength of relationships between any entities within a corpus of documents. In 2015, CareerBuilder needed such a capability to drive an ontology learning and reasoning system that was a critical component of the semantic search system we were building, so I had the opportunity to work with the search team to develop (and later open source) this capability, which we called the “Semantic Knowledge Graph”.

Paper Abstract:
This paper describes a new kind of knowledge representation and mining system which we are calling the Semantic Knowledge Graph. At its heart, the Semantic Knowledge Graph leverages an inverted index, along with a complementary uninverted index, to represent nodes (terms) and edges (the documents within intersecting postings lists for multiple terms/nodes). This provides a layer of indirection between each pair of nodes and their corresponding edge, enabling edges to materialize dynamically from underlying corpus statistics. As a result, any combination of nodes can have edges to any other nodes materialize and be scored to reveal latent relationships between the nodes. This provides numerous benefits: the knowledge graph can be built automatically from a real-world corpus of data, new nodes - along with their combined edges - can be instantly materialized from any arbitrary combination of preexisting nodes (using set operations), and a full model of the semantic relationships between all entities within a domain can be represented and dynamically traversed using a highly compact representation of the graph. Such a system has widespread applications in areas as diverse as knowledge modeling and reasoning, natural language processing, anomaly detection, data cleansing, semantic search, analytics, data classification, root cause analysis, and recommendations systems. The main contribution of this paper is the introduction of a novel system - the Semantic Knowledge Graph - which is able to dynamically discover and score interesting relationships between any arbitrary combination of entities (words, phrases, or extracted concepts) through dynamically materializing nodes and edges from a compact graphical representation built automatically from a corpus of data representative of a knowledge domain.

Comments are closed.