About Trey

Founder @ Searchkernel, author of
AI-Powered Search and Solr in Action, startup Advisor, researcher/ public speaker on search, relevance & ranking, recommendation systems, and natural language processing.

The Semantic Knowledge Graph

October 18th, 2016

Today, I was in Montreal, Quebec, Canada presenting a research paper entitled “The Semantic Knowledge Graph: A compact, auto-generated model for real-time traversal and ranking of any relationship within a domain“. I presented the paper at the 2016 IEEE 3rd International Conference on Data Science and Advanced Analytics, where the paper was accepted for publication.

Published Resources:
Research Paper (Arxiv)
Source Code (Github)
Solr Contribution: SOLR-9480 (Apache JIRA)

Slides:

Background:
I originally conceived of the core idea underlying the Semantic Knowledge Graph in 2010 and built the first pieces of it in early 2011 as part of a content-based recommendation system algorithm I was developing at the time for document to document matching (supporting a single-level graph traversal and edge scoring). Over the next few years, I would be involved in the development of distributed pivot facets in Apache Solr, and would ultimately come to the realization that a multi-level graph traversal would be possible and could yield some very promising results in terms of discovering and scoring the strength of relationships between any entities within a corpus of documents. In 2015, CareerBuilder needed such a capability to drive an ontology learning and reasoning system that was a critical component of the semantic search system we were building, so I had the opportunity to work with the search team to develop (and later open source) this capability, which we called the “Semantic Knowledge Graph”.

Paper Abstract:
This paper describes a new kind of knowledge representation and mining system which we are calling the Semantic Knowledge Graph. At its heart, the Semantic Knowledge Graph leverages an inverted index, along with a complementary uninverted index, to represent nodes (terms) and edges (the documents within intersecting postings lists for multiple terms/nodes). This provides a layer of indirection between each pair of nodes and their corresponding edge, enabling edges to materialize dynamically from underlying corpus statistics. As a result, any combination of nodes can have edges to any other nodes materialize and be scored to reveal latent relationships between the nodes. This provides numerous benefits: the knowledge graph can be built automatically from a real-world corpus of data, new nodes – along with their combined edges – can be instantly materialized from any arbitrary combination of preexisting nodes (using set operations), and a full model of the semantic relationships between all entities within a domain can be represented and dynamically traversed using a highly compact representation of the graph. Such a system has widespread applications in areas as diverse as knowledge modeling and reasoning, natural language processing, anomaly detection, data cleansing, semantic search, analytics, data classification, root cause analysis, and recommendations systems. The main contribution of this paper is the introduction of a novel system – the Semantic Knowledge Graph – which is able to dynamically discover and score interesting relationships between any arbitrary combination of entities (words, phrases, or extracted concepts) through dynamically materializing nodes and edges from a compact graphical representation built automatically from a corpus of data representative of a knowledge domain.

Lucene/Solr Revolution is always my favorite conference of the year, and I was invited to present again this year in the great city of Boston, MA. It was a fun presentation, representing a whirlwind tour through numerous open sourced extensions within the larger Apache Solr Ecosystem (The Semantic Knowledge Graph, the Solr Text Tagger, a Probabilistic Query Parser, Dice’s Conceptual Search Plugin, Solr’s Learning to Rank capability, SolrRDF, etc.) that can be combined, along with some query log mining, to build an end-to-end self-learning data system powered by Apache Solr. Videos will be posted soon, but here are the slides from my talk in the meantime!

Slides:

http://www.slideshare.net/treygrainger/reflected-intelligence-lucenesolr-as-a-selflearning-data-system

Video:

Talk Abstract:
What if your search engine could automatically tune its own domain-specific relevancy model? What if it could learn the important phrases and topics within your domain, automatically identify alternate spellings (synonyms, acronyms, and related phrases) and disambiguate multiple meanings of those phrases, learn the conceptual relationships embedded within your documents, and even use machine-learned ranking to discover the relative importance of different features and then automatically optimize its own ranking algorithms for your domain?

In this presentation, you’ll learn you how to do just that – to evolving Lucene/Solr implementations into self-learning data systems which are able to accept user queries, deliver relevance-ranked results, and automatically learn from your users’ subsequent interactions to continually deliver a more relevant experience for each keyword, category, and group of users.

Such a self-learning system leverages reflected intelligence to consistently improve its understanding of the content (documents and queries), the context of specific users, and the relevance signals present in the collective feedback from every prior user interaction with the system. Come learn how to move beyond manual relevancy tuning and toward a closed-loop system leveraging both the embedded meaning within your content and the wisdom of the crowds to automatically generate search relevancy algorithms optimized for your domain.

I was honored to be invited to present today, along with Khalifeh AlJadda, Lead Data Scientist at CareerBuilder, to a group of nearly 200 Georgia Tech graduate students and other members of the larger Georgia tech community. We thank Dr. Polo Chau and his Data and Visual Analytics class for inviting us and sponsoring the event, and we also appreciate the numerous other folks who also heard about and attended the presentation!

Khalifeh and I worked closely together while I was at CareerBuilder evolving their semantic search engine and recommendation engine into self-learning data systems. It was great being able to present some of the similar work I am now doing at Lucidworks, along with Khalifeh, who presented much of the work we had done at CareerBuilder, as well as some of the newer techniques they are now applying.

Slides:

http://www.slideshare.net/treygrainger/reflected-intelligence-evolving-selflearning-data-systems

Talk Abstract:
In the big data era, search and recommendation engines have become the primary mechanisms through which users both actively find and passively discover useful information. As such, it has never been more critical for these data systems to be able to deliver targeted, relevant results that fully match a user’s intent.

In this presentation, we’ll talk about evolving self-learning search and recommendation systems which are able to accept user queries, deliver relevance-ranked results, and iteratively learn from the users’ subsequent interactions to continually deliver a more relevant experience. Such a self-learning system leverages reflected intelligence to consistently improve its understanding of the content (documents and queries), the context of specific users, and the collective feedback from all prior user interactions with the system. Through iterative feedback loops, such a system can leverage user interactions to learn the meaning of important phrases and topics within a domain, identify alternate spellings and disambiguate multiple meanings of those phrases, learn the conceptual relationships between phrases, and even learn the relative importance of features to automatically optimize its own ranking algorithms on a per-query, per-category, or per-user/group basis.

We will cover some of the core technologies that enable such a system to be built (Apache Lucene/Solr, Apache Spark, Apache Hadoop, cloud computing), and will walk through some practical examples of how such a reflected intelligence system has been built and is being leveraged in a real-world implementation.

Lucidworks Logo

After 8.5 years at CareerBuilder – most recently as Director of Engineering over search, recommendations, and data analytics, I’m making an exciting transition to join Lucidworks as their new Senior Vice President of Engineering. While it’s bittersweet to be saying goodbye to my amazing team and colleagues at CareerBuilder, I also couldn’t be more excited to be joining such an industry leader in the Search and Information Retrieval space.

What attracted me to Lucidworks is the opportunity to work with visionaries in the search space building search technology that will help the masses derive intelligence from their data both at scale and in the tail. Search is a really hard problem, and I’m excited to be in great company trying to solve that problem well.

For more details about this move, check out my introduction interview with Lucidworks:

  • When did you first get started working with Apache Lucene?
  • How has search evolved over the past couple years? Where do you think it’ll be in the next 10?
  • What do you find most exciting in the current search technology landscape?
  • Where are the biggest challenges in the search space?
  • What attracted you to Lucidworks?
  • What will you be working on at Lucidworks?

Thanks for the warm welcome, Lucidworks! I’m incredibly excited to be working with such a top-notch team at Lucidworks, and am looking forward to building out what will be the most scalable, dependable, easy to use, and highly relevant search product on the market.

I was invited to speak on 2015.11.10 at the Bay Area Search Meetup in San Jose, CA. With over 175 people marked as attending (and several more on the waitlist who showed up), we had a very exciting and lively discussion for almost 2 hours (about 1/2 was my presentation, with the other half being Q&A mixed in throughout). Thanks again to eBay for hosting the event and providing pizza and beverages, and to everyone who attended for the warm welcome and great discussions.

Slides:

http://www.slideshare.net/treygrainger/searching-on-intent-knowledge-graphs-personalization-and-contextual-disambiguation

Video:

Talk Summary:
Search engines frequently miss the mark when it comes to understanding user intent. This talk will walk through some of the key building blocks necessary to turn a search engine into a dynamically-learning “intent engine”, able to interpret and search on meaning, not just keywords. We will walk through CareerBuilder’s semantic search architecture, including semantic autocomplete, query and document interpretation, probabilistic query parsing, automatic taxonomy discovery, keyword disambiguation, and personalization based upon user context/behavior. We will also see how to leverage an inverted index (Lucene/Solr) as a knowledge graph that can be used as a dynamic ontology to extract phrases, understand and weight the semantic relationships between those phrases and known entities, and expand the query to include those additional conceptual relationships.

As an example, most search engines completely miss the mark at parsing a query like (Senior Java Developer Portland, OR Hadoop). We will show how to dynamically understand that “senior” designates an experience level, that “java developer” is a job title related to “software engineering”, that “portland, or” is a city with a specific geographical boundary (as opposed to a keyword followed by a boolean operator), and that “hadoop” is the skill “Apache Hadoop”, which is also related to other terms like “hbase”, “hive”, and “map/reduce”. We will discuss how to train the search engine to parse the query into this intended understanding and how to reflect this understanding to the end user to provide an insightful, augmented search experience.

Topics: Semantic Search, Apache Solr, Finite State Transducers, Probabilistic Query Parsing, Bayes Theorem, Augmented Search, Recommendations, Query Disambiguation, NLP, Knowledge Graphs

Paper Abstract:
As the ability to store and process massive amounts of user behavioral data increases, new approaches continue to arise for leveraging the wisdom of the crowds to gain insights that were previously very challenging to discover by text mining alone. For example, through collaborative filtering, we can learn previously hidden relationships between items based upon users’ interactions with them, and we can also perform ontology mining to learn which keywords are semantically-related to other keywords based upon how they are used together by similar users as recorded in search engine query logs. The biggest challenge to this collaborative filtering approach is the variety of noise and outliers present in the underlying user behavioral data. In this paper we propose a novel approach to improve the quality of semantic relationships extracted from user behavioral data. Our approach utilizes millions of documents indexed into an inverted index in order to detect and remove noise and outliers.

Published in the 2015 IEEE International Conference on Big Data (IEEE BigData 2015)

Paper:
https://www.researchgate.net/publication/283980991_Improving_the_Quality_of_Semantic_Relationships_Extracted_from_Massive_User_Behavioral_Data

I was excited to be selected again this year to present at Lucene/Solr Revolution 2015 in Austin, Texas. My talk today focused on one the main areas of focus for me over the last year – building out a highly relevant and intelligent semantic search system. While I described and provided demos on the capabilities of the entire system (and many of the technical details for how someone could implement a similar system), I spent the majority of the time on the core Knowledge Graph we’ve built using Apache Solr to dynamically understand the meaning of any query or document that is provided as search input. This Solr-based Knowledge Graph – combined with a probabilistic, entity-based query parser, a sophisticated type-ahead prediction mechanism, spell checking, and a query-augmentation stage – is core to the Intent Engine we’ve built to be able to search on “things, not strings”, and to truly understand and match based upon the intent behind the user’s search.

Video:

Slides:
http://www.slideshare.net/treygrainger/leveraging-lucenesolr-as-a-knowledge-graph-and-intent-engine

Talk Summary:
Search engines frequently miss the mark when it comes to understanding user intent. This talk will describe how to overcome this by leveraging Lucene/Solr to power a knowledge graph that can extract phrases, understand and weight the semantic relationships between those phrases and known entities, and expand the query to include those additional conceptual relationships. For example, if a user types in (Senior Java Developer Portland, OR Hadoop), you or I know that the term “senior” designates an experience level, that “java developer” is a job title related to “software engineering”, that “portland, or” is a city with a specific geographical boundary, and that “hadoop” is a technology related to terms like “hbase”, “hive”, and “map/reduce”. Out of the box, however, most search engines just parse this query as text:((senior AND java AND developer AND portland) OR (hadoop)), which is not at all what the user intended. We will discuss how to train the search engine to parse the query into this intended understanding, and how to reflect this understanding to the end user to provide an insightful, augmented search experience.

Topics: Semantic Search, Finite State Transducers, Probabilistic Parsing, Bayes Theorem, Augmented Search, Recommendations, NLP, Knowledge Graphs

I was invited to present again this year at Lucene/Solr Revolution 2014 in Washington, D.C. My presentation took place this afternoon and covered the topic of “Semantic & Mulilingual Strategies in Lucene/Solr. The material was taken partially from the extensive Multilingual Search chapter (ch. 14) in Solr in Action and from some of the exciting semantic search work we’ve been doing recently at CareerBuilder.

Video:

Slides:
http://www.slideshare.net/treygrainger/semantic-multilingual-strategies-in-lucenesolr

Talk Summary: When searching on text, choosing the right CharFilters, Tokenizer, stemmers, and other TokenFilters for each supported language is critical. Additional tools of the trade include language detection through UpdateRequestProcessors, parts of speech analysis, entity extraction, stopword and synonym lists, relevancy differentiation for exact vs. stemmed vs. conceptual matches, and identification of statistically interesting phrases per language. For multilingual search, you also need to choose between several strategies such as: searching across multiple fields, using a separate collection per language combination, or combining multiple languages in a single field (custom code is required for this and will be open sourced). These all have their own strengths and weaknesses depending upon your use case.

This talk will provide a tutorial (with code examples) on how to pull off each of these strategies as well as compare and contrast the different kinds of stemmers, review the precision/recall impact of stemming vs. lemmatization, and describe some techniques for extracting meaningful relationships between terms to power a semantic search experience per-language. Come learn how to build an excellent semantic and multilingual search system using the best tools and techniques Lucene/Solr has to offer!

My team was fortunate to have 2 papers accepted for publication through the 2014 IEEE International Conference on Big Data, held last week in Washington, D.C. I presented one of the papers titled “Crowdsourced Query Augmentation through the Semantic Discovery of Domain-specific Jargon.” The slides and video (coming soon) are posted below for anyone who could not make the presentation in person.

Slides:

Paper Abstract: Most work in semantic search has thus far focused upon either manually building language-specific taxonomies/ontologies or upon automatic techniques such as clustering or dimensionality reduction to discover latent semantic links within the content that is being searched. The former is very labor intensive and is hard to maintain, while the latter is prone to noise and may be hard for a human to understand or to interact with directly. We believe that the links between similar user’s queries represent a largely untapped source for discovering latent semantic relationships between search terms. The proposed system is capable of mining user search logs to discover semantic
relationships between key phrases in a manner that is language agnostic, human understandable, and virtually noise-free.

Published in the 2014 IEEE International Conference on Big Data (IEEE BigData 2014)

Paper:

Paper Abstract:
In the big data era, scalability has become a crucial requirement for any useful computational model. Probabilistic graphical models are very useful for mining and discovering data insights, but they are not scalable enough to be suitable for big data problems. Bayesian Networks particularly demonstrate this limitation when their data is represented using few random variables while each random variable has a massive set of values. With hierarchical data – data that is arranged in a treelike structure with several levels – one would expect to see hundreds of thousands or millions of values distributed over even just a small number of levels. When modeling this kind of hierarchical data across large data sets, Bayesian networks become infeasible for representing the probability distributions for the following reasons: i) Each level represents a single random variable with hundreds of thousands of values, ii) The number of levels is usually small, so there are also few random variables, and iii) The structure of the network is predefined since the dependency is modeled top-down from each parent to each of its child nodes, so the network would contain a single linear path for the random variables from each parent to each child node. In this paper we present a scalable probabilistic graphical model to overcome these limitations for massive hierarchical data. We believe the proposed model will lead to an easily-scalable, more readable, and expressive implementation for problems that require probabilistic-based solutions for massive amounts of hierarchical data. We successfully applied this model to solve two different challenging probabilistic-based problems on massive hierarchical data sets for different domains, namely, bioinformatics and latent semantic discovery over search logs.

Published in the 2014 IEEE International Conference on Big Data (IEEE BigData 2014)

Paper:
https://arxiv.org/abs/1407.5656