Trey is SVP of Engineering @ Lucidworks, co-author of Solr in Action, founder or Celiaccess.com, researcher/ public speaker on search, analytics, recommendation systems, and natural language processing.

I had a blast last night at the DFW Data Science Meetup presenting on “The Apache Solr Smart Data Ecosystem.” There’s so much going on in the Apache Lucene/Solr world around data intelligence and relevancy, and we had so many questions and great discussion along the way, that the presentation and discussion nearly 3.5 hours! It was great to have such a welcoming and actively engaged audience the whole way through and being able to dive in deep on topics with everyone - thanks @dfwdatascience for your hospitality and for hosting such a great event!



Talk Abstract:
Search engines, and Apache Solr in particular, are quickly shifting the focus away from “big data” systems storing massive amounts of raw (but largely unharnessed) content, to “smart data” systems where the most relevant and actionable content is quickly surfaced instead. Apache Solr is the blazing-fast and fault-tolerant distributed search engine leveraged by 90% of Fortune 500 companies. As a community-driven open source project, Solr brings in diverse contributions from many of the top companies in the world, particularly those for whom returning the most relevant results is mission critical.

Out of the box, Solr includes advanced capabilities like learning to rank (machine-learned ranking), graph queries and distributed graph traversals, job scheduling for processing batch and streaming data workloads, the ability to build and deploy machine learning models, and a wide variety of query parsers and functions allowing you to very easily build highly relevant and domain-specific semantic search, recommendations, or personalized search experiences. These days, Solr even enables you to run SQL queries directly against it, mixing and matching the full power of Solr’s free-text, geospatial, and other search capabilities with the a prominent query language already known by most developers (and which many external systems can use to query Solr directly).

Due to the community-oriented nature of Solr, the ecosystem of capabilities also spans well beyond just the core project. In this talk, we’ll also cover several other projects within the larger Apache Lucene/Solr ecosystem that further enhance Solr’s smart data capabilities: bi-directional integration of Apache Spark and Solr’s capabilities, large-scale entity extraction, semantic knowledge graphs for discovering, traversing, and scoring meaningful relationships within your data, auto-generation of domain-specific ontologies, running SPARQL queries against Solr on RDF triples, probabilistic identification of key phrases within a query or document, conceptual search leveraging Word2Vec, and even Lucidworks’ own Fusion project which extends Solr to provide an enterprise-ready smart data platform out of the box.

We’ll dive into how all of these capabilities can fit within your data science toolbox, and you’ll come away with a really good feel for how to build highly relevant “smart data” applications leveraging these key technologies.

I was a panelist today for the South Big Data Hub’s open panel on Text Data Analysis. The event was geared toward researchers and companies working on text mining in any sector. Topics discussed included web-scraping, semantic web, analysis tools in R and Python, the benefits of open source search engines such as Solr and elasticsearch as well as current industry search options.

In my presentation, I provided a quick introduction to Apache Solr, described how companies are using Solr to power relevant search in industry, and provided a glimpse at where the industry is heading with regard to implementing more intelligent and relevant semantic search. Slides are attached here for future reference.


The Semantic Knowledge Graph

October 18th, 2016

Today, I was in Montreal, Quebec, Canada presenting a research paper entitled “The Semantic Knowledge Graph: A compact, auto-generated model for real-time traversal and ranking of any relationship within a domain“. I presented the paper at the 2016 IEEE 3rd International Conference on Data Science and Advanced Analytics, where the paper was accepted for publication.

Published Resources:
Research Paper (Arxiv)
Source Code (Github)
Solr Contribution: SOLR-9480 (Apache JIRA)


I originally conceived of the core idea underlying the Semantic Knowledge Graph in 2010 and built the first pieces of it in early 2011 as part of a content-based recommendation system algorithm I was developing at the time for document to document matching (supporting a single-level graph traversal and edge scoring). Over the next few years, I would be involved in the development of distributed pivot facets in Apache Solr, and would ultimately come to the realization that a multi-level graph traversal would be possible and could yield some very promising results in terms of discovering and scoring the strength of relationships between any entities within a corpus of documents. In 2015, CareerBuilder needed such a capability to drive an ontology learning and reasoning system that was a critical component of the semantic search system we were building, so I had the opportunity to work with the search team to develop (and later open source) this capability, which we called the “Semantic Knowledge Graph”.

Paper Abstract:
This paper describes a new kind of knowledge representation and mining system which we are calling the Semantic Knowledge Graph. At its heart, the Semantic Knowledge Graph leverages an inverted index, along with a complementary uninverted index, to represent nodes (terms) and edges (the documents within intersecting postings lists for multiple terms/nodes). This provides a layer of indirection between each pair of nodes and their corresponding edge, enabling edges to materialize dynamically from underlying corpus statistics. As a result, any combination of nodes can have edges to any other nodes materialize and be scored to reveal latent relationships between the nodes. This provides numerous benefits: the knowledge graph can be built automatically from a real-world corpus of data, new nodes - along with their combined edges - can be instantly materialized from any arbitrary combination of preexisting nodes (using set operations), and a full model of the semantic relationships between all entities within a domain can be represented and dynamically traversed using a highly compact representation of the graph. Such a system has widespread applications in areas as diverse as knowledge modeling and reasoning, natural language processing, anomaly detection, data cleansing, semantic search, analytics, data classification, root cause analysis, and recommendations systems. The main contribution of this paper is the introduction of a novel system - the Semantic Knowledge Graph - which is able to dynamically discover and score interesting relationships between any arbitrary combination of entities (words, phrases, or extracted concepts) through dynamically materializing nodes and edges from a compact graphical representation built automatically from a corpus of data representative of a knowledge domain.

Lucene/Solr Revolution is always my favorite conference of the year, and I was invited to present again this year in the great city of Boston, MA. It was a fun presentation, representing a whirlwind tour through numerous open sourced extensions within the larger Apache Solr Ecosystem (The Semantic Knowledge Graph, the Solr Text Tagger, a Probabilistic Query Parser, Dice’s Conceptual Search Plugin, Solr’s Learning to Rank capability, SolrRDF, etc.) that can be combined, along with some query log mining, to build an end-to-end self-learning data system powered by Apache Solr. Videos will be posted soon, but here are the slides from my talk in the meantime!



Talk Abstract:
What if your search engine could automatically tune its own domain-specific relevancy model? What if it could learn the important phrases and topics within your domain, automatically identify alternate spellings (synonyms, acronyms, and related phrases) and disambiguate multiple meanings of those phrases, learn the conceptual relationships embedded within your documents, and even use machine-learned ranking to discover the relative importance of different features and then automatically optimize its own ranking algorithms for your domain?

In this presentation, you’ll learn you how to do just that - to evolving Lucene/Solr implementations into self-learning data systems which are able to accept user queries, deliver relevance-ranked results, and automatically learn from your users’ subsequent interactions to continually deliver a more relevant experience for each keyword, category, and group of users.

Such a self-learning system leverages reflected intelligence to consistently improve its understanding of the content (documents and queries), the context of specific users, and the relevance signals present in the collective feedback from every prior user interaction with the system. Come learn how to move beyond manual relevancy tuning and toward a closed-loop system leveraging both the embedded meaning within your content and the wisdom of the crowds to automatically generate search relevancy algorithms optimized for your domain.

I was honored to be invited to present today, along with Khalifeh AlJadda, Lead Data Scientist at CareerBuilder, to a group of nearly 200 Georgia Tech graduate students and other members of the larger Georgia tech community. We thank Dr. Polo Chau and his Data and Visual Analytics class for inviting us and sponsoring the event, and we also appreciate the numerous other folks who also heard about and attended the presentation!

Khalifeh and I worked closely together while I was at CareerBuilder evolving their semantic search engine and recommendation engine into self-learning data systems. It was great being able to present some of the similar work I am now doing at Lucidworks, along with Khalifeh, who presented much of the work we had done at CareerBuilder, as well as some of the newer techniques they are now applying.


Talk Abstract:
In the big data era, search and recommendation engines have become the primary mechanisms through which users both actively find and passively discover useful information. As such, it has never been more critical for these data systems to be able to deliver targeted, relevant results that fully match a user’s intent.

In this presentation, we’ll talk about evolving self-learning search and recommendation systems which are able to accept user queries, deliver relevance-ranked results, and iteratively learn from the users’ subsequent interactions to continually deliver a more relevant experience. Such a self-learning system leverages reflected intelligence to consistently improve its understanding of the content (documents and queries), the context of specific users, and the collective feedback from all prior user interactions with the system. Through iterative feedback loops, such a system can leverage user interactions to learn the meaning of important phrases and topics within a domain, identify alternate spellings and disambiguate multiple meanings of those phrases, learn the conceptual relationships between phrases, and even learn the relative importance of features to automatically optimize its own ranking algorithms on a per-query, per-category, or per-user/group basis.

We will cover some of the core technologies that enable such a system to be built (Apache Lucene/Solr, Apache Spark, Apache Hadoop, cloud computing), and will walk through some practical examples of how such a reflected intelligence system has been built and is being leveraged in a real-world implementation.

After 8.5 years at CareerBuilder - most recently as Director of Engineering over search, recommendations, and data analytics, I’m making an exciting transition to join Lucidworks as their new Senior Vice President of Engineering. While it’s bittersweet to be saying goodbye to my amazing team and colleagues at CareerBuilder, I also couldn’t be more excited to be joining such an industry leader in the Search and Information Retrieval space.

What attracted me to Lucidworks is the opportunity to work with visionaries in the search space building search technology that will help the masses derive intelligence from their data both at scale and in the tail. Search is a really hard problem, and I’m excited to be in great company trying to solve that problem well.

For more details about this move, check out my introduction interview with Lucidworks:

  • When did you first get started working with Apache Lucene?
  • How has search evolved over the past couple years? Where do you think it’ll be in the next 10?
  • What do you find most exciting in the current search technology landscape?
  • Where are the biggest challenges in the search space?
  • What attracted you to Lucidworks?
  • What will you be working on at Lucidworks?

Thanks for the warm welcome, Lucidworks! I’m incredibly excited to be working with such a top-notch team at Lucidworks, and am looking forward to building out what will be the most scalable, dependable, easy to use, and highly relevant search product on the market.

I was invited to speak on 2015.11.10 at the Bay Area Search Meetup in San Jose, CA. With over 175 people marked as attending (and several more on the waitlist who showed up), we had a very exciting and lively discussion for almost 2 hours (about 1/2 was my presentation, with the other half being Q&A mixed in throughout). Thanks again to eBay for hosting the event and providing pizza and beverages, and to everyone who attended for the warm welcome and great discussions.



Talk Summary:
Search engines frequently miss the mark when it comes to understanding user intent. This talk will walk through some of the key building blocks necessary to turn a search engine into a dynamically-learning “intent engine”, able to interpret and search on meaning, not just keywords. We will walk through CareerBuilder’s semantic search architecture, including semantic autocomplete, query and document interpretation, probabilistic query parsing, automatic taxonomy discovery, keyword disambiguation, and personalization based upon user context/behavior. We will also see how to leverage an inverted index (Lucene/Solr) as a knowledge graph that can be used as a dynamic ontology to extract phrases, understand and weight the semantic relationships between those phrases and known entities, and expand the query to include those additional conceptual relationships.

As an example, most search engines completely miss the mark at parsing a query like (Senior Java Developer Portland, OR Hadoop). We will show how to dynamically understand that “senior” designates an experience level, that “java developer” is a job title related to “software engineering”, that “portland, or” is a city with a specific geographical boundary (as opposed to a keyword followed by a boolean operator), and that “hadoop” is the skill “Apache Hadoop”, which is also related to other terms like “hbase”, “hive”, and “map/reduce”. We will discuss how to train the search engine to parse the query into this intended understanding and how to reflect this understanding to the end user to provide an insightful, augmented search experience.

Topics: Semantic Search, Apache Solr, Finite State Transducers, Probabilistic Query Parsing, Bayes Theorem, Augmented Search, Recommendations, Query Disambiguation, NLP, Knowledge Graphs

I was excited to be selected again this year to present at Lucene/Solr Revolution 2015 in Austin, Texas. My talk today focused on one the main areas of focus for me over the last year - building out a highly relevant and intelligent semantic search system. While I described and provided demos on the capabilities of the entire system (and many of the technical details for how someone could implement a similar system), I spent the majority of the time on the core Knowledge Graph we’ve built using Apache Solr to dynamically understand the meaning of any query or document that is provided as search input. This Solr-based Knowledge Graph - combined with a probabilistic, entity-based query parser, a sophisticated type-ahead prediction mechanism, spell checking, and a query-augmentation stage - is core to the Intent Engine we’ve built to be able to search on “things, not strings”, and to truly understand and match based upon the intent behind the user’s search.



Talk Summary:
Search engines frequently miss the mark when it comes to understanding user intent. This talk will describe how to overcome this by leveraging Lucene/Solr to power a knowledge graph that can extract phrases, understand and weight the semantic relationships between those phrases and known entities, and expand the query to include those additional conceptual relationships. For example, if a user types in (Senior Java Developer Portland, OR Hadoop), you or I know that the term “senior” designates an experience level, that “java developer” is a job title related to “software engineering”, that “portland, or” is a city with a specific geographical boundary, and that “hadoop” is a technology related to terms like “hbase”, “hive”, and “map/reduce”. Out of the box, however, most search engines just parse this query as text:((senior AND java AND developer AND portland) OR (hadoop)), which is not at all what the user intended. We will discuss how to train the search engine to parse the query into this intended understanding, and how to reflect this understanding to the end user to provide an insightful, augmented search experience.

Topics: Semantic Search, Finite State Transducers, Probabilistic Parsing, Bayes Theorem, Augmented Search, Recommendations, NLP, Knowledge Graphs

I was invited to present again this year at Lucene/Solr Revolution 2014 in Washington, D.C. My presentation took place this afternoon and covered the topic of “Semantic & Mulilingual Strategies in Lucene/Solr. The material was taken partially from the extensive Multilingual Search chapter (ch. 14) in Solr in Action and from some of the exciting semantic search work we’ve been doing recently at CareerBuilder.



Talk Summary: When searching on text, choosing the right CharFilters, Tokenizer, stemmers, and other TokenFilters for each supported language is critical. Additional tools of the trade include language detection through UpdateRequestProcessors, parts of speech analysis, entity extraction, stopword and synonym lists, relevancy differentiation for exact vs. stemmed vs. conceptual matches, and identification of statistically interesting phrases per language. For multilingual search, you also need to choose between several strategies such as: searching across multiple fields, using a separate collection per language combination, or combining multiple languages in a single field (custom code is required for this and will be open sourced). These all have their own strengths and weaknesses depending upon your use case.

This talk will provide a tutorial (with code examples) on how to pull off each of these strategies as well as compare and contrast the different kinds of stemmers, review the precision/recall impact of stemming vs. lemmatization, and describe some techniques for extracting meaningful relationships between terms to power a semantic search experience per-language. Come learn how to build an excellent semantic and multilingual search system using the best tools and techniques Lucene/Solr has to offer!

My team was fortunate to have 2 papers accepted for publication through the 2014 IEEE International Conference on Big Data, held last week in Washington, D.C. I presented one of the papers titled “Crowdsourced Query Augmentation through the Semantic Discovery of Domain-specific Jargon.” The slides and video (coming soon) are posted below for anyone who could not make the presentation in person.


Paper Abstract: Most work in semantic search has thus far focused upon either manually building language-specific taxonomies/ontologies or upon automatic techniques such as clustering or dimensionality reduction to discover latent semantic links within the content that is being searched. The former is very labor intensive and is hard to maintain, while the latter is prone to noise and may be hard for a human to understand or to interact with directly. We believe that the links between similar user’s queries represent a largely untapped source for discovering latent semantic relationships between search terms. The proposed system is capable of mining user search logs to discover semantic
relationships between key phrases in a manner that is language agnostic, human understandable, and virtually noise-free.