Trey is SVP of Engineering @ Lucidworks, co-author of Solr in Action, founder or Celiaccess.com, researcher/ public speaker on search, analytics, recommendation systems, and natural language processing.

Last night I had the opportunity to speak at the Greenville Data Science & Analytics Meetup on “Building Search & Recommendation Engines“. It was a great opportunity to present a general introduction to Apache Solr, Search Engines, Relevancy, Recommendations, and generally building intelligent information retrieval systems. I appreciated the level of interest and insightful questions from everyone who attended, and I look forward to more great events from this group in the future!



Talk Abstract:
In this talk, you’ll learn how to build your own search and recommendation engine based on the open source Apache Lucene/Solr project. We’ll dive into some of the data science behind how search engines work, covering multi-lingual text analysis, natural language processing, relevancy ranking algorithms, knowledge graphs, reflected intelligence, collaborative filtering, and other machine learning techniques used to drive relevant results for free-text queries. We’ll also demonstrate how to build a recommendation engine leveraging the same platform and techniques that power search for most of the world’s top companies. You’ll walk away from this presentation with the toolbox you need to go and implement your very own search-based product using your own data.

I had a blast at the Southern Data Science Conference yesterday in Atlanta, GA, where I presented a talk titled “Intent Algorithms: The Data Science of Smart Information Retrieval Systems”. This was the first year the conference was held, and it’s already clear already that is going to hold the title as the preeminent Data Science conference in the Southeast United States. Top speakers, authors, and industry and academic practitioners were represented from the likes of Google, Lucidworks, NASA, Microsoft, Allen Institute for AI, Skymind, CareerBuilder, Glassdoor, Distil Networks, Takt, Elephant Scale, AT&T, Macy’s Technology, Lost Alamos National Laboratory, Georgia Tech, The University of Georgia, and the South Big Data Hub. I had a lot to cover on the topic of “intent algorithms”, so the talk went at quite a rapid pace (due to the 30 minute time limit) to be sure everyone walked away with a solid understanding of the topic. There’s a lot of good material and demos in the presentation, though, so it’s definitely worth checking out the video or slides below!




Talk Abstract:
Search engines, recommendation systems, advertising networks, and even data analytics tools all share the same end goal - to deliver the most relevant information possible to meet a given information need (usually in real-time). Perfecting these systems requires algorithms which can build a deep understanding of the domains represented by the underlying data, understand the nuanced ways in which words and phrases should be parsed and interpreted within different contexts, score the relationships between arbitrary phrases and concepts, continually learn from users’ context and interactions to make the system smarter, and generate custom models of personalized tastes for each user of the system.

In this talk, we’ll dive into both the philosophical questions associated with such systems (”how do you accurately represent and interpret the meaning of words?”, “How do you prevent filter bubbles?”, etc.), as well as look at practical examples of how these systems have been successfully implemented in production systems combining a variety of available commercial and open source components (inverted indexes, entity extraction, similarity scoring and machine-learned ranking, auto-generated knowledge graphs, phrase interpretation and concept expansion, etc.).

I had a blast last night at the DFW Data Science Meetup presenting on “The Apache Solr Smart Data Ecosystem.” There’s so much going on in the Apache Lucene/Solr world around data intelligence and relevancy, and we had so many questions and great discussion along the way, that the presentation and discussion nearly 3.5 hours! It was great to have such a welcoming and actively engaged audience the whole way through and being able to dive in deep on topics with everyone - thanks @dfwdatascience for your hospitality and for hosting such a great event!




Talk Abstract:
Search engines, and Apache Solr in particular, are quickly shifting the focus away from “big data” systems storing massive amounts of raw (but largely unharnessed) content, to “smart data” systems where the most relevant and actionable content is quickly surfaced instead. Apache Solr is the blazing-fast and fault-tolerant distributed search engine leveraged by 90% of Fortune 500 companies. As a community-driven open source project, Solr brings in diverse contributions from many of the top companies in the world, particularly those for whom returning the most relevant results is mission critical.

Out of the box, Solr includes advanced capabilities like learning to rank (machine-learned ranking), graph queries and distributed graph traversals, job scheduling for processing batch and streaming data workloads, the ability to build and deploy machine learning models, and a wide variety of query parsers and functions allowing you to very easily build highly relevant and domain-specific semantic search, recommendations, or personalized search experiences. These days, Solr even enables you to run SQL queries directly against it, mixing and matching the full power of Solr’s free-text, geospatial, and other search capabilities with the a prominent query language already known by most developers (and which many external systems can use to query Solr directly).

Due to the community-oriented nature of Solr, the ecosystem of capabilities also spans well beyond just the core project. In this talk, we’ll also cover several other projects within the larger Apache Lucene/Solr ecosystem that further enhance Solr’s smart data capabilities: bi-directional integration of Apache Spark and Solr’s capabilities, large-scale entity extraction, semantic knowledge graphs for discovering, traversing, and scoring meaningful relationships within your data, auto-generation of domain-specific ontologies, running SPARQL queries against Solr on RDF triples, probabilistic identification of key phrases within a query or document, conceptual search leveraging Word2Vec, and even Lucidworks’ own Fusion project which extends Solr to provide an enterprise-ready smart data platform out of the box.

We’ll dive into how all of these capabilities can fit within your data science toolbox, and you’ll come away with a really good feel for how to build highly relevant “smart data” applications leveraging these key technologies.

I was a panelist today for the South Big Data Hub’s open panel on Text Data Analysis. The event was geared toward researchers and companies working on text mining in any sector. Topics discussed included web-scraping, semantic web, analysis tools in R and Python, the benefits of open source search engines such as Solr and elasticsearch as well as current industry search options.

In my presentation, I provided a quick introduction to Apache Solr, described how companies are using Solr to power relevant search in industry, and provided a glimpse at where the industry is heading with regard to implementing more intelligent and relevant semantic search. Slides are attached here for future reference.


The Semantic Knowledge Graph

October 18th, 2016

Today, I was in Montreal, Quebec, Canada presenting a research paper entitled “The Semantic Knowledge Graph: A compact, auto-generated model for real-time traversal and ranking of any relationship within a domain“. I presented the paper at the 2016 IEEE 3rd International Conference on Data Science and Advanced Analytics, where the paper was accepted for publication.

Published Resources:
Research Paper (Arxiv)
Source Code (Github)
Solr Contribution: SOLR-9480 (Apache JIRA)


I originally conceived of the core idea underlying the Semantic Knowledge Graph in 2010 and built the first pieces of it in early 2011 as part of a content-based recommendation system algorithm I was developing at the time for document to document matching (supporting a single-level graph traversal and edge scoring). Over the next few years, I would be involved in the development of distributed pivot facets in Apache Solr, and would ultimately come to the realization that a multi-level graph traversal would be possible and could yield some very promising results in terms of discovering and scoring the strength of relationships between any entities within a corpus of documents. In 2015, CareerBuilder needed such a capability to drive an ontology learning and reasoning system that was a critical component of the semantic search system we were building, so I had the opportunity to work with the search team to develop (and later open source) this capability, which we called the “Semantic Knowledge Graph”.

Paper Abstract:
This paper describes a new kind of knowledge representation and mining system which we are calling the Semantic Knowledge Graph. At its heart, the Semantic Knowledge Graph leverages an inverted index, along with a complementary uninverted index, to represent nodes (terms) and edges (the documents within intersecting postings lists for multiple terms/nodes). This provides a layer of indirection between each pair of nodes and their corresponding edge, enabling edges to materialize dynamically from underlying corpus statistics. As a result, any combination of nodes can have edges to any other nodes materialize and be scored to reveal latent relationships between the nodes. This provides numerous benefits: the knowledge graph can be built automatically from a real-world corpus of data, new nodes - along with their combined edges - can be instantly materialized from any arbitrary combination of preexisting nodes (using set operations), and a full model of the semantic relationships between all entities within a domain can be represented and dynamically traversed using a highly compact representation of the graph. Such a system has widespread applications in areas as diverse as knowledge modeling and reasoning, natural language processing, anomaly detection, data cleansing, semantic search, analytics, data classification, root cause analysis, and recommendations systems. The main contribution of this paper is the introduction of a novel system - the Semantic Knowledge Graph - which is able to dynamically discover and score interesting relationships between any arbitrary combination of entities (words, phrases, or extracted concepts) through dynamically materializing nodes and edges from a compact graphical representation built automatically from a corpus of data representative of a knowledge domain.

Lucene/Solr Revolution is always my favorite conference of the year, and I was invited to present again this year in the great city of Boston, MA. It was a fun presentation, representing a whirlwind tour through numerous open sourced extensions within the larger Apache Solr Ecosystem (The Semantic Knowledge Graph, the Solr Text Tagger, a Probabilistic Query Parser, Dice’s Conceptual Search Plugin, Solr’s Learning to Rank capability, SolrRDF, etc.) that can be combined, along with some query log mining, to build an end-to-end self-learning data system powered by Apache Solr. Videos will be posted soon, but here are the slides from my talk in the meantime!




Talk Abstract:
What if your search engine could automatically tune its own domain-specific relevancy model? What if it could learn the important phrases and topics within your domain, automatically identify alternate spellings (synonyms, acronyms, and related phrases) and disambiguate multiple meanings of those phrases, learn the conceptual relationships embedded within your documents, and even use machine-learned ranking to discover the relative importance of different features and then automatically optimize its own ranking algorithms for your domain?

In this presentation, you’ll learn you how to do just that - to evolving Lucene/Solr implementations into self-learning data systems which are able to accept user queries, deliver relevance-ranked results, and automatically learn from your users’ subsequent interactions to continually deliver a more relevant experience for each keyword, category, and group of users.

Such a self-learning system leverages reflected intelligence to consistently improve its understanding of the content (documents and queries), the context of specific users, and the relevance signals present in the collective feedback from every prior user interaction with the system. Come learn how to move beyond manual relevancy tuning and toward a closed-loop system leveraging both the embedded meaning within your content and the wisdom of the crowds to automatically generate search relevancy algorithms optimized for your domain.

I was honored to be invited to present today, along with Khalifeh AlJadda, Lead Data Scientist at CareerBuilder, to a group of nearly 200 Georgia Tech graduate students and other members of the larger Georgia tech community. We thank Dr. Polo Chau and his Data and Visual Analytics class for inviting us and sponsoring the event, and we also appreciate the numerous other folks who also heard about and attended the presentation!

Khalifeh and I worked closely together while I was at CareerBuilder evolving their semantic search engine and recommendation engine into self-learning data systems. It was great being able to present some of the similar work I am now doing at Lucidworks, along with Khalifeh, who presented much of the work we had done at CareerBuilder, as well as some of the newer techniques they are now applying.



Talk Abstract:
In the big data era, search and recommendation engines have become the primary mechanisms through which users both actively find and passively discover useful information. As such, it has never been more critical for these data systems to be able to deliver targeted, relevant results that fully match a user’s intent.

In this presentation, we’ll talk about evolving self-learning search and recommendation systems which are able to accept user queries, deliver relevance-ranked results, and iteratively learn from the users’ subsequent interactions to continually deliver a more relevant experience. Such a self-learning system leverages reflected intelligence to consistently improve its understanding of the content (documents and queries), the context of specific users, and the collective feedback from all prior user interactions with the system. Through iterative feedback loops, such a system can leverage user interactions to learn the meaning of important phrases and topics within a domain, identify alternate spellings and disambiguate multiple meanings of those phrases, learn the conceptual relationships between phrases, and even learn the relative importance of features to automatically optimize its own ranking algorithms on a per-query, per-category, or per-user/group basis.

We will cover some of the core technologies that enable such a system to be built (Apache Lucene/Solr, Apache Spark, Apache Hadoop, cloud computing), and will walk through some practical examples of how such a reflected intelligence system has been built and is being leveraged in a real-world implementation.

Lucidworks Logo

After 8.5 years at CareerBuilder - most recently as Director of Engineering over search, recommendations, and data analytics, I’m making an exciting transition to join Lucidworks as their new Senior Vice President of Engineering. While it’s bittersweet to be saying goodbye to my amazing team and colleagues at CareerBuilder, I also couldn’t be more excited to be joining such an industry leader in the Search and Information Retrieval space.

What attracted me to Lucidworks is the opportunity to work with visionaries in the search space building search technology that will help the masses derive intelligence from their data both at scale and in the tail. Search is a really hard problem, and I’m excited to be in great company trying to solve that problem well.

For more details about this move, check out my introduction interview with Lucidworks:

  • When did you first get started working with Apache Lucene?
  • How has search evolved over the past couple years? Where do you think it’ll be in the next 10?
  • What do you find most exciting in the current search technology landscape?
  • Where are the biggest challenges in the search space?
  • What attracted you to Lucidworks?
  • What will you be working on at Lucidworks?

Thanks for the warm welcome, Lucidworks! I’m incredibly excited to be working with such a top-notch team at Lucidworks, and am looking forward to building out what will be the most scalable, dependable, easy to use, and highly relevant search product on the market.

I was invited to speak on 2015.11.10 at the Bay Area Search Meetup in San Jose, CA. With over 175 people marked as attending (and several more on the waitlist who showed up), we had a very exciting and lively discussion for almost 2 hours (about 1/2 was my presentation, with the other half being Q&A mixed in throughout). Thanks again to eBay for hosting the event and providing pizza and beverages, and to everyone who attended for the warm welcome and great discussions.




Talk Summary:
Search engines frequently miss the mark when it comes to understanding user intent. This talk will walk through some of the key building blocks necessary to turn a search engine into a dynamically-learning “intent engine”, able to interpret and search on meaning, not just keywords. We will walk through CareerBuilder’s semantic search architecture, including semantic autocomplete, query and document interpretation, probabilistic query parsing, automatic taxonomy discovery, keyword disambiguation, and personalization based upon user context/behavior. We will also see how to leverage an inverted index (Lucene/Solr) as a knowledge graph that can be used as a dynamic ontology to extract phrases, understand and weight the semantic relationships between those phrases and known entities, and expand the query to include those additional conceptual relationships.

As an example, most search engines completely miss the mark at parsing a query like (Senior Java Developer Portland, OR Hadoop). We will show how to dynamically understand that “senior” designates an experience level, that “java developer” is a job title related to “software engineering”, that “portland, or” is a city with a specific geographical boundary (as opposed to a keyword followed by a boolean operator), and that “hadoop” is the skill “Apache Hadoop”, which is also related to other terms like “hbase”, “hive”, and “map/reduce”. We will discuss how to train the search engine to parse the query into this intended understanding and how to reflect this understanding to the end user to provide an insightful, augmented search experience.

Topics: Semantic Search, Apache Solr, Finite State Transducers, Probabilistic Query Parsing, Bayes Theorem, Augmented Search, Recommendations, Query Disambiguation, NLP, Knowledge Graphs

Paper Abstract:
As the ability to store and process massive amounts of user behavioral data increases, new approaches continue to arise for leveraging the wisdom of the crowds to gain insights that were previously very challenging to discover by text mining alone. For example, through collaborative filtering, we can learn previously hidden relationships between items based upon users’ interactions with them, and we can also perform ontology mining to learn which keywords are semantically-related to other keywords based upon how they are used together by similar users as recorded in search engine query logs. The biggest challenge to this collaborative filtering approach is the variety of noise and outliers present in the underlying user behavioral data. In this paper we propose a novel approach to improve the quality of semantic relationships extracted from user behavioral data. Our approach utilizes millions of documents indexed into an inverted index in order to detect and remove noise and outliers.

Published in the 2015 IEEE International Conference on Big Data (IEEE BigData 2015)