Trey is SVP of Engineering @ Lucidworks, co-author of Solr in Action, founder or Celiaccess.com, researcher/ public speaker on search, analytics, recommendation systems, and natural language processing.

I was excited to be selected again this year to present at Lucene/Solr Revolution 2015 in Austin, Texas. My talk today focused on one the main areas of focus for me over the last year - building out a highly relevant and intelligent semantic search system. While I described and provided demos on the capabilities of the entire system (and many of the technical details for how someone could implement a similar system), I spent the majority of the time on the core Knowledge Graph we’ve built using Apache Solr to dynamically understand the meaning of any query or document that is provided as search input. This Solr-based Knowledge Graph - combined with a probabilistic, entity-based query parser, a sophisticated type-ahead prediction mechanism, spell checking, and a query-augmentation stage - is core to the Intent Engine we’ve built to be able to search on “things, not strings”, and to truly understand and match based upon the intent behind the user’s search.



Talk Summary:
Search engines frequently miss the mark when it comes to understanding user intent. This talk will describe how to overcome this by leveraging Lucene/Solr to power a knowledge graph that can extract phrases, understand and weight the semantic relationships between those phrases and known entities, and expand the query to include those additional conceptual relationships. For example, if a user types in (Senior Java Developer Portland, OR Hadoop), you or I know that the term “senior” designates an experience level, that “java developer” is a job title related to “software engineering”, that “portland, or” is a city with a specific geographical boundary, and that “hadoop” is a technology related to terms like “hbase”, “hive”, and “map/reduce”. Out of the box, however, most search engines just parse this query as text:((senior AND java AND developer AND portland) OR (hadoop)), which is not at all what the user intended. We will discuss how to train the search engine to parse the query into this intended understanding, and how to reflect this understanding to the end user to provide an insightful, augmented search experience.

Topics: Semantic Search, Finite State Transducers, Probabilistic Parsing, Bayes Theorem, Augmented Search, Recommendations, NLP, Knowledge Graphs

I was invited to present again this year at Lucene/Solr Revolution 2014 in Washington, D.C. My presentation took place this afternoon and covered the topic of “Semantic & Mulilingual Strategies in Lucene/Solr. The material was taken partially from the extensive Multilingual Search chapter (ch. 14) in Solr in Action and from some of the exciting semantic search work we’ve been doing recently at CareerBuilder.



Talk Summary: When searching on text, choosing the right CharFilters, Tokenizer, stemmers, and other TokenFilters for each supported language is critical. Additional tools of the trade include language detection through UpdateRequestProcessors, parts of speech analysis, entity extraction, stopword and synonym lists, relevancy differentiation for exact vs. stemmed vs. conceptual matches, and identification of statistically interesting phrases per language. For multilingual search, you also need to choose between several strategies such as: searching across multiple fields, using a separate collection per language combination, or combining multiple languages in a single field (custom code is required for this and will be open sourced). These all have their own strengths and weaknesses depending upon your use case.

This talk will provide a tutorial (with code examples) on how to pull off each of these strategies as well as compare and contrast the different kinds of stemmers, review the precision/recall impact of stemming vs. lemmatization, and describe some techniques for extracting meaningful relationships between terms to power a semantic search experience per-language. Come learn how to build an excellent semantic and multilingual search system using the best tools and techniques Lucene/Solr has to offer!

My team was fortunate to have 2 papers accepted for publication through the 2014 IEEE International Conference on Big Data, held last week in Washington, D.C. I presented one of the papers titled “Crowdsourced Query Augmentation through the Semantic Discovery of Domain-specific Jargon.” The slides and video (coming soon) are posted below for anyone who could not make the presentation in person.


Paper Abstract: Most work in semantic search has thus far focused upon either manually building language-specific taxonomies/ontologies or upon automatic techniques such as clustering or dimensionality reduction to discover latent semantic links within the content that is being searched. The former is very labor intensive and is hard to maintain, while the latter is prone to noise and may be hard for a human to understand or to interact with directly. We believe that the links between similar user’s queries represent a largely untapped source for discovering latent semantic relationships between search terms. The proposed system is capable of mining user search logs to discover semantic
relationships between key phrases in a manner that is language agnostic, human understandable, and virtually noise-free.

Published in the 2014 IEEE International Conference on Big Data (IEEE BigData 2014)


Paper Abstract:
In the big data era, scalability has become a crucial requirement for any useful computational model. Probabilistic graphical models are very useful for mining and discovering data insights, but they are not scalable enough to be suitable for big data problems. Bayesian Networks particularly demonstrate this limitation when their data is represented using few random variables while each random variable has a massive set of values. With hierarchical data - data that is arranged in a treelike structure with several levels - one would expect to see hundreds of thousands or millions of values distributed over even just a small number of levels. When modeling this kind of hierarchical data across large data sets, Bayesian networks become infeasible for representing the probability distributions for the following reasons: i) Each level represents a single random variable with hundreds of thousands of values, ii) The number of levels is usually small, so there are also few random variables, and iii) The structure of the network is predefined since the dependency is modeled top-down from each parent to each of its child nodes, so the network would contain a single linear path for the random variables from each parent to each child node. In this paper we present a scalable probabilistic graphical model to overcome these limitations for massive hierarchical data. We believe the proposed model will lead to an easily-scalable, more readable, and expressive implementation for problems that require probabilistic-based solutions for massive amounts of hierarchical data. We successfully applied this model to solve two different challenging probabilistic-based problems on massive hierarchical data sets for different domains, namely, bioinformatics and latent semantic discovery over search logs.

Published in the 2014 IEEE International Conference on Big Data (IEEE BigData 2014)


I was fortunate to be able to speak last week (along with Joe Streeky, my Search Infrastructure Development Manager) at the very first Atlanta Solr Meetup held at Atlanta Tech Village. The talk covered how we scale Solr at CareerBuilder to power our recommendation engine, semantic search platform, and big data analytics products. Thanks to everyone who came out for a great event and to LucidWorks who sponsored us with the meeting place, pizza, and drinks.



Talk Summary: CareerBuilder uses Solr to power their recommendation engine, semantic search, and data analytics products. They maintain an infrastructure of hundreds of Solr servers, holding over a billion documents and serving over a million queries an hour across thousands of unique search indexes. Come learn how CareerBuilder has integrated Solr into their technology platform (with assistance from Hadoop, Cassandra, and RabbitMQ) and walk through api and code examples to see how you can use Solr to implement your own real-time recommendation engine, semantic search, and data analytics solutions.

Paper Abstract:
Common difficulties like the cold-start problem and a lack of sufficient information about users due to their limited interactions have been major challenges for most recommender systems (RS). To overcome these challenges and many similar ones that result in low accuracy (precision and recall) recommendations, we propose a novel system that extracts semantically-related search keywords based on the aggregate behavioral data of many users. These semantically-related search keywords can be used to substantially increase the amount of knowledge about a specific user’s interests based upon even a few searches and thus improve the accuracy of the RS. The proposed system is capable of mining aggregate user search logs to discover semantic relationships between key phrases in a manner that is language agnostic, human understandable, and virtually noise-free. These semantically related keywords are obtained by looking at the links between queries of similar users which, we believe, represent a largely untapped source for discovering latent semantic relationships between search terms.

Published in the Proceedings of the 8th ACM Conference on Recommender Systems (ACM RecSys 2014)


Timothy Potter and I were recently interviewed about the launch of our new book, Solr in Action, which was published last month. If you want to learn more about the book or just hear about our two-year journey to bring what critics are calling the “definitive guide” to Solr to market, please check out the podcast below:

This week, the SolrCluster team is joined by Trey Grainger, Director of Engineering for Search at CareerBuilder, and Timothy Potter, Lucene/Solr Committer and senior engineer at LucidWorks, to discuss their recently released co-authored book, Solr in Action. Solr in Action is a comprehensive guide to implementing scalable search with Lucene/Solr, based on the real-world applications that Tim and Trey have worked on throughout the course of their careers in Solr. Tim and Trey share with us the challenges they faced, accomplishments they achieved, and what they learned in the process of co-authoring their first book.

SolrCluster is hosted by: Yann Yu and Adam Johnson. Questions? email us at solrcluster@lucidworks.com or on twitter @LucidWorks #solrcluster.

Solr in Action is Published!

March 26th, 2014

After nearly two years of writing, editing, and coding up examples, I’m excited to announce that Solr in Action has finally been published! We released our first “early access” version back in October of 2012 and have since been working tirelessly to round out this comprehensive (664 pages!) guide covering versions through Solr 4.7.

Solr in Action cover

Solr in Action is an essential resource for implementing fast and scalable search using Apache Solr. It uses well-documented examples ranging from basic keyword searching to scaling a system for billions of documents and queries. With this book, you’ll gain a deep understanding of how to implement core Solr capabilities such as faceted navigation through search results, matched snippet highlighting, field collapsing and search results grouping, spell-checking, query autocomplete, querying by functions, and more. You’ll also see how to take Solr to the next level, with deep coverage of large-scale production use cases, sophisticated multilingual search, complex query operations, and advanced relevancy tuning strategies.

Solr in Action is intentionally designed to be a learning guide as opposed to a reference manual. It builds from an initial introduction to Solr all the way to advanced topics such as implementing a predictive search experience, writing your own Solr plugins for function queries and multilingual text analysis, using Solr for big data analytics, and even building your own Solr-based recommendation engine.

The book uses fun real-world examples, including analyzing the text of tweets, searching and faceting on restaurants, grouping similar items in an ecommerce application, highlighting interesting keywords in UFO sighting reports, and even building a personalized job search experience. Executable code for all examples is included with the book, and several chapters are available for free at the publisher’s website.

I just got back from a fantastic trip to Dublin, Ireland for last week’s Lucene/Solr Revolution EU. I was privileged this year to to present a deep dive (75 minute) session on “Enhancing Relevancy through Personalization & Semantic Search.” I appreciate all the great questions and feedback from everyone who attended.



Talk Summary: Matching keywords is just step one in the effort to maximize the relevancy of your search platform. In this talk, you’ll learn how to implement advanced relevancy techniques which enable your search platform to “learn” from your content and users’ behavior.

Topics will include automatic synonym discovery, latent semantic indexing, payload scoring, document-to-document searching, foreground vs. background corpus analysis for interesting term extraction, collaborative filtering, and mining user behavior to drive geographically and conceptually personalized search results.

You’ll learn how CareerBuilder has enhanced Solr (also utilizing Hadoop) to dynamically discover relationships between data and behavior, and how you can implement similar techniques to greatly enhance the relevancy of your search platform.

I just made it back from the beautiful, sunny city of San Diego where LucidWorks hosted another fantastic Lucene/Solr Revolution conference this week. I was invited back this year to present on “Building a Real-time, Big Data Analytics Platform with Solr.” Thank you to everyone who came and packed out the room, especially those who provided great feedback afterward and asked all of the terrific questions!


Slides: http://www.slideshare.net/treygrainger/building-a-real-time-big-data-analytics-platform-with-solr

Talk Summary: Having “big data” is great, but turning that data into actionable intelligence is where the real value lies. This talk will demonstrate how you can use Solr to build a highly scalable data analytics engine to enable customers to engage in lightning fast, real-time knowledge discovery.

At CareerBuilder, we utilize these techniques to report the supply and demand of the labor force, compensation trends, customer performance metrics, and many live internal platform analytics. You will walk away from this talk with an advanced understanding of faceting, including pivot-faceting, geo/radius faceting, time-series faceting, function faceting, and multi-select faceting. You’ll also get a sneak peak at some new faceting capabilities just wrapping up development including distributed pivot facets and percentile/stats faceting, which will be open-sourced.

The presentation will be a technical tutorial, along with real-world use-cases and data visualizations. After this talk, you’ll never see Solr as just a text search engine again.