Last month I presented a talk at Spark Summit Europe 2015 about a system I have been working on for a while. The system provides a Dictionary based Entity Recognition Microservice based on Solr, SolrTextTagger and OpenNLP. You can find the Abstract, Slides and Video for the talk here. In this post, I describe why I built it and what we are using it for.
My employer, the Reed-Elsevier (RELX) Group, is the world's leading provider of Science and Technology Information. Our charter is to build data and information solutions that help our users (usually STM researchers) achieve better results. Our group at Elsevier Labs is building a Machine Reading Pipeline to distill information from our books and journals into rich domain-specific Knowledge Graphs, that could hopefully be used to make new inferences about the state of our world.
Knowledge graphs (like any other graph) consist of vertices and edges. The vertices represent concepts in the STM universe, and the edges represent the relationships between those concepts. The concepts at the nodes may be generic, such as "surgeon", or may be specific entities such as "Dr. Jane Doe". In order to build knowledge graphs, we need a way to recognize and extract concepts and entities from the text, a process known as entity recognition.
The easiest way to get started with entity recognition is to use pre-trained statistical Named Entity Recognizers (NERs) available in off-the-shelf Natural Language Processing (NLP) libraries. However, these NERs are trained to recognize a very small and general class of entities such as names of people and places, organizations, etc. While there is value in recognizing these classes, we are typically interested in finding more specific subclasses of these classes (such as universities rather than just any organization) or completely different classes (such as protein names).
Further, STM content is very diverse. While there may be some overlap, entities of interest in one subject (say math) are typically very different from entities of interest in another (say biology). Fortunately, well-curated vocabularies exist for most STM disciplines, which we can leverage in our entity recognition efforts.
Because of this, our approach to NER is dictionary based. Dictionary-based entity matching is a process where snippets of text are matched against a dictionary of terms that represent entities. While this approach may not be as resilient to previously unseen entities as the statistical approach described earlier, it requires no manual tagging, and given enough data, achieves comparable coverage. Dictionary-based matching can also be used to create training data to build custom statistical NERs tailored for different domains, thus achieving the best of both worlds.
Dictionary-based matching techniques are usually based on the Aho-Corasick algorithm, in which the dictionary is held in a compact in-memory data structure against which input text is streamed, matching all dictionary entries simultaneously. The problem with this technique is that it breaks down for large dictionaries, since the corresponding memory requirements also become large. Duplicating the dictionary on all nodes of a Spark cluster could be difficult because of its size.
Our solution is called the Solr Dictionary Annotator (SoDA). It is a HTTP REST micro-service that allows a client to post a block of text and get back a list of annotations. Annotations are structured objects that contain the entity identifier, the matched text, the beginning and ending character offsets of the matched text in the input text block, and the confidence of the match. Clients can specify how accurate the match should be.
For exact and case-insensitive matching, SoDA piggybacks on a recent development from the Lucene community. Michael McCandless, a Lucene/Solr committer, figured out a way to build finite-state transducers (FST) with Lucene in a very memory-efficient manner, taking advantage of the fact that the index already stores terms in a sorted manner. David Smiley, another Solr committer, realized that FSTs could be used for text tagging, and built the SolrTextTagger plugin for Solr. In keeping with Lucene’s tradition of memory-efficiency and speed, he introduced some more strategies to keep the memory footprint low without significantly impacting the retrieval speed. The original dictionary used a GATE based implementation of the Aho-Corasick algorithm that needed 80GB of RAM to store the dictionary, while SolrTextTagger version consumed only 198MB.
For fuzzy matching, SoDA uses OpenNLP, another open source project, to chunk incoming text into phrases. Depending on the fuzziness of the matching desired, different analysis chains are applied to the incoming phrases, and they are matched against pre-normalized dictionary entries stored in the index. We borrow several ideas from the Python library FuzzyWuzzy from SeatGeek.
SoDA exposes a JSON over HTTP interface, so its language and platform agnostic. You compose a JSON request document containing the text to be linked and the type of matching required, and send it to the REST endpoint URL via HTTP POST (some parameterless services like the status service are accessible over HTTP GET). The server responds with another JSON document containing the entities found in the text and metadata around these entities.
At its very core, SoDA is a Spring/Scala based web application that exposes a JSON over HTTP interface on the front end and communicates with a Solr index on the back end. A variety of matching strategies are supported, from exact and case-insensitive matching to completely fuzzy matching. The diagram below shows the components that make up the SoDA application. The client is a Spark Notebook in the Databricks cloud, where the rest of our NLP pipeline is also.
SolrTextTagger is used to serve the exact case-sensitive and case-insensitive entity matches, and OpenNLP is used to chunk incoming text to match against the underlying Solr index for the fuzzy matches. Horizontal scalability (with linear increase in throughput) is achieved by duplicating the component and putting them behind a load balancer.
Our experiments indicate that we can achieve a sustained annotation rate of 30-35 docs/second against a dictionary with 8M+ entries, where each document is about 100MB on average, with SoDA and Solr running on 2 r3.2xlarge machines behind a load balancer. We have been using SoDA for a few months now, and it has already proven itself as a useful component in our pipeline.
My employer has been kind enough to allow me to release SoDA to the open source community. Its available at GitHub here under an Apache 2.0 license. If you are looking for Dictionary based Entity Recognition functionality and you liked what you read so far, I encourage you to download it and give it a try. I look forward to hearing your feedback.