Salmon Run: March 2011

Lately, I have been looking at Named Entity Extraction (NER). As I see it, NER can be used to improve the search experience in various ways. First, NER can be incorporated into a custom Lucene analyzer, so "known" entities are protected from stemming, both during indexing and search. Second, NER can be used to parse a query string into an intelligent boolean multi-field query.

The two frameworks I looked at for this were the General Architecture for Text Engineering (GATE) and Apache Unstructured Information Management Architecture (UIMA). GATE is a huge and comprehensive framework, and it took me a while to get my head around it, and I still don't think I got it all. During this time, I happened to stumble across UIMA, and I liked the fact that it was all Java (compared to GATE's Jape, as powerful as it is) and because I liked the way it was built (small components roll up nicely into larger components, compared to the Language/Processor Resources approach of GATE). Maybe its just me, but I felt that GATE is more aimed towards linguists (many prebuilt components, but relatively harder to build their own) and UIMA towards programmers (relatively fewer components, but a well defined API fo people to build their own fairly easily).

Anyway, I decided to get familiar with the UIMA API by solving a toy problem. Assume a website which allows searching for names of people and organizations with optional (and partial) addresses to narrow the search. Behind the scenes, asume an index which stores city, state and zipcode as separate indexed fields. The query string is parsed using a UIMA aggregate analysis engine (AE) composed of a pipeline of three primitive AEs, for parsing the zipcode, state and city respectively. The end result of the analysis is the term with token offset information for each of these entities. I haven't gone as far as the query parser (a CAS Consumer in UIMA), so in this post I show the various descriptors and annotator code that parse the query string and extract the entities from it.

UIMA Background

For those not familiar with UIMA, its a framework developed by IBM and donated to Apache. UIMA is currently in the Apache incubator. For details, you should refer to the UIMA Tutorial and Developer's Guide, but if you want a really quick (and possibly incomplete) tour, here it is. The basic building block that you build is a primitive Analysis Engine (AE). Each primitive AE needs to have an annotation type and an annotator. The type is defined as an XML file and a tool called JCasGen used to generate the POJO representing the type and annotation. The annotator is written next, and an XML descriptor created. The framework instantiates the annotator using the AE XML descriptor. Aggregate AEs are defined as XML files, and define chains of primitive AEs.

UIMA comes with an Eclipse plug in, which provides tools to build the XML using fill-in forms. Its probably advisable to use that because the XML is quite complex, at least initially.

Zip Code Annotator

The Zip Code Annotator uses regular expressions to find zip codes in the input text. As mentioned before it needs a ZipCode type, which is defined by the following XML file:

<?xml version="1.0" encoding="UTF-8"?>
<!-- Source: src/main/java/com/mycompany/myapp/uima/annotators/zipcode/ZipCode.xml -->
<typeSystemDescription xmlns="http://uima.apache.org/resourceSpecifier">
  <name>ZipCode</name>
  <description>Defines the zipcode type</description>
  <version>1.0</version>
  <vendor>MyCompany, Inc.</vendor>
  <types>
    <typeDescription>
      <name>com.mycompany.myapp.uima.annotators.zipcode.ZipCodeAnnotation</name>
      <description>ZipCode</description>
      <supertypeName>uima.tcas.Annotation</supertypeName>
    </typeDescription>
  </types>
</typeSystemDescription>

Running JCasGen creates a ZipCodeAnnotation_Type.java and a ZipCodeAnnotation.java files (the annotation class name is specified in the XML file). We then write the annotator, which looks like this:

// Source: src/main/java/com/mycompany/myapp/uima/annotators/zipcode/ZipCodeAnnotator.java
package com.mycompany.myapp.uima.annotators.zipcode;

import java.util.regex.Matcher;
import java.util.regex.Pattern;

import org.apache.uima.analysis_component.JCasAnnotator_ImplBase;
import org.apache.uima.analysis_engine.AnalysisEngineProcessException;
import org.apache.uima.jcas.JCas;

public class ZipCodeAnnotator extends JCasAnnotator_ImplBase {

  private Pattern zipCodePattern = Pattern.compile("\\d{5}(-\\d{4})*");
  
  @Override
  public void process(JCas jCAS) throws AnalysisEngineProcessException {
    String text = jCAS.getDocumentText();
    Matcher matcher = zipCodePattern.matcher(text);
    int pos = 0;
    while (matcher.find(pos)) {
      ZipCodeAnnotation annotation = new ZipCodeAnnotation(jCAS);
      annotation.setBegin(matcher.start());
      annotation.setEnd(matcher.end());
      annotation.addToIndexes();
      pos = matcher.end();
    }
  }
}

Finally, we build a component descriptor for the annotator as shown below:

<?xml version="1.0" encoding="UTF-8" ?>
<!-- Source: src/main/java/com/mycompany/myapp/uima/annotators/zipcode/ZipCodeAE.xml -->
<analysisEngineDescription xmlns="http://uima.apache.org/resourceSpecifier">
  <frameworkImplementation>org.apache.uima.java</frameworkImplementation>
  <primitive>true</primitive>
  <annotatorImplementationName>
    com.mycompany.myapp.uima.annotators.zipcode.ZipCodeAnnotator
  </annotatorImplementationName>
  <analysisEngineMetaData>
    <name>Zip Code Annotator</name>
    <description>Recognize and annotate zip code in text</description>
    <version>1.0</version>
    <vendor>MyCompany, Inc.</vendor>
    <configurationParameters></configurationParameters>
    <configurationParameterSettings></configurationParameterSettings>
    <typeSystemDescription>
      <imports>
        <import location="ZipCode.xml"/>
      </imports>
    </typeSystemDescription>
    <typePriorities></typePriorities>
    <fsIndexCollection></fsIndexCollection>
    <capabilities>
      <capability>
        <inputs></inputs>
        <outputs>
          <type>com.mycompany.myapp.uima.annotators.zipcode.ZipCode</type>
        </outputs>
        <languagesSupported></languagesSupported>
      </capability>
    </capabilities>
    <operationalProperties>
      <modifiesCas>true</modifiesCas>
      <multipleDeploymentAllowed>true</multipleDeploymentAllowed>
      <outputsNewCASes>false</outputsNewCASes>
    </operationalProperties>
  </analysisEngineMetaData>
  <externalResourceDependencies></externalResourceDependencies>
  <resourceManagerConfiguration></resourceManagerConfiguration>
</analysisEngineDescription>

For each annotator, I build a unit test to make sure it functions properly. To keep the size of the post down, I will show the unit test for only the aggregate AE I create out of these primitives. The beauty of UIMA is that the Java code to call and run an aggregate AE is the same as that for a primitive AE.

City Annotator

The city annotator follows a slightly different approach. Rather than use a regular expression, it uses a list of US cities that is written to a database table. At JVM startup, UIMA calls the AE's init() method to load the database into an in-memory Set. The text is passed through a Lucene ShingleFilter, and the tokens generated matched against the contents of the set. There is an additional tweak to remove city tokens which are subsumed within longer city tokens, so for example, if both "Brunswick" and "South Brunswick" are recognized and the first is within the second one, the first token will be removed.

As before, we need an annotation type and an annotator. The XML descriptor for the type is shown below:

<?xml version="1.0" encoding="UTF-8"?>
<!-- Source: src/main/java/com/mycompany/myapp/uima/annotators/city/City.xml -->
<typeSystemDescription xmlns="http://uima.apache.org/resourceSpecifier">
  <name>City</name>
  <description>US Cities</description>
  <version>1.0</version>
  <vendor>MyCompany, Inc.</vendor>
  <types>
    <typeDescription>
      <name>com.mycompany.myapp.uima.annotators.city.CityAnnotation</name>
      <description>US States</description>
      <supertypeName>uima.tcas.Annotation</supertypeName>
    </typeDescription>
  </types>
</typeSystemDescription>

We then run JCasGen to generate the Type and Annotation classes, and write the City Annotator, the code for which is shown below:

// Source: src/main/java/com/mycompany/myapp/uima/annotators/city/CityAnnotator.java
package com.mycompany.myapp.uima.annotators.city;

import java.io.IOException;
import java.io.StringReader;
import java.util.HashSet;
import java.util.Iterator;
import java.util.List;
import java.util.Map;
import java.util.Set;

import org.apache.commons.lang.StringUtils;
import org.apache.commons.lang.math.IntRange;
import org.apache.commons.lang.math.Range;
import org.apache.lucene.analysis.LowerCaseFilter;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.WhitespaceTokenizer;
import org.apache.lucene.analysis.shingle.ShingleFilter;
import org.apache.lucene.analysis.tokenattributes.OffsetAttribute;
import org.apache.lucene.analysis.tokenattributes.TermAttribute;
import org.apache.uima.UimaContext;
import org.apache.uima.analysis_component.JCasAnnotator_ImplBase;
import org.apache.uima.analysis_engine.AnalysisEngineProcessException;
import org.apache.uima.cas.FSIndex;
import org.apache.uima.jcas.JCas;
import org.apache.uima.resource.ResourceInitializationException;

import com.mycompany.myapp.utils.DbUtils;

public class CityAnnotator extends JCasAnnotator_ImplBase {

  private static final int MAX_SHINGLE_SIZE = 3;
  
  private Set<String> cityNames = new HashSet<String>();
  
  @Override
  public void initialize(UimaContext ctx) 
      throws ResourceInitializationException {
    super.initialize(ctx);
    try {
      List<Map<String,Object>> rows = DbUtils.queryForList(
        "select name from us_cities", null);
      for (Map<String,Object> row : rows) {
        cityNames.add(StringUtils.lowerCase((String) row.get("name")));
      }
    } catch (Exception e) {
      throw new ResourceInitializationException(e);
    }
  }
  
  @SuppressWarnings("unchecked")
  @Override
  public void process(JCas jcas) throws AnalysisEngineProcessException {
    String text = jcas.getDocumentText();
    text = text.replaceAll("\\p{Punct}", " ");
    WhitespaceTokenizer tokenizer = new WhitespaceTokenizer(
      new StringReader(text));
    TokenStream tokenStream = new LowerCaseFilter(tokenizer);
    tokenStream = new ShingleFilter(tokenStream, MAX_SHINGLE_SIZE);
    try {
      while (tokenStream.incrementToken()) {
        TermAttribute term = tokenStream.getAttribute(TermAttribute.class);
        OffsetAttribute offset = tokenStream.getAttribute(OffsetAttribute.class);
        String shingle = term.term();
        if (cityNames.contains(shingle)) {
          CityAnnotation annotation = new CityAnnotation(jcas);
          annotation.setBegin(offset.startOffset());
          annotation.setEnd(offset.endOffset());
          annotation.addToIndexes();
        }
      }
      // remove city entities that are subsumed within other
      // city entities (such as Concord => North Concord, we
      // should prefer the longer match).
      // NOTE: this is an O(n**2) operation! If there are large
      // number of annotations, then this can be expensive
      FSIndex index = jcas.getAnnotationIndex(CityAnnotation.type);
      for (Iterator<CityAnnotation> it1 = index.iterator(); it1.hasNext(); ) {
        CityAnnotation ca1 = it1.next();
        Range r1 = new IntRange(ca1.getBegin(), ca1.getEnd());
        for (Iterator<CityAnnotation> it2 = index.iterator(); it2.hasNext(); ) {
          CityAnnotation ca2 = it2.next();
          if (ca1.getAddress() == ca2.getAddress()) {
            continue;
          }
          Range r2 = new IntRange(ca2.getBegin(), ca2.getEnd());
          if (r1.containsRange(r2)) {
            ca2.removeFromIndexes();
          }
        }
      }
    } catch (IOException e) {
      throw new AnalysisEngineProcessException(e);
    }
  }
}

And finally, there is the XML descriptor for the City AE, which is shown below:

<?xml version="1.0" encoding="UTF-8" ?>
<!-- Source: src/main/java/com/mycompany/myapp/uima/annotators/city/CityAE.xml -->
<analysisEngineDescription xmlns="http://uima.apache.org/resourceSpecifier">
  <frameworkImplementation>org.apache.uima.java</frameworkImplementation>
  <primitive>true</primitive>
  <annotatorImplementationName>
    com.mycompany.myapp.uima.annotators.city.CityAnnotator
  </annotatorImplementationName>
  <analysisEngineMetaData>
    <name>US States Annotator</name>
    <description>Recognize and annotate city names in text</description>
    <version>1.0</version>
    <vendor>MyCompany, Inc.</vendor>
    <configurationParameters></configurationParameters>
    <configurationParameterSettings></configurationParameterSettings>
    <typeSystemDescription>
      <imports>
        <import location="City.xml"/>
      </imports>
    </typeSystemDescription>
    <typePriorities></typePriorities>
    <fsIndexCollection></fsIndexCollection>
    <capabilities>
      <capability>
        <inputs></inputs>
        <outputs>
          <type>com.mycompany.myapp.uima.annotators.city.CityAnnotation</type>
        </outputs>
        <languagesSupported></languagesSupported>
      </capability>
    </capabilities>
    <operationalProperties>
      <modifiesCas>true</modifiesCas>
      <multipleDeploymentAllowed>true</multipleDeploymentAllowed>
      <outputsNewCASes>false</outputsNewCASes>
    </operationalProperties>
  </analysisEngineMetaData>
  <externalResourceDependencies></externalResourceDependencies>
  <resourceManagerConfiguration></resourceManagerConfiguration>
</analysisEngineDescription>

State Annotator

The state annotator uses a combination of pattern matching and name based lookup for both state abbreviations and the full names of the state. Since the addresses in our (hypothetical) index contains the states as abbreviations, we add the abbreviation as an attribute of the annotated state names. The code first searches for two letter patterns (CA, OR, etc), and then looks them up against a list of state abbreviations. It then shingles the input and looks up the shingles against a list of state names. The two lists are generated from data in a database table that is sucked into the in-memory data structures in the init() method.

Here is the XML descriptor for the State type. We have defined the "abbreviation" feature here, which triggers creation of getters and setters in the StateAnnotation POJO.

<?xml version="1.0" encoding="UTF-8"?>
<!-- Source: src/main/java/com/mycompany/myapp/uima/annotators/state/State.xml -->
<typeSystemDescription xmlns="http://uima.apache.org/resourceSpecifier">
  <name>State</name>
  <description>US State</description>
  <version>1.0</version>
  <vendor>MyCompany, Inc.</vendor>
  <types>
    <typeDescription>
      <name>com.mycompany.myapp.uima.annotators.state.StateAnnotation</name>
      <description>US State</description>
      <supertypeName>uima.tcas.Annotation</supertypeName>
      <features>
        <featureDescription>
          <name>abbreviation</name>
          <description>State Abbreviation</description>
          <rangeTypeName>uima.cas.String</rangeTypeName>
        </featureDescription>
      </features>
    </typeDescription>
  </types>
</typeSystemDescription>

The code for the State Annotator is shown below.

// Source: src/main/java/com/mycompany/myapp/uima/annotators/state/StateAnnotator.java
package com.mycompany.myapp.uima.annotators.state;

import java.io.IOException;
import java.io.StringReader;
import java.util.HashMap;
import java.util.HashSet;
import java.util.List;
import java.util.Map;
import java.util.Set;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

import org.apache.commons.lang.StringUtils;
import org.apache.lucene.analysis.LowerCaseFilter;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.WhitespaceTokenizer;
import org.apache.lucene.analysis.shingle.ShingleFilter;
import org.apache.lucene.analysis.tokenattributes.OffsetAttribute;
import org.apache.lucene.analysis.tokenattributes.TermAttribute;
import org.apache.uima.UimaContext;
import org.apache.uima.analysis_component.JCasAnnotator_ImplBase;
import org.apache.uima.analysis_engine.AnalysisEngineProcessException;
import org.apache.uima.jcas.JCas;
import org.apache.uima.resource.ResourceInitializationException;

import com.mycompany.myapp.utils.DbUtils;

public class StateAnnotator extends JCasAnnotator_ImplBase {

  private static final Pattern STATE_PATTERN = 
    Pattern.compile("[A-Z]{2}");
  private static final int SHINGLE_SIZE = 2;
  
  private Set<String> stateAbbrs = new HashSet<String>();
  private Map<String,String> stateNames = new HashMap<String,String>();
  
  @Override
  public void initialize(UimaContext ctx) 
      throws ResourceInitializationException {
    super.initialize(ctx);
    try {
      List<Map<String,Object>> rows = DbUtils.queryForList(
        "select abbr, name from us_states", null);
      for (Map<String,Object> row : rows) {
        stateAbbrs.add((String) row.get("abbr"));
        stateNames.put(
          StringUtils.lowerCase((String) row.get("name")),
          (String) row.get("abbr"));
      }
    } catch (Exception e) {
      throw new ResourceInitializationException(e);
    }
  }
  
  @Override
  public void process(JCas jcas) throws AnalysisEngineProcessException {
    String text = jcas.getDocumentText();
    // look for words with two uppercase letters, then check
    // against the abbreviation set
    Matcher stateAbbrMatcher = STATE_PATTERN.matcher(text);
    int pos = 0;
    while (stateAbbrMatcher.find(pos)) {
      int start = stateAbbrMatcher.start();
      int end = stateAbbrMatcher.end();
      String abbr = text.substring(start, end);
      if (stateAbbrs.contains(abbr)) {
        StateAnnotation annotation = new StateAnnotation(jcas);
        annotation.setBegin(start);
        annotation.setEnd(end);
        annotation.setAbbreviation(abbr);
        annotation.addToIndexes();
      }
      pos = stateAbbrMatcher.end();
    }
    // now look for multi-word tokens (1-3), looking for a match
    // against the state names
    // preprocess the text so we remove punctuation
    text = text.replaceAll("\\p{Punct}", " ");
    WhitespaceTokenizer tokenizer = new WhitespaceTokenizer(
      new StringReader(text));
    TokenStream tokenStream = new LowerCaseFilter(tokenizer);
    tokenStream = new ShingleFilter(tokenStream, SHINGLE_SIZE);
    try {
      while (tokenStream.incrementToken()) {
        TermAttribute term = tokenStream.getAttribute(TermAttribute.class);
        OffsetAttribute offset = tokenStream.getAttribute(OffsetAttribute.class);
        String shingle = term.term();
        if (stateNames.containsKey(shingle)) {
          StateAnnotation annotation = new StateAnnotation(jcas);
          annotation.setBegin(offset.startOffset());
          annotation.setEnd(offset.endOffset());
          annotation.setAbbreviation(stateNames.get(shingle));
          annotation.addToIndexes();
        }
      }
    } catch (IOException e) {
      throw new AnalysisEngineProcessException(e);
    }
  }
}

And finally, the XML descriptor for the State AE. The abbreviation feature has to be defined in this XML as well.

<?xml version="1.0" encoding="UTF-8" ?>
<!-- Source: src/main/java/com/mycompany/myapp/uima/annotators/state/StateAE.xml -->
<analysisEngineDescription xmlns="http://uima.apache.org/resourceSpecifier">
  <frameworkImplementation>org.apache.uima.java</frameworkImplementation>
  <primitive>true</primitive>
  <annotatorImplementationName>
    com.mycompany.myapp.uima.annotators.state.StateAnnotator
  </annotatorImplementationName>
  <analysisEngineMetaData>
    <name>US States Annotator</name>
    <description>Annotate state abbreviations and names in text</description>
    <version>1.0</version>
    <vendor>MyCompany, Inc.</vendor>
    <configurationParameters></configurationParameters>
    <configurationParameterSettings></configurationParameterSettings>
    <typeSystemDescription>
      <imports>
        <import location="State.xml"/>
      </imports>
    </typeSystemDescription>
    <typePriorities></typePriorities>
    <fsIndexCollection></fsIndexCollection>
    <capabilities>
      <capability>
        <inputs></inputs>
        <outputs>
          <type>com.mycompany.myapp.uima.annotators.state.StateAnnotation</type>
          <feature>com.mycompany.myapp.uima.annotators.state.StateAnnotation:abbreviation</feature>
        </outputs>
        <languagesSupported></languagesSupported>
      </capability>
    </capabilities>
    <operationalProperties>
      <modifiesCas>true</modifiesCas>
      <multipleDeploymentAllowed>true</multipleDeploymentAllowed>
      <outputsNewCASes>false</outputsNewCASes>
    </operationalProperties>
  </analysisEngineMetaData>
  <externalResourceDependencies></externalResourceDependencies>
  <resourceManagerConfiguration></resourceManagerConfiguration>
</analysisEngineDescription>

Putting it all together - Unit test

As mentioned before, each AE has its own unit tests to make sure they are working. Unit tests are especially important in this kind of setup, because a real life aggregate AE pipeline will consist of a set of co-operating primitive AE or aggregate AEs. Since there are likely to be inter-dependencies, unit tests can be a way to ensure that new functionality does not break something that used to work before the change. Of course, you should use Assert.assertXXX() calls instead of System.out.println() as I am doing here.

So I created an aggregate AE which recognizes tokens in address snippets, and I call it the AddressAE. There is no Java code for this AE, only an XML descriptor that chains the previous primitive AEs together. Here it is:

<?xml version="1.0" encoding="UTF-8" ?>
<!-- Source: src/main/java/com/mycompany/myapp/uima/annotators/aggregates/AddressAE.xml -->
<analysisEngineDescription xmlns="http://uima.apache.org/resourceSpecifier">
  <frameworkImplementation>org.apache.uima.java</frameworkImplementation>
  <primitive>false</primitive>
  <delegateAnalysisEngineSpecifiers>
    <delegateAnalysisEngine key="ZipCode">
      <import location="../zipcode/ZipCodeAE.xml"/>
    </delegateAnalysisEngine>
    <delegateAnalysisEngine key="State">
      <import location="../state/StateAE.xml"/>
    </delegateAnalysisEngine>
    <delegateAnalysisEngine key="City">
      <import location="../city/CityAE.xml"/>
    </delegateAnalysisEngine>
  </delegateAnalysisEngineSpecifiers>
  <analysisEngineMetaData>
    <name>AddressAE</name>
    <description>Runs the delegate AEs together</description>
    <version>1.0</version>
    <vendor>MyCompany, Inc.</vendor>
    <flowConstraints>
      <fixedFlow>
        <node>ZipCode</node>
        <node>State</node>
        <node>City</node>
      </fixedFlow>
    </flowConstraints>
    <configurationParameters></configurationParameters>
    <configurationParameterSettings></configurationParameterSettings>
    <fsIndexCollection></fsIndexCollection>
    <capabilities>
      <capability>
        <inputs></inputs>
        <outputs>
          <type allAnnotatorFeatures="true">
            com.mycompany.myapp.uima.annotators.zipcode.ZipCodeAnnotator
          </type>
          <type allAnnotatorFeatures="true">
            com.mycompany.myapp.uima.annotators.state.StateAnnotator
          </type>
          <type allAnnotatorFeatures="true">
            com.mycompany.myapp.uima.annotators.city.CityAnnotator
          </type>
        </outputs>
        <languagesSupported>en</languagesSupported>
      </capability>
    </capabilities>
    <operationalProperties>
      <modifiesCas>true</modifiesCas>
      <multipleDeploymentAllowed>true</multipleDeploymentAllowed>
      <outputsNewCASes>false</outputsNewCASes>
    </operationalProperties>
  </analysisEngineMetaData>
  <externalResourceDependencies></externalResourceDependencies>
  <resourceManagerConfiguration></resourceManagerConfiguration>
</analysisEngineDescription>

I create a TestUtils class which expose standard static methods to get an AE from the UIMA framework given its XML descriptor, another one that runs the AE, and a method that prints the results. This code was derived from JUnit test code in the AlchemyAPI UIMA sandbox component.

// Source: src/test/java/com/mycompany/myapp/uima/TestUtils.java
package com.mycompany.myapp.uima;

import java.io.IOException;
import java.util.ArrayList;
import java.util.Iterator;
import java.util.List;
import java.util.Map;

import org.apache.commons.lang.StringUtils;
import org.apache.uima.UIMAFramework;
import org.apache.uima.analysis_engine.AnalysisEngine;
import org.apache.uima.analysis_engine.AnalysisEngineDescription;
import org.apache.uima.analysis_engine.AnalysisEngineProcessException;
import org.apache.uima.cas.FSIndex;
import org.apache.uima.cas.Feature;
import org.apache.uima.jcas.JCas;
import org.apache.uima.jcas.tcas.Annotation;
import org.apache.uima.resource.ResourceInitializationException;
import org.apache.uima.util.InvalidXMLException;
import org.apache.uima.util.ProcessTrace;
import org.apache.uima.util.ProcessTraceEvent;
import org.apache.uima.util.XMLInputSource;

public class TestUtils {

  public static AnalysisEngine getAE(
      String descriptor, Map<String,Object> params) 
      throws IOException, InvalidXMLException, 
      ResourceInitializationException {
    AnalysisEngine ae = null;
    try {
      XMLInputSource in = new XMLInputSource(descriptor);
      AnalysisEngineDescription desc = 
        UIMAFramework.getXMLParser().
        parseAnalysisEngineDescription(in);
      if (params != null) {
        for (String key : params.keySet()) {
          desc.getAnalysisEngineMetaData().
          getConfigurationParameterSettings().
          setParameterValue(key, params.get(key));
        }
      }
      ae = UIMAFramework.produceAnalysisEngine(desc);
    } catch (Exception e) {
      throw new ResourceInitializationException(e);
    }
    return ae;
  }
  
  public static JCas runAE(AnalysisEngine ae, String text)
      throws AnalysisEngineProcessException, 
      ResourceInitializationException {
    JCas jcas = ae.newJCas();
    jcas.setDocumentText(text);
    ProcessTrace trace = ae.process(jcas);
    for (ProcessTraceEvent evt : trace.getEvents()) {
      if (evt != null && evt.getResultMessage() != null &&
          evt.getResultMessage().contains("error")) {
        throw new AnalysisEngineProcessException(
          new Exception(evt.getResultMessage()));
      }
    }
    return jcas;
  }
  
  public static void printResults(JCas jcas) {
    FSIndex index = jcas.getAnnotationIndex();
    for (Iterator<Annotation> it = index.iterator(); it.hasNext(); ) {
      Annotation annotation = it.next();
      List<Feature> features = new ArrayList<Feature>();
      if (annotation.getType().getName().contains("com.mycompany")) {
        features = annotation.getType().getFeatures();
      }
      List<String> fasl = new ArrayList<String>();
      for (Feature feature : features) {
        if (feature.getName().contains("com.mycompany")) {
          String name = feature.getShortName();
          String value = annotation.getStringValue(feature);
          fasl.add(name + "=\"" + value + "\"");
        }
      }
      System.out.println(
        annotation.getType().getShortName() + ": " +
        annotation.getCoveredText() + " " +
        (fasl.size() > 0 ? StringUtils.join(fasl.iterator(), ",") : "") + " " +
        annotation.getBegin() + ":" + annotation.getEnd());
    }
    System.out.println("==");
  }
}

The JUnit test for the AddressAE is simple (and follows the same pattern as the JUnit test cases for the primitive AEs). Here it is:

// Source: src/test/java/com/mycompany/myapp/uima/annotators/aggregates/AddressAnnotatorTest.java

package com.mycompany.myapp.uima.annotators.aggregates;

import org.apache.uima.analysis_engine.AnalysisEngine;
import org.apache.uima.jcas.JCas;
import org.junit.Test;

import com.mycompany.myapp.uima.TestUtils;

public class AddressAnnotatorTest {

  private final String[] TEST_STRINGS = new String[] {
    "Dr Goldwater, University of Michigan, Ann Arbor, MI 01234",
    "Microsoft, 1 Microsoft Way, Redmond, WA",
    "Apple, 1 Infinite Loop, Cupertino, CA 95014",
    "IBM, 1 New Orchard Road, Armonk, NY 10504",
    "Google, 1600 Amphitheater Parkway, Mountain View, CA 94043",
    "Healthline, 600 3rd Street, San Francisco, CA 94107",
    "Jane Doe, Lake Tahoe, California",
    "Miss Liberty, Empire State Building, New York, NY"
  };
  
  @Test
  public void testAddressAE() throws Exception {
    AnalysisEngine ae = TestUtils.getAE(
      "src/main/java/com/mycompany/myapp/uima/annotators/aggregates/AddressAE.xml", 
      null);
    for (String text : TEST_STRINGS) {
      JCas jcas = TestUtils.runAE(ae, text);
      TestUtils.printResults(jcas);
    }
  }
}

And here are the results of this test. Each test string is treated as a Document by UIMA, so thats the first line. Below this are the annotations produced by each of the primitive AEs described above. I also report the begin and end offsets along with the annotated text in case I ever want to produce a Lucene tokenizer out of this. The next step is to create multi-field Lucene queries that query individual fields in the index.

DocumentAnnotation: Dr Goldwater, University of Michigan, Ann Arbor, MI 01234  0:57
StateAnnotation: Michigan abbreviation="MI" 28:36
CityAnnotation: Ann Arbor  38:47
StateAnnotation: MI abbreviation="MI" 49:51
ZipCodeAnnotation: 01234  52:57

DocumentAnnotation: Microsoft, 1 Microsoft Way, Redmond, WA  0:39
CityAnnotation: Redmond  28:35
StateAnnotation: WA abbreviation="WA" 37:39

DocumentAnnotation: Apple, 1 Infinite Loop, Cupertino, CA 95014  0:43
CityAnnotation: Cupertino  24:33
StateAnnotation: CA abbreviation="CA" 35:37
ZipCodeAnnotation: 95014  38:43

DocumentAnnotation: IBM, 1 New Orchard Road, Armonk, NY 10504  0:41
CityAnnotation: Armonk  25:31
StateAnnotation: NY abbreviation="NY" 33:35
ZipCodeAnnotation: 10504  36:41

DocumentAnnotation: Google, 1600 Amphitheater Parkway, Mountain View, CA 94043  0:58
StateAnnotation: CA abbreviation="CA" 50:52
ZipCodeAnnotation: 94043  53:58

DocumentAnnotation: Healthline, 600 3rd Street, San Francisco, CA 94107  0:51
CityAnnotation: San Francisco  28:41
StateAnnotation: CA abbreviation="CA" 43:45
ZipCodeAnnotation: 94107  46:51

DocumentAnnotation: Jane Doe, Lake Tahoe, California  0:32
StateAnnotation: California abbreviation="CA" 22:32

DocumentAnnotation: Miss Liberty, Empire State Building, New York, NY  0:49
StateAnnotation: New York abbreviation="NY" 37:45
CityAnnotation: New York  37:45
StateAnnotation: NY abbreviation="NY" 47:49

Conclusion

As you can see, UIMA provides a nice framework for NER, allowing you to manually specify tokens that should be protected. All the programmer has to do is to specify the algorithms by which the tokens should be recognized. If you notice the results though, there is still quite a lot of improvement that can be done. For example, Michigan in "University of Michigan" is being recognized as a state, which points to the need to recognize various Universities. Also "New York" is recognized both as a city and a state, which points to the need for the city and the state annotators to be aware of each other (ie a city and state are usually collocated).

There is obviously much more to UIMA than this. I plan on taking a look at the UIMA sandbox components, either using some of them as-is, or leveraging the ideas in there to make my code smarter.

Sometime back, I described how I built (among other things) a custom Solr QParser plugin to handle Payload Term Queries. Looking back on this recently, I realized how lame it was - all it could handle were single Payload Term Queries, and a one level deep AND and OR combinations of these queries. More to the point, I discovered that I had to support queries of this form:

1	+concepts:123456 +(concepts:234567 concepts:345678 ...)

The original parsing code simply split up the query by whitespace, then by colon, and depending on whether the key was preceded by a "+" sign, either added it to the Boolean Query as an Occur.MUST or Occur.SHOULD. Obviously, this would not be able to parse the form of the query above.

Coencidentally, a few days ago, I was hunting around for something completely different on my laptop, and I came across the QueryParser Lucene contrib module that replaces the original Lucene JavaCC based QueryParser with a nice little framework that splits the query parsing into 3 phases - syntax parsing, query processing and query building. It has been available since Lucene 2.9.0, and on the version I am using (Lucene 2.9.3/Solr 1.4.1) both QueryParser implementations are supported.

In my case, my Payload Query syntax is identical to the Term Query syntax, so all I really needed to do was to return a PayloadTermQuery instead of a TermQuery in the query building phase. So all I needed to do to build a robust Payload QueryParser was to just implement a custom QueryBuilder and call it from within this framework.

There is not much documentation available on how to use the framework though, apart from the Javadocs, and the advice in there is to take a look at the StandardQueryParser and use that as a template to design your own. So thats what I did. I ended up building a few more classes in order to integrate it into my custom QParser plugin, but it was really quite simple.

Here is the updated code for my QParser plugin. Apart from this code change, all I had to do was add the lucene-queryparser-2.9.3.jar to the Solr classpath. There is no change in its configuration and the associated Solr request handler I used it from - both these are described in my previous post I referred to above.

// $Source: src/java/org/apache/solr/search/ext/PayloadQParserPlugin.java
package org.apache.solr.search.ext;

import org.apache.lucene.index.Term;
import org.apache.lucene.queryParser.ParseException;
import org.apache.lucene.queryParser.core.QueryNodeException;
import org.apache.lucene.queryParser.core.QueryParserHelper;
import org.apache.lucene.queryParser.core.nodes.FieldQueryNode;
import org.apache.lucene.queryParser.core.nodes.QueryNode;
import org.apache.lucene.queryParser.standard.builders.StandardQueryBuilder;
import org.apache.lucene.queryParser.standard.builders.StandardQueryTreeBuilder;
import org.apache.lucene.queryParser.standard.config.StandardQueryConfigHandler;
import org.apache.lucene.queryParser.standard.parser.StandardSyntaxParser;
import org.apache.lucene.queryParser.standard.processors.StandardQueryNodeProcessorPipeline;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.payloads.AveragePayloadFunction;
import org.apache.lucene.search.payloads.PayloadTermQuery;
import org.apache.solr.common.params.SolrParams;
import org.apache.solr.common.util.NamedList;
import org.apache.solr.request.SolrQueryRequest;
import org.apache.solr.search.QParser;
import org.apache.solr.search.QParserPlugin;

/**
 * Parser plugin to parse payload queries.
 */
public class PayloadQParserPlugin extends QParserPlugin {

  @Override
  public QParser createParser(String qstr, SolrParams localParams,
      SolrParams params, SolrQueryRequest req) {
    return new PayloadQParser(qstr, localParams, params, req);
  }

  public void init(NamedList args) {
    // do nothing
  }
}

class PayloadQParser extends QParser {

  public PayloadQParser(String qstr, SolrParams localParams, 
      SolrParams params, SolrQueryRequest req) {
    super(qstr, localParams, params, req);
  }

  @Override
  public Query parse() throws ParseException {
    PayloadQueryParser parser = new PayloadQueryParser();
    try {
      Query q = (Query) parser.parse(qstr, "concepts");
      return q;
    } catch (QueryNodeException e) {
      throw new ParseException(e.getMessage());
    }
  }
}

class PayloadQueryParser extends QueryParserHelper {
  
  public PayloadQueryParser() {
    super(new StandardQueryConfigHandler(), new StandardSyntaxParser(),
      new StandardQueryNodeProcessorPipeline(null),
      new PayloadQueryTreeBuilder());
  }
}

class PayloadQueryTreeBuilder extends StandardQueryTreeBuilder {
  
  public PayloadQueryTreeBuilder() {
    super();
    setBuilder(FieldQueryNode.class, new PayloadQueryNodeBuilder());
  }
}

class PayloadQueryNodeBuilder implements StandardQueryBuilder {
  
  @Override
  public PayloadTermQuery build(QueryNode queryNode) throws QueryNodeException {
    FieldQueryNode node = (FieldQueryNode) queryNode;
    return new PayloadTermQuery(
      new Term(node.getFieldAsString(), node.getTextAsString()),
      new AveragePayloadFunction(), false);
  }
}

As you can see, in my QParser.parse() method, I instantiate PayloadQueryParser, which is a subclass of QueryParserHelper. I reuse the same constructor code as StandardQueryParser (another subclass of QueryParserHelper and my template), except I pass in a custom QueryBuilder - the PayloadQueryTreeBuilder. The PayloadQueryTreeBuilder subclasses StandardQueryTreeBuilder, except it redefines what builder to use for FieldQueryNode types - the StandardQueryTreeBuilder is sort of a factory and delegates to the appropriate QueryBuilder depending on the type of the node. Finally, the PayloadQueryNodeBuilder implements the StandardQueryBuilder (similar to the FieldQueryNodeBuilder), and redefines the build() method to produce a PayloadTermQuery instead of a TermQuery as FieldQueryNodeBuilder does.

And thats pretty much it. I tested this by hitting the /concept-search URL and verified that the queries are correctly parsed and returned by printing the queries in the log.

Hopefully this post was useful, if for nothing else that people find out about the new QueryParser framework and begin to use it. The customization I did here is pretty trivial in terms of code, but it saved me a lot of work.

Salmon Run

Saturday, March 26, 2011

Smart Query Parsing with UIMA

UIMA Background

Zip Code Annotator

City Annotator

State Annotator

Putting it all together - Unit test

Conclusion

Friday, March 11, 2011

Using Lucene's new QueryParser framework in Solr

Posts

Labels

Blogs I Read

About me

My Nerd Rating

Visitor Map

Contact Me