Saturday, June 25, 2011

Running a UIMA Analysis Engine in a Lucene Analyzer Chain

Last week, I wrote about a UIMA Aggregate Analysis Engine (AE) that annotates keywords in a body of text, optionally inserting synonyms, using a combination of pattern matching and dictionary lookups. The idea is that this analysis will be done on text on its way into a Lucene index. So this week, I describe the Lucene Analyzer chain that I built around the AE I described last week.

A picture is worth a thousand words, so here is one that shows what I am (or will be soon, in much greater detail) talking about.


As you can imagine, most of the work happens in the UimaAETokenizer. The tokenizer is a buffering (non-streaming) Tokenizer, ie, the entire text is read from the Reader and analyzed by the UIMA AE, then individual tokens returned on successive calls to its incrementToken() method. I decided to use the new (to me) AttributeSource.State object to keep track of the tokenizer's state between calls to incrementToken() (found out about it by grokking through the Synonym filter example in the LIA2 book).

After (UIMA) analysis, the annotated tokens are marked as Keyword, any transformed values for the annotation are set into the SynonymMap (for use by the synonym filter, next in the chain). Text that is not annotated are split up (by punctuation and whitespace) and returned as plain Lucene Term (or CharTerm since Lucene 3.x) tokens. Here is the code for the Tokenizer class.

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
// Source: src/main/java/com/mycompany/tgni/lucene/UimaAETokenizer.java
package com.mycompany.tgni.lucene;

import java.io.BufferedReader;
import java.io.IOException;
import java.io.Reader;
import java.util.Arrays;
import java.util.HashMap;
import java.util.HashSet;
import java.util.Iterator;
import java.util.LinkedList;
import java.util.List;
import java.util.Map;
import java.util.Set;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

import org.apache.commons.io.IOUtils;
import org.apache.commons.lang.StringUtils;
import org.apache.commons.lang.math.IntRange;
import org.apache.lucene.analysis.Token;
import org.apache.lucene.analysis.Tokenizer;
import org.apache.lucene.analysis.synonym.SynonymMap;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.apache.lucene.analysis.tokenattributes.KeywordAttribute;
import org.apache.lucene.analysis.tokenattributes.OffsetAttribute;
import org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute;
import org.apache.lucene.util.AttributeSource;
import org.apache.uima.analysis_engine.AnalysisEngine;
import org.apache.uima.cas.FSIndex;
import org.apache.uima.jcas.JCas;
import org.apache.uima.jcas.tcas.Annotation;

import com.mycompany.tgni.uima.annotators.keyword.KeywordAnnotation;
import com.mycompany.tgni.uima.utils.UimaUtils;

/**
 * Tokenizes a block of text from the passed in reader and
 * annotates it with the specified UIMA Analysis Engine. Terms
 * in the text that are not annotated by the Analysis Engine
 * are split on whitespace and punctuation. Attributes available:
 * CharTermAttribute, OffsetAttribute, PositionIncrementAttribute
 * and KeywordAttribute. 
 */
public final class UimaAETokenizer extends Tokenizer {

  private final CharTermAttribute termAttr;
  private final OffsetAttribute offsetAttr;
  private final PositionIncrementAttribute posIncAttr;
  private final KeywordAttribute keywordAttr;

  private AttributeSource.State current;
  private AnalysisEngine ae;
  private SynonymMap synmap;
  private LinkedList<IntRange> rangeList;
  private Map<IntRange,Object> rangeMap;
  private Reader reader = null;
  private boolean eof = false;

  private static final Pattern PUNCT_OR_SPACE_PATTERN = 
    Pattern.compile("[\\p{Punct}\\s+]");
  private static final String SYN_DELIMITER = "__";
  
  public UimaAETokenizer(Reader input, 
      String aeDescriptor, Map<String,Object> aeParams,
      SynonymMap synonymMap) {
    super(input);
    // validate inputs
    try {
      ae = UimaUtils.getAE(aeDescriptor, aeParams);
    } catch (Exception e) {
      throw new RuntimeException(e);
    }
    if (synonymMap == null) {
      throw new RuntimeException(
        "Need valid (non-null) reference to a SynonymMap");
    }
    synmap = synonymMap;
    reader = new BufferedReader(input);
    // set available attributes
    termAttr = addAttribute(CharTermAttribute.class);
    offsetAttr = addAttribute(OffsetAttribute.class);
    posIncAttr = addAttribute(PositionIncrementAttribute.class);
    keywordAttr = addAttribute(KeywordAttribute.class);
    // initialize variables
    rangeList = new LinkedList<IntRange>();
    rangeMap = new HashMap<IntRange,Object>();
  }
  
  @Override
  public boolean incrementToken() throws IOException {
    if (rangeList.size() > 0) {
      populateAttributes();
      current = captureState();
      restoreState(current);
      if (rangeList.size() == 0) {
        eof = true;
      }
      return true;
    }
    // if no more tokens, return
    if (eof) {
      return false;
    }
    // analyze input and buffer tokens
    clearAttributes();
    rangeList.clear();
    rangeMap.clear();
    try {
      List<String> texts = IOUtils.readLines(reader);
      for (String text : texts) {
        JCas jcas = UimaUtils.runAE(ae, text);
        FSIndex<? extends Annotation> fsindex = 
          jcas.getAnnotationIndex(KeywordAnnotation.type);
        int pos = 0;
        for (Iterator<? extends Annotation> it = fsindex.iterator();
            it.hasNext(); ) {
          KeywordAnnotation annotation = (KeywordAnnotation) it.next();
          int begin = annotation.getBegin();
          int end = annotation.getEnd();
          if (pos < begin) {
            // this is plain text, split this up by whitespace
            // into individual terms
            addNonAnnotatedTerms(pos, text.substring(pos, begin));
          }
          IntRange range = new IntRange(begin, end);
          mergeAnnotationInfo(range, annotation);     
          pos = end;
        }
        if (pos < text.length()) {
          addNonAnnotatedTerms(pos, text.substring(pos));
        }
        current = captureState();
      }
    } catch (Exception e) {
      throw new IOException(e);
    }
    // return the first term from rangeList
    populateAttributes();
    return true;
  }
  
  private void populateAttributes() {
    if (rangeList.size() == 0) {
      return;
    }
    // return buffered tokens one by one. If current
    // token has an associated UimaAnnotationAttribute,
    // then set the attribute in addition to term
    IntRange range = rangeList.removeFirst();
    if (rangeMap.containsKey(range)) {
      Object rangeValue = rangeMap.get(range);
      if (rangeValue instanceof KeywordAnnotation) {
        // this is a UIMA Keyword annotation
        KeywordAnnotation annotation = (KeywordAnnotation) rangeValue;
        String term = annotation.getCoveredText();
        String transformedValue = annotation.getTransformedValue();
        if (StringUtils.isNotEmpty(transformedValue)) {
          List<Token> tokens = SynonymMap.makeTokens(
            Arrays.asList(StringUtils.split(
            transformedValue, SYN_DELIMITER)));
          // rather than add all the synonym tokens in a single
          // add, we have to do this separately to ensure that
          // the position increment attribute is set to 0 for
          // all the synonyms, not just the first one
          for (Token token : tokens) {
            synmap.add(Arrays.asList(term), Arrays.asList(token), true, true);
          }
        }
        offsetAttr.setOffset(annotation.getBegin(), 
          annotation.getEnd());
        termAttr.copyBuffer(term.toCharArray(), 0, term.length());
        termAttr.setLength(term.length());
        keywordAttr.setKeyword(true);
        posIncAttr.setPositionIncrement(1);
      } else {
        // this is a plain text term
        String term = (String) rangeValue;
        termAttr.copyBuffer(term.toCharArray(), 0, term.length());
        termAttr.setLength(term.length());
        offsetAttr.setOffset(range.getMinimumInteger(), 
          range.getMaximumInteger());
        keywordAttr.setKeyword(false);
        posIncAttr.setPositionIncrement(1);
      }
    }
  }

  private void addNonAnnotatedTerms(int pos, String snippet) {
    int start = 0;
    Matcher m = PUNCT_OR_SPACE_PATTERN.matcher(snippet);
    while (m.find(start)) {
      int begin = m.start();
      int end = m.end();
      if (start == begin) {
        // this is a punctuation character, skip it
        start = end;
        continue;
      }
      IntRange range = new IntRange(pos + start, pos + begin);
      rangeList.add(range);
      rangeMap.put(range, snippet.substring(start, begin));
      start = end; 
    }
    // take care of trailing string in snippet
    if (start < snippet.length()) {
      IntRange range = new IntRange(pos + start, pos + snippet.length());
      rangeList.add(range);
      rangeMap.put(range, snippet.substring(start));
    }
  }

  private void mergeAnnotationInfo(IntRange range, 
      KeywordAnnotation annotation) {
    // verify if the range has not already been recognized.
    // this is possible if multiple AEs recognize and act
    // on the same pattern/dictionary entry
    if (rangeMap.containsKey(range) &&
        rangeMap.get(range) instanceof KeywordAnnotation) {
      KeywordAnnotation prevAnnotation = 
        (KeywordAnnotation) rangeMap.get(range);
      Set<String> synonyms = new HashSet<String>();
      if (StringUtils.isNotEmpty(
          prevAnnotation.getTransformedValue())) {
        synonyms.addAll(Arrays.asList(StringUtils.split(
          prevAnnotation.getTransformedValue(), SYN_DELIMITER)));
      }
      if (StringUtils.isNotEmpty(annotation.getTransformedValue())) {
        synonyms.addAll(Arrays.asList(StringUtils.split(
          annotation.getTransformedValue(), SYN_DELIMITER)));
      }
      annotation.setTransformedValue(StringUtils.join(
        synonyms.iterator(), SYN_DELIMITER));
      rangeMap.put(range, annotation);
    } else {
      rangeList.add(range);
      rangeMap.put(range, annotation);
    }
  }
}

The UimaUtils class is a simple utilities class that wraps common UIMA operations such as building an Analysis Engine from a descriptor, running an Analysis Engine, etc. Its code is shown below:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
// Source: ./src/main/java/com/mycompany/tgni/uima/utils/UimaUtils.java
package com.mycompany.tgni.uima.utils;

import java.io.IOException;
import java.util.ArrayList;
import java.util.Iterator;
import java.util.List;
import java.util.Map;

import org.apache.commons.lang.StringUtils;
import org.apache.uima.UIMAFramework;
import org.apache.uima.analysis_engine.AnalysisEngine;
import org.apache.uima.analysis_engine.AnalysisEngineDescription;
import org.apache.uima.analysis_engine.AnalysisEngineProcessException;
import org.apache.uima.cas.FSIndex;
import org.apache.uima.cas.Feature;
import org.apache.uima.jcas.JCas;
import org.apache.uima.jcas.tcas.Annotation;
import org.apache.uima.resource.ResourceInitializationException;
import org.apache.uima.util.InvalidXMLException;
import org.apache.uima.util.ProcessTrace;
import org.apache.uima.util.ProcessTraceEvent;
import org.apache.uima.util.XMLInputSource;

/**
 * Largely copied from the TestUtils class in UIMA Sandbox component
 * AlchemyAPIAnnotator.
 */
public class UimaUtils {

  public static AnalysisEngine getAE(
      String descriptor, Map<String,Object> params) 
      throws IOException, InvalidXMLException, 
      ResourceInitializationException {
    AnalysisEngine ae = null;
    try {
      XMLInputSource in = new XMLInputSource(descriptor);
      AnalysisEngineDescription desc = 
        UIMAFramework.getXMLParser().
        parseAnalysisEngineDescription(in);
      if (params != null) {
        for (String key : params.keySet()) {
          desc.getAnalysisEngineMetaData().
            getConfigurationParameterSettings().
            setParameterValue(key, params.get(key));
        }
      }
      ae = UIMAFramework.produceAnalysisEngine(desc);
    } catch (Exception e) {
      throw new ResourceInitializationException(e);
    }
    return ae;
  }
  
  public static JCas runAE(AnalysisEngine ae, String text)
      throws AnalysisEngineProcessException,
      ResourceInitializationException {
    JCas jcas = ae.newJCas();
    jcas.setDocumentText(text);
    ProcessTrace trace = ae.process(jcas);
    for (ProcessTraceEvent evt : trace.getEvents()) {
      if (evt != null && evt.getResultMessage() != null &&
          evt.getResultMessage().contains("error")) {
        throw new AnalysisEngineProcessException(
          new Exception(evt.getResultMessage()));
      }
    }
    return jcas;
  }
  
  public static void printResults(JCas jcas) {
    FSIndex<Annotation> index = jcas.getAnnotationIndex();
    for (Iterator<Annotation> it = index.iterator(); it.hasNext(); ) {
      Annotation annotation = it.next();
      List<Feature> features = new ArrayList<Feature>();
      if (annotation.getType().getName().contains("com.mycompany")) {
        features = annotation.getType().getFeatures();
      }
      List<String> fasl = new ArrayList<String>();
      for (Feature feature : features) {
        if (feature.getName().contains("com.mycompany")) {
          String name = feature.getShortName();
          String value = annotation.getStringValue(feature);
          fasl.add(name + "=\"" + value + "\"");
        }
      }
      System.out.println(
        annotation.getType().getShortName() + ": " +
        annotation.getCoveredText() + " " +
        (fasl.size() > 0 ? StringUtils.join(fasl.iterator(), ",") : "") + " " +
        annotation.getBegin() + ":" + annotation.getEnd());
    }
    System.out.println("==");
  }
}

The next filter in the chain is the (Lucene provided, since 3.0 I think) SynonymFilter. It needs a reference to a SynonymMap. An empty SynonymMap was provided to the UimaAETokenizer, which it populated, and now it is available for use by the SynonymFilter. And yes, I do realize that this sort of pass-by-reference stuff is frowned upon in the Java world, but at least in this case, it keeps the code simple and easy to understand.

At the end of this step, the SynonymFilter will set the synonym terms at the same offset as the original term, and set the position increment gap to 0.

The next two filters are the LowerCaseFilter and StopFilter, to lowercase the tokens and remove stopwords respectively. I wanted them to not operate on tokens generated off the UIMA annotations in my UimaAETokenizer, similar to how the PorterStemFilter operates on Lucene 4.0. Specifically, with PorterStemFilter, it is possible to mark certain terms as keywords using KeywordAttribute.setKeyword(true), and these terms will be skipped for stemming.

However, this functionality does not exist in Lucene (yet), so I have opened a JIRA (LUCENE-3236) with the necessary patches for this, hopefully it will be incorporated into Lucene at some point. In the interim, you can use the versions below, which are functionality-wise identical to the patched versions I provided in the JIRA.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
// Source: src/main/java/com/mycompany/tgni/lucene/LowerCaseFilter.java
package com.mycompany.tgni.lucene;

import java.io.IOException;

import org.apache.lucene.analysis.TokenFilter;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.apache.lucene.analysis.tokenattributes.KeywordAttribute;
import org.apache.lucene.analysis.util.CharacterUtils;
import org.apache.lucene.util.Version;

public final class LowerCaseFilter extends TokenFilter {
  private final CharacterUtils charUtils;
  private final CharTermAttribute termAtt = 
    addAttribute(CharTermAttribute.class);
  private final KeywordAttribute keywordAtt = 
    addAttribute(KeywordAttribute.class);
  
  private boolean ignoreKeyword = false;

  /**
   * Extra constructor to trigger new keyword-aware behavior.
   */
  public LowerCaseFilter(Version matchVersion, TokenStream in, 
      boolean ignoreKeyword) {
    super(in);
    charUtils = CharacterUtils.getInstance(matchVersion);
    this.ignoreKeyword = ignoreKeyword;
  }

  /**
   * Old ctor.
   */
  public LowerCaseFilter(Version matchVersion, TokenStream in) {
    this(matchVersion, in, false);
  }
  
  @Override
  public final boolean incrementToken() throws IOException {
    if (input.incrementToken()) {
      if (ignoreKeyword && keywordAtt.isKeyword()) {
        // do nothing
        return true;
      }
      final char[] buffer = termAtt.buffer();
      final int length = termAtt.length();
      for (int i = 0; i < length;) {
       i += Character.toChars(
         Character.toLowerCase(charUtils.codePointAt(buffer, i)), buffer, i);
      }
      return true;
    } else
      return false;
  }
}

The only real change is an extra constructor to trigger keyword-aware behavior, the addition of the KeywordAttribute (so this filter is now keyword aware) and a little if condition in the incrementToken() method to short circuit the lowercasing in case the term is marked as a keyword.

Similarly, the StopFilter below is also almost identical to the stock Lucene StopFilter. Like the custom version of the LowerCaseFilter, the only changes are the extra constructor (to trigger keyword-aware behavior), the addition of a KeywordAttribute to its list of recognized attributes and an extra condition in the (protected) accept() method.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
// Source: src/main/java/com/mycompany/tgni/lucene/StopFilter.java
package com.mycompany.tgni.lucene;

import java.io.IOException;
import java.util.Arrays;
import java.util.List;
import java.util.Set;

import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.apache.lucene.analysis.tokenattributes.KeywordAttribute;
import org.apache.lucene.analysis.util.CharArraySet;
import org.apache.lucene.analysis.util.FilteringTokenFilter;
import org.apache.lucene.util.Version;

public final class StopFilter extends FilteringTokenFilter {

  private final CharArraySet stopWords;
  private final CharTermAttribute termAtt = 
    addAttribute(CharTermAttribute.class);
  private final KeywordAttribute keywordAtt =
    addAttribute(KeywordAttribute.class);

  private boolean ignoreKeyword = false;
  
  /**
   * New ctor to trigger keyword-aware behavior.
   */
  public StopFilter(Version matchVersion, TokenStream input, Set<?> stopWords, 
      boolean ignoreCase, boolean ignoreKeyword) {
    super(true, input);
    this.stopWords = stopWords instanceof CharArraySet ? 
      (CharArraySet) stopWords : 
      new CharArraySet(matchVersion, stopWords, ignoreCase);
    this.ignoreKeyword = ignoreKeyword;
  }

  /**
   * Old ctor for current behavior.
   */
  public StopFilter(Version matchVersion, TokenStream input, Set<?> stopWords, 
      boolean ignoreCase) {
    this(matchVersion, input, stopWords, ignoreCase, false);
  }
  
  public StopFilter(Version matchVersion, TokenStream in, Set<?> stopWords) {
    this(matchVersion, in, stopWords, false);
  }

  public static Set<Object> makeStopSet(Version matchVersion, 
      String... stopWords) {
    return makeStopSet(matchVersion, stopWords, false);
  }
  
  public static Set<Object> makeStopSet(Version matchVersion, 
      List<?> stopWords) {
    return makeStopSet(matchVersion, stopWords, false);
  }
    
  public static Set<Object> makeStopSet(Version matchVersion, 
      String[] stopWords, boolean ignoreCase) {
    CharArraySet stopSet = new CharArraySet(matchVersion, stopWords.length, 
      ignoreCase);
    stopSet.addAll(Arrays.asList(stopWords));
    return stopSet;
  }
  
  public static Set<Object> makeStopSet(Version matchVersion, List<?> stopWords,
       boolean ignoreCase){
    CharArraySet stopSet = new CharArraySet(matchVersion, stopWords.size(),
      ignoreCase);
    stopSet.addAll(stopWords);
    return stopSet;
  }
  
  @Override
  protected boolean accept() throws IOException {
    return (ignoreKeyword && keywordAtt.isKeyword()) || 
      !stopWords.contains(termAtt.buffer(), 0, termAtt.length());
  }
}

And finally, my analyzer contains the PorterStemFilter, which already recognizes keywords, so no changes needed there.

To test this analyzer, I wrote a little JUnit test that takes the snippets of text that I used to test my UIMA AEs before.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
// Source: src/test/java/com/mycompany/tgni/lucene/UimaAETokenizerTest.java
package com.mycompany.tgni.lucene;

import java.io.BufferedReader;
import java.io.File;
import java.io.FileReader;
import java.io.Reader;
import java.io.StringReader;
import java.util.ArrayList;
import java.util.List;
import java.util.Set;

import org.apache.commons.lang.StringUtils;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.en.PorterStemFilter;
import org.apache.lucene.analysis.synonym.SynonymFilter;
import org.apache.lucene.analysis.synonym.SynonymMap;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.apache.lucene.analysis.tokenattributes.KeywordAttribute;
import org.apache.lucene.analysis.tokenattributes.OffsetAttribute;
import org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute;
import org.apache.lucene.util.Version;
import org.junit.Test;

import com.mycompany.tgni.uima.annotators.keyword.KeywordAnnotatorsTest;

public class UimaAETokenizerTest {

  private Analyzer analyzer;
  
  @Test
  public void testUimaKeywordTokenizer() throws Exception {
    analyzer = getAnalyzer();
    for (String s : KeywordAnnotatorsTest.TEST_STRINGS) {
      System.out.println("input=" + s);
      TokenStream tokenStream = analyzer.tokenStream("f", new StringReader(s));
      while (tokenStream.incrementToken()) {
        CharTermAttribute termAttr = 
          tokenStream.getAttribute(CharTermAttribute.class);
        OffsetAttribute offsetAttr = 
          tokenStream.getAttribute(OffsetAttribute.class);
        System.out.print("output term=" + 
          new String(termAttr.buffer(), 0, termAttr.length()) +
          ", offset=" + offsetAttr.startOffset() + "," + 
          offsetAttr.endOffset());
        KeywordAttribute keywordAttr = 
          tokenStream.getAttribute(KeywordAttribute.class);
        System.out.print(", keyword?" + keywordAttr.isKeyword());
        PositionIncrementAttribute posIncAttr = 
          tokenStream.getAttribute(PositionIncrementAttribute.class);
        System.out.print(", posinc=" + posIncAttr.getPositionIncrement());
        System.out.println();
      }
    }
  }

  private Analyzer getAnalyzer() throws Exception {
    if (analyzer == null) {
      List<String> stopwords = new ArrayList<String>();
      BufferedReader swreader = new BufferedReader(
        new FileReader(new File(
        "src/main/resources/stopwords.txt")));
      String line;
      while ((line = swreader.readLine()) != null) {
        if (StringUtils.isEmpty(line) || line.startsWith("#")) {
          continue;
        }
        stopwords.add(StringUtils.trim(line));
      }
      swreader.close();
      final Set<?> stopset = StopFilter.makeStopSet(
        Version.LUCENE_40, stopwords);
      analyzer = new Analyzer() {
        @Override
        public TokenStream tokenStream(String fieldName, Reader reader) {
          SynonymMap synonymsMap = new SynonymMap();
          TokenStream input = new UimaAETokenizer(reader,
            "src/main/resources/descriptors/TaxonomyMappingAE.xml",
            null, synonymsMap);
          input = new SynonymFilter(input, synonymsMap);
          input = new LowerCaseFilter(Version.LUCENE_40, input, true);
          input = new StopFilter(Version.LUCENE_40, input, stopset, false, true);
          input = new PorterStemFilter(input);
          return input;
        }
      };
    }
    return analyzer;
  }
}

The output (edited for readability) of the test shows that the analyzer works as expected. You can see the effects of each of the filters in our analyzer in the different examples below.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
input=Born in the USA I was...
  output term=born, offset=0,4, keyword?false, posinc=1
  output term=USA, offset=12,15, keyword?true, posinc=3
  output term=i, offset=16,17, keyword?false, posinc=1
input=CSC and IBM are Fortune 500 companies.
  output term=CSC, offset=0,3, keyword?true, posinc=1
  output term=IBM, offset=8,11, keyword?true, posinc=2
  output term=fortun, offset=16,23, keyword?false, posinc=2
  output term=500, offset=24,27, keyword?false, posinc=1
  output term=compani, offset=28,37, keyword?false, posinc=1
input=Linux is embraced by the Oracles and IBMs of the world
  output term=linux, offset=0,5, keyword?false, posinc=1
  output term=embrac, offset=9,17, keyword?false, posinc=2
  output term=oracl, offset=25,32, keyword?false, posinc=3
  output term=IBMs, offset=37,41, keyword?true, posinc=2
    output term=IBM, offset=37,41, keyword?true, posinc=0
  output term=world, offset=49,54, keyword?false, posinc=3
input=PET scans are uncomfortable.
  output term=PET, offset=0,3, keyword?true, posinc=1
  output term=scan, offset=4,9, keyword?false, posinc=1
  output term=uncomfort, offset=14,27, keyword?false, posinc=2
input=The HIV-1 virus is an AIDS carrier
  output term=HIV-1, offset=4,9, keyword?true, posinc=2
    output term=HIV 1, offset=4,9, keyword?true, posinc=0
    output term=HIV1, offset=4,9, keyword?true, posinc=0
  output term=viru, offset=10,15, keyword?false, posinc=1
  output term=AIDS, offset=22,26, keyword?true, posinc=3
    output term=Acquired Immunity Deficiency Syndrome, offset=22,26, keyword?true, posinc=0
  output term=carrier, offset=27,34, keyword?false, posinc=1
input=Unstructured Information Management Application (UIMA) is fantastic!
  output term=unstructur, offset=0,12, keyword?false, posinc=1
  output term=inform, offset=13,24, keyword?false, posinc=1
  output term=manag, offset=25,35, keyword?false, posinc=1
  output term=applic, offset=36,47, keyword?false, posinc=1
  output term=UIMA, offset=49,53, keyword?true, posinc=1
  output term=fantast, offset=58,67, keyword?false, posinc=2
input=Born in the U.S.A., I was...
  output term=born, offset=0,4, keyword?false, posinc=1
  output term=U.S.A., offset=12,18, keyword?true, posinc=3
    output term=USA, offset=12,18, keyword?true, posinc=0
  output term=i, offset=20,21, keyword?false, posinc=1
input=He is a free-wheeling kind of guy.
  output term=he, offset=0,2, keyword?false, posinc=1
  output term=free-wheeling, offset=8,21, keyword?true, posinc=3
    output term=freewheeling, offset=8,21, keyword?true, posinc=0
    output term=free wheeling, offset=8,21, keyword?true, posinc=0
  output term=kind, offset=22,26, keyword?false, posinc=1
  output term=gui, offset=30,33, keyword?false, posinc=2
input=Magellan was one of our great mariners
  output term=magellan, offset=0,8, keyword?false, posinc=1
  output term=on, offset=13,16, keyword?false, posinc=2
  output term=our, offset=20,23, keyword?false, posinc=2
  output term=great, offset=24,29, keyword?false, posinc=1
  output term=mariners, offset=30,38, keyword?true, posinc=1
input=Get your daily dose of Vitamin A here!
  output term=get, offset=0,3, keyword?false, posinc=1
  output term=your, offset=4,8, keyword?false, posinc=1
  output term=daili, offset=9,14, keyword?false, posinc=1
  output term=dose, offset=15,19, keyword?false, posinc=1
  output term=Vitamin A, offset=23,32, keyword?true, posinc=2
  output term=here, offset=33,37, keyword?false, posinc=1

So anyway, thats about it for today. This information is probably not all that useful unless you are trying to do something along similar lines, but hopefully it was interesting :-). Next week, I hope to incorporate this analyzer into Neo4J's Lucene based IndexService (for looking up nodes in a graph).

Saturday, June 18, 2011

UIMA Analysis Engine for Keyword Recognition and Transformation

You have probably noticed that I've been playing with UIMA lately, perhaps a bit aimlessly. One of my goals with UIMA is to create an Analysis Engine (AE) that I can plug into the front of the Lucene analyzer chain for one of my applications. The AE would detect and mark keywords in the input stream so they would be exempt from stemming by downstream Lucene analyzers.

So couple of weeks ago, I picked up the bits and pieces of UIMA code that I had written and started to refactor them to form a sequence of primitive AEs that detected keywords in text using pattern and dictionary recognition. Each primitive AE places new KeywordAnnotation objects into an annotation index.

The primitive AEs I came up with are pretty basic, but offers a surprising amount of bang for the buck. There are just two annotators - the PatternAnnotator and DictionaryAnnotator - that do the processing for my primitive AEs listed below. Obviously, more can be added (and will, eventually) as required.

  • Pattern based keyword recognition
  • Pattern based keyword recognition and transformation
  • Dictionary based keyword recognition, case sensitive
  • Dictionary based keyword recognition and transformation, case sensitive
  • Dictionary based keyword recognition, case insensitive
  • Dictionary based keyword recognition and transformation, case insensitive

These AEs are arranged linearly in a fixed-flow chain to form the aggregate AE as shown in the diagram below:


Thats it for background - lets look at some code (and since its UIMA, lots of XML descriptors).

The Keyword Annotation

The Keyword annotation is described to UIMA using the following XML. As you can see, the only extra thing we add to the standard annotation object is the transformed value property, which allows us to store transformations (synonyms) that are returned by some of the AEs listed above.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
<!-- Source: src/main/resources/descriptors/Keyword.xml -->
<?xml version="1.0" encoding="UTF-8"?>
<typeSystemDescription xmlns="http://uima.apache.org/resourceSpecifier">
  <name>Keyword</name>
  <description>
    Represents character sequence patterns in text.
  </description>
  <version>1.0</version>
  <vendor>MyCompany Inc.</vendor>
  <types>
    <typeDescription>
      <name>com.mycompany.tgni.uima.annotators.keyword.KeywordAnnotation</name>
      <description/>
      <supertypeName>uima.tcas.Annotation</supertypeName>
      <features>
        <featureDescription>
          <name>transformedValue</name>
          <description>The transformed value (can be empty)</description>
          <rangeTypeName>uima.cas.String</rangeTypeName>
        </featureDescription>
      </features>
    </typeDescription>
  </types>
</typeSystemDescription>

UIMA provides a code generator (JCasGen) which you then run on the XML file to produce a pair of Java classes (not shown) called KeywordAnnotation.java and KeywordAnnotation_Type.java. From an UIMA application programmer's perspective, the KeywordAnnotation class provides getters and setters for properties defined in the XML above.

Pattern Annotator and AEs

The PatternAnnotator uses regular expressions defined in an external text file. I started out using database tables for configuration, but this got a bit cumbersome, so I switched to using property files under git control instead.

The PatternAnnotator operates in two modes - in preserve or transform modes. In preserve mode, it simply recognizes patterns listed in a text file, as shown in the example below.

1
2
3
4
5
# Source: src/main/resources/pattern_preservations.txt
# Format of this file:
# pattern # optional comment
#
[A-Z]{2}[A-Za-z0-9-]* # abbreviation: first 2 uppercase followed by any

In transform mode, it recognizes patterns and sets the transformedValue property of the resulting annotation with the result of running the specified pattern on the recognized pattern. Here is an example of its configuration:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
# Source: src/main/resources/pattern_transformations.txt
# Format of this file:
# pattern transform
# Inline comments not permitted. Transform is supplied as s/src/repl/
#
# abbreviation with embedded periods eg. U.S.A. Transform to USA
([A-Z]\.)+ s/\.//
# hyphenated words should convert to single and multiple words, eg. 
# free-wheeling should convert to freewheeling, free wheeling
(\w+)-(\w+) s/(\w+)-(\w+)/$1$2, $1 $2/

The code for the PatternAnnotator is not too complex, it is based in part upon the examples provided in the UIMA distribution. Here it is:

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
// Source: src/main/java/com/mycompany/tgni/uima/annotators/keyword/PatternAnnotator.java
package com.mycompany.tgni.uima.annotators.keyword;

import java.util.HashMap;
import java.util.HashSet;
import java.util.Map;
import java.util.Set;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

import org.apache.commons.lang.StringUtils;
import org.apache.uima.UimaContext;
import org.apache.uima.analysis_component.JCasAnnotator_ImplBase;
import org.apache.uima.analysis_engine.AnalysisEngineProcessException;
import org.apache.uima.jcas.JCas;
import org.apache.uima.resource.ResourceAccessException;
import org.apache.uima.resource.ResourceInitializationException;

import com.mycompany.tgni.uima.conf.SharedMapResource;
import com.mycompany.tgni.uima.conf.SharedSetResource;

/**
 * Annotates pattern found in input text. Operates in preserve
 * or transform mode. In preserve mode, recognizes and annotates
 * a set of supplied regex patterns. In transform mode, recognizes
 * and annotates a map of regex patterns which have associated
 * transforms, and additionally applies the transformation and
 * stores it in its transformedValue feature.
 */
public class PatternAnnotator extends JCasAnnotator_ImplBase {

  private String preserveOrTransform;
  private Set<Pattern> patternSet;
  private Map<Pattern,String> patternMap;
  
  private final static String PRESERVE = "preserve";
  private final static String TRANSFORM = "transform";
  
  @Override
  public void initialize(UimaContext ctx) 
      throws ResourceInitializationException {
    super.initialize(ctx);
    preserveOrTransform = 
      (String) ctx.getConfigParameterValue("preserveOrTransform");
    try {
      if (PRESERVE.equals(preserveOrTransform)) {
        SharedSetResource res = (SharedSetResource) 
          ctx.getResourceObject("patternAnnotatorProperties");
        patternSet = new HashSet<Pattern>();
        for (String patternString : res.getConfig()) {
          patternSet.add(Pattern.compile(patternString));
        }
      } else if (TRANSFORM.equals(preserveOrTransform)) {
        SharedMapResource res = (SharedMapResource)
          ctx.getResourceObject("patternAnnotatorProperties");
        patternMap = new HashMap<Pattern,String>();
        Map<String,String> confMap = res.getConfig();
        for (String patternString : confMap.keySet()) {
          patternMap.put(Pattern.compile(patternString), 
            confMap.get(patternString));
        }
      } else {
        throw new ResourceInitializationException(
          new IllegalArgumentException(
          "Configuration parameter preserveOrTransform " +
          "must be either 'preserve' or 'transform'"));
      }
    } catch (ResourceAccessException e) {
      throw new ResourceInitializationException(e);
    }
  }
  
  @Override
  public void process(JCas jcas) 
      throws AnalysisEngineProcessException {
    String text = jcas.getDocumentText();
    int pcnt = 0;
    Set<Pattern> patterns = PRESERVE.equals(preserveOrTransform) ?
      patternSet : patternMap.keySet();
    for (Pattern pattern : patterns) {
      Matcher matcher = pattern.matcher(text);
      int pos = 0;
      while (matcher.find(pos)) {
        pos = matcher.end();
        KeywordAnnotation annotation = new KeywordAnnotation(jcas);
        annotation.setBegin(matcher.start());
        annotation.setEnd(pos);
        if (TRANSFORM.equals(preserveOrTransform)) {
          String token = StringUtils.substring(
            text, annotation.getBegin(), annotation.getEnd());
          String transform = patternMap.get(pattern);
          String transformedValue = applyTransform(token, transform);
          annotation.setTransformedValue(transformedValue);
        }
        annotation.addToIndexes();
      }
      pcnt++;
    }
  }

  private String applyTransform(String token, String transform) {
    String[] tcols = 
      StringUtils.splitPreserveAllTokens(transform, "/");
    if (tcols.length == 4) {
      Pattern p = Pattern.compile(tcols[1]);
      Matcher m = p.matcher(token);
      return m.replaceAll(tcols[2]);
    } else {
      return token;
    }
  }
}

In order to read configuration files, UIMA provides a redirection mechanism that is quite neat. Basically, in the XML configuration you specify a name and bind it with a file name and a SharedResourceObject implementation. My annotators so far need to read a list of patterns and a map of patterns and associated transformations, so I built two simple implementations, the SharedSetResource and SharedMapResource. The code for these are shown below, they are also used in the DictionaryAnnotator described later.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
// Source: src/main/java/com/mycompany/tgni/uima/conf/SharedSetResource.java
package com.mycompany.tgni.uima.conf;

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.util.Collections;
import java.util.HashSet;
import java.util.Iterator;
import java.util.Set;

import org.apache.commons.io.IOUtils;
import org.apache.commons.lang.StringUtils;
import org.apache.uima.resource.DataResource;
import org.apache.uima.resource.ResourceInitializationException;
import org.apache.uima.resource.SharedResourceObject;

/**
 * Converts the specified text file of property values into a Set.
 * Values must start at the first character of a line and be 
 * terminated by tab or newline.
 */
public class SharedSetResource implements SharedResourceObject {

  private final Set<String> configs = new HashSet<String>();
  
  @Override
  public void load(DataResource res) 
      throws ResourceInitializationException {
    InputStream istream = null;
    try {
      istream = res.getInputStream();
      BufferedReader reader = new BufferedReader(
        new InputStreamReader(istream));
      String line;
      while ((line = reader.readLine()) != null) {
        if (StringUtils.isEmpty(line) || line.startsWith("#")) {
          continue;
        }
        if (line.indexOf('\t') > 0) {
          String[] cols = StringUtils.split(line, "\t");
          configs.add(StringUtils.trim(cols[0]));
        } else {
          configs.add(StringUtils.trim(line));
        }
      }
      reader.close();
    } catch (IOException e) {
      throw new ResourceInitializationException(e);
    } finally {
      IOUtils.closeQuietly(istream);
    }
  }
  
  public Set<String> getConfig() {
    return Collections.unmodifiableSet(configs);
  }
}

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
// Source: src/main/java/com/mycompany/tgni/uima/conf/SharedMapResource.java
package com.mycompany.tgni.uima.conf;

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.util.Arrays;
import java.util.Collections;
import java.util.HashMap;
import java.util.List;
import java.util.Map;

import org.apache.commons.io.IOUtils;
import org.apache.commons.lang.StringUtils;
import org.apache.uima.resource.DataResource;
import org.apache.uima.resource.ResourceInitializationException;
import org.apache.uima.resource.SharedResourceObject;

/**
 * Converts the specified properties file into a Map. Key and
 * value must be tab separated.
 */
public class SharedMapResource implements SharedResourceObject {

  private Map<String,String> configs = new HashMap<String,String>();
  
  @Override
  public void load(DataResource res) 
      throws ResourceInitializationException {
    InputStream istream = null;
    try {
      istream = res.getInputStream();
      BufferedReader reader = new BufferedReader(
        new InputStreamReader(istream));
      String line;
      while ((line = reader.readLine()) != null) {
        if (StringUtils.isEmpty(line) ||
            line.startsWith("#")) {
          continue;
        }
        String[] kv = StringUtils.split(line, "\t");
        configs.put(kv[0], kv[1]);
      }
      reader.close();
    } catch (IOException e) {
      throw new ResourceInitializationException(e);
    } finally {
      IOUtils.closeQuietly(istream);
    }
  }
  
  public Map<String,String> getConfig() {
    return Collections.unmodifiableMap(configs);
  }
  
  public List<String> asList(String value) {
    if (value == null) {
      return Collections.emptyList();
    } else {
      String[] vals = value.split("\\s*,\\s*");
      return Arrays.asList(vals);
    }
  }
}

The two flavors of the Pattern Annotator (ie one for preserve and one for transform) are defined using XML files. Here are the respective XML definitions:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
<!-- Source: src/main/resources/descriptors/PatternPreserveAE.xml -->
<?xml version="1.0" encoding="UTF-8"?>
<analysisEngineDescription xmlns="http://uima.apache.org/resourceSpecifier">
  <frameworkImplementation>org.apache.uima.java</frameworkImplementation>
  <primitive>true</primitive>
  <annotatorImplementationName>
    com.mycompany.tgni.uima.annotators.keyword.PatternAnnotator
  </annotatorImplementationName>
  <analysisEngineMetaData>
    <name>PatternPreserveAE</name>
    <description>Recognize and preserve patterns.</description>
    <version>1.0</version>
    <vendor>MyCompany Inc.</vendor>
    <configurationParameters>
      <configurationParameter>
        <name>preserveOrTransform</name>
        <description>Whether to preserve ot transform pattern</description>
        <type>String</type>
        <multiValued>false</multiValued>
        <mandatory>true</mandatory>
      </configurationParameter>
    </configurationParameters>
    <configurationParameterSettings>
      <nameValuePair>
        <name>preserveOrTransform</name>
        <value>
          <string>preserve</string>
        </value>
      </nameValuePair>
    </configurationParameterSettings>
    <typeSystemDescription>
      <imports>
        <import location="Keyword.xml"/>
      </imports>
    </typeSystemDescription>
    <typePriorities/>
    <fsIndexCollection/>
    <capabilities>
      <capability>
        <inputs/>
        <outputs>
          <type>
            com.mycompany.tgni.uima.annotators.keyword.KeywordAnnotation
          </type>
          <feature>
            com.mycompany.tgni.uima.annotators.keyword.KeywordAnnotation:transformedValue
          </feature>
        </outputs>
        <languagesSupported/>
      </capability>
    </capabilities>
    <operationalProperties>
      <modifiesCas>true</modifiesCas>
      <multipleDeploymentAllowed>true</multipleDeploymentAllowed>
      <outputsNewCASes>false</outputsNewCASes>
    </operationalProperties>
  </analysisEngineMetaData>
  <resourceManagerConfiguration>
    <externalResources>
      <externalResource>
        <name>patternSet</name>
        <description>Set of patterns to preserve</description>
        <fileResourceSpecifier>
          <fileUrl>file:src/main/resources/pattern_preservations.txt</fileUrl>
        </fileResourceSpecifier>
        <implementationName>
          com.mycompany.tgni.uima.conf.SharedSetResource
        </implementationName>
      </externalResource>
    </externalResources>
    <externalResourceBindings>
      <externalResourceBinding>
        <key>patternAnnotatorProperties</key>
        <resourceName>patternSet</resourceName>
      </externalResourceBinding>
    </externalResourceBindings>
  </resourceManagerConfiguration>
</analysisEngineDescription>

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
<!-- Source: src/main/resources/descriptors/PatternTransformAE.xml -->
<?xml version="1.0" encoding="UTF-8"?>
<analysisEngineDescription xmlns="http://uima.apache.org/resourceSpecifier">
  <frameworkImplementation>org.apache.uima.java</frameworkImplementation>
  <primitive>true</primitive>
  <annotatorImplementationName>
    com.mycompany.tgni.uima.annotators.keyword.PatternAnnotator
  </annotatorImplementationName>
  <analysisEngineMetaData>
    <name>PatternTransformAE</name>
    <description>Recognize and transform patterns.</description>
    <version>1.0</version>
    <vendor>MyCompany Inc.</vendor>
    <configurationParameters>
      <configurationParameter>
        <name>preserveOrTransform</name>
        <description>Whether to preserve ot transform pattern</description>
        <type>String</type>
        <multiValued>false</multiValued>
        <mandatory>true</mandatory>
      </configurationParameter>
    </configurationParameters>
    <configurationParameterSettings>
      <nameValuePair>
        <name>preserveOrTransform</name>
        <value>
          <string>transform</string>
        </value>
      </nameValuePair>
    </configurationParameterSettings>
    <typeSystemDescription>
      <imports>
        <import location="Keyword.xml"/>
      </imports>
    </typeSystemDescription>
    <typePriorities/>
    <fsIndexCollection/>
    <capabilities>
      <capability>
        <inputs/>
        <outputs>
          <type>
            com.mycompany.tgni.uima.annotators.keyword.KeywordAnnotation
          </type>
          <feature>
            com.mycompany.tgni.uima.annotators.keyword.KeywordAnnotation:transformedValue
          </feature>
        </outputs>
        <languagesSupported/>
      </capability>
    </capabilities>
    <operationalProperties>
      <modifiesCas>true</modifiesCas>
      <multipleDeploymentAllowed>true</multipleDeploymentAllowed>
      <outputsNewCASes>false</outputsNewCASes>
    </operationalProperties>
  </analysisEngineMetaData>
  <resourceManagerConfiguration>
    <externalResources>
      <externalResource>
        <name>patternMap</name>
        <description>Map of patterns to transform</description>
        <fileResourceSpecifier>
          <fileUrl>
            file:src/main/resources/pattern_transformations.txt
          </fileUrl>
        </fileResourceSpecifier>
        <implementationName>
          com.mycompany.tgni.uima.conf.SharedMapResource
        </implementationName>
      </externalResource>
    </externalResources>
    <externalResourceBindings>
      <externalResourceBinding>
        <key>patternAnnotatorProperties</key>
        <resourceName>patternMap</resourceName>
      </externalResourceBinding>
    </externalResourceBindings>
  </resourceManagerConfiguration>
</analysisEngineDescription>

As you can see, they are largely similar, the differences are in the configurationParameterSettings and the resourceManagerConfigurations sections in the XML.

Dictionary Annotator and AEs

The DictionaryAnnotator relies on exact matches of words or phrases against a dictionary. Like the PatternAnnotator, it can work in either preserve or transform mode. In preserve mode, it operates against a set of known words or phrases. In transform modes, it operates against a map of key-value pairs, the key is a word or phrase, and the value is its synonym.

Since it supports multi-word phrases, the matching is done using a Lucene ShingleFilter with a maximum shingle size of 5. Here is the code for the DictionaryAnnotator.

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
// Source: src/main/java/com/mycompany/tgni/uima/annotators/keyword/DictionaryAnnotator.java
package com.mycompany.tgni.uima.annotators.keyword;

import java.io.IOException;
import java.io.StringReader;
import java.util.HashMap;
import java.util.HashSet;
import java.util.Map;
import java.util.Set;

import org.apache.commons.lang.StringUtils;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.core.LowerCaseFilter;
import org.apache.lucene.analysis.core.WhitespaceTokenizer;
import org.apache.lucene.analysis.shingle.ShingleFilter;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.apache.lucene.analysis.tokenattributes.OffsetAttribute;
import org.apache.lucene.util.Version;
import org.apache.uima.UimaContext;
import org.apache.uima.analysis_component.JCasAnnotator_ImplBase;
import org.apache.uima.analysis_engine.AnalysisEngineProcessException;
import org.apache.uima.jcas.JCas;
import org.apache.uima.resource.ResourceAccessException;
import org.apache.uima.resource.ResourceInitializationException;

import com.mycompany.tgni.uima.conf.SharedMapResource;
import com.mycompany.tgni.uima.conf.SharedSetResource;
import com.mycompany.tgni.uima.utils.AnnotatorUtils;

/**
 * Annotates patters found in input text. Operates in preserve
 * or transform mode. In preserve mode, recognizes and annotates
 * a set of supplied dictionary words. In transform mode, the
 * recognized words are annotated and the transformed value 
 * set into the annotation. Default matching is case-insensitive
 * but can be overriden using ignoreCase config parameter. Multi-
 * word patterns can be specified in the dictionaries (upto a 
 * maximum size of maxShingleSize (default 5).
 */
public class DictionaryAnnotator extends JCasAnnotator_ImplBase {

  private String preserveOrTransform;
  private boolean ignoreCase;
  private int maxShingleSize = 5;

  private Set<String> dictSet;
  private Map<String,String> dictMap;
  
  private final static String PRESERVE = "preserve";
  private final static String TRANSFORM = "transform";

  @Override
  public void initialize(UimaContext ctx) 
      throws ResourceInitializationException {
    super.initialize(ctx);
    preserveOrTransform = 
      (String) ctx.getConfigParameterValue("preserveOrTransform");
    ignoreCase = (Boolean) ctx.getConfigParameterValue("ignoreCase");
    maxShingleSize = (Integer) ctx.getConfigParameterValue("maxShingleSize");
    try {
      if (PRESERVE.equals(preserveOrTransform)) {
        SharedSetResource res = (SharedSetResource) 
          ctx.getResourceObject("dictAnnotatorProperties");
        dictSet = new HashSet<String>();
        for (String dictPhrase : res.getConfig()) {
          if (ignoreCase) {
            dictSet.add(StringUtils.lowerCase(dictPhrase));
          } else {
            dictSet.add(dictPhrase);
          }
        }
      } else if (TRANSFORM.equals(preserveOrTransform)) {
        SharedMapResource res = (SharedMapResource) 
          ctx.getResourceObject("dictAnnotatorProperties");
        Map<String,String> confMap = res.getConfig();
        dictMap = new HashMap<String,String>();
        for (String dictPhrase : confMap.keySet()) {
          if (ignoreCase) {
            dictMap.put(StringUtils.lowerCase(dictPhrase),
              confMap.get(dictPhrase));
          } else {
            dictMap.put(dictPhrase, confMap.get(dictPhrase));
          }
        }
      } else {
        throw new ResourceInitializationException(
          new IllegalArgumentException(
          "Configuration parameter preserveOrTransform " +
          "must be either 'preserve' or 'transform'"));
      }
    } catch (ResourceAccessException e) {
      throw new ResourceInitializationException(e);
    }
  }
  
  @Override
  public void process(JCas jcas) 
      throws AnalysisEngineProcessException {
    String text = jcas.getDocumentText();
    // replace punctuation in working copy of text so the presence
    // of punctuation does not throw off the matching process
    text = text.replaceAll("\\p{Punct}", " ");
    // for HTML text fragments, replace tagged span with spaces
    text = AnnotatorUtils.whiteout(text);
    WhitespaceTokenizer tokenizer = new WhitespaceTokenizer(
      Version.LUCENE_40, new StringReader(text));
    TokenStream tokenStream;
    if (ignoreCase) {
      tokenStream = new LowerCaseFilter(
        Version.LUCENE_40, tokenizer);
      tokenStream = new ShingleFilter(tokenStream, maxShingleSize);
    } else {
      tokenStream = new ShingleFilter(tokenizer, maxShingleSize);
    }
    try {
      while (tokenStream.incrementToken()) {
        CharTermAttribute term = 
          tokenStream.getAttribute(CharTermAttribute.class);
        OffsetAttribute offset = 
          tokenStream.getAttribute(OffsetAttribute.class);
        String shingle = new String(term.buffer(), 0, term.length());
        boolean foundToken = false;
        if (PRESERVE.equals(preserveOrTransform)) {
          if (dictSet.contains(shingle)) {
            foundToken = true;
          }
        } else {
          if (dictMap.containsKey(shingle)) {
            foundToken = true;
          }
        }
        if (foundToken) {
          KeywordAnnotation annotation = new KeywordAnnotation(jcas);
          annotation.setBegin(offset.startOffset());
          annotation.setEnd(offset.endOffset());
          if (TRANSFORM.equals(preserveOrTransform)) {
            // replace with the specified phrase
            annotation.setTransformedValue(dictMap.get(shingle));
          }
          annotation.addToIndexes();
        }
      }
    } catch (IOException e) {
      throw new AnalysisEngineProcessException(e);
    }
  }
}

The configuration file structures are very similar to that shown for pattern. For the preserve mode, its just a list of words or phrases that need to be recognized. For transform mode, its just a tab separated list of key-value pairs. Nothing much to see there, so not showing it.

We build four primitive AEs out of this annotator, one set for case sensitive matching and one set for case-insensitive matching. Here are the XML descriptions for each of the four.

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
<!-- Source: src/main/resources/DictionaryPreserveMatchCaseAE.xml -->
<?xml version="1.0" encoding="UTF-8"?>
<analysisEngineDescription xmlns="http://uima.apache.org/resourceSpecifier">
  <frameworkImplementation>org.apache.uima.java</frameworkImplementation>
  <primitive>true</primitive>
  <annotatorImplementationName>
    com.mycompany.tgni.uima.annotators.keyword.DictionaryAnnotator
  </annotatorImplementationName>
  <analysisEngineMetaData>
    <name>DictionaryPreserveMatchCaseAE</name>
    <description>
      Dictionary based annotator. Detects phrases to preserve.
      Case matters.
    </description>
    <version>1.0</version>
    <vendor>MyCompany Inc.</vendor>
    <configurationParameters>
      <configurationParameter>
        <name>preserveOrTransform</name>
        <description>preserve/transform matched dictionary entry</description>
        <type>String</type>
        <multiValued>false</multiValued>
        <mandatory>true</mandatory>
      </configurationParameter>
      <configurationParameter>
        <name>ignoreCase</name>
        <description>Whether to ignore case when matching</description>
        <type>Boolean</type>
        <multiValued>false</multiValued>
        <mandatory>true</mandatory>
      </configurationParameter>
      <configurationParameter>
        <name>maxShingleSize</name>
        <description>Max number of words in phrase shingles</description>
        <type>Integer</type>
        <multiValued>false</multiValued>
        <mandatory>true</mandatory>
      </configurationParameter>
    </configurationParameters>
    <configurationParameterSettings>
      <nameValuePair>
        <name>preserveOrTransform</name>
        <value>
          <string>preserve</string>
        </value>
      </nameValuePair>
      <nameValuePair>
        <name>ignoreCase</name>
        <value>
          <boolean>false</boolean>
        </value>
      </nameValuePair>
      <nameValuePair>
        <name>maxShingleSize</name>
        <value>
          <integer>5</integer>
        </value>
      </nameValuePair>
    </configurationParameterSettings>
    <typeSystemDescription>
      <imports>
        <import location="Keyword.xml"/>
      </imports>
    </typeSystemDescription>
    <typePriorities/>
    <fsIndexCollection/>
    <capabilities>
      <capability>
        <inputs/>
        <outputs>
          <type>
            com.mycompany.tgni.uima.annotators.keyword.KeywordAnnotation
          </type>
          <feature>
            com.mycompany.tgni.uima.annotators.keyword.KeywordAnnotation:transformedValue
          </feature>
        </outputs>
        <languagesSupported/>
      </capability>
    </capabilities>
    <operationalProperties>
      <modifiesCas>true</modifiesCas>
      <multipleDeploymentAllowed>true</multipleDeploymentAllowed>
      <outputsNewCASes>false</outputsNewCASes>
    </operationalProperties>
  </analysisEngineMetaData>
  <resourceManagerConfiguration>
    <externalResources>
      <externalResource>
        <name>dictCaseSensitiveSet</name>
        <description>Set of dictionary phrases to preserve</description>
        <fileResourceSpecifier>
          <fileUrl>
            file:src/main/resources/dict_preservations_matchcase.txt
          </fileUrl>
        </fileResourceSpecifier>
        <implementationName>
          com.mycompany.tgni.uima.conf.SharedSetResource
        </implementationName>
      </externalResource>
    </externalResources>
    <externalResourceBindings>
      <externalResourceBinding>
        <key>dictAnnotatorProperties</key>
        <resourceName>dictCaseSensitiveSet</resourceName>
      </externalResourceBinding>
    </externalResourceBindings>
  </resourceManagerConfiguration>
</analysisEngineDescription>

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
<!-- Source: src/main/resources/descriptors/DictionaryTransformMatchCaseAE.xml -->
<?xml version="1.0" encoding="UTF-8"?>
<analysisEngineDescription xmlns="http://uima.apache.org/resourceSpecifier">
  <frameworkImplementation>org.apache.uima.java</frameworkImplementation>
  <primitive>true</primitive>
  <annotatorImplementationName>
    com.mycompany.tgni.uima.annotators.keyword.DictionaryAnnotator
  </annotatorImplementationName>
  <analysisEngineMetaData>
    <name>DictionaryTransformMatchCaseAE</name>
    <description>
      Dictionary based annotator. Detects phrases to transform.
    </description>
    <version>1.0</version>
    <vendor>MyCompany Inc.</vendor>
    <configurationParameters>
      <configurationParameter>
        <name>preserveOrTransform</name>
        <description>Preserve/transform matched dictionary entry</description>
        <type>String</type>
        <multiValued>false</multiValued>
        <mandatory>true</mandatory>
      </configurationParameter>
      <configurationParameter>
        <name>ignoreCase</name>
        <description>Whether to ignore case when matching</description>
        <type>Boolean</type>
        <multiValued>false</multiValued>
        <mandatory>true</mandatory>
      </configurationParameter>
      <configurationParameter>
        <name>maxShingleSize</name>
        <description>Max number of words in phrase shingles</description>
        <type>Integer</type>
        <multiValued>false</multiValued>
        <mandatory>true</mandatory>
      </configurationParameter>
    </configurationParameters>
    <configurationParameterSettings>
      <nameValuePair>
        <name>preserveOrTransform</name>
        <value>
          <string>transform</string>
        </value>
      </nameValuePair>
      <nameValuePair>
        <name>ignoreCase</name>
        <value>
          <boolean>false</boolean>
        </value>
      </nameValuePair>
      <nameValuePair>
        <name>maxShingleSize</name>
        <value>
          <integer>5</integer>
        </value>
      </nameValuePair>
    </configurationParameterSettings>
    <typeSystemDescription>
      <imports>
        <import location="Keyword.xml"/>
      </imports>
    </typeSystemDescription>
    <typePriorities/>
    <fsIndexCollection/>
    <capabilities>
      <capability>
        <inputs/>
        <outputs>
          <type>
            com.mycompany.tgni.uima.annotators.keyword.KeywordAnnotation
          </type>
          <feature>
            com.mycompany.tgni.uima.annotators.keyword.KeywordAnnotation:transformedValue
          </feature>
        </outputs>
        <languagesSupported/>
      </capability>
    </capabilities>
    <operationalProperties>
      <modifiesCas>true</modifiesCas>
      <multipleDeploymentAllowed>true</multipleDeploymentAllowed>
      <outputsNewCASes>false</outputsNewCASes>
    </operationalProperties>
  </analysisEngineMetaData>
  <resourceManagerConfiguration>
    <externalResources>
      <externalResource>
        <name>dictCaseSensitiveMap</name>
        <description>Map of dictionary phrases to preserve</description>
        <fileResourceSpecifier>
          <fileUrl>
            file:src/main/resources/dict_transformations_matchcase.txt
          </fileUrl>
        </fileResourceSpecifier>
        <implementationName>
          com.mycompany.tgni.uima.conf.SharedMapResource
        </implementationName>
      </externalResource>
    </externalResources>
    <externalResourceBindings>
      <externalResourceBinding>
        <key>dictAnnotatorProperties</key>
        <resourceName>dictCaseSensitiveMap</resourceName>
      </externalResourceBinding>
    </externalResourceBindings>
  </resourceManagerConfiguration>
</analysisEngineDescription>

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
<!-- Source: src/main/resources/descriptors/DictionaryPreserveIgnoreCaseAE.xml -->
<?xml version="1.0" encoding="UTF-8"?>
<analysisEngineDescription xmlns="http://uima.apache.org/resourceSpecifier">
  <frameworkImplementation>org.apache.uima.java</frameworkImplementation>
  <primitive>true</primitive>
  <annotatorImplementationName>
    com.mycompany.tgni.uima.annotators.keyword.DictionaryAnnotator
  </annotatorImplementationName>
  <analysisEngineMetaData>
    <name>DictionaryPreserveIgnoreCaseAE</name>
    <description>
      Dictionary based annotator. Detects phrases to preserve. Case ignored.
    </description>
    <version>1.0</version>
    <vendor>MyCompany Inc.</vendor>
    <configurationParameters>
      <configurationParameter>
        <name>preserveOrTransform</name>
        <description>Preserve/transform matched dictionary entry</description>
        <type>String</type>
        <multiValued>false</multiValued>
        <mandatory>true</mandatory>
      </configurationParameter>
      <configurationParameter>
        <name>ignoreCase</name>
        <description>Whether to ignore case when matching</description>
        <type>Boolean</type>
        <multiValued>false</multiValued>
        <mandatory>true</mandatory>
      </configurationParameter>
      <configurationParameter>
        <name>maxShingleSize</name>
        <description>Max number of words in phrase shingles</description>
        <type>Integer</type>
        <multiValued>false</multiValued>
        <mandatory>true</mandatory>
      </configurationParameter>
    </configurationParameters>
    <configurationParameterSettings>
      <nameValuePair>
        <name>preserveOrTransform</name>
        <value>
          <string>preserve</string>
        </value>
      </nameValuePair>
      <nameValuePair>
        <name>ignoreCase</name>
        <value>
          <boolean>true</boolean>
        </value>
      </nameValuePair>
      <nameValuePair>
        <name>maxShingleSize</name>
        <value>
          <integer>5</integer>
        </value>
      </nameValuePair>
    </configurationParameterSettings>
    <typeSystemDescription>
      <imports>
        <import location="Keyword.xml"/>
      </imports>
    </typeSystemDescription>
    <typePriorities/>
    <fsIndexCollection/>
    <capabilities>
      <capability>
        <inputs/>
        <outputs>
          <type>
            com.mycompany.tgni.uima.annotators.keyword.KeywordAnnotation
          </type>
          <feature>
            com.mycompany.tgni.uima.annotators.keyword.KeywordAnnotation:transformedValue
          </feature>
        </outputs>
        <languagesSupported/>
      </capability>
    </capabilities>
    <operationalProperties>
      <modifiesCas>true</modifiesCas>
      <multipleDeploymentAllowed>true</multipleDeploymentAllowed>
      <outputsNewCASes>false</outputsNewCASes>
    </operationalProperties>
  </analysisEngineMetaData>
  <resourceManagerConfiguration>
    <externalResources>
      <externalResource>
        <name>dictCaseInsensitiveSet</name>
        <description>Set of dictionary phrases to preserve</description>
        <fileResourceSpecifier>
          <fileUrl>
            file:src/main/resources/dict_preservations_ignorecase.txt
          </fileUrl>
        </fileResourceSpecifier>
        <implementationName>
          com.mycompany.tgni.uima.conf.SharedSetResource
        </implementationName>
      </externalResource>
    </externalResources>
    <externalResourceBindings>
      <externalResourceBinding>
        <key>dictAnnotatorProperties</key>
        <resourceName>dictCaseInsensitiveSet</resourceName>
      </externalResourceBinding>
    </externalResourceBindings>
  </resourceManagerConfiguration>
</analysisEngineDescription>

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
<!-- Source: src/main/resources/descriptors/DictionaryTransformIgnoreCaseAE.xml -->
<?xml version="1.0" encoding="UTF-8"?>
<analysisEngineDescription xmlns="http://uima.apache.org/resourceSpecifier">
  <frameworkImplementation>org.apache.uima.java</frameworkImplementation>
  <primitive>true</primitive>
  <annotatorImplementationName>
    com.mycompany.tgni.uima.annotators.keyword.DictionaryAnnotator
  </annotatorImplementationName>
  <analysisEngineMetaData>
    <name>DictionaryTransformIgnoreCaseAE</name>
    <description>
      Dictionary based annotator. Detects phrases to transform. Case ignored.
    </description>
    <version>1.0</version>
    <vendor>MyCompany Inc.</vendor>
    <configurationParameters>
      <configurationParameter>
        <name>preserveOrTransform</name>
        <description>Preserve/transform matched dictionary entry</description>
        <type>String</type>
        <multiValued>false</multiValued>
        <mandatory>true</mandatory>
      </configurationParameter>
      <configurationParameter>
        <name>ignoreCase</name>
        <description>Whether to ignore case when matching</description>
        <type>Boolean</type>
        <multiValued>false</multiValued>
        <mandatory>true</mandatory>
      </configurationParameter>
      <configurationParameter>
        <name>maxShingleSize</name>
        <description>Max number of words in phrase shingles</description>
        <type>Integer</type>
        <multiValued>false</multiValued>
        <mandatory>true</mandatory>
      </configurationParameter>
    </configurationParameters>
    <configurationParameterSettings>
      <nameValuePair>
        <name>preserveOrTransform</name>
        <value>
          <string>transform</string>
        </value>
      </nameValuePair>
      <nameValuePair>
        <name>ignoreCase</name>
        <value>
          <boolean>true</boolean>
        </value>
      </nameValuePair>
      <nameValuePair>
        <name>maxShingleSize</name>
        <value>
          <integer>5</integer>
        </value>
      </nameValuePair>
    </configurationParameterSettings>
    <typeSystemDescription>
      <imports>
        <import location="Keyword.xml"/>
      </imports>
    </typeSystemDescription>
    <typePriorities/>
    <fsIndexCollection/>
    <capabilities>
      <capability>
        <inputs/>
        <outputs>
          <type>
            com.mycompany.tgni.uima.annotators.keyword.KeywordAnnotation
          </type>
          <feature>
            com.mycompany.tgni.uima.annotators.keyword.KeywordAnnotation:transformedValue
          </feature>
        </outputs>
        <languagesSupported/>
      </capability>
    </capabilities>
    <operationalProperties>
      <modifiesCas>true</modifiesCas>
      <multipleDeploymentAllowed>true</multipleDeploymentAllowed>
      <outputsNewCASes>false</outputsNewCASes>
    </operationalProperties>
  </analysisEngineMetaData>
  <resourceManagerConfiguration>
    <externalResources>
      <externalResource>
        <name>dictCaseInsensitiveMap</name>
        <description>Map of dictionary phrases to preserve</description>
        <fileResourceSpecifier>
          <fileUrl>
            file:src/main/resources/dict_transformations_ignorecase.txt
          </fileUrl>
        </fileResourceSpecifier>
        <implementationName>
          com.mycompany.tgni.uima.conf.SharedMapResource
        </implementationName>
      </externalResource>
    </externalResources>
    <externalResourceBindings>
      <externalResourceBinding>
        <key>dictAnnotatorProperties</key>
        <resourceName>dictCaseInsensitiveMap</resourceName>
      </externalResourceBinding>
    </externalResourceBindings>
  </resourceManagerConfiguration>
</analysisEngineDescription>

As before, the XMLs are largely similar (and quite frankly, rather boringly repetitive, I only put them in here because some of you like to see things explicitly :-)), the only difference is in the confgurationParameterSettings and resourceManagerConfiguration settings.

Putting it together: the aggregate AE

Hooking this all up into a single aggregate AE means building yet another XML file to store this information. And yes, XML files with UIMA get real old real fast, although, admittedly, UIMA comes with Eclipse based tooling to generate these XMLs via component descriptor wizards. Anyway, here it is:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
<!-- Source: src/main/resources/descriptors/TaxonomyMappingAE.xml -->
<?xml version="1.0" encoding="UTF-8"?>
<analysisEngineDescription xmlns="http://uima.apache.org/resourceSpecifier">
  <frameworkImplementation>org.apache.uima.java</frameworkImplementation>
  <primitive>false</primitive>
  <delegateAnalysisEngineSpecifiers>
    <delegateAnalysisEngine key="PatternPreserveAE">
      <import location="PatternPreserveAE.xml"/>
    </delegateAnalysisEngine>
    <delegateAnalysisEngine key="PatternTransformAE">
      <import location="PatternTransformAE.xml"/>
    </delegateAnalysisEngine>
    <delegateAnalysisEngine key="DictionaryPreserveMatchCaseAE">
      <import location="DictionaryPreserveMatchCaseAE.xml"/>
    </delegateAnalysisEngine>
    <delegateAnalysisEngine key="DictionaryTransformMatchCaseAE">
      <import location="DictionaryTransformMatchCaseAE.xml"/>
    </delegateAnalysisEngine>
    <delegateAnalysisEngine key="DictionaryPreserveIgnoreCaseAE">
      <import location="DictionaryPreserveIgnoreCaseAE.xml"/>
    </delegateAnalysisEngine>
    <delegateAnalysisEngine key="DictionaryTransformIgnoreCaseAE">
      <import location="DictionaryTransformIgnoreCaseAE.xml"/>
    </delegateAnalysisEngine>
  </delegateAnalysisEngineSpecifiers>
  <analysisEngineMetaData>
    <name>TaxonomyMappingAE</name>
    <description>
      Chain of UIMA Annotators to pre-process taxonomy concepts for storage 
      into Neo4J's Lucene Index
    </description>
    <version>1.0</version>
    <vendor>MyCompany Inc.</vendor>
    <configurationParameters/>
    <configurationParameterSettings/>
    <flowConstraints>
      <fixedFlow>
        <node>PatternPreserveAE</node>
        <node>PatternTransformAE</node>
        <node>DictionaryPreserveMatchCaseAE</node>
        <node>DictionaryTransformMatchCaseAE</node>
        <node>DictionaryPreserveIgnoreCaseAE</node>
        <node>DictionaryTransformIgnoreCaseAE</node>
      </fixedFlow>
    </flowConstraints>
    <fsIndexCollection/>
    <capabilities>
      <capability>
        <inputs/>
        <outputs>
          <type allAnnotatorFeatures="true">
            com.mycompany.tgni.uima.annotators.keyword.PatternAnnotator
          </type>
          <type allAnnotatorFeatures="true">
            com.mycompany.tgni.uima.annotators.keyword.PatternAnnotator
          </type>
          <type allAnnotatorFeatures="true">
            com.mycompany.tgni.uima.annotators.keyword.DictionaryAnnotator
          </type>
          <type allAnnotatorFeatures="true">
            com.mycompany.tgni.uima.annotators.keyword.DictionaryAnnotator
          </type>
          <type allAnnotatorFeatures="true">
            com.mycompany.tgni.uima.annotators.keyword.DictionaryAnnotator
          </type>
          <type allAnnotatorFeatures="true">
            com.mycompany.tgni.uima.annotators.keyword.DictionaryAnnotator
          </type>
        </outputs>
        <languagesSupported/>
      </capability>
    </capabilities>
    <operationalProperties>
      <modifiesCas>true</modifiesCas>
      <multipleDeploymentAllowed>true</multipleDeploymentAllowed>
      <outputsNewCASes>false</outputsNewCASes>
    </operationalProperties>
  </analysisEngineMetaData>
  <resourceManagerConfiguration/>
</analysisEngineDescription>

To test this, I ran the following JUnit tests. Obviously, I've been testing out the individual primitive AEs as I was building them, so I didn't expect any big issues when building the test for the aggregate AE. The only problem I had when testing the aggregate AE was effectively partitioning the properties (which was one of the reasons to go with the SharedResourceObject implementations I mentioned above).

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
// Source: src/test/java/com/mycompany/tgni/uima/annotators/aggregates/TaxonomyMappingAETest.java
package com.mycompany.tgni.uima.annotators.aggregates;

import java.util.Iterator;

import org.apache.commons.lang.StringUtils;
import org.apache.uima.analysis_engine.AnalysisEngine;
import org.apache.uima.cas.FSIndex;
import org.apache.uima.jcas.JCas;
import org.apache.uima.jcas.tcas.Annotation;
import org.junit.Test;

import com.mycompany.tgni.uima.annotators.keyword.KeywordAnnotation;
import com.mycompany.tgni.uima.annotators.keyword.KeywordAnnotatorsTest;
import com.mycompany.tgni.uima.utils.UimaUtils;

public class TaxonomyMappingAETest {

  @Test
  public void testConceptMappingPipeline() throws Exception {
    AnalysisEngine ae = UimaUtils.getAE(
      "src/main/resources/descriptors/TaxonomyMappingAE.xml", null);
    for (String testString : KeywordAnnotatorsTest.TEST_STRINGS) {
      JCas jcas = UimaUtils.runAE(ae, testString);
      System.out.println("input=" + testString);
      FSIndex<? extends Annotation> index = 
        jcas.getAnnotationIndex(KeywordAnnotation.type);
      for (Iterator<? extends Annotation> it = index.iterator(); 
          it.hasNext(); ) {
        KeywordAnnotation annotation = (KeywordAnnotation) it.next();
        System.out.println("(" + annotation.getBegin() + "," + 
          annotation.getEnd() + "): " + 
          annotation.getCoveredText() + 
          (StringUtils.isEmpty(annotation.getTransformedValue()) ?
          "" : " => " + annotation.getTransformedValue()));
      }
    }
  }
}

And as expected, these produce the following results. The last two - "mariners" and "Vitamin A" are from dictionary annotator configurations.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
input=Born in the USA I was...
(12,15): USA
input=CSC and IBM are Fortune 500 companies.
(0,3): CSC
(8,11): IBM
input=Linux is embraced by the Oracles and IBMs of the world
(37,41): IBMs
input=PET scans are uncomfortable.
(0,3): PET
input=The HIV-1 virus is an AIDS carrier
(4,9): HIV-1
(4,9): HIV-1 => HIV1, HIV 1
(22,26): AIDS
(22,26): AIDS => Acquired Immunity Deficiency Syndrome
input=Unstructured Information Management Application (UIMA) is fantastic!
(49,53): UIMA
input=Born in the U.S.A., I was...
(12,18): U.S.A. => USA
input=He is a free-wheeling kind of guy.
(8,21): free-wheeling => freewheeling, free wheeling
input=Magellan was one of our great mariners
(30,38): mariners
input=Get your daily dose of Vitamin A here!
(23,32): Vitamin A

So anyway, the next step is to hook this stuff up into a Lucene analyzer chain, which is what I am working on currently. More on that (hopefully) next week.