Saturday, August 27, 2011

An UIMA Noun Phrase POS Annotator using OpenNLP

Some time ago, I described an UIMA Sentence Annotator, that parsed a block of plain text using OpenNLP's Sentence Detector and its prebuilt Maximum Entropy based sentence model.

This Sentence Annotator is used in my TGNI application to split a block of text into sentences, which are then further split up into shingles and matched against the Lucene/Neo4j representation of the taxonomy. This approach works in general, but yields a fair amount of false positives.

One class of false positives are words such as "Be" or "As" which you would normally expect to be stopworded out, but which match the chemical name synonyms for the elements Berrylium and Arsenic respectively. Another class consisted of words used in an incorrect context, for example "lead" in the sense "lead a team" rather than the metal. An approach to solving for both the above classes would be to use only the noun phrases from the sentences - the non-noun portions are the ones that generally contain the ambiguous usages described above.

I decided to investigate if I could do this with OpenNLP, since I was already using it. The last time I used it was only for sentence detection, and documentation was quite sparse at that time. Fortunately, this time round, I stumbled on these two posts in Davelog: Getting starting with OpenNLP 1.5.0 - Sentence Detection and Tokenizing and Part of Speech (POS) Tagging with OpenNLP 1.5.0, both of which were enormously useful in getting me started.

So I decided to replace my SentenceAnnotator (which annotated the text with sentence annotation markers) with a NounPhraseAnnotator. This one also first splits the input text into sentences using the SentenceDetector, then for each sentence it tokenizes it into words using the Tokenizer, then find POS tags for each token using the POSTagger. Now using the tokens and the associated tags, it uses the Chunker to break up the sentence into phrase chunks. For each chunk, it checks its type and only noun-phrases (NP) are annotated. The SentenceDetector, Tokenizer, POSTagger and Chunker are all OpenNLP components, each backed by their own maximum entropy based models. Pre-built versions of these models are available for download from here.

A UIMA primitive Analysis Engine (AE) consists of an annotation descriptor (specified as XML), an annotator (specified as a Java class) and its associated AE descriptor (also specified as XML).

Annotation XML Descriptor

There is nothing to this annotation, really. Its just a regular Annotation without any extra properties. Here it is.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
<?xml version="1.0" encoding="UTF-8"?>
<!-- src/main/resources/descriptors/NounPhrase.xml -->
<typeSystemDescription xmlns="http://uima.apache.org/resourceSpecifier">
  <name>NounPhrase</name>
  <description>Annotation to represent Noun Phrase sequences in a body of text.</description>
  <version>1.0</version>
  <vendor/>
  <types>
    <typeDescription>
      <name>com.mycompany.tgni.uima.annotators.nlp.NounPhraseAnnotation</name>
      <description/>
      <supertypeName>uima.tcas.Annotation</supertypeName>
    </typeDescription>
  </types>
</typeSystemDescription>

Annotator

I have already described what the annotator does above. Ultimately, it will replace the SentenceAnnotator, so it should consume TextAnnotation objects placed by the upstream TextAnnotator. For now, for quick development and testing, I have modeled it as a primitive AE which consumes text blocks. Here is the code for the NounPhraseAnnotator.

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
// Source: src/main/java/com/mycompany/tgni/uima/annotators/nlp/NounPhraseAnnotator.java
package com.mycompany.tgni.uima.annotators.nlp;

import java.io.InputStream;

import opennlp.tools.chunker.ChunkerME;
import opennlp.tools.chunker.ChunkerModel;
import opennlp.tools.postag.POSModel;
import opennlp.tools.postag.POSTaggerME;
import opennlp.tools.sentdetect.SentenceDetectorME;
import opennlp.tools.sentdetect.SentenceModel;
import opennlp.tools.tokenize.TokenizerME;
import opennlp.tools.tokenize.TokenizerModel;
import opennlp.tools.util.Span;

import org.apache.commons.io.IOUtils;
import org.apache.uima.UimaContext;
import org.apache.uima.analysis_component.JCasAnnotator_ImplBase;
import org.apache.uima.analysis_engine.AnalysisEngineProcessException;
import org.apache.uima.jcas.JCas;
import org.apache.uima.resource.ResourceInitializationException;

/**
 * Annotate noun phrases in sentences from within blocks of
 * text (marked up with TextAnnotation) from either HTML or
 * plain text documents. Using the OpenNLP library and models,
 * the incoming text is tokenized into sentences, then each 
 * sentence is tokenized to words and POS tagged, and finally
 * tokens are grouped together into chunks. Of these chunks,
 * only the noun phrases are annotated. 
 */
public class NounPhraseAnnotator extends JCasAnnotator_ImplBase {

  private SentenceDetectorME sentenceDetector;
  private TokenizerME tokenizer;
  private POSTaggerME posTagger;
  private ChunkerME chunker;
  
  @Override
  public void initialize(UimaContext ctx) 
      throws ResourceInitializationException {
    super.initialize(ctx);
    InputStream smis = null;
    InputStream tmis = null;
    InputStream pmis = null;
    InputStream cmis = null;
    try {
      smis = getContext().getResourceAsStream("SentenceModel");
      SentenceModel smodel = new SentenceModel(smis);
      sentenceDetector = new SentenceDetectorME(smodel);
      smis.close();
      tmis = getContext().getResourceAsStream("TokenizerModel");
      TokenizerModel tmodel = new TokenizerModel(tmis);
      tokenizer = new TokenizerME(tmodel);
      tmis.close();
      pmis = getContext().getResourceAsStream("POSModel");
      POSModel pmodel = new POSModel(pmis);
      posTagger = new POSTaggerME(pmodel);
      pmis.close();
      cmis = getContext().getResourceAsStream("ChunkerModel");
      ChunkerModel cmodel = new ChunkerModel(cmis);
      chunker = new ChunkerME(cmodel);
      cmis.close();
    } catch (Exception e) {
      throw new ResourceInitializationException(e);
    } finally {
      IOUtils.closeQuietly(cmis);
      IOUtils.closeQuietly(pmis);
      IOUtils.closeQuietly(tmis);
      IOUtils.closeQuietly(smis);
    }
  }
  
  @Override
  public void process(JCas jcas) throws AnalysisEngineProcessException {
    String text = jcas.getDocumentText();
    Span[] sentSpans = sentenceDetector.sentPosDetect(jcas.getDocumentText());
    for (Span sentSpan : sentSpans) {
      String sentence = sentSpan.getCoveredText(text).toString();
      int start = sentSpan.getStart();
      Span[] tokSpans = tokenizer.tokenizePos(sentence);
      String[] tokens = new String[tokSpans.length];
      for (int i = 0; i < tokens.length; i++) {
        tokens[i] = tokSpans[i].getCoveredText(sentence).toString();
      }
      String[] tags = posTagger.tag(tokens);
      Span[] chunks = chunker.chunkAsSpans(tokens, tags);
      for (Span chunk : chunks) {
        if ("NP".equals(chunk.getType())) {
          NounPhraseAnnotation annotation = new NounPhraseAnnotation(jcas);
          annotation.setBegin(start + 
            tokSpans[chunk.getStart()].getStart());
          annotation.setEnd(
            start + tokSpans[chunk.getEnd() - 1].getEnd());
          annotation.addToIndexes(jcas);
        }
      }
    }
  }
}

The various OpenNLP components are all initialized (once at startup) in the initialize() method - as you can see, the code pattern is quite repetitive. The process() method splits the text into sentences, sentences into tokens, POS tags the tokens, then uses the tokens and tags to chunk each sentence. Only noun-phrase chunks are annotated. One important things to note are that the chunk spans report its start and end offsets in terms of token (not character positions). Another thing to note is that the NounPhrase annotation start and end offsets are character offsets relative to the start of the incoming block of text.

AE Descriptor

Finally, the AE descriptor for the NounPhraseAnnotator. This is also pretty vanilla, the only non-standard block is the resource manager configuration which relates the OpenNLP model files with symbolic names used by the annotator. One other thing that you may notice is the reference to the TextAnnotator - as mentioned above, the ultimate goal is to replace the SentenceAnnotator which consumes TextAnnotations - thats why its here.

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
<?xml version="1.0" encoding="UTF-8"?>
<analysisEngineDescription xmlns="http://uima.apache.org/resourceSpecifier">
  <frameworkImplementation>org.apache.uima.java</frameworkImplementation>
  <primitive>true</primitive>
  <annotatorImplementationName>
    com.mycompany.tgni.uima.annotators.nlp.NounPhraseAnnotator
  </annotatorImplementationName>
  <analysisEngineMetaData>
    <name>NounPhraseAE</name>
    <description>Annotates Noun Phrases in sentences</description>
    <version>1.0</version>
    <vendor/>
    <configurationParameters/>
    <configurationParameterSettings/>
    <typeSystemDescription>
      <types>
        <typeDescription>
          <name>com.mycompany.tgni.uima.annotators.text.TextAnnotation</name>
          <description/>
          <supertypeName>uima.tcas.Annotation</supertypeName>
        </typeDescription>
        <typeDescription>
          <name>com.mycompany.tgni.uima.annotators.nlp.NounPhraseAnnotation</name>
          <description/>
          <supertypeName>uima.tcas.Annotation</supertypeName>
        </typeDescription>
      </types>
    </typeSystemDescription>
    <typePriorities/>
    <fsIndexCollection/>
    <capabilities>
      <capability>
        <inputs>
          <type allAnnotatorFeatures="true">com.mycompany.tgni.uima.annotator.text.TextAnnotator</type>
          <feature>com.mycompany.tgni.uima.annotator.text.TextAnnotator:tagname</feature>
        </inputs>
        <outputs>
          <type allAnnotatorFeatures="true">com.mycompany.tgni.uima.annotators.nlp.NounPhraseAnnotator</type>
        </outputs>
        <languagesSupported/>
      </capability>
    </capabilities>
    <operationalProperties>
      <modifiesCas>true</modifiesCas>
      <multipleDeploymentAllowed>true</multipleDeploymentAllowed>
      <outputsNewCASes>false</outputsNewCASes>
    </operationalProperties>
  </analysisEngineMetaData>
  <externalResourceDependencies>
    <externalResourceDependency>
      <key>SentenceModel</key>
      <description>OpenNLP Sentence Model</description>
      <optional>false</optional>
    </externalResourceDependency>
    <externalResourceDependency>
      <key>TokenizerModel</key>
      <description>OpenNLP Tokenizer Model</description>
      <optional>false</optional>
    </externalResourceDependency>
    <externalResourceDependency>
      <key>POSModel</key>
      <description>OpenNLP POS Tagging Model</description>
      <optional>false</optional>
    </externalResourceDependency>
    <externalResourceDependency>
      <key>ChunkerModel</key>
      <description>OpenNLP Chunker Model</description>
      <optional>false</optional>
    </externalResourceDependency>
  </externalResourceDependencies>
  <resourceManagerConfiguration>
    <externalResources>
      <externalResource>
        <name>SentenceModelSerFile</name>
        <description/>
        <fileResourceSpecifier>
          <fileUrl>file:@tgni.home@/conf/models/en-sent.bin</fileUrl>
        </fileResourceSpecifier>
      </externalResource>
      <externalResource>
        <name>TokenizerModelSerFile</name>
        <description/>
        <fileResourceSpecifier>
          <fileUrl>file:@tgni.home@/conf/models/en-token.bin</fileUrl>
        </fileResourceSpecifier>
      </externalResource>
      <externalResource>
        <name>POSModelSerFile</name>
        <description/>
        <fileResourceSpecifier>
          <fileUrl>file:@tgni.home@/conf/models/en-pos-maxent.bin</fileUrl>
        </fileResourceSpecifier>
      </externalResource>
      <externalResource>
        <name>ChunkerModelSerFile</name>
        <description/>
        <fileResourceSpecifier>
          <fileUrl>file:@tgni.home@/conf/models/en-chunker.bin</fileUrl>
        </fileResourceSpecifier>
      </externalResource>
    </externalResources>
    <externalResourceBindings>
      <externalResourceBinding>
        <key>SentenceModel</key>
        <resourceName>SentenceModelSerFile</resourceName>
      </externalResourceBinding>
      <externalResourceBinding>
        <key>TokenizerModel</key>
        <resourceName>TokenizerModelSerFile</resourceName>
      </externalResourceBinding>
      <externalResourceBinding>
        <key>POSModel</key>
        <resourceName>POSModelSerFile</resourceName>
      </externalResourceBinding>
      <externalResourceBinding>
        <key>ChunkerModel</key>
        <resourceName>ChunkerModelSerFile</resourceName>
      </externalResourceBinding>
    </externalResourceBindings>
  </resourceManagerConfiguration>
</analysisEngineDescription>

Testing Code and Results

The following JUnit test runs the NounPhraseAnnotator primitive AE against some input sentences.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
// Source: src/test/java/com/mycompany/tgni/uima/annotators/nlp/NounPhraseAnnotatorTest.java
package com.mycompany.tgni.uima.annotators.nlp;

import java.util.Iterator;

import org.apache.uima.analysis_engine.AnalysisEngine;
import org.apache.uima.cas.FSIndex;
import org.apache.uima.jcas.JCas;
import org.junit.Test;

import com.mycompany.tgni.uima.utils.UimaUtils;

/**
 * Tests for the Noun Phrase Annotator.
 */
public class NounPhraseAnnotatorTest {

  private static final String[] INPUTS = new String[] { ... };

  @Test
  public void testNounPhraseAnnotation() throws Exception {
    AnalysisEngine ae = UimaUtils.getAE(
      "conf/descriptors/NounPhraseAE.xml", null);
    for (String input : INPUTS) {
      System.out.println("text: " + input);
      JCas jcas = UimaUtils.runAE(ae, input, UimaUtils.MIMETYPE_TEXT);
      FSIndex index = jcas.getAnnotationIndex(NounPhraseAnnotation.type);
      for (Iterator<NounPhraseAnnotation> it = index.iterator(); it.hasNext();) {
        NounPhraseAnnotation annotation = it.next();
        System.out.println("...(" + annotation.getBegin() + "," + 
          annotation.getEnd() + "): " + 
          annotation.getCoveredText());
      }
    }
  }
}

And here are some test inputs and the associated noun phrases that were annotated. The annotation consists of the start and end character positions and the actual string covered.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
    [junit] text: Be that as it may, the show must go on.
    [junit] ...(11,13): it
    [junit] ...(19,27): the show
    [junit] text: As I was telling you, he will not attend the meeting.
    [junit] ...(3,4): I
    [junit] ...(17,20): you
    [junit] ...(22,24): he
    [junit] ...(41,52): the meeting
    [junit] text: Dr Johnson will lead the team
    [junit] ...(0,10): Dr Johnson
    [junit] ...(21,29): the team
    [junit] text: Lead is the lead cause of lead poisoning.
    [junit] ...(0,4): Lead
    [junit] ...(8,22): the lead cause
    [junit] ...(26,40): lead poisoning

As you can see, the "Be" and "As" are no longer in the set of strings to be matched. The "lead" as a verb in the third example is also taken care of. The fourth example does return "the lead cause" which will still need to be taken care of somehow.

Tuesday, August 23, 2011

Implementing Concordance with Lucene Span Queries

I recently needed to build a Named Entity Recognizer (NER) for our proprietary concept mapping/indexing platform to recognize and extract age group data from our document corpus. The approach I envisioned was to match specific age related patterns in the data and map them into specific age brackets.

I have also been reading the NLTK Book (free online version available here) lately, and came across a concept called concordance, which is basically a list of occurrences of a particular keyword with the context in the document corpus. It occurred to me that running a concordance on the document corpus for selected keywords would help me extract the patterns I needed.

Thinking through this some more, I remembered reading Accessing words around a positional match in Lucene by Grant Ingersoll, where he demonstrates the use of Span Queries to find collocated terms.

Since I already had an index whose body was indexed with term vectors, positions and offsets, I figured it would be easier to adapt Grant's code than set up NLTK to find the concordances for a few key terms. So this is what I did - the JUnit test below shows my version, which generates output very similar to that generated by NLTK's concordance() method.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
// Source: src/test/java/com/mycompany/tgni/utils/ConcordanceGeneratorTest.java
package com.mycompany.tgni.utils;

import java.io.File;
import java.io.FileWriter;
import java.io.PrintWriter;

import org.apache.commons.lang.StringUtils;
import org.apache.commons.lang.math.IntRange;
import org.apache.lucene.document.Document;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.Term;
import org.apache.lucene.index.TermVectorMapper;
import org.apache.lucene.index.TermVectorOffsetInfo;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.spans.SpanTermQuery;
import org.apache.lucene.search.spans.Spans;
import org.apache.lucene.store.FSDirectory;
import org.junit.Test;

public class ConcordanceGeneratorTest {

  private static final int NUM_CONTEXT_CHARS = 25;
  private static final String[] AGE_TERMS = new String[] {
    "aged", "age", "year"
  };
  
  @Test
  public void testGenerateConcordance() throws Exception {
    IndexSearcher searcher = new IndexSearcher(FSDirectory.open(
      new File("/path/to/my/index")));
    IndexReader reader = searcher.getIndexReader();
    PrintWriter writer = new PrintWriter(new FileWriter(
      new File("/tmp/concordance.txt")), true);
    for (String ageTerm : AGE_TERMS) {
      writer.println("==== concordance for term='" + ageTerm + "' ===="); 
      SpanTermQuery spt = new SpanTermQuery(new Term("body", ageTerm));
      Spans spans = spt.getSpans(reader);
      OffsetTermVectorMapper tvm = new OffsetTermVectorMapper();
      while (spans.next()) {
        Document doc = reader.document(spans.doc());
        String body = doc.get("body");
        tvm.start = spans.start();
        tvm.end = spans.end();
        reader.getTermFreqVector(spans.doc(), "body", tvm);
        String conc = StringUtils.substring(body, 
          tvm.range.getMinimumInteger() - NUM_CONTEXT_CHARS, 
          tvm.range.getMaximumInteger() + NUM_CONTEXT_CHARS);
        if (StringUtils.isNotEmpty(conc)) {
          writer.println(StringUtils.join(new String[] {
            "...", conc, "..."
          }));
        }
      }
    }
    searcher.close();
    writer.flush();
    writer.close();
  }
  
  private class OffsetTermVectorMapper extends TermVectorMapper {

    public int start;
    public int end;
    public IntRange range;
    
    @Override
    public void map(String term, int frequency, TermVectorOffsetInfo[] offsets,
        int[] positions) {
      for (int i = 0; i < positions.length; i++) {
        if (positions[i] >= start && positions[i] < end) {
          TermVectorOffsetInfo offset = offsets[i];
          range = new IntRange(offset.getStartOffset(), offset.getEndOffset());
        }
      }
    }

    @Override
    public void setExpectations(String field, int numTerms,
        boolean storeOffsets, boolean storePositions) {
      // NOOP
    }
  }
}

The code scans the index for spans containing the keywords "age", "aged" and "year", finds the character offsets of these spans, then returns substrings consisting of 25 character snippets on either side for context. Here is some sample output (truncated to 20 top results for brevity).

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
==== concordance for term='aged' ====
...y feel trapped within an aged body. Grief, a sense of ...
...ated deaths among people aged 65 years or older and ar...
...tal falls involve people aged 75 years or older-only 4...
...population. Among people aged 65 to 69 years, 1 of eve...
...racture, and among those aged 85 years or older, 1 fal...
...g Medicare beneficiaries aged &gt; or = 65 yearsUnite...
...or, ME) at 68 weeks and aged in our colony room at Ru...
...nalyzed with Age (middle-aged and young)  Treatment (...
...ntly a disease of middle-aged white males, with a medi...
..., but also occurs in the aged. It was the first human ...
...es. Among the population aged five years and above, HI...
... 121 consecutive adults (aged &gt;16 years) who underw...
...c abnormality. In middle-aged and older adults, in who...
...es more common in middle-aged patients withchronic act...
...ctures are mostly middle-aged and actively employed. I...
...d 426 non-twin siblings, aged 12-18 years, was recruit...
...y much involved. Edwina, aged 77, is a young mother an...
...eening for 94% of people aged 55 years and over....
... criteria (), were those aged 18 or older (male or fem...
...g 301 toddlers (children aged 1 to 2 years) with upper...
...
==== concordance for term='age' ====
...ric Depression Scale Old age isn't so bad when you co...
...n individual 65 years of age or older. Among such ind...
... social integration. The age range of this rapidly gr...
...er of people 65 years of age or older is projected to...
...llion people 65 years of age and older. This group wi...
...n older than 85 years of age represents the fastest-g...
...e older than 65 years of age. These statistics are ex...
... individuals 95 years of age and older. African Ameri...
...but only 8% of the older age groups. In 1986, most ol...
...n older than 65 years of age continue to work; 25% ar...
...ans will be 100 years of age or older by the year 205...
...y 2 million will be that age by the year 2080. Falls ...
... individuals 65 years of age and older. Two thirds of...
... in patients 85 years of age and older are caused by ...
...s older than 65 years of age have this fear. Structur...
...rises progressively with age, whereas diastolic press...
...d to half by 75 years of age. Vascular changes may al...
...he case in the geriatric age group. It is well docume...
...s older than 70 years of age do not have chest pain w...
...s older than 65 years of age use about 25% of all pre...
...
==== concordance for term='year' ====
...ntegrity, for at least 1 year after nerve injury. Irre...
... often has occurred by 2 years. Sensory cross-reinnerva...
...ed state for a period of years. Thus, even late reinner...
..., on the order of 2 to 3 years, may be able to restore ...
...ocols were used over the years the data was gathered. E...
...ient is an individual 65 years of age or older. Among s...
...ation spans more than 40 years. The world's geriatric p...
...ng at a rate of 2.5% per yearsignificantly faster tha...
... Americans older than 65 years increased from a little ...
... the number of people 65 years of age or older is proje...
...re 146 million people 65 years of age and older. This g...
...se to 232 million by the year 2020. A decreasing rate ...
...population older than 85 years of age represents the fa...
...cted to continue. By the year 2020, one fifth of the p...
...on will be older than 65 years of age. These statistics...
...3:1 among individuals 95 years of age and older. Africa...
...population older than 65 years. Approximately 12% of th...
...population older than 65 years of age continue to work;...
...on Americans will be 100 years of age or older by the y...
...s of age or older by the year 2050 and that nearly 2 m...
...

As you can see, this list is a good way to find common patterns that need to be extracted from the corpus. All you need is a bit of imagination to think of some good representative terms that cover most patterns you are likely to encounter in the corpus. You also need to scan the list manually to weed it out. It is now relatively simple to craft a number of regular expressions that capture the lower and upper bound (where available) of the date ranges and assign these to predefined age group blocks.

Of course, the downside is that it kind of puts the cart before the horse. The Age-Group NER is part of the indexing pipeline, but we need an index to be built without this filter first in order to get data to build this filter. The right way would probably be to generate the concordance data with something like NLTK. But it is relatively cheap resource-wise to build a plain old Lucene index from your corpus, so perhaps its not quite so bad.

Saturday, August 20, 2011

Implementing Concept Subsumption with Bitsets

According to this page, Concept Subsumption is defined as follows:

Given an ontology O and two classes A, B, verify whether the interpretation of A is a subset of the interpretation of B in every model of O.

For example, concept mapping the string "Lung Cancer Symptom" would yield the following concepts - "lung cancer symptoms", "lung", "cancer", "lung cancer" and "cancer symptoms". Using the definition above, we can say that the concept for "lung cancer symptoms" subsumes the other concepts listed, since the meaning of the words in this concept is a subset of the meaning of the same words in the set of other concepts.

However, if you now look at the same problem from a pure text processing point of view, its just an application of the Longest Match or Maximal Munch rule - ie, map the longest substring(s) that can be mapped to one or more concepts.

My TGNI application reports all concepts found (along with their start and end positions) in a body of text. While some clients may prefer to filter the concepts based on custom (usually semantic) criteria, subsumption is a common enough requirement to be the default expected behavior. So I need to modify TGNI to only return the longest matches (unless all concepts are specifically requested).

A naive implementation of such a filter would be to sort the concepts based on match length (longest first), then for each matched text, check for subsumption with the concepts previous to it on the list, basically an O(n2) operation. Checking for subsumption involves either checking whether the current matched text is a substring of all the previous ones, or whether the range defined by the start and end positions are contained in all the previous ranges.

Another interesting way, which I thought about recently, would be to compute intersections of bitsets representing the input text and the matched text and drop a concept if the intersection caused no change in cardinality. Its probably easier to describe with an example:

1
2
3
4
5
6
input   : patient with lung cancer symptoms 111111111111111111111111111111111 
(13,32) : lung cancer symptoms              111111111111100000000000000000000 
(18,32) : cancer symptoms                   111111111111100000000000000000000 
(14,23) : lung cancer                       111111111111100000000000000000000
(18,23) : cancer                            111111111111100000000000000000000
(13,17) : lung                              111111111111100000000000000000000

Notice how the only time the cardinality (number of 1s) of the input bitset changes is when the bitset for "lung cancer symptoms" is ANDed with the input bitset. Using our test described above, the only concept that would survive the subsumption filtering is this one.

Here is some JUnit prototype code I wrote to prove to myself that this will work as described above. Folding this code into the ConceptAnnotator is fairly trivial (it will be called as a post-processing step to remove ConceptAnnotations from the FSIndex after all the concepts are mapped), so I will not describe that code on the blog unless I make some major changes to that class in the future.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
// Source: src/test/java/com/mycompany/tgni/uima/annotators/concept/MaximalMunchTest.java
package com.mycompany.tgni.uima.annotators.concept;

import java.util.Arrays;
import java.util.ArrayList;
import java.util.Collections;
import java.util.Comparator;
import java.util.List;

import org.apache.lucene.util.OpenBitSet;
import org.junit.Test;

public class MaximalMunchTest {

  private static class Annot {
    public int start;
    public int end;
    public String text;
    public Annot(int start, int end, String text) {
      this.start = start;
      this.end = end;
      this.text = text;
    }
  };

  private static final String INPUT = 
    "lung cancer symptoms and kidney cancer";
  
  // 0         1         2         3
  // 0123456789012345678901234567890123456789
  // lung cancer symptoms and kidney cancer
  public static final Annot[] ANNOTS = new Annot[] {
    new Annot(0, 3, "lung"),
    new Annot(0, 10, "lung cancer"),
    new Annot(0, 19, "lung cancer symptoms"),
    new Annot(5, 10, "cancer"),
    new Annot(5, 19, "cancer symptoms"),
    new Annot(5, 10, "cancer"),
    new Annot(25, 30, "kidney"),
    new Annot(25, 37, "kidney cancer"),
    new Annot(32, 37, "cancer")
  };
  
  @SuppressWarnings("unchecked")
  @Test
  public void testMaximalMunch() throws Exception {
    OpenBitSet tset = new OpenBitSet(INPUT.length());
    tset.set(0, tset.length());
    List<Annot> annots = Arrays.asList(ANNOTS);
    // sort the annotations, longest first
    Collections.sort(annots, new Comparator<Annot>() {
      @Override
      public int compare(Annot annot1, Annot annot2) {
        Integer len1 = annot1.end - annot1.start;
        Integer len2 = annot2.end - annot2.start;
        return len2.compareTo(len1);
      }
    });
    List<Annot> maxMunchAnnots = new ArrayList<Annot>();
    long prevCardinality = tset.cardinality();
    for (Annot annot : annots) {
      OpenBitSet aset = new OpenBitSet(tset.length());
      aset.set(0, tset.length());
      aset.flip(annot.start, annot.end);
      tset.intersect(aset);
      long cardinality = tset.cardinality();
      if (cardinality == prevCardinality) {
        // complete overlap, skip
        continue;
      }
      maxMunchAnnots.add(annot);
      prevCardinality = cardinality;
    }
    for (Annot annot : maxMunchAnnots) {
      System.out.println("(" + annot.start + "," + 
        annot.end + "): " + annot.text);
    }
  }
}

As expected, the code successfully identifies the longest matched concepts in the list.

1
2
    [junit] (0,19): lung cancer symptoms
    [junit] (25,37): kidney cancer

The nice thing about this algorithm is that it has O(n), ie linear complexity, compared to the naive approach I described above, at a cost of slightly more memory used to hold the bitsets. What do you think? If you have suggestions on a different/better algorithm/data structure to achieve this stuff, would appreciate you letting me know.

Wednesday, August 17, 2011

A Spring Web Interface for TGNI

Having successfully run some unit tests for various use cases I wanted to cover (see previous post), it was time to build a web interface for TGNI. The web interface I envisioned would allow someone to check out the concept mapping functionality from a web browser, as well as provide (an application-centric) means of navigating the graph database.

As you would expect, there is not much to building a web interface once you have your JUnit tests working, especially if you use something like Spring. Of course, web interface development is slower, because I am not as fluent in JSTL/HTML/CSS as I am in Java, so I end up having to RTFM more often. But when you want to have other people look at and use your software, a web interface is an essential (and often the cheapest) tool to provide.

There are some things I ended up changing in order to make the web interface work the way I wanted it to. In this post, I will list these changes and provide the code and screenshots for the interesting components.

Overview

The central class in the integration is a Spring multi-action controller (written with Spring 3 annotations). It wraps the NodeService to provide the navigation interface, and the aggregate UIMA Analysis Engine (AE) to do concept mapping. Here is the code for the ConceptMappingController.

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
// Source: src/main/java/com/mycompany/tgni/spring/ConceptMappingController.java
package com.mycompany.tgni.spring;

import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.net.URL;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.Iterator;
import java.util.List;
import java.util.Map;

import javax.annotation.PostConstruct;

import org.apache.commons.collections15.Bag;
import org.apache.commons.lang.StringUtils;
import org.apache.commons.lang.math.NumberUtils;
import org.apache.uima.analysis_engine.AnalysisEngine;
import org.apache.uima.cas.FSIndex;
import org.apache.uima.jcas.JCas;
import org.springframework.stereotype.Controller;
import org.springframework.web.bind.annotation.RequestMapping;
import org.springframework.web.bind.annotation.RequestMethod;
import org.springframework.web.bind.annotation.RequestParam;
import org.springframework.web.servlet.ModelAndView;

import com.mycompany.tgni.beans.TConcept;
import com.mycompany.tgni.beans.TRelTypes;
import com.mycompany.tgni.beans.TRelation;
import com.mycompany.tgni.neo4j.NodeService;
import com.mycompany.tgni.neo4j.NodeServiceFactory;
import com.mycompany.tgni.uima.annotators.concept.ConceptAnnotation;
import com.mycompany.tgni.uima.utils.UimaUtils;

/**
 * Controller to expose TGNI functionality via a web application.
 */
@Controller
public class ConceptMappingController {

  private String conceptMappingAEDescriptor;
  
  private AnalysisEngine conceptMappingAE;
  private NodeService nodeService;
  
  public void setConceptMappingAEDescriptor(
      String conceptMappingAEDescriptor) {
    this.conceptMappingAEDescriptor = conceptMappingAEDescriptor;
  }
  
  @PostConstruct
  public void init() throws Exception {
    conceptMappingAE = UimaUtils.getAE(conceptMappingAEDescriptor, null);
    nodeService = NodeServiceFactory.getInstance();
  }

  @RequestMapping(value="/find.html")
  public ModelAndView find(
      @RequestParam(value="q", required=false) String q) {
    ModelAndView mav = new ModelAndView();
    mav.addObject("operation", "find");
    if (StringUtils.isEmpty(q)) {
      mav.setViewName("find");
      return mav;
    }
    try {
      if (NumberUtils.isNumber(q) && 
          StringUtils.length(q) == 7) {
        return show(Integer.valueOf(q));
      } else {
        long startTs = System.currentTimeMillis();
        List<TConcept> concepts = nodeService.findConcepts(q);
        mav.addObject("concepts", concepts);
        long endTs = System.currentTimeMillis();
        mav.addObject("q", q);
        mav.addObject("elapsed", new Long(endTs - startTs));
      }
    } catch (Exception e) {
      mav.addObject("error", e.getMessage());
    }
    mav.setViewName("find");
    return mav;
  }
  
  @RequestMapping(value="/map.html")
  public ModelAndView map(
      @RequestParam(value="q1", required=false) String q1,
      @RequestParam(value="q2", required=false) String q2,
      @RequestParam(value="q3", required=false) String q3,
      @RequestParam(value="if", required=false, 
        defaultValue=UimaUtils.MIMETYPE_STRING) String inputFormat,
      @RequestParam(value="of", required=true, 
        defaultValue="html") String outputFormat) {

    ModelAndView mav = new ModelAndView();
    mav.addObject("operation", "map");
    // validate parameters (at least one of o, q, u or t must
    // be supplied, otherwise show the input form
    mav.addObject("q1", StringUtils.isEmpty(q1) ? "" : q1);
    mav.addObject("q2", StringUtils.isEmpty(q2) ? "" : q2);
    mav.addObject("q3", StringUtils.isEmpty(q3) ? "" : q3);
    String q = StringUtils.isNotEmpty(q1) ? q1 : 
      StringUtils.isNotEmpty(q2) ? q2 : 
      StringUtils.isNotEmpty(q3) ? q3 : null;
    if (StringUtils.isEmpty(q)) {
      setViewName(mav, outputFormat);
      return mav;
    }
    try {
      if (NumberUtils.isNumber(q) && 
          StringUtils.length(q) == 7) {
        return show(Integer.valueOf(q));
      } else {
        // show list of concepts
        String text = q;
        if ((q.startsWith("http://") && 
            UimaUtils.MIMETYPE_HTML.equals(inputFormat))) {
          URL u = new URL(q);
          BufferedReader br = new BufferedReader(
            new InputStreamReader(u.openStream()));
          StringBuilder tbuf = new StringBuilder();
          String line = null;
          while ((line = br.readLine()) != null) {
            tbuf.append(line).append("\n");
          }
          br.close();
          text = tbuf.toString();
        }
        List<ConceptAnnotation> annotations = 
          new ArrayList<ConceptAnnotation>();
        long startTs = System.currentTimeMillis();
        JCas jcas = UimaUtils.runAE(conceptMappingAE, text, inputFormat);
        FSIndex fsindex = jcas.getAnnotationIndex(ConceptAnnotation.type);
        for (Iterator<ConceptAnnotation> it = fsindex.iterator(); it.hasNext(); ) {
          ConceptAnnotation annotation = it.next();
          annotations.add(annotation);
        }
        if (annotations.size() == 0) {
          mav.addObject("error", "No concepts found");
        } else {
          mav.addObject("text", text);
          mav.addObject("annotations", annotations);
          long endTs = System.currentTimeMillis();
          mav.addObject("elapsed", new Long(endTs - startTs));
        }
        setViewName(mav, outputFormat);
      }
    } catch (Exception e) {
      mav.addObject("error", e.getMessage());
      setViewName(mav, outputFormat);
    }
    return mav;
  }
  
  @RequestMapping(value="/show.html", method=RequestMethod.GET)
  public ModelAndView show(
      @RequestParam(value="q", required=true) int q) {
    ModelAndView mav = new ModelAndView();
    mav.addObject("operation", "show");
    try {
      long startTs = System.currentTimeMillis();
      // show all details about the concept
      TConcept concept = nodeService.getConcept(q);
      Bag<TRelTypes> relCounts = nodeService.getRelationCounts(concept);
      Map<String,List<TRelation>> relmap = 
        new HashMap<String,List<TRelation>>();
      Map<Integer,String> oidmap = new HashMap<Integer,String>();
      for (TRelTypes reltype : relCounts.uniqueSet()) {
        List<TRelation> rels = nodeService.getRelatedConcepts(
          concept, reltype);
        for (TRelation rel : rels) {
          TConcept toConcept = nodeService.getConcept(rel.getToOid());
          oidmap.put(rel.getToOid(), toConcept.getPname());
        }
        relmap.put(reltype.name(), rels);
      }
      mav.addObject("concept", concept);
      mav.addObject("relmap", relmap);
      mav.addObject("oidmap", oidmap);
      long endTs = System.currentTimeMillis();
      mav.addObject("elapsed", new Long(endTs - startTs));
    } catch (Exception e) {
      mav.addObject("error", e.getMessage());
    }
    mav.setViewName("show");
    return mav;
  }
  
  private void setViewName(ModelAndView mav, String format) {
    if ("html".equals(format)) {
      mav.setViewName("map");
    } else if ("xml".equals(format)) {
      mav.setViewName("map-xml");
    } else if ("json".equals(format)) {
      mav.setViewName("map-json");
    } else if ("jsonp".equals(format)) {
      mav.addObject("mapPrefix", "map");
      mav.setViewName("map-json");
    }
  }
}

As you can see, it exposes the map(), find() and show() methods. The map() is the one that exposes the Concept Mapping AE via the web. It takes its input either as a single short string, an OID, a block of plain text or HTML copy-pasted into its textbox, or an URL from which it will pull in the content. It analyzes the string or text and provides a list of concepts that it found. The map() method can also output results in XML, JSON or JSON-P.

The find() method allows you to quickly find concepts by name. Its filtering criteria is not as strict as the one used by the Concept Annotator, you can think of it as a basic search interface into the Lucene index, allowing you to quickly find what concepts exist in your database that match your search string.

The show() method, on the other hand, can be thought of as an interface into the graph database, allowing you to look up a node and all its details by OID, including references to nodes immediately adjacent to it. Of course, all of these are interlinked via JSP references. Here are some screenshots to make things clear.

Some concepts mapped off a block of plain text copy-pasted into the text box.
The same output as above but in XML, for a remote client to consume. Other formats supported are JSON and JSON-P.
Clicking on one of these concepts leads to the Node view screen, with details about the concept and its immediate neighbors.
A list of concept nodes that match "Heart Attack".

I have been curious about how to build tabbed navigation, it looks nicer than a list of links across the top of the page, so I built one based on the advice provided here. The image for the logo was snagged from this blog post.

Configuration and Data needs to be centrally located

The Concept Mapping AE in the controller described above is an aggregate AE, composed of a Boilerplate removal AE, a Sentence Annotating AE, and the Concept Annotator AE. Each of these primitive AEs, as well as the aggregate AE, is instantiated by the UIMA Framework using its XML descriptor files. In addition, some of these AEs have external references to their own properties files.

I wasn't very confident that I would be able to load everything from the classpath, and I needed to move the data (the Lucene index, Neo4j database and the EHCache cache files) to a central location anyway, I decided to do something like Solr does with its SOLR_HOME and move everything to a central location, and have everything be accessed as files.

To do this, I replaced the absolute path names in the various XML and properties files with a @tgni.home@ followed by a relative path. I then added an Ant task (using the Ant replace task) that allows me to "push" the latest configuration changes to $TGNI_HOME/conf. Here is the XML snippet for the target.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
  <!-- default location for TGNI_HOME (if not specified) -->
  <property name="tgni.home" value="/prod/web/data/tgni"/>
  <target name="copyconf" description="Create data/config instance">
    <echo message="Copying config and data files to ${tgni.home}"/>
    <mkdir dir="${tgni.home}/conf"/>
    <copy todir="${tgni.home}/conf">
      <fileset dir="src/main/resources"/>
    </copy>
    <replace dir="${tgni.home}/conf" value="${tgni.home}">
      <include name="**/*.properties"/>
      <include name="**/*.xml"/>
      <replacetoken>@tgni.home@</replacetoken>
    </replace>
  </target>

In the code, I added an extra method to UimaUtils that provided the value for the TGNI_HOME environment variable to the code. If no TGNI_HOME is defined in the environment, then the default value (hardcoded) is used.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
  public static String getTgniHome() {
    if (tgniHome == null) {
      Map<String,String> env = System.getenv();
      if (env.containsKey("TGNI_HOME")) {
        tgniHome = env.get("TGNI_HOME");
      } else {
        tgniHome = DEFAULT_TGNI_HOME;
      }
    }
    return tgniHome;
  }

Any files accessed by the UIMA framework through the XML descriptors are already expanded to the absolute path. Files accessed from the code prefix supplied paths (relative to TGNI_HOME) with the value returned by the method above. The approach resulted in very minimal changes to the code.

The data is stored under $TGNI_HOME/data with separate subdirectories for the Neo4j database, the Lucene index, and the EHCache cache files.

NodeService needs to be a singleton

The other major change to the code was that Neo4j has a limitation (or feature) that a JVM can have only a single reference to the Neo4j GraphDatabaseService. In my controller, the Concept Mapper AE needs one reference (used by the Concept Annotator) and the controller itself needs another reference to support the navigation interface. So I was getting errors advising me of this every time the webapp started.

The solution was to have all the components reuse the same instance of the NodeService (the combo service interface that hides a reference to a Neo4j database and a Lucene index behind it). For this, I needed a factory that would instantiate a single instance of a NodeService on (Spring) container startup and destroy it on shutdown. Here is the code for this factory, named (appropriately enough) NodeServiceFactory.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
// Source: src/main/java/com/mycompany/tgni/neo4j/NodeServiceFactory.java
package com.mycompany.tgni.neo4j;

import java.io.File;
import java.io.FileInputStream;
import java.util.Properties;
import java.util.ResourceBundle;

import org.apache.commons.io.FilenameUtils;
import org.apache.commons.lang.StringUtils;
import org.neo4j.kernel.EmbeddedGraphDatabase;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import com.mycompany.tgni.uima.utils.UimaUtils;

/**
 * Factory for NodeService to ensure that a single instance
 * of NodeService is created (Neo4j allows a single reference
 * to it per JVM). Clients requiring a NodeService reference
 * call NodeServiceFactory.getInstance().
 */
public class NodeServiceFactory {

  private static final Logger logger = LoggerFactory.getLogger(
    NodeServiceFactory.class);
  
  private static NodeServiceFactory factory = new NodeServiceFactory();
  
  private static NodeService instance;
  
  private NodeServiceFactory() {
    init();
  }
  
  private void init() {
    try {
      Properties props = new Properties();
      props.load(new FileInputStream(new File(
        StringUtils.join(new String[] {
        UimaUtils.getTgniHome(),
        "conf",
        "nodeservice.properties"}, File.separator))));
      instance = new NodeService();
      instance.setGraphDir(props.getProperty("graphDir"));
      instance.setIndexDir(props.getProperty("indexDir"));
      instance.setStopwordsFile(props.getProperty("stopwordsFile"));
      instance.setTaxonomyMappingAEDescriptor(
        props.getProperty("taxonomyMappingAEDescriptor"));
      instance.setCacheDescriptor(
        props.getProperty("cacheDescriptor"));
      instance.init();
    } catch (Exception e) {
      logger.error("Can't initialize NodeService", e);
    }
  }
  
  public static NodeService getInstance() {
    return instance;
  }
  
  public static void destroy() {
    if (instance != null) {
      try {
        instance.destroy();
        instance = null;
      } catch (Exception e) {
        logger.error("Can't destroy NodeService", e);
      }
    }
  }
}

The Spring XML configuration snippet to declare the nodeService singleton is as follows.

1
2
  <bean id="nodeService" class="com.mycompany.tgni.neo4j.NodeServiceFactory" 
    factory-method="getInstance" destroy-method="destroy"/>

This reference can either be injected into the controller, or the controller gets the common NodeService instance using NodeServiceFactory.getInstance(). I chose the latter method because thats how I had to do it in the ConceptAnnotator (no possibility of setter injection for that, this is UIMA land), and I wanted to keep it consistent.