Saturday, September 27, 2014

Predictive Language Model using Neural Networks


Notwithstanding the success stories and popularity of Artificial Neural Networks (ANN), I have, until recently, thought of it as a tool by which you randomly torture data until it confesses its secrets. Its only lately that I have begun to (partially) understand the method behind the madness, as I have started going through the archives of the Neural Networks for Machine Learning course on Coursera.

This week I describe an ANN that is trained on a set of 4-grams generated from a corpus of ~97k sentences, and which, when given the first 3 words in a 4-gram sequence predicts the fourth word. The entire vocabulary (with words lowercased and punctuations removed) comes out to slightly under 250 unique words. The data comes from Programming Assignment II of the course, which asks the student to build a similar network. However, this is not the solution to the assignment for reasons I enumerate below, and is likely to give you completely bogus results, so caveat emptor.

Since neurons need to be fed floating point numbers as input, and our words are categorical variables, they can be converted to a sparse 250 element array using One-Hot encoding. Alternatively, we can project the word onto a denser, smaller array (of size 50 in our case). This process is called word-embedding, and if done correctly (ie the way the assignment requires), these smaller word vectors encode not only information about itself, but also context information relative to other words in the corpus as this article explains.

One of the things I try to do when learning new things is to also find and adopt toolkits that will allow me to build them myself without having to drop down to first principles. In this case I chose Encog for my toolkit. I looked at DeepLearning4j (DL4j) also but found the documentation a bit lacking - I'll probably come back to it once I know a bit more. Both libraries expect (quite a bit of) knowledge of ANNs, and you have to hunt through the Internet, mailing lists and source code for both projects to get answers, but of the two, Encog is a bit easier to get started with. It helps that the author of Encog has also written a number of books, two of which I bought and found very useful, namely Programming Neural Networks with Encog3 in Java and Introduction to the Math of Neural Networks. There is also a free book on his site that will get you started with Encog.

A library can help speed you up in some cases, but can also hold you back in others. In this case, neither library provides a way to build the word embedding into the network, ie the network would have an input layer of 750 neurons (3 words, 250 elements per word) projected into a embedded word space of 150 neurons (3 words, 50 elements each), followed by a hidden layer of 200 neurons and an output layer of 250 neurons (one word, one-hot encoding). In this network, the word embedding weights start random but are updated during training, thus inheriting context information from other words in the 4-gram and corpus.

I ended up training a secondary ANN with the vocabulary words and project the one-hot encoding of the words onto the denser 50-dimensional space. I used tanh as the activation function based on the advice on DL4j's word2vec page. I then used the results of this network to feed my main network with 150 (3 words, 50 dimensions each) neurons, 200 hidden neurons and 250 output neurons. This allows me to predict the 4th word pretty well, but does not have the generalization power I could have achieved if the word-embedding weights were built into the ANN. Here is the code.

1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
// Source: src/main/scala/com/mycompany/scalcium/langmodel/LangModelNetwork.scala
package com.mycompany.scalcium.langmodel

import java.io.File

import scala.Array._
import scala.collection.JavaConversions._
import scala.collection.mutable.ArrayBuffer
import scala.io.Source
import scala.util.Random

import org.encog.Encog
import org.encog.engine.network.activation.ActivationSigmoid
import org.encog.engine.network.activation.ActivationTANH
import org.encog.mathutil.matrices.Matrix
import org.encog.ml.data.basic.BasicMLData
import org.encog.ml.data.basic.BasicMLDataSet
import org.encog.neural.networks.BasicNetwork
import org.encog.neural.networks.layers.BasicLayer
import org.encog.neural.networks.training.propagation.back.Backpropagation
import org.encog.neural.networks.training.propagation.resilient.ResilientPropagation

class LangModelNetwork(infile: File) {

  val sentences = Source.fromFile(infile).getLines.toList
  val vocab = buildVocab(sentences)
  val encoder = new OneHotEncoder(vocab.size)
  
  // embed each word into a dense 50D representation
  val embeddings = computeWordEmbeddings(vocab, 50, false)

  val quadgrams = buildQuadGrams(sentences, vocab)
  val trainDataset = new BasicMLDataSet()
  val testDataset = new BasicMLDataSet()
  quadgrams.map(quadgram => {
    val (ingrams, outgram) = quadgram.splitAt(3)
    val inputs = ingrams.flatMap(ingram => embeddings(vocab(ingram))).toArray
    val output = encoder.encode(vocab(outgram.head))
    // split train/test 75/25
    if (Random.nextDouble < 0.75D)
      trainDataset.add(new BasicMLData(inputs), new BasicMLData(output))
    else testDataset.add(new BasicMLData(inputs), new BasicMLData(output))
  })
  val network = new BasicNetwork()
  network.addLayer(new BasicLayer(null, true, 50 * 3)) // 50 neurons per word
  network.addLayer(new BasicLayer(new ActivationSigmoid(), true, 200))
  network.addLayer(new BasicLayer(new ActivationSigmoid(), false, vocab.size))
  network.getStructure().finalizeStructure()
  network.reset()
  val trainer = new Backpropagation(network, trainDataset, 0.1, 0.9)
  trainer.setBatchSize(100)
  trainer.setThreadCount(Runtime.getRuntime.availableProcessors())
  var epoch = 0
  do {
    trainer.iteration()
    Console.println("Epoch: %d, error: %5.3f".format(epoch, trainer.getError()))
    epoch += 1
  } while (trainer.getError() > 0.01)
  trainer.finishTraining()
  // measure performance on test set
  var numOk = 0.0D
  var numTests = 0.0D
  testDataset.map(pair => {
    val predicted = network.compute(pair.getInput()).getData()
    val actual = encoder.decode(pair.getIdeal().getData())
    if (actual == predicted.indexOf(predicted.max)) numOk += 1.0D
    numTests += 1.0D
  })
  Console.println("Test set error: %5.3f".format(numOk / numTests))
  Encog.getInstance().shutdown()
  
  def computeWordEmbeddings(vocab: Map[String,Int], targetSize: Int,
      trace: Boolean): Map[Int,Array[Double]] = {
    val inputs = (0 until vocab.size)
      .map(i => encoder.encode(i))
      .toArray
    val dataset = new BasicMLDataSet(inputs, inputs) 
    val network = new BasicNetwork()
    network.addLayer(new BasicLayer(null, true, vocab.size))
    network.addLayer(new BasicLayer(new ActivationTANH(), false, targetSize))
    network.getStructure().finalizeStructure()
    network.reset()
    val trainer = new ResilientPropagation(network, dataset)
    var epoch = 0
    do {
      trainer.iteration()
      if (trace) Console.println("Epoch: %d, error: %5.3f"
        .format(epoch, trainer.getError()))
      epoch += 1
    } while (trainer.getError() > 0.001)
    trainer.finishTraining()
    val embeddings = dataset.zipWithIndex.map(pz => {
        val predicted = network.compute(pz._1.getInput())
        (pz._2, predicted.getData())
      })
      .toMap
    Encog.getInstance().shutdown()
    embeddings
  }
  
  def getWords(sentence: String): List[String] = 
    sentence.toLowerCase.split("\\s+")
      .filter(word => ! word.matches("\\p{Punct}")) // remove puncts
      .filter(word => word.trim().length > 0)       // remove empty words
      .toList
    
  def buildVocab(sentences: List[String]): Map[String,Int] =
    sentences.flatMap(sentence => getWords(sentence))
      .toSet        // remove duplicates
      .zipWithIndex // word => word_id
      .toMap

  def buildQuadGrams(sentences: List[String], vocab: Map[String,Int]): 
   List[List[String]] = 
    sentences.map(sentence => getWords(sentence))
      .map(words => words.sliding(4))
      .flatten
      .filter(quad => quad.size == 4)

  def printSparse(matrix: Matrix, numRows: Int): String = {
    val buffer = ArrayBuffer[(Int,Int,Double)]()
    (0 until numRows).foreach(rownum => {
      (0 until matrix.getCols()).foreach(colnum => {
        if (matrix.get(rownum, colnum) > 0.0D) buffer += ((rownum, colnum, 1.0D))
      })
    })
    buffer.mkString(",")
  }
}

class OneHotEncoder(val arraySize: Int) {
  
  def encode(idx: Int): Array[Double] = {
    val encoded = Array.fill[Double](arraySize)(0.0D)
    encoded(idx) = 1.0
    encoded
  }
  
  def decode(activations: Array[Double]): Int = {
    val idxs = activations.zipWithIndex
      .filter(a => a._1 > 0.0D)
    if (idxs.size == 1) idxs.head._2 else -1
  }
}

Its called from a small JUnit test that passes the raw_sentences.txt file into the constructor. Here is the code for the JUnit test.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
// Source: src/test/scala/com/mycompany/scalcium/langmodel/LangModelNetworkTest.scala
package com.mycompany.scalcium.langmodel

import org.junit.Test
import java.io.File

class LangModelNetworkTest {

  @Test
  def testReadSentences(): Unit = {
    val infile = new File("/path/to/raw_sentences.txt")
    val lmn = new LangModelNetwork(infile)
  }
}

Note that you will have to give sbt more heap space that it allocates by default. I was able to do this by passing in -mem 4096 to my sbt command.

The error on the training set is around 0.4% and test set is around 4.2%. I tried various values of Learning Rate and Momentum and the number did not change too much. Here are the numbers if you are interested.

Learning Rate Momentum Training Error Test Error
0.10.90.0040.042
0.0010.90.0070.046
100.90.0050.009
0.100.0040.048
0.10.50.0040.048

Similarly, I experimented with a few values of the Input and Hidden layer sizes, but did not see too much change in the overall errors.

Input Layer Size Hidden Layer Size Training Error Test Error
502000.0040.042
51000.0040.041
50100.0040.041
10050.0040.041

Finally, in order to see the word embedding stuff in action, I decided to use the word2vec NN in gensim. This is a Python (Cython if available) port of Google's word2vec implementation described here. Here is the code to build a Gensim Word2Vec model with the sentences supplied in the programming assignment.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
// Source: nltk-examples/src/topicmodel/gensim_word2vec.py
import string
import nltk
from gensim.models import word2vec
import logging
logging.basicConfig(format="%(asctime)s: %(levelname)s : %(message)s", 
                    level=logging.INFO)

# load data
fin = open("/path/to/raw_sentences.txt", 'rb')
puncts = set([c for c in string.punctuation])
sentences = []
for line in fin:
    # each sentence is a list of words, we lowercase and remove punctuations
    # same as the Scala code
    sentences.append([w for w in nltk.word_tokenize(line.strip().lower()) 
            if w not in puncts])
fin.close()

# train word2vec with sentences
model = word2vec.Word2Vec(sentences, size=100, window=4, min_count=1, workers=4)
model.init_sims(replace=True)

# find 10 words closest to "day"
print "words most similar to 'day':"
print model.most_similar(positive=["day"], topn=10)

# find closest word to "he"
print "words most similar to 'he':"
print model.most_similar(positive=["he"], topn=1)

Here are the results of running this code. As you can see the vectors assigned to each word by word2vec display some knowledge of semantics inherited from the context. Essentially, similar words are words that may be dropped into the quadgram without making the quadgram nonsensical.

1
2
3
4
5
6
7
8
9
words most similar to 'day':
  [('night', 0.7594585418701172), ('week', 0.728765606880188), 
    ('year', 0.6336678266525269), ('season', 0.5495686531066895), 
    ('game', 0.5045607089996338), ('group', 0.5027363896369934), 
    ('after', 0.4796753525733948), ('off', 0.45759227871894836), 
    ('time', 0.45758330821990967), ('university', 0.44466036558151245)]

words most similar to 'he':
  [('she', 0.9050558805465698)]

Thats all I have for today. ANNs seem to be a bit on the bleeding edge compared to some of the other ML tools I use, so perhaps I may have to start writing code from first principles. But there is a lot of things to learn and I will probably spend the next few weeks exploring more about ANNs.

Saturday, September 20, 2014

Coreference Resolution with Stanford CoreNLP and LingPipe


Up until recently, I had no use for Entity Recognition apart from the entities in our medical ontology. As we move away from published literature and into the realm of patient records, recognizing names, locations, etc is becoming a necessity. And once you get into Entity Recognition, the need for Coreference Resolution can't be very far behind. Accordingly, I decided to investigate what kind of support was available for Coreference Resolution among the three NLP toolkits I am reasonably familiar with - Apache OpenNLP, LingPipe and Stanford CoreNLP.

Having recently experimented with Stanford CoreNLP, I was aware that it supported Coreference Resolution. I was also aware that OpenNLP offerered limited support based on this blog post by D P Dearing. I didn't know about LingPipe, but this post on the LingPipe blog also indicated some sort of support. So I decided to investigate and build implementations using each of the three toolkits - this post describes the results of that effort.

Coreference Resolution initially seemed (to me) to be something of a black art involving linguistics and regular expression hackery, but this quote from the LingPipe blog post (referenced above) gave me a rough mental model of how one can go about doing it. Hopefully it helps you too.

LingPipe's heuristic coreference package is based on Breck's thesis work (called CogNIAC). If you troll through the code or read Breck's paper, you'll see that it's essentially a greedy online algorithm that visits the entity mentions in a document in order, and for each mention either links it to a previous linked chain of mentions, or it starts a new chain consisting only of the current mention. Essentially, it's a greedy online clustering algorithm.

The resolution of a mention is guided by matchers and anti-matchers that score candidate antecedent mention chains based on properties such as the closest matching alias (using a complex comparison allowing for missing tokens), known alias associations, discourse proximity (how far away the last mention in a chain is and how many are intervening), entity type (person, location, and ogranization, and for persons, gender).

Anyway, on to my implementation. Like the Tokenizer and NameFinder, I built this CorefResolver trait that all my implementations must extend, and a factory using which I could get one of its implementations by name. I also implemented a case class to hold a Coreference mention (the string itself, and its start and end character offsets in the input text).

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
// Source: src/main/scala/com/mycompany/scalcium/coref/CorefResolver.scala
package com.mycompany.scalcium.coref

object CorefResolver {
  def getResolver(name: String): CorefResolver = {
    name.toLowerCase() match {
      case "stanford" => new StanfordCorefResolver()
      case "lingpipe" => new LingPipeCorefResolver()
      case "opennlp" => new OpenNLPCorefResolver()
    }
  }
}

trait CorefResolver {
  def resolve(text: String): List[(CorefTriple,List[CorefTriple])]
}

case class CorefTriple(text: String, begin: Int, end: Int)

The Stanford CoreNLP toolkit has the best API support for Coreference Resolution among the three (in my opinion). The CoreNLP API implements a pipeline of named processes, and getting the coreferences is simply a matter of reading the appropriate annotations the pipeline has placed on the text. This StackOverflow discussion provided me with most of the pointers for my implementation.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
// Source: src/main/scala/com/mycompany/scalcium/coref/StanfordCorefResolver.scala
package com.mycompany.scalcium.coref

import java.util.Properties

import scala.collection.JavaConversions._
import scala.collection.mutable.ArrayBuffer

import edu.stanford.nlp.pipeline.StanfordCoreNLP
import edu.stanford.nlp.pipeline.Annotation
import edu.stanford.nlp.dcoref.CorefCoreAnnotations.CorefChainAnnotation
import edu.stanford.nlp.ling.CoreAnnotations.SentencesAnnotation
import edu.stanford.nlp.ling.CoreAnnotations.TokensAnnotation
import edu.stanford.nlp.util.IntPair
import edu.stanford.nlp.dcoref.CorefChain.CorefMention
import edu.stanford.nlp.ling.CoreAnnotations.TextAnnotation

class StanfordCorefResolver extends CorefResolver {

  val props = new Properties()
  props("annotators") = "tokenize, ssplit, pos, lemma, ner, parse, dcoref"
  val pipeline = new StanfordCoreNLP(props)
  
  override def resolve(text: String): List[(CorefTriple,List[CorefTriple])] = {
    val doc = new Annotation(text)
    pipeline.annotate(doc)
    val sentences = doc.get(classOf[SentencesAnnotation])
      .map(coremap => coremap.get(classOf[TextAnnotation]))
      .toList
    val sentenceOffsets = buildSentenceOffsets(sentences)
    val graph = doc.get(classOf[CorefChainAnnotation])
    graph.values.map(chain => {
      val mention = chain.getRepresentativeMention()
      val ref = toTuple(mention, doc, sentenceOffsets)
      val comentions = chain.getMentionsInTextualOrder()
      val corefs = comentions.map(coref => toTuple(coref, doc, sentenceOffsets))
                             .filter(triple => ! ref.text.equals(triple.text))
                             .toList
      (ref, corefs)
    })
    .filter(tuple => tuple._2.size > 0)
    .toList
  }

  def max(a: Int, b: Int) = if (a > b) a else b
  def min(a: Int, b: Int) = if (a < b) a else b

  def toTuple(coref: CorefMention, doc: Annotation, soffsets: Map[Int,Int]): 
   CorefTriple = {
    val sbegin = soffsets(coref.sentNum - 1)
    val mtriple = doc.get(classOf[SentencesAnnotation])
      // get sentence being analyzed
      .get(coref.sentNum - 1)
      // get all tokens in sentence with character offsets
      .get(classOf[TokensAnnotation])
      .map(token => ((token.originalText().toString(), 
          sbegin + token.beginPosition(), sbegin + token.endPosition())))
      // sublist the coreference part
      .subList(coref.startIndex - 1, coref.endIndex - 1)
      // join adjacent tokens into a single mention triple
      .foldLeft(("", Int.MaxValue, 0))((a, b) => 
        (List(a._1, b._1).mkString(" "), min(a._2, b._2), max(a._3, b._3)))
    CorefTriple(mtriple._1.trim(), mtriple._2, mtriple._3)
  }
  
  def buildSentenceOffsets(sentences: List[String]): Map[Int,Int] = {
    val slengths = sentences
      .zipWithIndex
      .map(si => (si._2, si._1.length()))
      .sortWith(_._1 > _._1)
    val soffsets = ArrayBuffer[(Int,Int)]()
    for (sindex <- slengths.map(_._1)) {
      val offset = if (sindex == 0) ((sindex, 0))
      else {
     val rest = slengths.drop(slengths.size - sindex)
        val offset = rest.map(_._2).foldLeft(0)(_ + _)
        ((sindex, offset))
      }
      soffsets += offset
    }
    soffsets.toMap[Int,Int]
  }
}

The JUnit test to test this runs through seven input sentence groups (some of them single sentences) and prints the co-references in an human readable format. Here is the code to test the Stanford CoreNLP based CorefResolver, the others are similar.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
// Source: src/test/scala/com/mycompany/scalcium/coref/CorefResolverTest.scala
package com.mycompany.scalcium.coref

import org.junit.Test
import scala.io.Source
import java.io.File

class CorefResolverTest {

  val texts = List(
    "The atom is a basic unit of matter, it consists of a dense central 
     nucleus surrounded by a cloud of negatively charged electrons.",
    "The Revolutionary War occurred during the 1700s and it was the first 
     war in the United States.",
    "Mr. Vinken is chairman of Elsevier N.V., the Dutch publishing group.",
    "The project leader is refusing to help. The jerk thinks only of himself.",
    "A final resting place for another legend, Anna Pavlova, the Russian 
     ballerina who spent her final years in London, may be less than secure. 
     For 65 years, Ms. Pavlova's ashes have been in a white urn at Golder's 
     Green cemetery, where they are likely to remain according to a director 
     of the crematorium.",
    "Another icon of the '60s, Che Guevara, has been turned into a capitalist 
     tool 28 years after he was gunned down in Bolivia.",
    "I am Sam. Sam I am. I like green eggs and ham."
  )
      
  @Test
  def testStanfordCorefResolver(): Unit = {
    val scr = CorefResolver.getResolver("stanford")
    texts.foreach(text => {
      val x = scr.resolve(text)
      prettyPrint(text, x)
    })
  }
  
  def prettyPrint(text: String,
      result, List[(CorefTriple,List[CorefTriple])]): Unit = {
    Console.println(text)
    result.foreach(refcorefs => {
      val ref = refcorefs._1
      val corefs = refcorefs._2
      Console.println("(%d,%d): %s".format(ref.begin, ref.end, ref.text))
      corefs.foreach(coref => 
        Console.println("  (%d,%d): %s".format(coref.begin, coref.end, 
          coref.text)))
    })
    Console.println()
  }
}

The results from this implementation are quite good. I used a set of seven sentence groups (most of them single sentences) to test them out, and it identified coreferences that seemed valid, although it didn't get all of them. The downside is the tremendous amount of system resources it consumes, but that may not be a huge factor given adequate ability to parallelize operations. I have manually highlighted the mentions that it found in the input sentence group.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
The atom is a basic unit of matter, it consists of a dense central nucleus 
surrounded by a cloud of negatively charged electrons.
(12,34): a basic unit of matter
  (0,8): The atom
  (36,38): it

The Revolutionary War occurred during the 1700s and it was the first war in 
the United States.
(0,21): The Revolutionary War
  (52,54): it
  (59,93): the first war in the United States

Mr. Vinken is chairman of Elsevier N.V., the Dutch publishing group.
(26,39): Elsevier N.V.
  (41,67): the Dutch publishing group
(0,10): Mr. Vinken
  (14,67): chairman of Elsevier N.V. , the Dutch publishing group

The project leader is refusing to help. The jerk thinks only of himself.
(79,87): The jerk
  (103,110): himself

A final resting place for another legend, Anna Pavlova, the Russian ballerina 
who spent her final years in London, may be less than secure. For 65 years, 
Ms. Pavlova's ashes have been in a white urn at Golder's Green cemetery, 
where they are likely to remain according to a director of the crematorium.
(293,306): Ms. Pavlova 's
  (42,54): Anna Pavlova
  (88,91): her

Another icon of the '60s, Che Guevara, has been turned into a capitalist tool 
28 years after he was gunned down in Bolivia.
(26,37): Che Guevara
  (93,95): he

I am Sam. Sam I am. I like green eggs and ham.
(19,24): Sam I
  (0,1): I
  (5,8): Sam
  (23,24): I
  (38,39): I

I wasn't able to make my OpenNLP implementation work, although I probably didn't try hard enough. While writing the code (heavily influenced by D P Dearing's blog post), I realized that support for Coreference Resolution in OpenNLP was a bit half baked, and even if I got it working, it probably won't be too useful for me.

While I was able to make the LingPipe implementation work using the CorefDemo.java in the source distribution as a guide, it seems to need larger chunks of input text to be able to find coreferences in them, and it misses quite a few. Here is the code:

1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
// Source: src/main/scala/com/mycompany/scalcium/coref/LingPipeCorefResolver.scala
package com.mycompany.scalcium.coref

import java.io.File
import java.io.FileInputStream
import java.io.ObjectInputStream

import scala.collection.JavaConversions._
import scala.util.matching.Regex

import com.aliasi.chunk.Chunk
import com.aliasi.chunk.ChunkFactory
import com.aliasi.chunk.Chunker
import com.aliasi.coref.EnglishMentionFactory
import com.aliasi.coref.WithinDocCoref
import com.aliasi.sentences.MedlineSentenceModel
import com.aliasi.sentences.SentenceChunker
import com.aliasi.tokenizer.IndoEuropeanTokenizerFactory
import com.aliasi.util.Streams
import com.aliasi.coref.Tags

class LingPipeCorefResolver extends CorefResolver {

  val ModelDir = "src/main/resources/lingpipe/models"
  val ChunkerModelFile = "ne-en-news-muc6.AbstractCharLmRescoringChunker"
  val MalePronouns = "(?i)\\b(he|him|his)\\b".r
  val FemalePronouns = "(?i)\\b(she|her|hers)\\b".r
  val NeuterPronouns = "(?i)\\b(it)\\b".r
    
  val tokenizerFactory = new IndoEuropeanTokenizerFactory()
  val sentenceModel = new MedlineSentenceModel()
  val sentenceChunker = new SentenceChunker(tokenizerFactory, sentenceModel)
  val entityChunker = readObject(new File(ModelDir, ChunkerModelFile))
    .asInstanceOf[Chunker]
    
  override def resolve(text: String): List[(CorefTriple,List[CorefTriple])] = {
 val mentionFactory = new EnglishMentionFactory()
    val coref = new WithinDocCoref(mentionFactory)
    val sentenceChunking = sentenceChunker.chunk(text.toCharArray, 0, text.length)
    val mentions = sentenceChunking.chunkSet()
      .zipWithIndex
      .map(chunkIndexPair => {
        val schunk = chunkIndexPair._1
        val sentence = text.substring(schunk.start, schunk.end)  
        // find entities in sentence
        val mentionChunking = entityChunker.chunk(sentence)
        val mentions = mentionChunking.chunkSet().toSet
        // add different types of pronoun entities
        val malePronouns = buildPronounMentions(
          MalePronouns, sentence, "MALE_PRONOUN", mentions)
        val femalePronouns = buildPronounMentions(
          FemalePronouns, sentence, "FEMALE_PRONOUN", mentions)
        val neuterPronouns = buildPronounMentions(
          NeuterPronouns, sentence, "NEUTER_PRONOUN", mentions)
        val allMentions = 
          mentions ++ malePronouns ++ femalePronouns ++ neuterPronouns
        // resolve coreferences
        allMentions.map(chunk => {
          val chstart = chunk.start
          val chend = chunk.end
          val chtext = sentence.substring(chstart, chend)
          val chtype = chunk.`type`
          val mention = mentionFactory.create(chtext, chtype)
          val mentionId = coref.resolveMention(mention, chunkIndexPair._2)
          (mentionId, (schunk.start + chstart, schunk.start + chend, chtext))
        })
      })
      .flatten
      .groupBy(pair => pair._1) // {mentionId => Set((mentionId, (chunk))
      .filter(kv => kv._2.size > 1) // filter out single mentions
      .map(kv => kv._2.map(x => CorefTriple(x._2._3, x._2._1, x._2._2)).toList)
      .toList                   // List[List[CorefTriple]]
    mentions.map(mention => {
      val head = mention.head
      val rest = mention.tail
      (head, rest)
    })
  }
  
  def readObject(f: File): Object = {
    val oistream = new ObjectInputStream(new FileInputStream(f))
    val obj = oistream.readObject
    Streams.closeQuietly(oistream)
    obj
  }
  
  def buildPronounMentions(regex: Regex, sentence: String, tag: String, 
      mentions: Set[Chunk]): Set[Chunk] =
    regex.findAllMatchIn(sentence)
      .map(m => ChunkFactory.createChunk(m.start, m.end, tag))
      .filter(pronoun => ! overlaps(mentions, pronoun))
      .toSet
  
  def overlaps(mentions: Set[Chunk], pronoun: Chunk): Boolean = {
    val pstart = pronoun.start
    val pend = pronoun.end
    mentions.filter(mention => {
      val maxStart = if (mention.start < pstart) pstart else mention.start
      val minEnd = if (mention.end < pend) mention.end else pend
      maxStart < minEnd
    })
    .size() > 0
  }
}

In my test set, the only one in which it caught the coreferences was in the Anna Pavlova sentence, where it reported fewer coreferences than the Stanford CoreNLP implementation. It is however, orders of magnitude faster.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
The atom is a basic unit of matter, it consists of a dense central nucleus 
surrounded by a cloud of negatively charged electrons.

The Revolutionary War occurred during the 1700s and it was the first war in 
the United States.

Mr. Vinken is chairman of Elsevier N.V., the Dutch publishing group.

The project leader is refusing to help. The jerk thinks only of himself.

A final resting place for another legend, Anna Pavlova, the Russian ballerina 
who spent her final years in London, may be less than secure. For 65 years, 
Ms. Pavlova's ashes have been in a white urn at Golder's Green cemetery, 
where they are likely to remain according to a director of the crematorium.
(42,54): Anna Pavlova
  (88,91): her

Another icon of the '60s, Che Guevara, has been turned into a capitalist tool 
28 years after he was gunned down in Bolivia.

I am Sam. Sam I am. I like green eggs and ham.

Thats all I have for today. I hope you found it useful. Overall, my initial reaction is that neither implementation (Stanford or LingPipe) would work out too well for me. But if the feature was really required, I would probably pay the performance price and go with Stanford CoreNLP.

Friday, September 05, 2014

Tokenizing and Named Entity Recognition with Stanford CoreNLP



I got into NLP using Java, but I was already using Python at the time, and soon came across the Natural Language Tool Kit (NLTK), and just fell in love with the elegance of its API. So much so that when I started working with Scala, I figured it would be a good idea to build a NLP toolkit with an API similar to NLTKs, primarily as a way to learn NLP and Scala but also to build something that would be as enjoyable to work with as NLTK and have the benefit of Java's rich ecosystem.

The project is perenially under construction, and serves as a test bed for my NLP experiments. In the past, I have used OpenNLP and LingPipe to build Tokenizer implementations that expose an API similar to NLTK's. More recently, I have built an Named Entity Recognizer (NER) with OpenNLP's NameFinder. At the recommendation of one of my readers, I decided to take a look at Stanford CoreNLP, with which I ended up building a Tokenizer and a NER implementation. This post describes that work.

The code for the Tokenizer is shown below. The appropriate implementation can be invoked by calling Tokenizer.getTokenizer("stanford") using a factory pattern on the Tokenizer trait.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
// Source: src/main/scala/com/mycompany/scalcium/tokenizers/StanfordTokenizer.scala
package com.mycompany.scalcium.tokenizers

import java.util.Properties

import scala.collection.JavaConversions._
import scala.collection.mutable.ArrayBuffer

import edu.stanford.nlp.ling.CoreAnnotations.PartOfSpeechAnnotation
import edu.stanford.nlp.ling.CoreAnnotations.SentencesAnnotation
import edu.stanford.nlp.ling.CoreAnnotations.TextAnnotation
import edu.stanford.nlp.ling.CoreAnnotations.TokensAnnotation
import edu.stanford.nlp.pipeline.Annotation
import edu.stanford.nlp.pipeline.StanfordCoreNLP
import edu.stanford.nlp.trees.Tree
import edu.stanford.nlp.trees.TreeCoreAnnotations.TreeAnnotation

class StanfordTokenizer extends Tokenizer {

  val props = new Properties()
  props("annotators") = "tokenize, ssplit, pos, parse"
  val pipeline = new StanfordCoreNLP(props)

  override def sentTokenize(para: String): List[String] = {
    val doc = new Annotation(para)
    pipeline.annotate(doc)
    doc.get(classOf[SentencesAnnotation])
      .map(coremap => coremap.get(classOf[TextAnnotation]))
      .toList
  }
  
  override def wordTokenize(sentence: String): List[String] = {
    val sent = new Annotation(sentence)
    pipeline.annotate(sent)
    sent.get(classOf[SentencesAnnotation])
      .head
      .get(classOf[TokensAnnotation])
      .map(corelabel => corelabel.get(classOf[TextAnnotation]))
      .toList
  }
  
  override def posTag(sentence: String): List[(String,String)]= {
    val sent = new Annotation(sentence)
    pipeline.annotate(sent)
    sent.get(classOf[SentencesAnnotation])
      .head
      .get(classOf[TokensAnnotation])
      .map(corelabel => {
        val word = corelabel.get(classOf[TextAnnotation])
        val tag = corelabel.get(classOf[PartOfSpeechAnnotation])
        (word, tag)
      })
      .toList
  }
  
  override def phraseChunk(sentence: String): List[(String,String)] = {
    val sent = new Annotation(sentence)
    pipeline.annotate(sent)
    val tree = sent.get(classOf[SentencesAnnotation])
      .head
      .get(classOf[TreeAnnotation])
    val chunks = ArrayBuffer[(String,String)]()
    extractChunks(tree, chunks)
    chunks.toList
  }
  
  def extractChunks(tree: Tree, chunks: ArrayBuffer[(String,String)]): Unit = {
    tree.children().map(child => {
      val tag = child.value()
      if (child.isPhrasal() && hasOnlyLeaves(child)) {
        // concatenate words into phrase if the children of this
        // phrase are leaves (not phrases themselves)
        val phrase = child.getLeaves[Tree]()
          .flatMap(leaf => leaf.yieldWords())
          .map(word => word.word())
          .mkString(" ")
        chunks += ((phrase, tag))
      } else {
     // dig deeper
     extractChunks(child, chunks)
      }
    })
  }
  
  def hasOnlyLeaves(tree: Tree): Boolean = 
    tree.children().filter(child => child.isPhrasal()).size == 0
}

Most of the calls are straightforward. The only exception is the phraseChunk() method, which was originally built as a wrapper around OpenNLP's shallow phrase chunking. Stanford parser only does deep parsing into a Tree, from which my code extracts a list of phrases and phrase types.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
>>> text = """
Pierre Vinken, 61 years old, will join the board as a nonexecutive director 
Nov. 29. Mr. Vinken is chairman of Elsevier N.V., the Dutch publishing group
based at Amsterdam. 
Rudolph Agnew, 55 years old and former chairman of Consolidated Gold Fields 
PLC, was named a nonexecutive director of this British industrial conglomerate.
"""
>>> tokenizer = Tokenizer.getTokenizer("stanford")
>>> sentences = tokenizer.sentTokenize(text)
List(Pierre Vinken, 61 years old, will join the board as a nonexecutive 
  director Nov. 29.,
  Mr. Vinken is chairman of Elsevier N.V., the Dutch publishing group
  based at Amsterdam.,
  Rudolph Agnew, 55 years old and former chairman of Consolidated Gold 
  Fields PLC, was named a nonexecutive director of this British industrial 
  conglomerate.)
>>> words = tokenizer.wordTokenize(sentences(0))
List(Pierre, Vinken, ,, 61 years, old, ,, will, join, the, board, as,
  a, nonexecutive, director, Nov., 29, .)
>>> postags = tokenizer.posTag(sentences(0))
List((Pierre,NNP), (Vinken,NNP), (,,,), (61,CD), (years,NNS), (old,JJ),
  (,,,), (will,MD), (join,VB), (the,DT), (board,NN), (as,IN), (a,DT),
  (nonexecutive,JJ), (director,NN), (Nov.,NNP), (29,CD), (.,.))
>>> phrases = tokenizer.phraseTokenize(sentences(0))
List(Pierre Vinken, 61 years, the board, a nonexecutive director, Nov. 29)
>>> chunks = tokenizer.phraseChunk(sentences(0))
List((Pierre Vinken,NP), (61 years,NP), (the board,NP), 
  (a nonexecutive director,NP), (Nov. 29,NP-TMP))
>>>

The Stanford CoreNLP based NER follows a similar approach as the Tokenizer, being instantiated by calling NameFinder.getNameFinder("stanford") using a factory pattern on the NameFinder trait. Here is the code.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
// Source: src/main/scala/com/mycompany/scalcium/names/StanfordNameFinder.scala
package com.mycompany.scalcium.names

import java.io.File
import scala.collection.JavaConversions._
import com.mycompany.scalcium.tokenizers.Tokenizer
import java.util.Properties
import edu.stanford.nlp.pipeline.StanfordCoreNLP
import edu.stanford.nlp.pipeline.Annotation
import edu.stanford.nlp.ling.CoreAnnotations.SentencesAnnotation
import edu.stanford.nlp.ling.CoreAnnotations.NamedEntityTagAnnotation
import edu.stanford.nlp.ling.CoreAnnotations.NormalizedNamedEntityTagAnnotation
import edu.stanford.nlp.ling.CoreAnnotations.TextAnnotation
import edu.stanford.nlp.ling.CoreAnnotations.TokensAnnotation

class StanfordNameFinder extends NameFinder {

  val props = new Properties()
  props("annotators") = "tokenize, ssplit, pos, lemma, ner"
  props("ssplit.isOneSentence") = "true"
  val pipeline = new StanfordCoreNLP(props)

  override def find(sentences: List[String]): List[List[(String,Int,Int)]] = {
    sentences.map(sentence => {
      val sent = new Annotation(sentence)
      pipeline.annotate(sent)
      sent.get(classOf[SentencesAnnotation])
        .head
        .get(classOf[TokensAnnotation])
        .map(corelabel => (corelabel.ner(), corelabel.beginPosition(), 
          corelabel.endPosition()))
        .filter(triple => (! "O".equals(triple._1)))
        .groupBy(triple => triple._1)
        .map(kv => {
          val key = kv._1
          val list = kv._2
          val begin = list.sortBy(x => x._2).head._2
          val end = list.sortBy(x => x._3).reverse.head._3
          (key, begin, end)
        })
        .toList
    })
    .toList
  }
}

The previous version of my Stanford based NER used the Stanford NER library and the 4 class classifier model directly. This was definitely an improvement over the OpenNLP NameFinder as described here (scroll down to the end). The code above creates a NER that can recognize 7 classes and uses code very similar to the Tokenizer (although arguably I could have created a 7 class NER by using the appropriate classifier model). Here is some output from the NER.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
>>> namefinder = NameFinder.getNameFinder("stanford")
>>> entities = namefinder.find(sentences)
List(List((PERSON,0,13), (DURATION,15,27), (DATE,76,83)),
  List((PERSON,4,10), (MISC,45,50), (LOCATION,77,86), (ORGANIZATION,26,39)),
  List((PERSON,0,13), (MISC,111,118), (DURATION,16,28), (ORGANIZATION,52,80)))
>>> prettyPrint(sentences, entities)
Pierre Vinken, 61 years old, will join the board as a nonexecutive director 
Nov. 29.
  (0,13): Pierre Vinken / PERSON
  (15,27): 61 years old / DURATION
  (76,83): Nov. 29 / DATE
Mr. Vinken is chairman of Elsevier N.V., the Dutch publishing group based 
at Amsterdam.
  (4,10): Vinken / PERSON
  (45,50): Dutch / MISC
  (77,86): Amsterdam / LOCATION
  (26,39): Elsevier N.V. / ORGANIZATION
Rudolph Agnew , 55 years old and former chairman of Consolidated Gold Fields 
PLC, was named a director of this British industrial conglomerate.
  (0,13): Rudolph Agnew / PERSON
  (111,118): British / MISC
  (16,28): 55 years old / DURATION
  (52,80): Consolidated Gold Fields PLC / ORGANIZATION
>>>

When I last looked at the Stanford parser, I found the API somewhat hard to understand. The CoreNLP API is much simpler and seems more unified, possibly at the cost of some compile time type checking.

Overall, I was quite impressed by Stanford CoreNLP's accuracy. However, performance-wise, Stanford CoreNLP seems to be uniformly slower than either OpenNLP and LingPipe, although not by much (using my limited set of examples). In all fairness, though, Stanford CoreNLP is designed to work in batch mode, where you run the pipeline with the text and then walk through the annotations generated as a result of the Properties object passed in to the constructor.