Saturday, July 26, 2014

Handwritten Digit Recognition with PyBrain


I am taking Andrew Ng's ML class on Coursera again. The last time I took it was approximately 3 years ago, when I was just starting out learning about Machine Learning. This time round, I am not submitting any of the programming assignments because I am doing them in Python rather than in Octave.

Last week (and this week) was Neural Networks. Instead of building a Neural Network from first principles as required by the Programming Assignment, I decided to use this opportunity to explore the PyBrain, a Python machine learning library for building Neural Networks.

The task is to classify images of handwritten digits into the numbers 0-9. The data is a subset of the MNIST Database. It consists of 5,000 black and white images of a single handwritten digit, each 20x20 pixels flattened into a 1x400 array of grayscale values 0-127, and the actual value of the digit. The data is provided as an MATLAB .mat file for the assignment. Here is a sample of the data (visualization code included below).


To do the classification, I used PyBrain to build a 3 layer FeedForward Neural Network. The input layer has 400 units, each corresponding to a single feature. The output layer has 10 units, each corresponding to one of the possible numeric values. The hidden layer has 25 units based on the guidelines in the programming assignment. I then split the data into a 75/25 training/test set, used a BackPropagation trainer to train the network with the training set, and computed accuracy using the test set.

PyBrain has its own routines for splitting a dataset into training and test sets, computing accuracy, etc, but since I am more familiar with utility classes in Scikit-Learn, I used these instead where possible. Here is the code - its heavily documented, so a narrative is probably unnecessary.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
# Source: src/digit_recognition/neural_network.py
from __future__ import division

import numpy as np
import matplotlib.pyplot as plt
import scipy.io
import math

from sklearn.cross_validation import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.metrics import accuracy_score

from pybrain.datasets import ClassificationDataSet
from pybrain.tools.shortcuts import buildNetwork
from pybrain.supervised.trainers import BackpropTrainer
from pybrain.structure.modules import SoftmaxLayer

def load_dataset(dataset, X, y):
    enc = OneHotEncoder(n_values=10)
    yenc = enc.fit_transform(np.matrix(y)).todense()
    for i in range(y.shape[0]):
        dataset.addSample(X[i, :], yenc[i][0])

NUM_EPOCHS = 50
NUM_HIDDEN_UNITS = 25

print "Loading MATLAB data..."    
data = scipy.io.loadmat("../../data/digit_recognition/ex3data1.mat")
X = data["X"]
y = data["y"]
y[y == 10] = 0 # '0' is encoded as '10' in data, fix it
n_features = X.shape[1]
n_classes = len(np.unique(y))

# visualize data
# get 100 rows of the input at random
print "Visualize data..."
idxs = np.random.randint(X.shape[0], size=100)
fig, ax = plt.subplots(10, 10)
img_size = math.sqrt(n_features)
for i in range(10):
    for j in range(10):
        Xi = X[idxs[i * 10 + j], :].reshape(img_size, img_size).T
        ax[i, j].set_axis_off()
        ax[i, j].imshow(Xi, aspect="auto", cmap="gray")
plt.show()

# split up training data for cross validation
print "Split data into training and test sets..."
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size=0.25, 
                                                random_state=42)
ds_train = ClassificationDataSet(X.shape[1], 10)
load_dataset(ds_train, Xtrain, ytrain)

# build a 400 x 25 x 10 Neural Network
print "Building %d x %d x %d neural network..." % (n_features, 
                                                   NUM_HIDDEN_UNITS, n_classes)
fnn = buildNetwork(n_features, NUM_HIDDEN_UNITS, n_classes, bias=True, 
                   outclass=SoftmaxLayer)
print fnn

# train network
print "Training network..."
trainer = BackpropTrainer(fnn, ds_train)
for i in range(NUM_EPOCHS):
    error = trainer.train()
    print "Epoch: %d, Error: %7.4f" % (i, error)
    
# predict using test data
print "Making predictions..."
ypreds = []
ytrues = []
for i in range(Xtest.shape[0]):
    pred = fnn.activate(Xtest[i, :])
    ypreds.append(pred.argmax())
    ytrues.append(ytest[i])
print "Accuracy on test set: %7.4f" % accuracy_score(ytrues, ypreds, 
                                                     normalize=True)

The highest test accuracy was achieved with a neural network trained for 50 epochs - approximately 91.5%. The corresponding accuracy on the training set was 99.8% (error of 0.0023). The learning curve for the neural network is shown below.


The corresponding (truncated for readability) output for the code above is shown below:

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
Loading MATLAB data...
Visualize data...
Split data into training and test sets...
Building 400 x 25 x 10 neural network...
FeedForwardNetwork-8
   Modules:
    [<BiasUnit 'bias'>, <LinearLayer 'in'>, <SigmoidLayer 'hidden0'>, 
     <SoftmaxLayer 'out'>]
   Connections:
    [<FullConnection 'FullConnection-4': 'in' -> 'hidden0'>, 
     <FullConnection 'FullConnection-5': 'bias' -> 'out'>, 
     <FullConnection 'FullConnection-6': 'bias' -> 'hidden0'>, 
     <FullConnection 'FullConnection-7': 'hidden0' -> 'out'>]

Training network...
Epoch: 0, Error:  0.0394
Epoch: 1, Error:  0.0241
Epoch: 2, Error:  0.0191
Epoch: 3, Error:  0.0163
Epoch: 4, Error:  0.0143
Epoch: 5, Error:  0.0129
...
Epoch: 45, Error:  0.0025
Epoch: 46, Error:  0.0025
Epoch: 47, Error:  0.0024
Epoch: 48, Error:  0.0024
Epoch: 49, Error:  0.0023
Making predictions...
Accuracy on test set:  0.9148

Kaggle has a Digit Recognizer competition (for knowledge) which has a larger dataset of 42,000 training rows and 28,000 unlabelled rows. The digits in this dataset are represented as 28x28 pixel images (flattened to 1x784 arrays of numbers in the range 0-128). I ran the code above (with some obvious changes to read CSV files instead of MATLAB files, additional code to predict values for the submission test, etc). With 250 epochs and 100 hidden units, the accuracy on the held out data was 69.86% and the accuracy on the submission set only slightly higher at 69.87%.

Friday, July 18, 2014

Clustering Medical Procedure Codes with Scalding


A colleague pointed out that there exists an inverse use-case to finding outliers in medical claims. This one is to group procedure codes into clusters, or what Health Insurance companies call Episode Treatment Groups (ETG). Essentially an ETG is a way to cluster a group of services (procedures) into a medically relevant unit.

The CMS.gov dataset provides slightly under 16 million anonymized outpatient claims for Medicare/Medicaid patients. Each outpatient record can have upto 6 ICD-9 procedure codes, upto 10 ICD-9 diagnosis codes and upto 45 HCPCS codes. So just like the outlier case, we can derive a measure of similarity between a pair of codes as the average co-occurrence within claims across the dataset.

I decided to use a variant of the DBSCAN clustering algorithm. This post provides some tips on how to implement DBSCAN in a distributed manner - I used the ideas in this post to develop my implementation. The intuition behind my clustering algorithm goes something like this.

We calculate the similarity sAB between a pair of codes A and B as the number of times they co-occur in the corpus. Clustering algorithms need a distance measure, so we treat the distance dAB as the reciprocal of their similarity, ie 1/sAB. The DBSCAN clustering algorithm works by selecting other points around each point that are within a specified distance ε from each other. Candidate cluster centroids are those that have at least MinPoints codes within this distance ε. My algorithm deviates from DBSCAN at this point - instead of finding density-reachable codes I just find the Top-N densest clusters. Density is calculated as the number of codes within a circular area of the mean radius, i.e. N2 / πΣi=0..Nd2. We then calculate the top N densest code clusters - these are our derived ETGs.

The Scalding code below does just this. We simplify a bit by not calculating using some constants such as π but otherwise the code is quite faithful to the algorithm described above.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
// Source: src/main/scala/com/mycompany/cmspp/cluster/CodeCluster.scala
package com.mycompany.cmspp.clusters

import com.twitter.scalding.Job
import com.twitter.scalding.Args
import com.twitter.scalding.TextLine
import com.twitter.scalding.Tsv
import scala.io.Source

class CodeCluster(args: Args) extends Job(args) {

  def extractPairs(line: String): List[(String,String)] = {
    val cols = line.split(",").toList
    val codes = (cols.slice(22, 27)  // ICD9 procedure code cols
      .map(x => if (x.isEmpty) x else "ICD9:" + x) 
      ::: cols.slice(31, 75)         // HCPCS (CPT4) procedure cols
      .map(x => if (x.isEmpty) x else "HCPCS:" + x))          
      .filter(x => (! x.isEmpty))
    val cjoin = for {codeA <- codes; codeB <- codes} yield (codeA, codeB)
    cjoin.filter(x => x._1 < x._2)
  }

  val Epsilon = args("epsilon").toDouble
  val MinPoints = args("minpoints").toInt
  val NumClusters = args("nclusters").toInt
  
  val output = Tsv(args("output"))

  val dists = TextLine(args("input"))
    .read
    // compute pair-wise distances between procedure codes
    .flatMapTo('line -> ('codeA, 'codeB)) { line: String => extractPairs(line) }
    .groupBy('codeA, 'codeB) { group => group.size('sim) }
    .map('sim -> 'radius) { x: Int => (1.0D / x) }
    .discard('sim)
    // group by codeA and retain only records which are within epsilon distance
    .groupBy('codeA) { group => group.sortBy('radius).reverse }
    .filter('radius) { x: Double => x < Epsilon }
    
  val codeCounts = dists
    .groupBy('codeA) { group => 
      group.sizeAveStdev('radius -> ('count, 'mean, 'std)) 
    }
    // only retain codes that have at least MinPoints points within Epsilon
    .filter('count) { x: Int => x > MinPoints }
    .discard('std)

  val densities = dists.joinWithSmaller(('codeA -> 'codeA), codeCounts)
    .map(('mean, 'count) -> 'density) { x: (Double,Int) => 
      1.0D * Math.pow(x._2, 2) / Math.pow(x._1, 2)
    }
    .discard('radius, 'count)
    
  // sort the result by density descending and find the top N clusters
  val densestCodes = densities.groupAll { group => 
      group.sortBy('density).reverse }
    .unique('codeA)
    .limit(NumClusters)
    
  // join code densities with densest codes to find final clusters
  densities.joinWithTiny(('codeA -> 'codeA), densestCodes)
    .groupBy('codeA) { group => group.mkString('codeB, ",")}
    .write(output)
}

object CodeCluster {
  def main(args: Array[String]): Unit = {
    // populate redis cache
    new CodeCluster(Args(List(
      "--local", "",
      "--epsilon", "0.3",
      "--minpoints", "10",
      "--nclusters", "10",
      "--input", "data/outpatient_claims.csv",
      "--output", "data/clusters.csv"
    ))).run
    Source.fromFile("data/clusters.csv")
      .getLines()
      .foreach(Console.println(_))
  }
}

I ran this locally with 1 million claims (out of the 16 million claims in my dataset) and got results like this:

1
 2
 3
 4
 5
 6
 7
 8
 9
10
HCPCS:00142    HCPCS:85025,  HCPCS:36415, HCPCS:93005, HCPCS:80053, ...
HCPCS:00300    HCPCS:85025,  HCPCS:80053, HCPCS:36415, HCPCS:99284, ...
HCPCS:00400    HCPCS:85025,  HCPCS:93005, HCPCS:80053, HCPCS:36415, ...
HCPCS:00532    HCPCS:85025,  HCPCS:36415, HCPCS:80048, HCPCS:G0378, ...
HCPCS:0073T    HCPCS:77417,  HCPCS:77336, HCPCS:77427, HCPCS:77280, ...
HCPCS:00740    HCPCS:85025,  HCPCS:36415, HCPCS:93005, HCPCS:85610, ...
HCPCS:00750    HCPCS:36415,  HCPCS:85025, HCPCS:85610, HCPCS:J3010, ...
HCPCS:00790    HCPCS:36415,  HCPCS:85025, HCPCS:80048, HCPCS:80053, ...
HCPCS:00810    HCPCS:36415,  HCPCS:85025, HCPCS:93005, HCPCS:80053, ...
HCPCS:00830    HCPCS:36415,  HCPCS:85025, HCPCS:80048, HCPCS:93005, ...

[Edit: 07/22/2014: This approach does not produce clusters. Notice in the data the same HCPCS:85025 is part of the first 3 clusters, which is obviously not wanted. I will implement the last part of the DBSCAN algorithm and update this page when I am done.]

And thats all I have for today. I'd like to point out a new book on Scalding, Programming MapReduce with Scalding by Antonios Chalkiopoulos. I was quite impressed by this book, you can read my review on Amazon if you are interested.

Thursday, July 03, 2014

A uimaScala Annotator for Named Entity Recognition


My last post was a little over a month ago, a record for me - I generally try to post every week or at least every other week. The reason for the delay is that I got stuck on an idea which turned out to be not very workable. Problem with these situations is that it kind of eats at me until I am able to resolve it or realize its completely unworkable and abandon it. I haven't completely given up hope on the idea yet, but I couldn't think of any ways to solve it either, so I decided to put it aside and catch up on my reading1 instead.

In the meantime, at work we have started using UIMAFit for a new NLP pipeline we are building. I had experimented with UIMA in the past, but gave up because its heavy dependence on XML became a pain after a while. UIMAFit does not completely get rid of XML, you still need to define the types in XML and generate the code using JCasGen, but the Analysis Engines don't need to be described in XML anymore.

Generally, I try to experiment with tools before proposing them at work, and since I do all my (JVM based) personal projects with Scala nowadays, I initially thought of using UIMAFit with Scala. However, using UIMAFit would make (my personal) project a mixture of Java and Scala (JCasGen would generate Java classes for the XML types), something I wanted to avoid if possible. Luckily I came across the uimaScala project, which provides a Scala interface to UIMAFit, and eliminates XML altogether as an added bonus (it uses a Scala DSL instead to specify the types).

Unfortunately, the project had been written using Scala 2.9 and built with SBT 0.12 and I was using Scala 2.10 and SBT 0.13. My attempts to just use the project based on the instructions in the project's README.md failed. So did attempts to build it locally. So I contacted the author, who was kind enough to make the necessary changes so it worked with Scala 2.11. So currently I am using Scala 2.11 for this project, there are still quite a few Scala 2.10 based projects like Spark and Scalding that I use, so I can't do a wholesale upgrade. This post describes an annotator built using uimaScala that marks up a text with PERSON and ORGANIZATION tags using OpenNLP's Named Entity Recognizer.

[Edit (2014-07-07): the uimaScala project also offers a JAR built with Scala 2.10 now. I was able to compile and run my project by updating my scalaVersion to 2.10.2 and removing the dependency to scala-xml (split out in 2.11 into its own library) in my build.sbt file.]

First the Name Finder. My pipeline actually doesn't have a need for a NER that recognizes PERSON and ORGANIZATION, but I've been meaning to figure out how to do this with OpenNLP for a while, so I built it anyway. Here's the code:

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
// Source: src/main/scala/com/mycompany/scalcium/utils/NameFinder.scala
package com.mycompany.scalcium.utils

import java.io.File
import java.io.FileInputStream

import org.apache.commons.io.IOUtils

import opennlp.tools.namefind.NameFinderME
import opennlp.tools.namefind.TokenNameFinderModel
import opennlp.tools.util.Span

class NameFinder {

  val ModelDir = "src/main/resources/opennlp/models"
  
  val tokenizer = Tokenizer.getTokenizer("opennlp")
  val personME = buildME("en_ner_person.bin")
  val orgME = buildME("en_ner_organization.bin")
  
  def find(finder: NameFinderME, doc: List[String]): 
      List[List[(String,Int,Int)]] = {
    try {
      doc.map(sent => find(finder, sent))
    } finally {
      clear(finder)
    }
  }
  
  def find(finder: NameFinderME, sent: String): 
    List[(String,Int,Int)] = {
    val words = tokenizer.wordTokenize(sent)
                         .toArray
    finder.find(words).map(span => {
      val start = span.getStart()
      val end = span.getEnd()
      val text = words.slice(start, end).mkString(" ")
      (text, start, end)
    }).toList
  }
  
  def clear(finder: NameFinderME): Unit = finder.clearAdaptiveData()
  
  def buildME(model: String): NameFinderME = {
    var pfin: FileInputStream = null
    try {
      pfin = new FileInputStream(new File(ModelDir, model))
      new NameFinderME(new TokenNameFinderModel(pfin))
    } finally {
    IOUtils.closeQuietly(pfin)
    }
  }
}

The Annotator uses the NameFinder and a previously written Tokenizer (which I haven't shown here, its a thin wrapper on top of OpenNLP's tokenizers) that provide methods that work like NLTK's text tokenizer methods. Note that this is generally not the way I would structure my annotator, I would prefer to have a pipeline with a Sentence tokenizer ahead of this and make the NameFinderAnnotator work on sentences instead, but in the interests of time and space I decided to make it accept the full text and tokenize it inside the process method.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
// Source: src/main/scala/com/mycompany/scalcium/pipeline/NameFinderAnnotator.scala
package com.mycompany.scalcium.pipeline

import org.apache.uima.jcas.JCas
import com.github.jenshaase.uimascala.core.SCasAnnotator_ImplBase
import com.mycompany.scalcium.utils.NameFinder
import com.mycompany.scalcium.utils.Tokenizer

class NameFinderAnnotator extends SCasAnnotator_ImplBase {

  val tokenizer = Tokenizer.getTokenizer("opennlp")
  val namefinder = new NameFinder()
  
  override def process(jcas: JCas): Unit = {
    val text = jcas.getDocumentText()
    val sentences = tokenizer.sentTokenize(text)
    val soffsets = sentences.map(sentence => sentence.length())
                            .scanLeft(0)(_ + _)
    // people annotations
    val allPersons = namefinder.find(namefinder.personME, sentences)
    applyAnnotations(jcas, allPersons, sentences, soffsets, "PER")
    // organization annotations
    val allOrgs = namefinder.find(namefinder.orgME, sentences)
    applyAnnotations(jcas, allOrgs, sentences, soffsets, "ORG")
  }
  
  def applyAnnotations(jcas: JCas, 
      allEnts: List[List[(String,Int,Int)]], sentences: List[String], 
      soffsets: List[Int], tag: String): Unit = {
    var sindex = 0
    allEnts.map(ents => { // all entities in each sentence
      ents.map(ent => {   // entity
        val coffset = charOffset(soffsets(sindex) + sindex,
          sentences(sindex), ent)
        val entity = new Entity(jcas, coffset._1, coffset._2)
        entity.setEntityType(tag)
        entity.addToIndexes()
      })
      sindex += 1
    })
  }

  def charOffset(soffset: Int, sentence: String, ent: (String,Int,Int)): 
      (Int,Int) = {
    val estring = tokenizer.wordTokenize(sentence)
      .slice(ent._2, ent._3)
      .mkString(" ")
    val cbegin = soffset + sentence.indexOf(estring)
    val cend = cbegin + estring.length()
    (cbegin, cend)
  }
}

The Entity annotation is described using the following Scala DSL. It defines an annotation that has the standard fields (begin, end) and an additional property entityType. Unfortunately my Scala-IDE (customized Eclipse) is not able to recognize it as valid Scala. However, it all compiles and runs fine from SBT on the command line. Very likely I have to let Scala-IDE know about the paradise compiler plugin (see the README.md for uimaScala for setting up the compiler plugin in your build.sbt). But hey, its better than having to write the types in XML!

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
// Source: src/main/scala/com/mycompany/scalcium/pipeline/TypeSystem.scala
package com.mycompany.scalcium.pipeline

import com.github.jenshaase.uimascala.core.description._
import org.apache.uima.jcas.tcas.Annotation
import org.apache.uima.cas.Feature

@TypeSystemDescription
object TypeSystem {

  val Entity = Annotation {
    val entityType = Feature[String]
  }
}

The uimaScala README recommends using its scalaz-stream based DSL to construct and execute pipelines. I haven't tried that yet, my JUnit unit test is based on patterns similar to my Java JUnit tests for my UIMAFit based pipeline at work. The JUnit test below takes a block of text and outputs the Entity annotations using the NameFinderAnnotator.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
// Source: src/test/scala/com/mycompany/scalcium/pipeline/NameFinderAnnotatorTest.scala
package com.mycompany.scalcium.pipeline

import org.junit.Test
import org.apache.uima.fit.factory.AnalysisEngineFactory
import org.apache.uima.fit.util.JCasUtil
import scala.collection.JavaConversions._

class NameFinderAnnotatorTest {

  val text = """
    Pierre Vinken , 61 years old , will join the board as a nonexecutive 
    director Nov. 29 . Mr. Vinken is chairman of Elsevier N.V. , the Dutch 
    publishing group . Rudolph Agnew , 55 years old and former chairman of 
    Consolidated Gold Fields PLC , was named a director of this British 
    industrial conglomerate ."""

  @Test
  def testPipeline(): Unit = {
    val ae = AnalysisEngineFactory.createEngine(classOf[NameFinderAnnotator])
    val jcas = ae.newJCas()
    jcas.setDocumentText(text)
    ae.process(jcas)
    JCasUtil.select(jcas, classOf[Entity]).foreach(entity => {
      Console.println("(%d, %d): %s/%s".format(
        entity.getBegin(), entity.getEnd(),
        text.substring(entity.getBegin(), entity.getEnd()),
        entity.getEntityType()))
    })
  }
}

The output of this test looks like below. It seems to have missed "Mr. Vinken" and "Elsevier N.V" as PERSON and ORGANIZATION respectively, but this seems to be a problem with the OpenNLP NameFinder (or maybe not even a problem, its a model based parser after all, it depends on what it was trained with).

1
2
3
(0, 13): Pierre Vinken/PER
(159, 172): Rudolph Agnew/PER
(211, 239): Consolidated Gold Fields PLC/ORG

And that's all I have for today. Hopefully it was worth the wait :-).

[1]: In case you are curious about what I read while I was not posting articles last month, here is the list of books I read over last month. The last one was specifically so I could learn how to make the uimaScala code compile under Scala 2.10 but it turned out to be unnecessary, many thanks to Jens Haase (author of uimaScala) for that.


Update (2014-09-02): I recently tried the Stanford NER because I heard good things about it, and I am happy to say it vastly outperforms OpenNLP in terms of tagging quality, at the expense of a very slight increase in processing time (3755ms for Stanford NER vs 3746ms for OpenNLP on my 3 sentence test above). OpenNLP has pre-trained models for PERSON and ORGANIZATION entity detection, Stanford NER can recognize PERSON, LOCATION, ORGANIZATION and MISC. I show below the results for my 3 sentences from OpenNLP and Stanford below for comparison.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
==== OpenNLP ====
Pierre Vinken, 61 years old, will join the board as a nonexecutive director 
Nov. 29.
  (0,13): Pierre Vinken / PERSON
Mr. Vinken is chairman of Elsevier N.V., the Dutch publishing group based 
at Amsterdam.
Rudolph Agnew , 55 years old and former chairman of Consolidated Gold Fields 
PLC, was named a director of this British industrial conglomerate.
  (0,13): Rudolph Agnew / PERSON
  (52,80): Consolidated Gold Fields PLC / ORGANIZATION

==== Stanford ====
Pierre Vinken, 61 years old, will join the board as a nonexecutive director 
Nov. 29.
  (0,13): Pierre Vinken / PERSON
Mr. Vinken is chairman of Elsevier N.V., the Dutch publishing group based 
at Amsterdam.
  (0,10): Mr. Vinken / PERSON
  (26,39): Elsevier N.V. / ORGANIZATION
  (45,50): Dutch / MISC
  (77,86): Amsterdam / LOCATION
Rudolph Agnew , 55 years old and former chairman of Consolidated Gold Fields 
PLC, was named a director of this British industrial conglomerate.
  (0,13): Rudolph Agnew / PERSON
  (52,80): Consolidated Gold Fields PLC / ORGANIZATION
  (111,118): British / MISC

My code to call the Stanford NER and extract entities from it is shown below. It takes in a sentence, and returns a List of triples containing the entity tag, the start and end character offsets.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
package com.mycompany.scalcium.names

import java.io.File

import scala.collection.JavaConversions._

import com.mycompany.scalcium.tokenizers.Tokenizer

import edu.stanford.nlp.ie.AbstractSequenceClassifier
import edu.stanford.nlp.ie.crf.CRFClassifier
import edu.stanford.nlp.ling.CoreLabel

class StanfordNameFinder extends NameFinder {

  val ModelDir = "src/main/resources/stanford"

  val tokenizer = Tokenizer.getTokenizer("opennlp")
  val classifier = buildClassifier(
    "english.conll.4class.distsim.crf.ser.gz")
  
  override def find(sentences: List[String]): 
      List[List[(String,Int,Int)]] = {
    sentences.map(sentence => 
      classifier.classifyToCharacterOffsets(sentence)
        .map(triple => (triple.first, 
          triple.second.toInt, triple.third.toInt))
        .toList)
  }
  
  def buildClassifier(model: String): 
      AbstractSequenceClassifier[CoreLabel] = {
    val modelfile = new File(ModelDir, model)
    CRFClassifier.getClassifier(modelfile)
  }
}