The small script I describe here came about as a result of a suggestion from a colleague. Some time ago I had built a Lucene analyzer that converted from British spellings to American spellings (for example, "colour" to "color"), based on a set of prefix, suffix and infix regular expressions. Unfortunately, the same regex that converts colour correctly also converts "four" to "for". Since our search is backed by a taxonomy, we can treat the synonyms defined in it as a controlled vocabulary, so my colleague suggested running the transformer against all the words in the (possibly multi-word) synonyms, then for the words matching a regex, checking against a dictionary that the original and transformed words mean the same.
When I built the Lucene analyzer, I was not as handy with NLTK as I am now, so this time around, I almost immediately thought about NLTK's Wordnet interface. The idea is to pass the two words to Wordnet. Each word can potentially result in one or more synsets (depending on its part of speech (POS)). We can conclude that they are the same word if one of the pairs of synsets have a path_similarity of 1 (the path_similarity varies from 0 to 1, 1 being identical). Here is the code:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
from __future__ import division from nltk.corpus import wordnet as wn import sys def similarity(w1, w2, sim=wn.path_similarity): synsets1 = wn.synsets(w1) synsets2 = wn.synsets(w2) sim_scores =  for synset1 in synsets1: for synset2 in synsets2: sim_scores.append(sim(synset1, synset2)) if len(sim_scores) == 0: return 0 else: return max(sim_scores) def main(): f = open(sys.argv, 'rb') for line in f: (word1, word2) = line.strip().split("\t") if similarity(word1, word2) != 1.0: print word1 f.close() if __name__ == "__main__": main()
The similarity is calculated in the similarity() method as the maximum similarity between any two synset pairs. The main() method just reads a file of word pairs and writes out words that don't convert to an equivalent word. For example, the following input file:
1 2 3 4
favour favor favourite favorite four for colour color
results in "four" being printed to the console, indicating that "four" should be treated as an exclusion for the analyzer.
I was also using the analyzer to normalize some Greek and Latin plurals to their corresponding singulars. These words are quite common in (English) medical text and normal stemming does not handle it correctly, so I used a set of (suffix) rules to do the conversion in the same analyzer. As with the British to American words, there are exceptions here also, which can be handled using the same code. Turns out that the plural and singular words map to the same node in Wordnet, so the task is even simpler. In any case, an input file like this:
1 2 3 4
humeri humerus femora femur insomnia insomnium media medium
results in "insomnia" being printed on the console, once more indicating that "insomnia" should be treated as an exclusion.
And thats all I have for today. Just goes to show how simple some problems can become when you have the right tools.