Salmon Run: July 2011

Saturday, July 30, 2011

Lucene: A Token Concatenating TokenFilter

In my previous post, I described my Neo4j/Lucene combo service that allows you to look up concepts in the Neo4j graph database by name or ID. The lookup happens using Lucene. My plan is to build an entity-recognition system (using an interface similar to OpenCalais - pass in a body of text and the service returns a list of entities in the text that matched a given vocabulary).

I hadn't thought this through completely before, but for the name lookup, I need exact match just like the ID lookup. So for example, if my text contains "Heart Attack", I want to recognize the entity "Heart Attack", not "Heart Attack Prevention". So my initial approach of stuffing all names into the same Lucene record as a multi-field had to change.

Based on the advice found in this solr-user mailing list discussion, I changed the code to write each name and synonym in a separate Lucene record, changed the analysis to use omitNorms(), and introduced an un-analyzed field for exact match boosting.

On the query side, I changed the query (analyzed with the analyzer chain described in my previous post) to add an optional exact query clause against the un-analyzed field name_s to boost the record whose name matched the query exactly, ie. Here is the updated code for both methods.

  public void addNode(TConcept concept, Long nid) 
      throws IOException {
    Set<String> syns = new HashSet<String>();
    syns.add(concept.getPname());
    syns.add(concept.getQname());
    syns.addAll(concept.getSynonyms());
    for (String syn : syns) {
      Document doc = new Document();
      doc.add(new Field("oid", String.valueOf(concept.getOid()), 
        Store.YES, Index.ANALYZED));
      doc.add(new Field("syn", syn, Store.YES, Index.ANALYZED_NO_NORMS));
      doc.add(new Field("syn_s", StringUtils.lowerCase(syn), Store.YES, 
       Index.NOT_ANALYZED));
      doc.add(new Field("nid", String.valueOf(nid), 
        Store.YES, Index.NO));
      writer.addDocument(doc);
    }
    writer.commit();
  }

  ...

  public List<Long> getNids(String name) throws Exception {
    QueryParser parser = new QueryParser(Version.LUCENE_40, null, analyzer);
    Query q = parser.parse("+syn:\"" + name + 
        "\" syn_s:\"" + StringUtils.lowerCase(name) + "\"^100");
    ScoreDoc[] hits = searcher.search(q, Integer.MAX_VALUE).scoreDocs;
    List<Long> nodeIds = new ArrayList<Long>();
    for (int i = 0; i < hits.length; i++) {
      Document doc = searcher.doc(hits[i].doc);
      nodeIds.add(Long.valueOf(doc.get("nid")));
    }
    return nodeIds;
  }

Essentially, change the query from syn:"foo" to +syn:"foo" syn_s:"foo". This way, the first clause ensures that the record with "foo" in name is matched, and the second clause boosts the record whose name is exactly "foo" to the top of the results. Of course, this is just part of the solution - since my client expects to see exact matches (and there can be more than 1 exact match), the getNids() needs to have some post-processing code to remove the non-exact matches.

Of course, as Chris Hostetter points out in the discussion, if you want exact match, then you shouldn't tokenize in the first place. However, in my case, I do need to tokenize as part of my normalization process, which injects synonyms via dictionary and pattern replacement, removes stopwords and selectively stems the tokens. The solution for me, then, is to join back the tokens into a single term before writing it out to the index.

This post describes a TokenFilter that I wrote to put at the end of my Tokenizer/TokenFilter chain, which takes the tokens produced by upstream tokenizers and creates a set of phrase tokens out of them.

My code is based on a similar component written by Robert Gründler. My code differs from this component in that it uses a slightly newer Lucene API (with the trunk version, the next()::Token method is not available for overriding), and generates multiple phrase tokens if synonyms are encountered in the tokens (using position increment == 0). Here is the code:

// Source: src/main/java/com/mycompany/tgni/lucene/TokenConcatenatingTokenFilter.java

package com.mycompany.tgni.lucene;

import java.io.IOException;
import java.util.ArrayList;
import java.util.LinkedList;
import java.util.List;

import org.apache.commons.lang.StringUtils;
import org.apache.lucene.analysis.TokenFilter;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute;
import org.apache.lucene.util.AttributeSource;

public class TokenConcatenatingTokenFilter extends TokenFilter {

  private CharTermAttribute termAttr;
  private PositionIncrementAttribute posIncAttr;
  
  private AttributeSource.State current;
  private LinkedList<List<String>> words;
  private LinkedList<String> phrases;

  private boolean concat = false;
  
  protected TokenConcatenatingTokenFilter(TokenStream input) {
    super(input);
    this.termAttr = addAttribute(CharTermAttribute.class);
    this.posIncAttr = addAttribute(PositionIncrementAttribute.class);
    this.words = new LinkedList<List<String>>();
    this.phrases = new LinkedList<String>();
  }

  @Override
  public boolean incrementToken() throws IOException {
    int i = 0;
    while (input.incrementToken()) {
      String term = new String(termAttr.buffer(), 0, termAttr.length());
      List<String> word = posIncAttr.getPositionIncrement() > 0 ?
        new ArrayList<String>() : words.removeLast();
      word.add(term);
      words.add(word);
      i++;
    }
    // now write out as a single token
    if (! concat) {
      makePhrases(words, phrases, 0);
      concat = true;
    }
    while (phrases.size() > 0) {
      String phrase = phrases.removeFirst();
      restoreState(current);
      clearAttributes();
      termAttr.copyBuffer(phrase.toCharArray(), 0, phrase.length());
      termAttr.setLength(phrase.length());
      current = captureState();
      return true;
    }
    concat = false;
    return false;
  }
  
  private void makePhrases(List<List<String>> words, 
      List<String> phrases, int currPos) {
    if (currPos == words.size()) {
      return;
    }
    if (phrases.size() == 0) {
      phrases.addAll(words.get(currPos));
    } else {
      List<String> newPhrases = new ArrayList<String>();
      for (String phrase : phrases) {
        for (String word : words.get(currPos)) {
          newPhrases.add(StringUtils.join(new String[] {phrase, word}, " "));
        }
      }
      phrases.clear();
      phrases.addAll(newPhrases);
    }
    makePhrases(words, phrases, currPos + 1);
  }
}

I just stick it to the end of the analyzer chain I've been using and rebuild my index. My revised analyzer chain looks like this (see the last line in the tokenStream() method.

public class QueryMappingAnalyzer extends Analyzer {

  private String aeDescriptor;
  private Set<?> stopset;
  
  public QueryMappingAnalyzer(String stopwordsFile, String aeDescriptor) 
      throws IOException {
    this.stopset = StopFilter.makeStopSet(Version.LUCENE_40, 
      new File(stopwordsFile));
    this.aeDescriptor = aeDescriptor;
  }
  
  @Override
  public TokenStream tokenStream(String fieldName, Reader reader) {
    SynonymMap synonymMap = new SynonymMap();
    TokenStream input = new UimaAETokenizer(reader, aeDescriptor, 
      null, synonymMap);
    input = new SynonymFilter(input, synonymMap);
    input = new LowerCaseFilter(Version.LUCENE_40, input, true);
    input = new StopFilter(Version.LUCENE_40, input, stopset, false);
    input = new PorterStemFilter(input);
    // concatenate tokens produced by upstream analysis into phrase token
    input = new TokenConcatenatingTokenFilter(input);
    return input;
  }
}

Using this updated analyzer chain, I am able to revert to my previous indexing structure of storing synonyms together in one record (since each phrase or phrase synonym is stored as a complete unit), and revert my query to just use the single syn field, like so:

  public void addNode(TConcept concept, Long nid) 
      throws IOException {
    Set<String> syns = new HashSet<String>();
    syns.add(concept.getPname());
    syns.add(concept.getQname());
    syns.addAll(concept.getSynonyms());
    Document doc = new Document();
    doc.add(new Field("oid", String.valueOf(concept.getOid()),
      Store.YES, Index.ANALYZED));
    for (String syn : syns) {
      doc.add(new Field("syn", syn, Store.YES, Index.ANALYZED));
    }
    doc.add(new Field("nid", String.valueOf(nid), Store.YES, Index.NO));
    writer.addDocument(doc);
//    for (String syn : syns) {
//      Document doc = new Document();
//      doc.add(new Field("oid", String.valueOf(concept.getOid()), 
//        Store.YES, Index.ANALYZED));
//      doc.add(new Field("syn", syn, Store.YES, Index.ANALYZED_NO_NORMS));
//      doc.add(new Field("syn_s", StringUtils.lowerCase(syn), Store.YES, Index.NOT_ANALYZED));
//      doc.add(new Field("nid", String.valueOf(nid), 
//        Store.YES, Index.NO));
//      writer.addDocument(doc);
//    }
    writer.commit();
  }

  ...

  public List<Long> getNids(String name) throws Exception {
    QueryParser parser = new QueryParser(Version.LUCENE_40, null, analyzer);
//    Query q = parser.parse("+syn:\"" + name + 
//        "\" syn_s:\"" + StringUtils.lowerCase(name) + "\"^100");
    Query q = parser.parse("syn:\"" + name + "\"");
    ScoreDoc[] hits = searcher.search(q, Integer.MAX_VALUE).scoreDocs;
    List<Long> nodeIds = new ArrayList<Long>();
    for (int i = 0; i < hits.length; i++) {
      Document doc = searcher.doc(hits[i].doc);
      nodeIds.add(Long.valueOf(doc.get("nid")));
    }
    return nodeIds;
  }

Of course, this does bring up the question of whether I really need to use Lucene for lookup, since my requirements seem to be exact match lookup by both OID and name. It seems to me that an embedded database such as HSQLDB or SQLite may perhaps be better choices. Of course, I would still use the Lucene analysis API to do the normalization before writing to the database and before reading from it, but the data would not be stored in a Lucene index. I have to think this through a bit before changing the lookup mechanism, just because I want to make sure that I will never need Lucene's search capabilities in this project.

Tuesday, July 19, 2011

A Homegrown Lucene Integration with Neo4j

In my previous post, I took a quick look at Neo4j version 1.4M4. My goal is to build a graph-based view into our taxonomy, which currently resides in an Oracle database and has two major entities - concepts and relationships. Concepts are related to each other via named and weighted relationships. As you can imagine, a graph database such as Neo4j is a natural fit for such a structure.

For this graph-based view, I need to not only to be able to navigate from one concept to another using their connecting relationships, but I also need to look up a node using either a numeric ID, or by name (including any of its synonyms). The last time I used Neo4j, they supported an IndexService which has since been deprecated and replaced with a more feature-rich but also much more tightly coupled Indexing Framework.

The indexing framework is nice, but it looked like too much work to integrate my stuff (using Lucene 4.0 from trunk) into it. Waiting for the Lucene team to release 4.0 and the Neo4j team to integrate it did not seem that great an option to me either.

However, while reading the Guidelines for Building a Neo4j Application, I had a bit of an epiphany. What if I used Lucene to do the lookup, extract the (Neo4j) node ID from the matched record(s), then use Neo4j's getNodeById(Long) to get the reference into Neo4j? The nice thing about this approach is that I am no longer dependent on Neo4j's support for a specific Lucene version - I could use my existing Lucene/UIMA code for lookup and Neo4j for traversal.

The rest of this post describes my first cut at a domain model and the API into this domain model, along with the services which power this API. Its very application dependent, so its very likely that you would be bored out of your mind while reading this. There ... you have been warned!

The Domain Model

The domain model is very simple. It consists of 3 beans - two classes and an enum. The two classes are the Concept and the Relation, called TConcept and TRelation respectively. They are POJOs, I have omitted the getters and setters for brevity.

// Source: src/main/java/com/mycompany/tgni/beans/TConcept.java
package com.mycompany.tgni.beans;

import java.util.List;
import java.util.Map;

/**
 * Models single concept.
 */
public class TConcept {

  private Integer oid;
  private String pname;
  private String qname;
  private List<String> synonyms;
  private Map<String,String> stycodes;
  private String stygrp;
  private Long mrank;
  private Long arank;
  private Integer tid;
  
  // ... getters and setters omitted ...

}

The important properties here are the OID (Oracle ID), which is the unique ID assigned by Oracle when the concept is imported into it. The pname, qname and synonyms fields are used for lookup by name. The other fields are for classification and ranking and are not important for this discussion.

// Source: src/main/java/com/mycompany/tgni/beans/TRelation.java
package com.mycompany.tgni.beans;

/**
 * Models relation between two TConcept objects.
 */
public class TRelation {

  private Integer fromOid;
  private TRelTypes relType;
  private Integer toOid;
  private Long mrank;
  private Long arank;
  private boolean mstip;
  private Long rmrank;
  private Long rarank;
  
  // ... getters and setters omitted ...

}

As before, the fields that uniquely identify the relationship is the two concepts at either end (fromOid and toOid), the relationship type (relType), and the weight of the relationship (a combination of mstip, mrank and arank). The other fields are for reverse relationships, which is fairly trivial to support but which I haven't done so far.

Finally, there is the TRelTypes enum that extends Neo4j's RelationshipTypes enum to define relationship types that are unique to my application. The actual names are not important, so I have replaced it with some dummy names. Since the relationship types are uniquely identified in the database by a numeric ID, we need to have a way to get the TRelTypes enum from its database ID. We need the lookup by name in the NodeService class described below. Here is the code:

// Source: src/main/java/com/mycompany/tgni/beans/TRelTypes.java
package com.mycompany.tgni.beans;

import java.util.HashMap;
import java.util.Map;

import org.neo4j.graphdb.RelationshipType;

/**
 * Enumeration of all relationship types.
 */
public enum TRelTypes implements RelationshipType {
  
  REL_1 (1),
  REL_2 (2),
  // ... more relationship types, omitted ...
  REL_20 (20)
  ;
  
  private Integer oid;
  
  private TRelTypes(Integer oid) {
    this.oid = oid;
  }

  private static Map<Integer,TRelTypes> oidMap = null;
  private static Map<String,TRelTypes> nameMap = null;
  static {
    oidMap = new HashMap<Integer,TRelTypes>();
    nameMap = new HashMap<String,TRelTypes>();
    for (TRelTypes type : TRelTypes.values()) {
      oidMap.put(type.oid, type);
      nameMap.put(type.name(), type);
    }
  }
  
  public static TRelTypes fromOid(Integer oid) {
    return oidMap.get(oid);
  }
  
  public static TRelTypes fromName(String name) {
    return nameMap.get(name);
  }
}

API Usage

The API consists of a single service class that exposes lookup and navigation operations on the graph in terms of TConcept and TRelation objects. The client of the API does not ever have a reference to any Neo4j or Lucene object.

In addition, there are some methods that allow insertion and updation of TConcept and TRelation objects. These are for internal use for loading from the database, so the Neo4j nodeID has to be exposed here. These methods are not part of the public API, and I will remove them from a future version of NodeService.

The sample code (copy-pasted from one of my JUnit tests) illustrates the usage of the (public methods of the) API.

// to set up the NodeService
NodeService nodeService = new NodeService();
nodeService.setGraphDir("data/graphdb");
nodeService.setIndexDir("data/index");
nodeService.setStopwordsFile("src/main/resources/stopwords.txt");
nodeService.setTaxonomyMappingAEDescriptor(
  "src/main/resources/descriptors/TaxonomyMappingAE.xml");
nodeService.init();

// look up a concept by OID
TConcept concept = nodeService.getConcept(123456);

// look up a concept by name
// the second parameter is the maximum number of results to return,
// and the third parameter is the minimum (Lucene) score to allow
List<TConcept> concepts = nodeService.getConcepts("foo", 10, 0.5F);

// get count of related concepts by relation type
Bag<TRelTypes> counts = nodeService.getRelationCounts(concept);

// get pointers to related concepts for a given relationship type
// if the (optional) sort parameter is not supplied, the List of
// TRelation objects are sorted using the default comparator.
List<TRelation> rels = nodeService.getRelatedConcepts(concept, TRelTypes.REL_1);

// to shut down the NodeService
nodeService.destroy();

Node Service

The client interacts directly with the NodeService, which hides the details of the underlying Neo4j and Lucene stores. I may also introduce (EHCache based) caching in this layer in the future. This is because this application is going to have to compete (in terms of performance) with a system which currently models the graph as in-memory maps, but which we want to phase out because of its rather large memory requirements. Anyway, here is the code. As mentioned before, it has several add/update/delete methods, which I will remove in the future.

// Source: src/main/java/com/mycompany/tgni/neo4j/NodeService.java
package com.mycompany.tgni.neo4j;

import java.util.ArrayList;
import java.util.Collections;
import java.util.Comparator;
import java.util.List;

import org.apache.commons.collections15.Bag;
import org.apache.commons.collections15.bag.HashBag;
import org.apache.lucene.search.Query;
import org.neo4j.graphdb.Direction;
import org.neo4j.graphdb.GraphDatabaseService;
import org.neo4j.graphdb.Node;
import org.neo4j.graphdb.Relationship;
import org.neo4j.graphdb.RelationshipType;
import org.neo4j.graphdb.Transaction;
import org.neo4j.kernel.EmbeddedGraphDatabase;

import com.mycompany.tgni.beans.TConcept;
import com.mycompany.tgni.beans.TRelTypes;
import com.mycompany.tgni.beans.TRelation;
import com.mycompany.tgni.lucene.LuceneIndexService;

public class NodeService {

  private String graphDir;
  private String indexDir;
  private String stopwordsFile;
  private String taxonomyMappingAEDescriptor;
  
  private GraphDatabaseService graphDb;
  private LuceneIndexService index;

  public void setGraphDir(String graphDir) {
    this.graphDir = graphDir;
  }

  public void setIndexDir(String indexDir) {
    this.indexDir = indexDir;
  }

  public void setStopwordsFile(String stopwordsFile) {
    this.stopwordsFile = stopwordsFile;
  }

  public void setTaxonomyMappingAEDescriptor(String aeDescriptor) {
    this.taxonomyMappingAEDescriptor = aeDescriptor;
  }

  public void init() throws Exception {
    this.graphDb = new EmbeddedGraphDatabase(graphDir);
    this.index = new LuceneIndexService();
    this.index.setIndexDirPath(indexDir);
    this.index.setStopwordsFile(stopwordsFile);
    this.index.setTaxonomyMappingAEDescriptor(taxonomyMappingAEDescriptor);
    index.init();
  }
  
  public void destroy() throws Exception {
    index.destroy();
    graphDb.shutdown();
  }
  
  public Long addConcept(TConcept concept) throws Exception {
    Transaction tx = graphDb.beginTx();
    Long nodeId = -1L;
    try {
      Node node = graphDb.createNode();
      nodeId = toNode(node, concept);
      index.addNode(concept, nodeId);
      tx.success();
    } catch (Exception e) {
      tx.failure();
      throw e;
    } finally {
      tx.finish();
    }
    return nodeId;
  }
  
  public Long updateConcept(TConcept concept) throws Exception {
    Long nodeId = index.getNid(concept.getOid());
    if (nodeId > 0L) {
      Transaction tx = graphDb.beginTx();
      try {
        Node node = graphDb.getNodeById(nodeId);
        toNode(node, concept);
        index.updateNode(concept);
        tx.success();
      } catch (Exception e) {
        tx.failure();
        throw e;
      } finally {
        tx.finish();
      }
    }
    return nodeId;
  }
  
  public Long removeConcept(TConcept concept) throws Exception {
    Long nodeId = index.getNid(concept.getOid());
    if (nodeId > 0L) {
      Transaction tx = graphDb.beginTx();
      try {
        Node node = graphDb.getNodeById(nodeId);
        if (node.hasRelationship()) {
          throw new Exception("Node cannot be deleted. Remove it first!");
        }
        node.delete();
        index.removeNode(concept);
        tx.success();
      } catch (Exception e) {
        tx.failure();
        throw e;
      } finally {
        tx.finish();
      }
    }
    return nodeId;
  }
  
  public void addRelation(TRelation rel) throws Exception {
    Long fromNodeId = index.getNid(rel.getFromOid());
    Long toNodeId = index.getNid(rel.getToOid());
    if ((fromNodeId != toNodeId) &&
        (fromNodeId > 0L && toNodeId > 0L)) {
      Transaction tx = graphDb.beginTx();
      try {
        Node fromNode = graphDb.getNodeById(fromNodeId);
        Node toNode = graphDb.getNodeById(toNodeId);
        TRelTypes relType = rel.getRelType();
        Relationship relationship = 
          fromNode.createRelationshipTo(toNode, relType);
        relationship.setProperty("mrank", rel.getMrank());
        relationship.setProperty("arank", rel.getArank());
        relationship.setProperty("mstip", rel.getMstip());
        // TODO: handle reverse relationships in future
        tx.success();
      } catch (Exception e) {
        tx.failure();
        throw e;
      } finally {
        tx.finish();
      }
    }
  }
  
  public void removeRelation(TRelation rel) throws Exception {
    Long fromNodeId = index.getNid(rel.getFromOid());
    Long toNodeId = index.getNid(rel.getToOid());
    if (fromNodeId != toNodeId && 
        (fromNodeId > 0L && toNodeId > 0L)) {
      Transaction tx = graphDb.beginTx();
      try {
        Node fromNode = graphDb.getNodeById(fromNodeId);
        Relationship relationshipToDelete = null;
        for (Relationship relationship : 
            fromNode.getRelationships(rel.getRelType(), Direction.OUTGOING)) {
          Node endNode = relationship.getEndNode();
          if (endNode.getId() == toNodeId) {
            relationshipToDelete = relationship;
            break;
          }
        }
        if (relationshipToDelete != null) {
          relationshipToDelete.delete();
        }
        tx.success();
      } catch (Exception e) {
        tx.failure();
        throw e;
      } finally {
        tx.finish();
      }
    }
  }
  
  public TConcept getConcept(Integer oid) throws Exception {
    Long nid = index.getNid(oid);
    Node node = graphDb.getNodeById(nid);
    return toConcept(node); 
  }
  
  public List<TConcept> getConcepts(String name, int maxDocs, 
      float minScore) throws Exception {
    List<Long> nids = index.getNids(name, maxDocs, minScore);
    List<TConcept> concepts = new ArrayList<TConcept>();
    for (Long nid : nids) {
      Node node = graphDb.getNodeById(nid);
      concepts.add(toConcept(node));
    }
    return concepts;
  }
  
  public List<TConcept> getConcepts(Query query, int maxDocs, float minScore) 
      throws Exception {
    List<Long> nids = index.getNids(query, maxDocs, minScore);
    List<TConcept> concepts = new ArrayList<TConcept>();
    for (Long nid : nids) {
      Node node = graphDb.getNodeById(nid);
      concepts.add(toConcept(node));
    }
    return concepts;
  }

  public Bag<TRelTypes> getRelationCounts(TConcept concept) 
      throws Exception {
    Bag<TRelTypes> counts = new HashBag<TRelTypes>();
    Long nid = index.getNid(concept.getOid());
    Node node = graphDb.getNodeById(nid);
    for (Relationship relationship : 
        node.getRelationships(Direction.OUTGOING)) {
      TRelTypes type = TRelTypes.fromName(
        relationship.getType().name()); 
      if (type != null) {
        counts.add(type);
      }
    }
    return counts;
  }
  
  private static final Comparator<TRelation> DEFAULT_SORT = 
    new Comparator<TRelation>() {
      @Override public int compare(TRelation r1, TRelation r2) {
        if (r1.getMstip() != r2.getMstip()) {
          return r1.getMstip() ? -1 : 1;
        } else {
          Long mrank1 = r1.getMrank();
          Long mrank2 = r2.getMrank();
          if (mrank1 != mrank2) {
            return mrank2.compareTo(mrank1);
          } else {
            Long arank1 = r1.getArank();
            Long arank2 = r2.getArank();
            return arank2.compareTo(arank1);
          }
        }
      }
  };
  
  public List<TRelation> getRelatedConcepts(TConcept concept,
      TRelTypes type) throws Exception {
    return getRelatedConcepts(concept, type, DEFAULT_SORT);
  }
  
  public List<TRelation> getRelatedConcepts(TConcept concept, 
      TRelTypes type, Comparator<TRelation> sort) 
      throws Exception {
    Long nid = index.getNid(concept.getOid());
    Node node = graphDb.getNodeById(nid);
    List<TRelation> rels = new ArrayList<TRelation>();
    if (node != null) {
      for (Relationship relationship : 
          node.getRelationships(type, Direction.OUTGOING)) {
        RelationshipType relationshipType = relationship.getType();
        if (TRelTypes.fromName(relationshipType.name()) != null) {
          Node relatedNode = relationship.getEndNode();
          Integer relatedConceptOid = (Integer) relatedNode.getProperty("oid");
          TRelation rel = new TRelation();
          rel.setFromOid(concept.getOid());
          rel.setToOid(relatedConceptOid);
          rel.setMstip((Boolean) relationship.getProperty("mstip"));
          rel.setMrank((Long) relationship.getProperty("mrank"));
          rel.setArank((Long) relationship.getProperty("arank"));
          rel.setRelType(TRelTypes.fromName(relationshipType.name()));
          rels.add(rel);
        }
      }
      Collections.sort(rels, sort);
      return rels;
    }
    return Collections.emptyList();
  }
  
  private Long toNode(Node node, TConcept concept) {
    node.setProperty("oid", concept.getOid());
    node.setProperty("pname", concept.getPname());
    node.setProperty("qname", concept.getQname());
    node.setProperty("synonyms", 
      JsonUtils.listToString(concept.getSynonyms())); 
    node.setProperty("stycodes", 
      JsonUtils.mapToString(concept.getStycodes())); 
    node.setProperty("stygrp", concept.getStygrp());
    node.setProperty("mrank", concept.getMrank());
    node.setProperty("arank", concept.getArank());
    return node.getId();
  }
  
  @SuppressWarnings("unchecked")
  private TConcept toConcept(Node node) {
    TConcept concept = new TConcept();
    concept.setOid((Integer) node.getProperty("oid"));
    concept.setPname((String) node.getProperty("pname"));
    concept.setQname((String) node.getProperty("qname"));
    concept.setSynonyms(JsonUtils.stringToList(
      (String) node.getProperty("synonyms")));
    concept.setStycodes(JsonUtils.stringToMap(
      (String) node.getProperty("stycodes")));
    concept.setStygrp((String) node.getProperty("stygrp"));
    concept.setMrank((Long) node.getProperty("mrank"));
    concept.setArank((Long) node.getProperty("arank"));
    return concept;
  }
}

Lucene Index Service

The Lucene Index Service provides methods to look up a concept by ID or by names. To do this, it uses a PerFieldAnalyzerWrapper to expose as its main Analyzer the KeywordAnalyzer, and for its "syns" (synonym) group of fields, it uses the TaxonomyNameMappingAnalyzer (which builds out an Tokenizer/TokenFilter chain identical to the one described here).

Additionally, it provides some persistence methods to write/update and delete TConcept objects from the Lucene index.

// Source: src/main/java/com/mycompany/tgni/lucene/LuceneIndexService.java
package com.mycompany.tgni.lucene;

import java.io.File;
import java.io.IOException;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.core.KeywordAnalyzer;
import org.apache.lucene.analysis.miscellaneous.PerFieldAnalyzerWrapper;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.Field.Index;
import org.apache.lucene.document.Field.Store;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.index.Term;
import org.apache.lucene.index.IndexWriterConfig.OpenMode;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TermQuery;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.util.Version;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import org.springframework.util.StringUtils;

import com.mycompany.tgni.beans.TConcept;

public class LuceneIndexService {

  private final Logger logger = LoggerFactory.getLogger(getClass());
  
  private String stopwordsFile;
  private String taxonomyMappingAEDescriptor;
  private String indexDirPath;

  public void setStopwordsFile(String stopwordsFile) {
    this.stopwordsFile = stopwordsFile;
  }

  public void setTaxonomyMappingAEDescriptor(String taxonomyMappingAEDescriptor) {
    this.taxonomyMappingAEDescriptor = taxonomyMappingAEDescriptor;
  }

  public void setIndexDirPath(String indexDirPath) {
    this.indexDirPath = indexDirPath;
  }

  private Analyzer analyzer;
  private IndexWriter writer;
  private IndexSearcher searcher;

  public void init() throws IOException {
    Map<String,Analyzer> otherAnalyzers = 
      new HashMap<String,Analyzer>();
    otherAnalyzers.put("syns", new TaxonomyNameMappingAnalyzer(
      stopwordsFile, taxonomyMappingAEDescriptor));
    analyzer = new PerFieldAnalyzerWrapper(
      new KeywordAnalyzer(), otherAnalyzers);
    IndexWriterConfig iwconf = new IndexWriterConfig(
      Version.LUCENE_40, analyzer);
    iwconf.setOpenMode(OpenMode.CREATE_OR_APPEND);
    Directory indexDir = FSDirectory.open(new File(indexDirPath));
    writer = new IndexWriter(indexDir, iwconf);
    writer.commit();
    searcher = new IndexSearcher(indexDir, true);
  }
  
  public void destroy() throws IOException {
    if (writer != null) {
      writer.commit();
      writer.optimize();
      writer.close();
    }
    if (searcher != null) {
      searcher.close();
    }
  }
  
  /**
   * Adds the relevant fields from a TConcept object into the 
   * Lucene index.
   * @param concept a TConcept object.
   * @param nid the node id from Neo4j.
   * @throws IOException if thrown.
   */
  public void addNode(TConcept concept, Long nid) 
      throws IOException {
    logger.debug("Adding concept=" + concept);
    Document doc = new Document();
    doc.add(new Field("oid", String.valueOf(concept.getOid()), 
      Store.YES, Index.ANALYZED));
    doc.add(new Field("syns", concept.getPname(), Store.YES, Index.ANALYZED));
    doc.add(new Field("syns", concept.getQname(), Store.YES, Index.ANALYZED));
    for (String syn : concept.getSynonyms()) {
      doc.add(new Field("syns", syn, Store.YES, Index.ANALYZED));
    }
    doc.add(new Field("nid", String.valueOf(nid), 
      Store.YES, Index.NO));
    writer.addDocument(doc);
    writer.commit();
  }
  
  /**
   * Removes a TConcept entry from the Lucene index. Caller is
   * responsible for enforcing whether the corresponding node is
   * connected to some other node in the graph. We remove the
   * record by IMUID (which is guaranteed to be unique).
   * @param concept a TConcept object.
   * @throws IOException if thrown.
   */
  public void removeNode(TConcept concept) throws IOException {
    writer.deleteDocuments(new Term("oid", 
      String.valueOf(concept.getOid())));
    writer.commit();
  }
  
  /**
   * Update node information in place.
   * @param concept the concept to update.
   * @throws IOException if thrown.
   */
  public void updateNode(TConcept concept) 
      throws IOException {
    Long nid = getNid(concept.getOid());
    if (nid != -1L) {
      removeNode(concept);
      addNode(concept, nid);
    }
  }

  /**
   * Returns the node id given the unique ID of a TConcept object. 
   * @param oid the unique id of the TConcept object.
   * @return the corresponding Neo4j node id.
   * @throws IOException if thrown.
   */
  public Long getNid(Integer oid) throws IOException {
    Query q = new TermQuery(new Term(
      "oid", String.valueOf(oid)));
    ScoreDoc[] hits = searcher.search(q, 1).scoreDocs; 
    if (hits.length == 0) {
      // nothing to update, leave
      return -1L;
    }
    Document doc = searcher.doc(hits[0].doc);
    return Long.valueOf(doc.get("nid"));
  }
  
  /**
   * Get a list of Neo4j node ids given a string to match against.
   * The number of node ids returned is the number requested or
   * the nodes that have a score higher than requested, whichever
   * occurs first.
   * @param name the string to match against.
   * @param maxNodes the number of node ids to return.
   * @param minScore the minimum score to allow.
   * @return a List of Neo4j node ids.
   * @throws Exception if thrown.
   */
  public List<Long> getNids(String name, int maxNodes, 
      float minScore) throws Exception {
    QueryParser parser = new QueryParser(Version.LUCENE_40, "syns", analyzer);
    Query q = parser.parse("syns:" + StringUtils.quote(name));
    return getNids(q, maxNodes, minScore);
  }
  
  /**
   * Returns a list of Neo4j node ids that match a given Lucene
   * query. The number of node ids returned is the number requested
   * or the nodes that have a score higher than requested, whichever
   * occurs first.
   * @param query the Lucene query to match against.
   * @param maxNodes the maximum number of node ids to return.
   * @param minScore the minimum score to allow.
   * @return a List of Neo4j node ids.
   * @throws Exception if thrown.
   */
  public List<Long> getNids(Query query, int maxNodes,
      float minScore) throws Exception {
    ScoreDoc[] hits = searcher.search(query, maxNodes).scoreDocs;
    List<Long> nodeIds = new ArrayList<Long>();
    for (int i = 0; i < hits.length; i++) {
      Document doc = searcher.doc(hits[i].doc);
      if (hits[i].score < minScore) {
        break;
      }
      nodeIds.add(Long.valueOf(doc.get("nid")));
    }
    return nodeIds;
  }
}

Thats pretty much it. If you have used Lucene and Neo4j together, would appreciate your thoughts in case you see some obvious gotchas in the approach described above.

Salmon Run

Saturday, July 30, 2011

Lucene: A Token Concatenating TokenFilter

Tuesday, July 19, 2011

A Homegrown Lucene Integration with Neo4j

The Domain Model

API Usage

Node Service

Lucene Index Service

Posts

Labels

Blogs I Read

About me

My Nerd Rating

Visitor Map

Contact Me