Saturday, September 29, 2007

Using Lucene with Jython

In a previous post, I had described a workaround to using Lucene BooleanQueries using PyLucene. Basically, all this involved was to build the Query programatically using AND and OR boolean operators supplied by Lucene's Query Parser syntax before passing it to the PyLucene.QueryParser object.

However, I faced a slightly different problem now. My task was to quality check an index built using a custom Lucene Analyzer (written in Java). The base queries the user was expected to type into our search page was available as a flat file. The quality check involved converting the input query into a custom Lucene Query object, then apply a set of standard facets to the Query using a QueryFilter, and write the results of each IndexSearcher.search(Query,QueryFilter) call into another flat file.

Of course, the most logical solution would have been to write a Java JUnit test that did this. But this was kind of a one-off, and writing Java code seemed kind of wasteful. I had experimented with Jython once before, where I was looking for a way to call some Java standalone programs from the command line. So I decided to try the same approach of adding the JAR files I needed to Jython's sys.path.

So here is my code, which should be pretty much self explanatory. The script takes as input arguments the path to the Lucene index, the path to the input file of query strings and the path to the file where the report should be written.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
#!/opt/jython2.2/jython
import sys
import string

def usage():
  print " ".join([sys.argv[0], "/path/to/index/to/read", "/path/to/input/file", \
    "/path/to/output/file"])
  sys.exit(-1)

def main():
  # Command line processing
  if (len(sys.argv) != 4):
    usage()

  # Set up constants for reporting
  facetValues = ["value1", "value2", "value3", "value4", "value5"]

  # Add jars to classpath
  jars = [
    "/full/path/to/lucene.jar",
    "/full/path/to/our/custom/analyzer.jar"
    ... other dependency jars
    ]
  for jar in jars:
    sys.path.append(jar)

  # Import references
  from org.apache.lucene.index import Term
  from org.apache.lucene.queryParser import QueryParser
  from org.apache.lucene.search import IndexSearcher
  from org.apache.lucene.search import TermQuery
  from org.apache.lucene.search import QueryFilter
  from org.apache.lucene.store import FSDirectory
  from com.mycompany.analyzer import MyCustomAnalyzer

  # load up an array with the input query strings
  querystrings = []
  infile = open(sys.argv[2], 'r')
  outfile = open(sys.argv[3], 'w')
  while (True):
    line = infile.readline()[:-1]
    if (line == ''):
      break
    querystrings.append(line)
  
  # search for the query and facet
  dir = FSDirectory.getDirectory(sys.argv[1], False)
  analyzer = MyCustomAnalyzer()
  searcher = IndexSearcher(dir)
  for querystring in querystrings:
    for facetValue in facetValues:
      luceneQuery = buildCustomQuery(querystring)
      query = QueryParser("body", analyzer).parse(luceneQuery)
      queryfilter = QueryFilter(TermQuery(Term("facet", facetValue)))
      hits = searcher.search(query, queryfilter)
      numHits = hits.length()
      # if we found nothing for this query and facet, we report it
      if (numHits == 0):
        outfile.write("|".join([querystring, facetValue, 'No Title', 'No URL', '0.0']))
        continue
      # show upto the top 3 results for the query and facet
      for i in range(0, min(numHits, 3)):
        doc = hits.doc(i)
        score = hits.score(i)
        title = doc.get("title")
        url = doc.get("url")
        outfile.write("|".join([disease, facet, title, url, str(score)]))

  # clean up
  searcher.close()
  infile.close()
  outfile.close()

def buildCustomLuceneQuery(querystring):
  """ do some custom query building here """
  return query
  
if __name__ == "__main__":
  main()

Why is this so cool? As you can see, the Python code is quite simple. However, it allows me to access functionality embedded in our custom Lucene Analyzer written in Java, as well as access the newer features of Lucene 2.1 (PyLucene is based on Lucene 1.4) if I need them. So basically, I can now write what is essentially Java client code in the much more compact Python language. Also, if I had written a Java program, I would either have to call Java with a rather longish -classpath parameter, or build up a shell script or Ant target. With Jython, the script can be called directly from the command line.

There are some obvious downsides as well. Since I mostly use Python for scripting, I end up downloading and installing many custom modules for Python, that I don't necessarily install on my Jython installation. For example, for database access, I have modules installed for Oracle, MySQL and PostgreSQL. However, with Jython, we could probably just use JDBC for database access, as described in Andy Todd's blog post here. Overall, I think having access to Java code from within Python using Jython is quite useful.

2 comments (moderated to prevent spam):

Anonymous said...

Hi sujit

i have read many of your developer alluring posts.Its great to have solutions reagrding the current technologies...thanks dude

Sujit Pal said...

Hi, you're welcome, and I am glad you like the content. You should probably post as anonymous or under your own handle, otherwise it could be misconstrued as your company's viewpoint :-).