Saturday, February 17, 2007

Python script to convert CSV files to Excel

I spent much of my last weekend generating large flat files of denormalized data from various data sources, and then converting it to Excel spreadsheets for human consumption. Although the process is quite simple, thanks to OpenOffice (I am a Linux user, in case you haven't guessed already), its a pain to have to navigate the GUI repeatedly to do this.

Looking around the web, I did not find a ready made script that would do what I wanted, but it seemed fairly simple to do, so I decided to do this rather than suffer through yet another GUI invocation just to convert my CSV file to Excel. To read and parse the file, I used Python's built in CSV module, and to write out the Excel spreadsheets, I downloaded and installed the pyExcelerator module. In addition, since my data files were rather large, I put in a splitting mechanism that would allow me to split my output into multiple Excel files of a specified file size. Here is the script.

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
#!/usr/local/bin/python
# Tool to convert CSV files (with configurable delimiter and text wrap
# character) to Excel spreadsheets.
import string
import sys
import getopt
import re
import os
import os.path
import csv
from pyExcelerator import *

def usage():
  """ Display the usage """
  print "Usage:" + sys.argv[0] + " [OPTIONS] csvfile"
  print "OPTIONS:"
  print "--title|-t: If set, the first line is the title line"
  print "--lines|-l n: Split output into files of n lines or less each"
  print "--sep|-s c [def:,] : The character to use for field delimiter"
  print "--output|o : output file name/pattern"
  print "--help|h : print this information"
  sys.exit(2)

def openExcelSheet(outputFileName):
  """ Opens a reference to an Excel WorkBook and Worksheet objects """
  workbook = Workbook()
  worksheet = workbook.add_sheet("Sheet 1")
  return workbook, worksheet

def writeExcelHeader(worksheet, titleCols):
  """ Write the header line into the worksheet """
  cno = 0
  for titleCol in titleCols:
    worksheet.write(0, cno, titleCol)
    cno = cno + 1

def writeExcelRow(worksheet, lno, columns):
  """ Write a non-header row into the worksheet """
  cno = 0
  for column in columns:
    worksheet.write(lno, cno, column)
    cno = cno + 1

def closeExcelSheet(workbook, outputFileName):
  """ Saves the in-memory WorkBook object into the specified file """
  workbook.save(outputFileName)

def getDefaultOutputFileName(inputFileName):
  """ Returns the name of the default output file based on the value
      of the input file. The default output file is always created in
      the current working directory. This can be overriden using the
      -o or --output option to explicitly specify an output file """
  baseName = os.path.basename(inputFileName)
  rootName = os.path.splitext(baseName)[0]
  return string.join([rootName, "xls"], '.')

def renameOutputFile(outputFileName, fno):
  """ Renames the output file name by appending the current file number
      to it """
  dirName, baseName = os.path.split(outputFileName)
  rootName, extName = os.path.splitext(baseName)
  backupFileBaseName = string.join([string.join([rootName, str(fno)], '-'), extName], '')
  backupFileName = os.path.join(dirName, backupFileBaseName)
  try:
    os.rename(outputFileName, backupFileName)
  except OSError:
    print "Error renaming output file:", outputFileName, "to", backupFileName, "...aborting"
    sys.exit(-1)

def validateOpts(opts):
  """ Returns option values specified, or the default if none """
  titlePresent = False
  linesPerFile = -1
  outputFileName = ""
  sepChar = ","
  for option, argval in opts:
    if (option in ("-t", "--title")):
      titlePresent = True
    if (option in ("-l", "--lines")):
      linesPerFile = int(argval)
    if (option in ("-s", "--sep")):
      sepChar = argval
    if (option in ("-o", "--output")):
      outputFileName = argval
    if (option in ("-h", "--help")):
      usage()
  return titlePresent, linesPerFile, sepChar, outputFileName

def main():
  """ This is how we are called """
  try:
    opts,args = getopt.getopt(sys.argv[1:], "tl:s:o:h", ["title", "lines=", "sep=", "output=", "help"])
  except getopt.GetoptError:
    usage()
  if (len(args) != 1):
    usage()
  inputFileName = args[0]
  try:
    inputFile = open(inputFileName, 'r')
  except IOError:
    print "File not found:", inputFileName, "...aborting"
    sys.exit(-1)
  titlePresent, linesPerFile, sepChar, outputFileName = validateOpts(opts)
  if (outputFileName == ""):
    outputFileName = getDefaultOutputFileName(inputFileName)
  workbook, worksheet = openExcelSheet(outputFileName)
  fno = 0
  lno = 0
  titleCols = []
  reader = csv.reader(inputFile, delimiter=sepChar)
  for line in reader:
    if (lno == 0 and titlePresent):
      if (len(titleCols) == 0):
        titleCols = line
      writeExcelHeader(worksheet, titleCols)
    else:
      writeExcelRow(worksheet, lno, line)
    lno = lno + 1
    if (linesPerFile != -1 and lno >= linesPerFile):
      closeExcelSheet(workbook, outputFileName)
      renameOutputFile(outputFileName, fno)
      fno = fno + 1
      lno = 0
      workbook, worksheet = openExcelSheet(outputFileName)
  inputFile.close()
  closeExcelSheet(workbook, outputFileName)
  if (fno > 0):
    renameOutputFile(outputFileName, fno)

if __name__ == "__main__":
  main()

Based on my not finding completed examples for this kind of thing on the web, I can only assume that the script is probably too trivial for most seasoned programmers to consider contributing back to the community. However, based on the few questions I saw about whether such a thing exists, I would conclude that there are some programmers (like me) for whom this is sufficiently non-trivial to consider looking around before writing one themselves. If you are in the latter category, I hope you find this script useful.

Saturday, February 10, 2007

Embedded Databases (for Java) smackdown

I have been looking for opportunities to use embedded databases in my Java programs, but so far I have not been able to justify doing this. If the data volume was large, I would end up using a fast relational database such as MySQL with application caching, or use in-memory data structures if the data volume was small.

In my previous post, I compared three auto-completion implementations, two of which used in-memory data structures. The third one used an in-memory HSQLDB implementation and the performance was an order of magnitude worse than the other two. However, in many cases, it may be acceptable to have slightly lower performance, if we have cleaner, more maintainable code as a result, and can compensate for the performance by using faster hardware.

For this post, I compare three popular embedded database solutions to do a set of simple exact match lookups. To provide a sense of scale, I also built an implementation using a simple HashMap on the one hand, and an implementation against a local MySQL database on the other. I will begin with my test harness code, followed by a short description and the code for each implementation, followed by my results.

Test Harness

The test harness is a simple JUnit class, with StopWatch calls to measure the times and a MemoryMXBean to measure the change in JVM size before and after the data is loaded. All the implementations implement the IEmbeddedDB interface, and each test exercises each implementation in turn, passing in a reference to the implementation to the doPerformanceTest() method. The data consists of about 2800 name-value pairs, read in from a flat file.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
public class EmbeddedDbTest extends TestCase {

  File inputFile = new File("...");
  String[] terms = {...};

  public void testInMemoryDb() throws Exception {
    IEmbeddedDb embeddedDb = new InMemoryEmbeddedDb();
    doPerformanceTest(embeddedDb, null);
  }

  public void testHsqlDb() throws Exception {
    IEmbeddedDb embeddedDb = new HsqlEmbeddedDb();
    Map<String,String> props = new HashMap<String,String>();
    props.put("dbPropsFile", "/tmp/hpairdb.properties");
    props.put("dbLogFile", "/tmp/hpair.log");
    props.put("dbScriptFile", "/tmp/hpairdb.script");
    props.put("driverClassName", "org.hsqldb.jdbcDriver");
    props.put("jdbcUrl", "jdbc:hsqldb:file:/tmp/hpairdb");
    props.put("username", "sa");
    props.put("password", "");
    doPerformanceTest(embeddedDb, props);
  }

  public void testHsqlCachedDb() throws Exception {
    IEmbeddedDb embeddedDb = new HsqlEmbeddedDb();
    Map<String,String> props = new HashMap<String,String>();
    props.put("dbPropsFile", "/tmp/hpaircdb.properties");
    props.put("dbLogFile", "/tmp/hpairc.log");
    props.put("dbScriptFile", "/tmp/hpaircdb.script");
    props.put("driverClassName", "org.hsqldb.jdbcDriver");
    props.put("jdbcUrl", "jdbc:hsqldb:file:/tmp/hpaircdb");
    props.put("username", "sa");
    props.put("password", "");
    props.put("useCachedTable", "true");
    doPerformanceTest(embeddedDb, props);
  }

  public void testDerbyDb() throws Exception {
    IEmbeddedDb embeddedDb = new DerbyEmbeddedDb();
    Map<String,String> props = new HashMap<String,String>();
    props.put("dbDir", "/tmp/dpairdb");
    props.put("driverClassName", "org.apache.derby.jdbc.EmbeddedDriver");
    props.put("jdbcUrl", "jdbc:derby:/tmp/dpairdb;create=true");
    doPerformanceTest(embeddedDb, props);
  }

  public void testBerkeleyDb() throws Exception {
    IEmbeddedDb embeddedDb = new BdbEmbeddedDb();
    Map<String,String> props = new HashMap<String,String>();
    props.put("dirName", "/tmp");
    props.put("dbName", "bpairdb");
    doPerformanceTest(embeddedDb, props);
  }

  public void testMySqlDb() throws Exception {
    IEmbeddedDb embeddedDb = new MySqlDb();
    Map<String,String> props = new HashMap<String,String>();
    props.put("driverClassName", "com.mysql.jdbc.Driver");
    props.put("jdbcUrl", "jdbc:mysql://localhost:3306/test");
    props.put("username", "root");
    props.put("password", "mysql");
    doPerformanceTest(embeddedDb, props);
  }

  private void doPerformanceTest(IEmbeddedDb embeddedDb, Map<String,String> props)
      throws Exception {
    MemoryMXBean memoryMxBean = ManagementFactory.getMemoryMXBean();
    String testName = ClassUtils.getShortClassName(embeddedDb.getClass());
    if (props != null && "true".equals(props.get("useCachedTable"))) {
      testName += " [cached]";
    }
    embeddedDb.init(props);
    StopWatch watch = new StopWatch(testName);
    long startHeapUsage = memoryMxBean.getHeapMemoryUsage().getUsed();
    watch.start("load " + testName);
    embeddedDb.load(inputFile);
    watch.stop();
    long endHeapUsage = memoryMxBean.getHeapMemoryUsage().getUsed();
    long heapUsedByDb = endHeapUsage - startHeapUsage;
    watch.start("get " + testName);
    for (int i = 0; i < 1000; i++) {
      for (String term : terms) {
        String value = embeddedDb.get(term);
        assertNotNull(value);
      }
    }
    watch.stop();
    embeddedDb.destroy();
    System.out.println(watch.prettyPrint());
    System.out.println("Heap usage (bytes):" + heapUsedByDb);
  }
}

Baseline 1: In-memory HashMap

We first created an IEmbeddedDb.java interface which is implemented by all the various implementations. Here it is:

1
2
3
4
5
6
7
// IEmbeddedDb.java
public interface IEmbeddedDb {
  public void init(Map<String,String> props) throws Exception;
  public void load(File file) throws Exception;
  public String get(String key);
  public void destroy() throws Exception;
}

Our HashMap implementation is trivial, and exists only to provide a lower bound to our results. Ideally, nothing should perform better than this, unless we are doing something terribly wrong. The code for InMemoryDb is here:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
// InMemoryEmbeddedDb.java
public class InMemoryEmbeddedDb implements IEmbeddedDb {

  private HashMap<String,String> hashMap;

  public void init(Map<String, String> props) throws Exception {
    hashMap = new HashMap<String,String>();
  }

  public void load(File file) throws Exception {
    BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(file)));
    String line = null;
    while ((line = reader.readLine()) != null) {
      String[] nvp = line.split("=");
      hashMap.put(nvp[0], nvp[1]);
    }
    reader.close();
  }

  public String get(String key) {
    return hashMap.get(key);
  }

  public void destroy() throws Exception {;}
}

HSQLDB : Memory table mode

The default table created with CREATE TABLE in HSQL are memory tables, which means that the entire data set is in memory. Obviously this does not buy us much, other than an SQL interface to read and write data into memory, and arguably this is not something worth buying in our simple case. We abstract out common boilerplate JDBC code into an abstract superclass. Each SQL implementation can then only pass in the Connection object to the superclass.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
// AbstractSqlDb
public class AbstractSqlDb {

  protected String createTableSql = "create table pairs(" +
      "keycol varchar(32) not null, " +
      "valcol varchar(32) not null, " +
      "primary key (keycol))";

  private Connection conn = null;
  private PreparedStatement psIns = null;
  private PreparedStatement psGet = null;

  protected void init(Connection conn) throws Exception {
    this.conn = conn;
    Statement ddlStmt = conn.createStatement();
    ddlStmt.execute(createTableSql);
    psIns = conn.prepareStatement("insert into pairs (keycol, valcol) values (?, ?)");
    psGet = conn.prepareStatement("select valcol from pairs where keycol = ?");
  }

  public void load(File file) throws Exception {
    BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(file)));
    String line = null;
    conn.setAutoCommit(false);
    try {
      while ((line = reader.readLine()) != null) {
        if (line.startsWith("#")) {
          continue;
        }
        String[] nvp = line.split("=");
        psIns.setString(1, nvp[0]);
        psIns.setString(2, nvp[1]);
        psIns.executeUpdate();
      }
      conn.commit();
    } catch (SQLException e) {
      conn.rollback();
    } finally {
      psIns.close();
      conn.setAutoCommit(true);
    }
    reader.close();
  }

  public String get(String key) {
    String value = null;
    try {
      psGet.setString(1, key);
      ResultSet rs = psGet.executeQuery();
      while (rs.next()) {
        value = rs.getString(1);
        break;
      }
      rs.close();
      return value;
    } catch (SQLException e) {
      return null;
    }
  }

  public void destroy() throws Exception {
    psGet.close();
    conn.close();
  }
}

// HsqlEmbeddedDb.java
public class HsqlEmbeddedDb extends AbstractSqlDb implements IEmbeddedDb {

  public void init(Map<String,String> props) throws Exception {
    new File(props.get("dbPropsFile")).delete();
    new File(props.get("dbLogFile")).delete();
    new File(props.get("dbScriptFile")).delete();
    Class.forName(props.get("driverClassName"));
    Connection conn = DriverManager.getConnection(props.get("jdbcUrl"),
      props.get("username"), props.get("password"));
    if ("true".equals(props.get("useCachedTable"))) {
      super.createTableSql = "create cached table pairs(" +
        "keycol varchar(32) not null, " +
        "valcol varchar(32) not null, " +
        "primary key (keycol))";
    }
    super.init(conn);
  }
}

HSQLDB : Cached table mode

The code for this implementation is the same as the previous one, except that we override the protected variable createTableSql to create a CACHED table explicitly. The documentation states that this may result in a loss of performance, but surprisingly in my case, it actually performs better.

Apache Derby : Embedded mode

According to the Derby FAQ, Derby does not support an in-memory mode. So we just implement a standard SQL implementation of our IEmbeddedDb interface.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
// DerbyEmbeddedDb.java
public class DerbyEmbeddedDb extends AbstractSqlDb implements IEmbeddedDb {

  public void init(Map<String,String> props) throws Exception {
    try {
      FileUtils.cleanDirectory(new File(props.get("dbDir")));
      new File(props.get("dbDir")).delete();
    } catch (IllegalArgumentException e) {
      // no directory exists, nothing to do here
    }
    Class.forName(props.get("driverClassName"));
    Connection conn = DriverManager.getConnection(props.get("jdbcUrl"));
    super.init(conn);
  }
}

Berkeley DB

Berkeley DB has a proprietary interface to store and get data. William Grosso provides a nice tutorial in his "Berkeley DB, Java Edition 1: the basics" article on OnJava.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
// BdbEmbeddedDb.java
public class BdbEmbeddedDb implements IEmbeddedDb {

  private Environment environment;
  private Database bdb;

  public void init(Map<String, String> props) throws Exception {
    EnvironmentConfig environmentConfig = new EnvironmentConfig();
    environmentConfig.setAllowCreate(true);
    File file = new File(props.get("dirName"));
    environment = new Environment(file, environmentConfig);
    DatabaseConfig databaseConfig = new DatabaseConfig();
    databaseConfig.setAllowCreate(true);
    bdb = environment.openDatabase(null, props.get("dbName"), databaseConfig);
  }

  public void load(File file) throws Exception {
    BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(file)));
    String line = null;
    while ((line = reader.readLine()) != null) {
      if (line.startsWith("#")) {
        continue;
      }
      String[] pair = line.split("=");
      DatabaseEntry key = new DatabaseEntry(pair[0].getBytes());
      DatabaseEntry value = new DatabaseEntry(pair[1].getBytes());
      bdb.put(null, key, value);
    }
    reader.close();
  }

  public String get(String key) {
    DatabaseEntry keyEntry = new DatabaseEntry(key.getBytes());
    DatabaseEntry valueEntry = new DatabaseEntry();
    try {
      OperationStatus status = bdb.get(null, keyEntry, valueEntry, LockMode.DEFAULT);
      if (status == OperationStatus.SUCCESS) {
        return new String(valueEntry.getData());
      } else {
        return null;
      }
    } catch (DatabaseException e) {
      return null;
    }
  }

  public void destroy() throws Exception {
    bdb.close();
    environment.close();
  }
}

Baseline 2: MySQL

My final implementation is based on MySQL, in order to provide a baseline on the other end. MySQL is generally accepted as the fastest relational database around, so any embedded solution that performs worse than MySQL is probably not worth considering.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
// MySqlDb.java
public class MySqlDb extends AbstractSqlDb implements IEmbeddedDb {

  public void init(Map<String, String> props) throws Exception {
    Class.forName(props.get("driverClassName"));
    Connection conn = DriverManager.getConnection(props.get("jdbcUrl"),
      props.get("username"), props.get("password"));
    Statement dropTableStmt = conn.createStatement();
    dropTableStmt.execute("drop table pairs");
    super.init(conn);
  }
}

Results

The results of running our EmbeddedDbTest against each of the following implementations are shown below. They are arranged in descending order of query time, which is the total time to run 1000 queries against each database.

Implementation Version Load time (ms) Query time (ms) Approx. memory (bytes)
In-memory HashMap N/A 89 4 567944
HSQLDB (memory) 1.8.0.7 167 156 460984
HSQLDB (cached) 1.8.0.7 155 81 485096
Berkeley DB JE 2.1.30 762 205 255680
Apache Derby 10.2.2.0 1034 2169 255624
MySQL 4.1.12 704 2755 152096

Conclusion

I had originally thought that Berkeley DB would come in first, followed by Apache Derby, followed by HSQLDB. In some respects, this assumption was based on this blog entry by David Van Couvering. I mean, if Oracle benchmarks its product against Derby, Derby must be a worthy competitor, right? Derby turned out to be a dog, though, only narrowly beating out a standard MySQL database.

Based on my results, however, the best performer turned out to be HSQLDB. I can understand the first result, where HSQLDB with an in-memory table (all data in memory), in which case, as expected, it compares very unfavorably with a HashMap because of the overhead of an SQL engine. However, in caching table mode, the behavior is similar to Berkeley DB, only part of the data is in memory, and more are pulled in as needed from disk. I had expected Berkeley DB to perform better because it does not have the overhead of an SQL Engine, but my results proved otherwise. I do notice that the memory usage for Berkeley DB is lower than HSQLDB, so the results may be different for larger data sets.

Obviously, there are some important caveats. I am not using the latest versions of MySQL and Berkeley DB JE. There is no sinister reason for this other than pure laziness. I already happened to have MySQL 4.1.12 installed on my laptop, and Berkeley DB JE 2.1.30 was the latest version in the Maven2 repository. It is very possible that the performance would be much better for the latest versions of these databases, but this is a quick and dirty evaluation I did for my own knowledge, feel free to run the provided code against later versions and give me updates. The second caveat is that I don't know much about tuning Apache Derby, so its possible that the one that comes out of the box is not tuned for performance. For what it is worth, though, I know nothing about tuning HSQLDB either, and the numbers for it are based on what came out of the box.

Saturday, February 03, 2007

Three Autocomplete implementations compared

Many web sites are now offering forms which suggest completions as you type. For example, in a form to hold the name of a US state, typing in "C" will pop up a list box containing ["California", "Colorado"]. Subsequently typing in an "a" will decrease the options in the list box to only ["California"]. Hitting the ENTER key will populate the field with "California". One of the most popular (although probably not the first) implementations is Google Suggest.

This feature is certainly very helpful from the user's perspective, since it saves keystrokes and enables him to get his job done faster. A side effect is that the list of completions aids in the process of discovery. For the user, it could mean that he gets to pages which he would not have looked at otherwise. For the site owner, it means that the site is more "sticky", thus translating into more page views and advertising dollars for sites that depend on advertising.

I have been curious about how auto-complete works, although the curiosity did not translate into actual code till recently. Obviously, AJAX is part of the equation, since each keystroke event in the form needs to be captured and sent back to the server and the possible completions returned and displayed in the scope of a single request. I was more interested, however, in how the server-side component can be built to efficiently return the results it needed to.

Over the last week, I came up with three possible implementations to do auto-completions on the file names in my ~/tmp directory. There are about 280 files in there, so this is nothing compared to what production quality auto-completion components will have to serve on real websites, but it could be a starting point for better ideas. I enumerate them here, with code, and some relative performance numbers.

In-Memory Trie

Tries are specialized data structures where a word can be stored as a sequence of characters. Reading the word involves traversing down the branch of the tree. At each node, the possible completions of the partial word can be found by traversing down all possible paths to the leaf level. It seemed ideal for modeling auto-completions, which is why I chose it. A Trie is modelled as a collection of TrieNode objects. A TrieNode is basically the current character and a Map of completions. Here is the code:

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
// Trie.java
public class Trie {

  private TrieNode rootNode;

  public Trie() {
    super();
    rootNode = new TrieNode(' ');
  }

  public void load(String phrase) {
    loadRecursive(rootNode, phrase + "$");
  }

  private void loadRecursive(TrieNode node, String phrase) {
    if (StringUtils.isBlank(phrase)) {
      return;
    }
    char firstChar = phrase.charAt(0);
    node.add(firstChar);
    TrieNode childNode = node.getChildNode(firstChar);
    if (childNode != null) {
      loadRecursive(childNode, phrase.substring(1));
    }
  }

  public boolean matchPrefix(String prefix) {
    TrieNode matchedNode = matchPrefixRecursive(rootNode, prefix);
    return (matchedNode != null);
  }

  private TrieNode matchPrefixRecursive(TrieNode node, String prefix) {
    if (StringUtils.isBlank(prefix)) {
      return node;
    }
    char firstChar = prefix.charAt(0);
    TrieNode childNode = node.getChildNode(firstChar);
    if (childNode == null) {
      // no match at this char, exit
      return null;
    } else {
      // go deeper
      return matchPrefixRecursive(childNode, prefix.substring(1));
    }
  }

  public List<String> findCompletions(String prefix) {
    TrieNode matchedNode = matchPrefixRecursive(rootNode, prefix);
    List<String> completions = new ArrayList<String>();
    findCompletionsRecursive(matchedNode, prefix, completions);
    return completions;
  }

  private void findCompletionsRecursive(TrieNode node, String prefix, List<String> completions) {
    if (node == null) {
      // our prefix did not match anything, just return
      return;
    }
    if (node.getNodeValue() == '$') {
      // end reached, append prefix into completions list. Do not append
      // the trailing $, that is only to distinguish words like ann and anne
      // into separate branches of the tree.
      completions.add(prefix.substring(0, prefix.length() - 1));
      return;
    }
    Collection<TrieNode> childNodes = node.getChildren();
    for (TrieNode childNode : childNodes) {
      char childChar = childNode.getNodeValue();
      findCompletionsRecursive(childNode, prefix + childChar, completions);
    }
  }

  public String toString() {
    return "Trie:" + rootNode.toString();
  }
}

// TrieNode.java
public class TrieNode {

  private Character character;
  private HashMap<Character,TrieNode> children;

  public TrieNode(char c) {
    super();
    this.character = new Character(c);
    children = new HashMap<Character,TrieNode>();
  }

  public char getNodeValue() {
    return character.charValue();
  }

  public Collection<TrieNode> getChildren() {
    return children.values();
  }

  public Set<Character> getChildrenNodeValues() {
    return children.keySet();
  }

  public void add(char c) {
    if (children.get(new Character(c)) == null) {
      // children does not contain c, add a TrieNode
      children.put(new Character(c), new TrieNode(c));
    }
  }

  public TrieNode getChildNode(char c) {
    return children.get(new Character(c));
  }

  public boolean contains(char c) {
    return (children.get(new Character(c)) != null);
  }

  public int hashCode() {
    return character.hashCode();
  }

  public boolean equals(Object obj) {
    if (!(obj instanceof TrieNode)) {
      return false;
    }
    TrieNode that = (TrieNode) obj;
    return (this.getNodeValue() == that.getNodeValue());
  }

  public String toString() {
    return ReflectionToStringBuilder.reflectionToString(this, ToStringStyle.DEFAULT_STYLE);
  }
}

In-Memory Relational Database

Although the previous implementation works fine, it is not very readable. This got me thinking. All we are doing are prefix queries on the passed in word, so a possible implementation could be database based. We use an in-memory database such as HSQLDB to ensure similar performance to the Trie implementation. Here is the code for this implementation.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
// DbTrie.java
public class DbTrie {

  private static final String DB_NAME = "/tmp/lsdb";

  private static final String LOAD_SQL = "insert into ls(name) values (?)";
  private static final String MATCH_SQL = "select count(*) from ls where name like '%prefix%%'";
  private static final String FIND_SQL = "select name from ls where name like '%prefix%%'";

  private JdbcTemplate jdbcTemplate;
  private boolean loadData = false;

  public DbTrie() throws Exception {
    DriverManagerDataSource dataSource = new DriverManagerDataSource();
    dataSource.setDriverClassName("org.hsqldb.jdbcDriver");
    dataSource.setUrl("jdbc:hsqldb:file:" + DB_NAME);
    dataSource.setUsername("sa");
    dataSource.setPassword("");
    jdbcTemplate = new JdbcTemplate(dataSource);
    if (! (new File(DB_NAME + ".properties").exists())) {
      jdbcTemplate.execute("create table ls(name varchar(64) not null, primary key(name));");
      loadData = true;
    }
  }

  public void load(String line) {
    if (loadData) {
      jdbcTemplate.update(LOAD_SQL, new String[] {line});
    }
  }

  public boolean matchPrefix(String prefix) {
    int numMatches = jdbcTemplate.queryForInt(MATCH_SQL.replaceAll("%prefix%", prefix));
    return (numMatches > 0);
  }

  @SuppressWarnings("unchecked")
  public List<String> findCompletions(String prefix) {
    List<String> completions = new ArrayList<String>();
    List<Map<String,String>> rows = jdbcTemplate.queryForList(
      FIND_SQL.replaceAll("%prefix%", prefix));
    for (Map<String,String> row : rows) {
      completions.add(row.get("NAME"));
    }
    return completions;
  }
}

Java Set

My final implementation is a plain old java.util.TreeSet. Given that we need a way to quickly jump down to the position where the rest of the entries are lexicographically greater than or equal to our input, then iterate until we reach a point where the entries no longer start with our input, a TreeSet seemed to be the ideal data structure. Here is the code:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
// SetTrie.java
public class SetTrie {

  private TreeSet<String> lines;

  public SetTrie() {
    lines = new TreeSet<String>();
  }

  public void load(String line) {
    lines.add(line);
  }

  public boolean matchPrefix(String prefix) {
    Set<String> tailSet = lines.tailSet(prefix);
    for (String tail : tailSet) {
      if (tail.startsWith(prefix)) {
        return true;
      }
    }
    return false;
  }

  public List<String> findCompletions(String prefix) {
    List<String> completions = new ArrayList<String>();
    Set<String> tailSet = lines.tailSet(prefix);
    for (String tail : tailSet) {
      if (tail.startsWith(prefix)) {
        completions.add(tail);
      } else {
        break;
      }
    }
    return completions;
  }
}

Performance

Each implementation was tested with the standard call sequence and the timing information captured. The call sequence is shown below:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
  Trie trie = new Trie();
  //DbTrie trie = new DbTrie();
  //SetTrie trie = new SetTrie();
  for (String line : lines) { // contains filenames from ~/tmp/
    trie.load();
  }
  for (int i = 1; i < TEST_STRING_LENGTH; i++) {
    String prefix = TEST_STRING.substring(0, i);
    List<String> completions = trie.findCompletions(prefix);
  }

Here are some performance numbers for each of the three implementations. As expected, the Trie implementation (being a specialized data structure) is the fastest, and the HSQLDB implementation is the slowest (because of the overhead of an SQL engine). The Set based implementation is the easiest to understand, but is not as quick as the Trie based implementation.

Implementation Setup (ms) Average Lookup time (ms)
Trie 82 0.03
HSQLDB 192 1.87
Set 6 0.06

Conclusion

All the implementations above are in-memory implementations. For large data sets, loading these data sets into memory could impact startup times significantly. There is also the concern with huge memory footprints of the application servers. While these may be possible implementations for small sites, these would probably not be suitable for large sites, even though they do have response times that can be measured in the order of fractions of milliseconds.