Salmon Run: June 2009

Sometime back, I wrote about writing my first Hadoop program, where I was trying to compute word co-occurrence frequencies in the queries submitted to our search engine by parsing out the query terms from our HTTP access.log files. Recently, a reader sent me a private email asking if I had something more robust - what she was after was parsing out and computing browser/version frequencies. Unfortunately, my parser was limited to parsing the NCSA common log format, and was not very easy to extend to do what she needed.

That set me thinking, and I decided to rebuild the parser so it is more general. In the process, I came up with three different versions, each of which is slightly different. I thought it may be interesting to describe them all here, so here they are.

Based on StringTokenizer.
Based on Scanner.
Based on looping through a char array.

Version 1: StringTokenizer

This version is a small step up from the zeroth version that I described in my earlier blog post. Instead of hard coding a bunch of nextToken() calls in sequence, I loop through the tokens using whitespace as a delimiter. If a token starts with a quote mark (") or an opening square bracket ([), then I continue to accumulate tokens till I find one which ends in a quote mark (") or a closing square bracket (]) respectively. Here's the code:

// Source: src/main/java/net/sf/jtmt/concurrent/hadoop/NcsaLogParser.java
package net.sf.jtmt.concurrent.hadoop;

import java.util.EnumMap;
import java.util.StringTokenizer;

import org.apache.commons.lang.StringUtils;
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;

/**
 * This class parses the NCSA common and combined access.log formats
 * and returns a Map of {Field, Value}. Some of the fields are already
 * separated, such as the components of the [date:time timezone] field,
 * and the components of the request field. However, other fields, such
 * as the user_agent field is not because I was not sure if the format
 * is a standard. If it is, then it can be separated out similar to the
 * date:time timezone and the request group fields.
 * 
 * The format is a predefined array of Field objects, so if the common
 * or combined format parsing is desired, then these formats can be 
 * passed to the parser by name. If trailing fields are specified in
 * the format but are missing in the data, they will be silently 
 * ignored. You can also specify a custom field format if your data
 * has missing fields or additional fields compared to either the
 * common or combined log format. If you don't care about the value
 * to be parsed, specify Field.IGNORE at that position. If you want the
 * values to be parsed out, then there are four fields currently, called
 * Field.CUSTOM_1, ..., Field.CUSTOM_4, that can be used.
 * 
 * Format of NCSA common log file format (access_log):
 *   host rfc931 username date:time timezone request statuscode bytes
 * Example:
 *   125.125.125.125 - dsmith [10/Oct/1999:21:15:05 +0500] \
 *     "GET /index.html HTTP/1.0" 200 1043
 * Format of NCSA combined log file format (access_log):
 *   host rfc931 username date:time timezone request statuscode bytes \
 *     referrer useragent cookie
 * Example:
 *   125.125.125.125 - dsmith [10/Oct/1999:21:15:05 +0500] \
 *     "GET /index.html HTTP/1.0" 200 1043 \
 *     "-" "Mozilla/5.0 (X11; ...)" "USER_ID:..."
 * We can also parse a custom file format if the array of fields are 
 * specified in the constructor.
 */
public class NcsaLogParser {
  
  private static final Log log = LogFactory.getLog(NcsaLogParser.class);
  
  public enum Field {
    HOST,
    PROTOCOL,
    USERNAME,
    DATE,
    TIME,
    TIMEZONE,
    REQUEST_METHOD,
    REQUEST_URL,
    REQUEST_PROTOCOL,
    STATUS_CODE,
    BYTE_COUNT,
    REFERER,
    USER_AGENT,
    COOKIE,
    // upto 4 slots for holding custom fields
    CUSTOM_1,
    CUSTOM_2,
    CUSTOM_3,
    CUSTOM_4,
    // a common slot for ignorable fields, can be applied
    // multiple times
    IGNORE
  };
  public static final Field[] COMMON_LOG_FORMAT = new Field[] {
    Field.HOST,
    Field.PROTOCOL,
    Field.USERNAME,
    Field.DATE,
    Field.TIME,
    Field.TIMEZONE,
    Field.REQUEST_METHOD,
    Field.REQUEST_URL,
    Field.REQUEST_PROTOCOL,
    Field.STATUS_CODE,
    Field.BYTE_COUNT
  };
  public static final Field[] COMBINED_LOG_FORMAT = new Field[] {
    Field.HOST,
    Field.PROTOCOL,
    Field.USERNAME,
    Field.DATE,
    Field.TIME,
    Field.TIMEZONE,
    Field.REQUEST_METHOD,
    Field.REQUEST_URL,
    Field.REQUEST_PROTOCOL,
    Field.STATUS_CODE,
    Field.BYTE_COUNT,
    Field.REFERER,
    Field.USER_AGENT,
    Field.COOKIE,
  };
  
  public static EnumMap<Field,String> parse(String logline, 
      Field[] format) {
    EnumMap<Field,String> values = 
      new EnumMap<Field, String>(Field.class);
    StringTokenizer tokenizer = new StringTokenizer(logline, " ");
    int fieldIdx = 0;
    StringBuilder buf = new StringBuilder();
    boolean inCompoundField = false;
    boolean inDateCompoundField = false;
    boolean inRequestCompoundField = false;
    while (tokenizer.hasMoreTokens()) {
      String token = tokenizer.nextToken();
      if (token.startsWith("[") && 
          format[fieldIdx] == Field.DATE 
          && (! inCompoundField)) {
        // if token starts with "[", this is part of the date:time timezone
        // triple. Keep accumulating until we find a token that ends with
        // "]".
        inDateCompoundField = true;
        buf.append(token).append(" ");
        continue;
      } else if (token.endsWith("]") && inDateCompoundField) {
        // ending token for "[xxx]" date:time timezone field, parse out
        // date time and timezone and post into the enum map.
        buf.append(token);
        String chopped = chopEnds(buf.toString(), 1, 1);
        int firstColonPos = chopped.indexOf(':');
        int firstSpacePos = chopped.indexOf(' ');
        values.put(Field.DATE, chopped.substring(0, firstColonPos));
        values.put(Field.TIME, 
          chopped.substring(firstColonPos + 1, firstSpacePos));
        values.put(Field.TIMEZONE, chopped.substring(firstSpacePos + 1));
        fieldIdx += 3;
        inDateCompoundField = false;
        buf = new StringBuilder();
        continue;
      } else if (token.startsWith("\"") && token.endsWith("\"")) {
        // this is a complete token, so no need to accumulate in this case
        values.put(format[fieldIdx], chopEnds(token, 1, 1));
      } else if (token.startsWith("\"") && 
          format[fieldIdx] == Field.REQUEST_METHOD && 
          (! inRequestCompoundField)) {
        // start of request group "method uri protocol", accumulate
        inRequestCompoundField = true;
        buf.append(token).append(" ");
        continue;
      } else if (token.endsWith("\"") && inRequestCompoundField) {
        // end of request group, parse and post into the enum map.
        buf.append(token);
        String[] subtokens = 
          StringUtils.split(chopEnds(buf.toString(), 1, 1), " ");
        values.put(Field.REQUEST_METHOD, subtokens[0]);
        values.put(Field.REQUEST_URL, subtokens[1]);
        values.put(Field.REQUEST_PROTOCOL, subtokens[2]);
        fieldIdx += 3;
        inRequestCompoundField = false;
        buf = new StringBuilder();
        continue;
      } else if (token.startsWith("\"") && (! inCompoundField)) {
        // if token starts with "\"", then accumulate into buffer till
        // we find a token which ends with "\"", then parse and post 
        // into enum map.
        inCompoundField = true;
        buf.append(token).append(" ");
        continue;
      } else if (token.endsWith("\"") && inCompoundField) {
        // parse and post into enum map
        buf.append(token);
        values.put(format[fieldIdx], chopEnds(buf.toString(), 1, 1));
        inCompoundField = false;
        buf = new StringBuilder();
      } else if (inCompoundField || inDateCompoundField || 
          inRequestCompoundField) {
        // we are inside a compound field, just accumulate
        buf.append(token).append(" ");
        continue;
      } else {
        // non-compound token is found, post into enum map
        values.put(format[fieldIdx], token);
      }
      fieldIdx++;
    }
    return values;
  }
  
  private static String chopEnds(String buf, 
      int numLeadingCharsToChop, int numTrailingCharsToChop) {
    return buf.substring(numLeadingCharsToChop, 
      buf.length() - numTrailingCharsToChop);
  }
}

As you can see, there are two predefined formats available in the parser. For custom formats, you can specify four custom fields (Field.CUSTOM_1..4) and any number of ignorable fields (Field.IGNORE). Here is a test case which exercises the code.

// Source: src/test/java/net/sf/jtmt/concurrent/hadoop/NcsaLogParserTest.java
package net.sf.jtmt.concurrent.hadoop;

import java.io.BufferedReader;
import java.io.File;
import java.io.FileReader;
import java.util.EnumMap;
import java.util.Map;

import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.junit.Test;

import net.sf.jtmt.concurrent.hadoop.NcsaLogParser.Field;;

/**
 * Test for NCSA Log parser.
 */
public class NcsaLogParserTest {

  private final Log log = LogFactory.getLog(getClass());

  private Field[] myCustomLogFormat = new Field[] {
    Field.HOST,
    Field.PROTOCOL,
    Field.USERNAME,
    Field.DATE,
    Field.TIME,
    Field.TIMEZONE,
    Field.REQUEST_METHOD,
    Field.REQUEST_URL,
    Field.REQUEST_PROTOCOL,
    Field.STATUS_CODE,
    Field.BYTE_COUNT,
    Field.REFERER,
    Field.CUSTOM_1, // we have a custom field in here
    Field.USER_AGENT,
    Field.COOKIE
  };

  @Test
  public void testParsing() throws Exception {
    File[] accesslogs = (new File(
      "src/test/resources/access_logs")).listFiles();
    for (File accesslog : accesslogs) {
      String line;
      BufferedReader reader = new BufferedReader(new FileReader(accesslog));
      while ((line = reader.readLine()) != null) {
        EnumMap<Field,String> elements = 
          NcsaLogParser.parse(line, myCustomLogFormat);
        String url = elements.get(Field.REQUEST_URL);
        if (url.contains("/search")) {
          printResult(line, "CUSTOM_LOG_FORMAT", elements);
        }
      }
      reader.close();
    }
  }
  
  @Test
  public void testParseCommonLogFormat() throws Exception {
    String line = "192.168.123.1 - - [19/Oct/2008:19:45:38 -0700] \"" +
      "GET /search?q1=foo&st=bar HTTP/1.1\" 200 323";
    EnumMap<Field,String> values = 
      NcsaLogParser.parse(line, NcsaLogParser.COMMON_LOG_FORMAT);
    printResult(line, "COMMON_LOG_FORMAT", values);
  }
  
  @Test
  public void testParseCombinedLogFormatNoCookie() throws Exception {
    String line = "192.168.123.12 - - [19/Oct/2008:19:45:38 -0700] \"" +
      "GET /search?q1=foo&st=bar HTTP/1.1\" 200 323 " +
      "\"-\" \"Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1.14) " +
      "Gecko/20080416 Fedora/2.0.0.14-1.fc7 Firefox/2.0.0.14\"";
    EnumMap<Field,String> values = 
      NcsaLogParser.parse(line, NcsaLogParser.COMBINED_LOG_FORMAT);
    printResult(line, "COMBINED_LOG_FORMAT(1)", values);
  }
  
  @Test
  public void testParseCombinedLogFormatWithCookie() throws Exception {
    String line = "192.168.123.12 - - [19/Oct/2008:19:45:38 -0700] \"" +
      "GET /search?q1=foo&st=bar HTTP/1.1\" 200 323 " +
      "\"-\" \"Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1.14) " +
      "Gecko/20080416 Fedora/2.0.0.14-1.fc7 Firefox/2.0.0.14\" " + 
      "\"USER_ID=12345,jsession_id=3BFY342211\"";
    EnumMap<Field,String> values = 
      NcsaLogParser.parse(line, NcsaLogParser.COMBINED_LOG_FORMAT);
    printResult(line, "COMBINED_LOG_FORMAT(2)", values);
  }
  
  @Test
  public void testParseCustomLogFormat() throws Exception {
    String line = "192.168.123.12 - - [19/Oct/2008:19:45:38 -0700] \"" +
      "GET /search?q1=foo&st=bar HTTP/1.1\" 200 323 " +
      "\"-\" 34567 \"Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1.14) " +
      "Gecko/20080416 Fedora/2.0.0.14-1.fc7 Firefox/2.0.0.14\"";
    EnumMap<Field,String> values = 
      NcsaLogParser.parse(line, myCustomLogFormat);
    printResult(line, "CUSTOM_LOG_FORMAT", values);
  }
  
  private void printResult(String line, String format,
       EnumMap<Field,String> result) {
    log.info(">>> " + line);
    log.info("format = " + format);
    log.info("components = {");
    for (Field key : result.keySet()) {
      log.info("  " + key.name() + " => " + result.get(key));
    }
    log.info("}");
  }
}

The output of the tests above is shown below:

>>> 192.168.123.1 - - [19/Oct/2008:19:45:38 -0700] "GET \
/search?q1=foo&st=bar HTTP/1.1" 200 323
format = COMMON_LOG_FORMAT
components = {
  HOST => 192.168.123.1
  PROTOCOL => -
  USERNAME => -
  DATE => 19/Oct/2008
  TIME => 19:45:38
  TIMEZONE => -0700
  REQUEST_METHOD => GET
  REQUEST_URL => /search?q1=foo&st=bar
  REQUEST_PROTOCOL => HTTP/1.1
  STATUS_CODE => 200
  BYTE_COUNT => 323
}

>>> 192.168.123.12 - - [19/Oct/2008:19:45:38 -0700] "GET \
/search?q1=foo&st=bar HTTP/1.1" 200 323 "-" "Mozilla/5.0 (X11; U; \
Linux i686; en-US; rv:1.8.1.14) Gecko/20080416 Fedora/2.0.0.14-1.fc7 \
Firefox/2
format = COMBINED_LOG_FORMAT(1)
components = {
  HOST => 192.168.123.12
  PROTOCOL => -
  USERNAME => -
  DATE => 19/Oct/2008
  TIME => 19:45:38
  TIMEZONE => -0700
  REQUEST_METHOD => GET
  REQUEST_URL => /search?q1=foo&st=bar
  REQUEST_PROTOCOL => HTTP/1.1
  STATUS_CODE => 200
  BYTE_COUNT => 323
  REFERER => -
  USER_AGENT => Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1.14) \
Gecko/20080416 Fedora/2.0.0.14-1.fc7 Firefox/2.0.0.14
}

>>> 192.168.123.12 - - [19/Oct/2008:19:45:38 -0700] "GET \
/search?q1=foo&st=bar HTTP/1.1" 200 323 "-" "Mozilla/5.0 (X11; U; Linux \
i686; en-US; rv:1.8.1.14) Gecko/20080416 Fedora/2.0.0.14-1.fc7 Firefox/2
format = COMBINED_LOG_FORMAT(2)
components = {
  HOST => 192.168.123.12
  PROTOCOL => -
  USERNAME => -
  DATE => 19/Oct/2008
  TIME => 19:45:38
  TIMEZONE => -0700
  REQUEST_METHOD => GET
  REQUEST_URL => /search?q1=foo&st=bar
  REQUEST_PROTOCOL => HTTP/1.1
  STATUS_CODE => 200
  BYTE_COUNT => 323
  REFERER => -
  USER_AGENT => Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1.14) \
Gecko/20080416 Fedora/2.0.0.14-1.fc7 Firefox/2.0.0.14
  COOKIE => USER_ID=12345,jsession_id=3BFY342211
}

>>> 192.168.123.12 - - [19/Oct/2008:19:45:38 -0700] "GET \
/search?q1=foo&st=bar HTTP/1.1" 200 323 "-" 34567 "Mozilla/5.0 (X11; U; \
Linux i686; en-US; rv:1.8.1.14) Gecko/20080416 Fedora/2.0.0.14-1.fc7 \
Firefox/2.0.0.14
format = CUSTOM_LOG_FORMAT
components = {
  HOST => 192.168.123.12
  PROTOCOL => -
  USERNAME => -
  DATE => 19/Oct/2008
  TIME => 19:45:38
  TIMEZONE => -0700
  REQUEST_METHOD => GET
  REQUEST_URL => /search?q1=foo&st=bar
  REQUEST_PROTOCOL => HTTP/1.1
  STATUS_CODE => 200
  BYTE_COUNT => 323
  REFERER => -
  USER_AGENT => Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1.14) \
Gecko/20080416 Fedora/2.0.0.14-1.fc7 Firefox/2.0.0.14
  CUSTOM_1 => 34567
}

Version 2: Scanner

As you can see, the code in the first version does the job, but there is a bit of code in the while(tokenizer.hasMoreTokens()) loop which seemed a bit gnarly to me, so I decided to see if it could be improved. I hit upon the idea of using java.util.Scanner and so my second version is based on that. The parser code does become much less complex, but the format is specified in terms of regular expressions, which can make the parser hard to use if you are using a custom format.

The predefined formats may not work all the time either - I've tested it on my corpus of access.log files, and I did uncover some edge cases for which the initial regexps were failing, so I updated them - but I'm not sure I got them all. Anyway, here is the code:

// Source: src/main/java/net/sf/jtmt/concurrent/hadoop/NcsaLogParser.java
package net.sf.jtmt.concurrent.hadoop;

import java.util.EnumMap;
import java.util.Scanner;
import java.util.regex.MatchResult;

import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;

/**
 * This class parses the NCSA common and combined access.log formats
 * and returns a Map of {Field, Value}. Some of the fields are already
 * separated, such as the components of the [date:time timezone] field,
 * and the components of the request field. However, other fields, such
 * as the user_agent field is not because I was not sure if the format
 * is a standard. If it is, then it can be separated out similar to the
 * date:time timezone and the request group fields.
 */
public class NcsaLogParser {
  
  private static final Log log = LogFactory.getLog(NcsaLogParser.class);
  
  public static EnumMap<Format.Field,String> parse(
      String logline, Format format) {
    EnumMap<Format.Field,String> tokens = 
      new EnumMap<Format.Field,String>(Format.Field.class); 
    Scanner scanner = new Scanner(logline);
    scanner.findInLine(format.getPattern());
    MatchResult result = scanner.match();
    for (int i = 0; i < format.getFieldNames().length; i++) {
      tokens.put(format.getFieldNames()[i], result.group(i + 1));
    }
    scanner.close();
    return tokens;
  }
}

As you can see, the code is much shorter - it works by matching the entire line as a single regular expression, so much of the complexity (for the predefined formats) is now in the Format class, which is shown below:

// Source: src/main/java/net/sf/jtmt/concurrent/hadoop/Format.java
package net.sf.jtmt.concurrent.hadoop;

import java.util.regex.Pattern;

/**
 * A holder class to hold predefined or user-defined access log
 * formats. Formats are defined in terms of regular expression 
 * snippets. Predefined formats include the NCSA common log format
 * and the NCSA combined log format.
 * 
 * Format of NCSA common log file format (access_log):
 *   host rfc931 username date:time timezone request statuscode bytes
 * Example:
 *   125.125.125.125 - dsmith [10/Oct/1999:21:15:05 +0500] \
 *     "GET /index.html HTTP/1.0" 200 1043
 * 
 * Format of NCSA combined log file format (access_log):
 *   host rfc931 username date:time timezone request statuscode bytes \
 *     referrer useragent cookie
 * Example:
 *   125.125.125.125 - dsmith [10/Oct/1999:21:15:05 +0500] \
 *     "GET /index.html HTTP/1.0" 200 1043 \
 *     "-" "Mozilla/5.0 (X11; ...)" "USER_ID:..."
 *
 * The Field enum contains all available field names from the Apache
 * mod_log_config page. A custom log format object can be built up
 * using these names and the appropriate regular expressions. It does 
 * require a fair bit of work to build a custom format, but once built
 * the implementation is pretty stable.
 */
public class Format {

  public enum Field {
    remote_host,
    remote_port,
    local_host,
    rfc931,
    username,
    date,
    time,
    timezone,
    request_method,
    request_uri,
    request_protocol,
    request_query_string,
    response_status_code,
    connection_status,
    bytes_recd,
    bytes_sent,
    response_time_micros,
    referer,
    user_agent,
    cookie,
    pid,
    tid,
  };

  public static final Format COMMON_LOG_FORMAT = new Format(
    "([a-z0-9.]+) " +                 // remote_host, dotted quad or resolved
    "([a-z0-9_-]+) " +                // rfc921 usually -
    "([a-z0-9_-]+) " +                // username if identified, else -
    "\\[(\\d{1,2}/\\w{3}/\\d{4}):" +  // date part
    "(\\d{1,2}:\\d{1,2}:\\d{1,2}) " + // time part
    "([+-]\\d{4})\\] " +              // timezone part
    "\"([A-Z]{3,5}) " +               // request method
    "(/[^ ]+) " +                     // request uri
    "([A-Z]+/\\d\\.\\d)\" " +         // request protocol
    "(\\d+) " +                       // response status code
    "(-|\\d+)",                       // bytes received
    new Field[] {
      Field.remote_host,
      Field.rfc931,
      Field.username,
      Field.date,
      Field.time,
      Field.timezone,
      Field.request_method,
      Field.request_uri,
      Field.request_protocol,
      Field.response_status_code,
      Field.bytes_recd
  });
  public static final Format COMBINED_LOG_FORMAT = new Format(
      "([a-z0-9.]+) " +                 // remote_host, dotted quad or
                                        // resolved
      "([a-z0-9_-]+) " +                // rfc921 usually -
      "([a-z0-9_-]+) " +                // username if identified, else -
      "\\[(\\d{1,2}/\\w{3}/\\d{4}):" +  // date part
      "(\\d{1,2}:\\d{1,2}:\\d{1,2}) " + // time part
      "([+-]\\d{4})\\] " +              // timezone part
      "\"([A-Z]{3,5}) " +               // request method
      "(/[^ ]+) " +                     // request uri
      "([A-Z]+/\\d\\.\\d)\" " +         // request protocol
      "(\\d+) " +                       // response status code
      "(-|\\d+) " +                     // bytes received
      "\"([^\"]+)\" " +                 // referer, dotted quad or resolved
                                        // or '-'
      "\"([^\"]+)\" " +                 // user-agent string
      "\"([^\"]+)\"",                   // cookie nvps
      new Field[] {
        Field.remote_host,
        Field.rfc931,
        Field.username,
        Field.date,
        Field.time,
        Field.timezone,
        Field.request_method,
        Field.request_uri,
        Field.request_protocol,
        Field.response_status_code,
        Field.bytes_recd,
        Field.referer,
        Field.user_agent,
        Field.cookie
  });
  
  private Pattern pattern;
  private Field[] fieldNames;
  
  public Format(String pattern, Field[] fieldNames) {
    this.pattern = Pattern.compile(pattern);
    this.fieldNames = fieldNames;
  }
  
  public Pattern getPattern() {
    return pattern;
  }
  
  public void setPattern(Pattern pattern) {
    this.pattern = pattern;
  }
  
  public Field[] getFieldNames() {
    return fieldNames;
  }

  public void setFieldNames(Field[] fieldNames) {
    this.fieldNames = fieldNames;
  }
}

I moved the Field enum into the Format class. This time the enum contains an exhaustive list of field names, so there is no need to specify the custom classes. The NcsaLogParser.parse() signature has changed to include the Format class instead of an array of Field[] enums. The unit test is similar to the previous version, except for the signature change.

// Source: src/test/java/net/sf/jtmt/concurrent/hadoop/NcsaLogParserTest.java
package net.sf.jtmt.concurrent.hadoop;

import java.io.BufferedReader;
import java.io.File;
import java.io.FileReader;
import java.util.EnumMap;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

import net.sf.jtmt.concurrent.hadoop.Format.Field;

import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.junit.Test;

/**
 * Test for NCSA Log parser.
 */
public class NcsaLogParserTest {

  private final Log log = LogFactory.getLog(getClass());

  private Format myCustomLogFormat = new Format(
    "([a-z0-9.]+) " +                 // remote_host, dotted quad or resolved
    "([a-z0-9_-]+) " +                // rfc921 usually -
    "([a-z0-9_-]+) " +                // username if identified, else -
    "\\[(\\d{1,2}/\\w{3}/\\d{4}):" +  // date part
    "(\\d{1,2}:\\d{1,2}:\\d{1,2}) " + // time part
    "(-\\d{4})\\] " +                 // timezone part
    "\"([A-Z]{3,5}) " +               // request method
    "(/[^ ]+) " +                     // request uri
    "([A-Z]+/\\d\\.\\d)\" " +         // request protocol
    "(\\d+) " +                       // response status code
    "(-|\\d+) " +                     // bytes received
    "\"([^\"]+)\" " +                 // referer, dotted quad or resolved 
                                      // or '-'
    "(\\d+) " +                       // custom: response time in microsecs
    "\"([^\"]+)\"",                   // user-agent string
    new Field[] {
      Field.remote_host,
      Field.rfc931,
      Field.username,
      Field.date,
      Field.time,
      Field.timezone,
      Field.request_method,
      Field.request_uri,
      Field.request_protocol,
      Field.response_status_code,
      Field.bytes_recd,
      Field.referer,
      Field.response_time_micros, // our custom field here
      Field.user_agent,
  });

  @Test
  public void testParsing() throws Exception {
    File[] accesslogs = 
      (new File("src/test/resources/access_logs")).listFiles();
    for (File accesslog : accesslogs) {
      String line;
      BufferedReader reader = new BufferedReader(new FileReader(accesslog));
      while ((line = reader.readLine()) != null) {
        EnumMap<Field,String> elements = 
          NcsaLogParser.parse(line, myCustomLogFormat);
        String url = elements.get(Field.request_uri);
        if (url.contains("/search")) {
          printResult(line, "CUSTOM_LOG_FORMAT", elements);
        }
      }
      reader.close();
    }
  }

  @Test
  public void testParseCommonLogFormatNoCookie() throws Exception {
    String line = "192.168.123.12 - - [19/Oct/2008:19:45:38 -0700] \"" +
      "GET /search?q1=foo&st=bar HTTP/1.1\" 200 323";
    EnumMap<Field,String> values = 
      NcsaLogParser.parse(line, Format.COMMON_LOG_FORMAT);
    printResult(line, "COMMON_LOG_FORMAT", values);
  }

  // this version needs an exact match - so cookie field is no longer
  // optional. if the log file does not contain the cookie field, 
  // a custom format is needed which is basically the combined log 
  // format minus the cookie field.
  @Test
  public void testParseCombinedLogFormatWithCookie() throws Exception {
    String line = "192.168.123.12 - - [19/Oct/2008:19:45:38 -0700] \"" +
      "GET /search?q1=foo&st=bar HTTP/1.1\" 200 323 " +
      "\"-\" \"Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1.14) " +
      "Gecko/20080416 Fedora/2.0.0.14-1.fc7 Firefox/2.0.0.14\" " + 
      "\"USER_ID=12345,jsession_id=3BFY342211\"";
    EnumMap<Field,String> values = 
      NcsaLogParser.parse(line, Format.COMBINED_LOG_FORMAT);
    printResult(line, "COMBINED_LOG_FORMAT", values);
  }
  
  @Test
  public void testParseCustomLogFormat() throws Exception {
    String line = "192.168.123.12 - - [19/Oct/2008:19:45:38 -0700] \"" +
      "GET /search?q1=foo&st=bar HTTP/1.1\" 200 323 " +
      "\"-\" 34567 \"Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1.14) " +
      "Gecko/20080416 Fedora/2.0.0.14-1.fc7 Firefox/2.0.0.14\"";
    EnumMap<Field,String> values = 
      NcsaLogParser.parse(line, myCustomLogFormat);
    printResult(line, "CUSTOM_LOG_FORMAT", values);
  }
  
  private void printResult(
      String line, String format, EnumMap<Field,String> result) {
    log.info(">>> " + line);
    log.info("format = " + format);
    log.info("components = {");
    for (Field key : result.keySet()) {
      log.info("  " + key.name() + " => " + result.get(key));
    }
    log.info("}");
  }
}

The results are identical to the previous version, except the keys are in lowercase.

>>> 192.168.123.12 - - [19/Oct/2008:19:45:38 -0700] "GET \
/search?q1=foo&st=bar HTTP/1.1" 200 323
format = COMMON_LOG_FORMAT
components = {
  remote_host => 192.168.123.12
  rfc931 => -
  username => -
  date => 19/Oct/2008
  time => 19:45:38
  timezone => -0700
  request_method => GET
  request_uri => /search?q1=foo&st=bar
  request_protocol => HTTP/1.1
  response_status_code => 200
  bytes_recd => 323
}

>>> 192.168.123.12 - - [19/Oct/2008:19:45:38 -0700] "GET \
/search?q1=foo&st=bar HTTP/1.1" 200 323 "-" "Mozilla/5.0 (X11; U; Linux \
i686; en-US; rv:1.8.1.14) Gecko/20080416 Fedora/2.0.0.14-1.fc7 Firefox/2
format = COMBINED_LOG_FORMAT
components = {
  remote_host => 192.168.123.12
  rfc931 => -
  username => -
  date => 19/Oct/2008
  time => 19:45:38
  timezone => -0700
  request_method => GET
  request_uri => /search?q1=foo&st=bar
  request_protocol => HTTP/1.1
  response_status_code => 200
  bytes_recd => 323
  referer => -
  user_agent => Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1.14) \ Gecko/20080416 Fedora/2.0.0.14-1.fc7 Firefox/2.0.0.14
  cookie => USER_ID=12345,jsession_id=3BFY342211
}

>>> 192.168.123.12 - - [19/Oct/2008:19:45:38 -0700] "GET \
/search?q1=foo&st=bar HTTP/1.1" 200 323 "-" 34567 "Mozilla/5.0 (X11; U; \
Linux i686; en-US; rv:1.8.1.14) Gecko/20080416 Fedora/2.0.0.14-1.fc7 Fir
format = CUSTOM_LOG_FORMAT
components = {
  remote_host => 192.168.123.12
  rfc931 => -
  username => -
  date => 19/Oct/2008
  time => 19:45:38
  timezone => -0700
  request_method => GET
  request_uri => /search?q1=foo&st=bar
  request_protocol => HTTP/1.1
  response_status_code => 200
  bytes_recd => 323
  response_time_micros => 34567
  referer => -
  user_agent => Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1.14) \
Gecko/20080416 Fedora/2.0.0.14-1.fc7 Firefox/2.0.0.14
}

Even though the Scanner based code works fine, setting it up is a pain. The regular expressions themselves are not the problem, it is the fact that you may have missed some edge case that will blow up in your face with a IllegalStateException from the Scanner because it could not match the line. Besides, the complexity of having to specify is not really warranted since the problem is actually quite simple, as I found after a bit of thought.

Version 3: char[]

So, as I mentioned above, the problem of parsing a log file is really quite simple - tokenize on space, except when you encounter a quote or open square bracket. In these cases, continue to accumulate until you see another quote or closing square bracket respectively. Essentially similar logic to my string tokenizer version, but cleaner since I don't have to mess with startsWith() and endsWith() calls.

I also figured that if the caller wanted the contents of the datetime or request or user-agent groups, then he could apply client specific logic to do this - after the initial tokenization, extracting the sub-tokens can be as simple as a String.split() call. That way, we don't spend cycles on stuff we are not going to use. In the spirit of minimalism, I also got rid of the EnumMap and just return a List of tokens - the caller is responsible for getting at the token by name. Here is the code:

// Source: src/main/java/net/sf/jtmt/concurrent/hadoop/NcsaLogParser.java
package net.sf.jtmt.concurrent.hadoop;

import java.util.ArrayList;
import java.util.List;

/**
 * A stripped down version of the NCSA Log parser. There is no attempt
 * to identify the fields here. The log line is tokenized by whitespace,
 * unless the current token begins with a quote (ie multi word fields
 * such as user_agent) or with a square bracket (ie datetime group). 
 * Caller is responsible for identifying which token to use. Using this
 * approach simplifies the logic a lot, and also ensures that only the
 * processing that is absolutely necessary get done, ie we are not 
 * spending cycles to parse out the contents of the datetime or the 
 * request group unless we want to. If we do, then the parsing is done
 * in the caller code - at this point, the extra parsing can be as simple 
 * as calling String.split() with the appropriate delimiters.
 */
public class NcsaLogParser {
  
  public static List<String> parse(String logline) {
    List<String> tokens = new ArrayList<String>();
    StringBuilder buf = new StringBuilder();
    char[] lc = logline.toCharArray();
    boolean inQuotes = false;
    boolean inBrackets = false;
    for (int i = 0; i < lc.length; i++) {
      if (lc[i] == '"') {
        inQuotes = inQuotes ? false : true;
      } else if (lc[i] == '[') {
        inBrackets = true;
      } else if (lc[i] == ']') {
        if (inBrackets) {
          inBrackets = false;
        }
      } else if (lc[i] == ' ' && (! inQuotes) && (! inBrackets)) {
        tokens.add(buf.toString());
        buf = new StringBuilder();
      } else {
        buf.append(lc[i]);
      }
    }
    if (buf.length() > 0) {
      tokens.add(buf.toString());
    }
    return tokens;
  }
}

As before, the unit test is similar to the previous versions, except to take into account the signature change of the parse() method.

// Source: src/test/java/net/sf/jtmt/concurrent/hadoop/NcsaLogParserTest.java
package net.sf.jtmt.concurrent.hadoop;

import java.io.BufferedReader;
import java.io.File;
import java.io.FileReader;
import java.util.EnumMap;
import java.util.List;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

import net.sf.jtmt.concurrent.hadoop.Format.Field;

import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.junit.Test;

/**
 * Test for NCSA Log parser.
 */
public class NcsaLogParserTest {

  private final Log log = LogFactory.getLog(getClass());

  @Test
  public void testParsing() throws Exception {
    File[] accesslogs = 
      (new File("src/test/resources/access_logs")).listFiles();
    for (File accesslog : accesslogs) {
      String line;
      BufferedReader reader = new BufferedReader(new FileReader(accesslog));
      while ((line = reader.readLine()) != null) {
        List<String> tokens = NcsaLogParser.parse(line);
        String url = tokens.get(4);
        if (url.contains("/search")) {
          printResult(line, tokens);
        }
      }
      reader.close();
    }
  }

  @Test
  public void testParseCommonLogFormatNoCookie() throws Exception {
    String line = "192.168.123.12 - - [19/Oct/2008:19:45:38 -0700] \"" +
      "GET /search?q1=foo&st=bar HTTP/1.1\" 200 323";
    List<String> tokens = NcsaLogParser.parse(line);
    printResult(line, tokens);
  }

  @Test
  public void testParseCombinedLogFormatWithCookie() throws Exception {
    String line = "192.168.123.12 - - [19/Oct/2008:19:45:38 -0700] \"" +
      "GET /search?q1=foo&st=bar HTTP/1.1\" 200 323 " +
      "\"-\" \"Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1.14) " +
      "Gecko/20080416 Fedora/2.0.0.14-1.fc7 Firefox/2.0.0.14\" " + 
      "\"USER_ID=12345,jsession_id=3BFY342211\"";
    List<String> tokens = NcsaLogParser.parse(line);
    printResult(line, tokens);
  }

  @Test
  public void testParseCustomLogFormat() throws Exception {
    String line = "192.168.123.12 - - [19/Oct/2008:19:45:38 -0700] \"" +
      "GET /search?q1=foo&st=bar HTTP/1.1\" 200 323 " +
      "\"-\" 34567 \"Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1.14) " +
      "Gecko/20080416 Fedora/2.0.0.14-1.fc7 Firefox/2.0.0.14\"";
    List<String> tokens = NcsaLogParser.parse(line);
    printResult(line, tokens);
  }

  private void printResult(String line, List<String> tokens) {
    log.info(">>> " + line);
    log.info("tokens={");
    int i = 0;
    for (String token : tokens) {
      log.info("  [" + i + "] = " + token);
      i++;
    }
    log.info("}");
  }
}

The results of this run are shown below. As you can see, it does the job well without the extra complexity of either StringTokenizer or regular expressions.

>>> 192.168.123.12 - - [19/Oct/2008:19:45:38 -0700] "GET \
/search?q1=foo&st=bar HTTP/1.1" 200 323
tokens={
  [0] = 192.168.123.12
  [1] = -
  [2] = -
  [3] = 19/Oct/2008:19:45:38 -0700
  [4] = GET /search?q1=foo&st=bar HTTP/1.1
  [5] = 200
  [6] = 323
}

>>> 192.168.123.12 - - [19/Oct/2008:19:45:38 -0700] "GET \
/search?q1=foo&st=bar HTTP/1.1" 200 323 "-" "Mozilla/5.0 (X11; U; \
Linux i686; en-US; rv:1.8.1.14) Gecko/20080416 Fedora/2.0.0.14-1.fc7 \
Firefox/2
tokens={
  [0] = 192.168.123.12
  [1] = -
  [2] = -
  [3] = 19/Oct/2008:19:45:38 -0700
  [4] = GET /search?q1=foo&st=bar HTTP/1.1
  [5] = 200
  [6] = 323
  [7] = -
  [8] = Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1.14) Gecko/20080416 \
Fedora/2.0.0.14-1.fc7 Firefox/2.0.0.14
  [9] = USER_ID=12345,jsession_id=3BFY342211
}

>>> 192.168.123.12 - - [19/Oct/2008:19:45:38 -0700] "GET \
/search?q1=foo&st=bar HTTP/1.1" 200 323 "-" 34567 "Mozilla/5.0 (X11; \
U; Linux i686; en-US; rv:1.8.1.14) Gecko/20080416 Fedora/2.0.0.14-1.fc7 \
Firefox/2
tokens={
  [0] = 192.168.123.12
  [1] = -
  [2] = -
  [3] = 19/Oct/2008:19:45:38 -0700
  [4] = GET /search?q1=foo&st=bar HTTP/1.1
  [5] = 200
  [6] = 323
  [7] = -
  [8] = 34567
  [9] = Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1.14) Gecko/20080416 \
Fedora/2.0.0.14-1.fc7 Firefox/2.0.0.14
}

Strangely, given that a large amount of data analysis for web based systems is based on HTTP access log data, I couldn't find a single free/open-source parser available to do this. I think its probably because log parsers are somewhere in the no-man's land of the complexity continuum - too simple to share for those who can build it, and cheap enough to buy from a commercial vendor (as part of a larger solution) for those who can't. So anyway, if you fall into the latter category and find that buying is not an option, feel free to use these, either by themselves, or as a starting point for your own implementation.