The basic format is one or more field names followed by a colon, followed by one or more actions. Some actions take an optional or required parameter.
Since Omega 1.4.6, the parameter value can be enclosed in double quotes, which is necessary if it contains whitespace; it's also needed for parameter values containing a comma for actions which support multiple parameters (such as split) since there unquoted commas are interpreted as separating parameters.
Since Omega 1.4.8, the following C-like escape sequences are supported for parameter values enclosed in double quotes: \\, \", \0, \t, \n, \r, and \x followed by two hex digits.
The actions are applied in the specified order to each field listed, and fields can be listed in several lines.
Here's an example:
desc1 : unhtml index truncate=200 field=sample desc2 desc3 desc4 : unhtml index name : field=caption weight=3 index ref : field=ref boolean=Q unique=Q type : field=type boolean=XT
Don't put spaces around the = separating an action and its argument - current versions allow spaces here (though this was never documented as supported) but it leads to a missing argument quietly swallowing the next action rather than using an empty value or giving an error, e.g. this takes hash as the field name, which is unlikely to be what was intended:
url : field= hash boolean=Q unique=Q
Since 1.4.6 a deprecation warning is emitted for spaces before or after the =.
The actions are:
converts pairs of hex digits to binary byte values (providing a way to specify arbitrary binary strings e.g. for use in a document value slot). The input should have an even length and be composed entirely of hex digits (if it isn't, an error is reported and the value is unchanged).
hextobin was added in Omega 1.4.6.
parse the text as a date string using strptime() with the format specified by FORMAT, and set the text to the result as a Unix time_t (seconds since 1970), which can then be fed into date or valuepacked, for example:
last_update : parsedate="%Y%m%d %T" field=lastmod valuepacked=0
parsedate was added in Omega 1.4.6.
Split the text at each occurrence of DELIMITER, discard any empty strings, perform OPERATION on the resulting list, and then for each entry perform all the actions which follow split in the current rule.
OPERATION can be dedup (remove second and subsequent occurrences from the list of any value), sort (sort), or none (default: none).
If you want to specify , for delimiter, you need to quote it, e.g. split=",",dedup.
Like value=VALUESLOT, this adds as a Xapian document value in slot VALUESLOT, but it first encodes as a 4 byte big-endian binary string. If the input is a Unix time_t value, the resulting slot can be used for date range filtering and to sort the MSet by date. Can be used in combination with parsedate, for example:
last_update : parsedate="%Y%m%d %T" field=lastmod valuepacked=0
valuepacked was added in Omega 1.4.6.
The data to be indexed is read in from one or more files. Each file has records separated by a blank line. Each record contains one or more fields of the form "name=value". If value contains newlines, these must be escaped by inserting an equals sign ('=') after each newline. Here's an example record:
id=ghq147 title=Sample Record value=This is a multi-line =value. Note how each newline =is escaped. format=HTML
See mbox2omega and mbox2omega.script for an example of how you can generate a dump file from an external source and write an index script to be used with it. Try "mbox2omega --help" for more information.