ODS support for Octave

Copyright © 2009 - 2016 Philip Nienhuis <prnienhuis at users.sf.net>

This version October 23, 2016

(ODS = Open Document Format spreadsheet data format, used by e.g., OpenOffice.org.)

Files content

odsread.m
No-hassle read script for reading from an ODS file and parsing the numeric and text data into separate arrays.

odswrite.m
No-hassle write script for writing to an ODS file.

odsopen.m 
Get a file pointer to an ODS spreadsheet file.

ods2oct.m
Read raw data from an ODS spreadsheet file using the file pointer handed by odsopen.

oct2ods.m
Write data to an ODS spreadsheet file using the file pointer handed by odsopen.

odsclose.m
Close file handle made by odsopen and -if data have been transfered to a spreadsheet- save data.

odsfinfo.m
Explore sheet names and optionally estimated data size of ods files with unknown content.

calccelladdress.m
Utility function, can be used to compute a spreadsheet-type cell adress from 1-based row and column numbers.

parsecell.m
(contained in Excel xlsread scripts, but works also for ods support) parse raw data (cell array) into separate numeric array and text (cell) array.)

chk_spreadsheet_support.m
Internal function for (1) checking, (2) setting up, (3) debugging spreadsheet support. While not specifically meant for direct invocation from the Octave prompt (it is more useful during initialization of Octave itself) it can be very helpful when hunting down issues with spreadsheet support in Octave.

test_spsh.m, io_ods_testscript.m
Undocumented scripts for testing basic operation of ODS spreadsheet functions. Meant for testers and developers, but I don't mind if mere mortal users give it a try as well ;-)


REQUIRED SUPPORT SOFTWARE

For the native Octave interface (OCT)

(read/write support for ODS 1.2 (LibreOffice/OpenOffice.org Calc), OOXML (Excel 2007+), and Gnumeric)
NO external support software is required!

Octave >= 3.8.0 will do just fine but maybe a bit slow. If you want faster I/O, Java support need to be compiled in, a Java JRE > 1.6.0 must be installed, and one or more of the following is required:

  • odfdom.jar (for the OTK interface, currently the preferred option)

  • and/or

  • jopendocument<version>.jar (for the JOD interface)


  • and/or

  • OpenOffice.org (or clones like LibreOffice, Go-Office, ...) (for the UNO interface)

    Whatever Java option, these class libs must be referenced with full pathnames in your javaclasspath.
    When the io package gets loaded, a utility function (PKG_ADD) tries to automatically find the Java class libs and adds the ones it found to the javaclasspath; When the io package gets unloaded, these same class libs will be removed from the javaclasspath.

    Except for the UNO (OOo) classes, on MinGW the jar files had best be put in /<libdir>/java where <libdir> on MinGW it is usually /lib; on Linux system supplied Java class libs usually reside in /usr/share/java. Alternatively, you can put them in your HOME directory in a subdirectory java (mind case!) - on *nix that would be ~./java, on Windows %USERPROFILE%/java (same level as My Documents). The PKG_ADD routine, that gets run each time the io package is loaded, expects the class libs there; if they are elsewhere, add them in ./share/octave/<version>/m/startup/octaverc using appropriate javaaddpath statements or a chk_spreadsheet_support() call.
    In addition, you can specify a subdirectory using the environment variable OCTAVE_IO_JAVALIBS.
    Once a particular Java class lib has been added to the javaclasspath, it won't be searched anymore nor reloaded from the next search location. The search order is:

    1. Specified by the environment variable OCTAVE_IO_JAVALIBS
    2. <HOME_DIR>/java
    3. /usr/share/java (*nix) or /lib/java (MinGW)
    If you do not want PKG_ADD to load the Java class libs, specify a value of "no", "false" or "0" for the OCTAVE_IO_JAVALIBS environment variable before starting Octave.


    USAGE

    (see “help ods<function_filename>” in octave terminal.)

    odsread is a sort of analog to xlsread and works more or less the same. odsread is a mere wrapper for the functions odsopen, ods2oct, and odsclose that do file access and the actual reading, plus parsecell for post-processing.

    odswrite works similar to xlswrite. It too is a wrapper for scripts which do the actual work and invoke other scripts, a.o. oct2ods.

    odsfinfo can be used to explore odsfiles with unknown content for sheet names and to get an impression of the data content sizes.

    When you need data from just one sheet, odsread is for you. But when you need data from multiple sheets in the same spreadsheet file, or if you want to process spreadsheet data by limited-size chunks at a time, odsopen / ods2oct [/parsecell] / … / odsclose sequences provides for much more speed and flexibility as the spreadsheet needs to be read just once rather than repeatedly for each call to odsread.

    Same reasoning goes for odswrite.

    Also, if you use odsopen / …../, you can process multiple spreadsheets simultaneously – just use odsopen repeatedly to get multiple spreadsheet file pointers.

    Moreover, after adding data to an existing spreadsheet file, you can fiddle with the filename in the ods file pointer struct to save the data into another, possibly new spreadsheet file.

    If you use odsopen / ods2oct / … / oct2ods / …. / odsclose, DO NOT FORGET to invoke odsclose in the end. The file pointers can contain an enormous amount of data and may needlessly keep precious memory allocated. In case of the UNO interface, the hidden OpenOffice.org invocation (soffice.bin) can even block proper closing of Octave.


    SPREADSHEET FORMULA SUPPORT, STRIPPING AND ENCODING

    When using the OTK, UNO, and/or OCT interface you can:
    In short, you can enter spreadsheet formulas and in a later stage read them back, change them and re-enter them in the worksheet.
    The behaviour is controlled by an option structure options which has some fields ("flags") that can be set to TRUE or FALSE:

    Note that when invoking odsread.m and odswrite.m, these options are not available. To use these options you need to invoke ods2oct.m / oct2ods.m as outlined above under USAGE

    GOTCHAS

    I know of one big gotcha: i.e. reading dates (& time). A less obvious one is Java memory pool allocation size.

    Date and time in ODS

    Octave (as does Matlab) stores dates as a number representing the number of days since January 1, 0 (and as an aside ignores a.o. Pope Gregorius' intervention in 1582 when 10 days were simply skipped).

    OpenOffice.org stores dates as text strings like “yyyy-mm-dd”.

    MS-Excel stores dates as a number representing the number of days since January 1, 1900 (and as an aside, erroneously assumes 1900 to be a leap year).

    Now, converting OpenOffice.org date cell values (actually, character strings flagged by “date” attributes) into Octave looks pretty straightforward. But when the ODS spreadsheet was originally an Excel spreadsheet converted by OpenOffice.org, the date cells can either be OOo date values (i.e.,strings) OR old numerical values from the Excel spreadsheet.

    So: you should carefully check what happens to date cells.

    As octave has no ”date” or “time” data type, octave date values (usually numerical data) are simply transferred as “floats” to ODS spreadsheets. You'll have to convert the values into dates yourself from within OpenOffice.org.

    While adding data and time values has been implemented in the write scripts, the wait is for clever solutions to distinguish dates from floats in octave cell arrays.

    Java memory pool allocation size

    The Java virtual machine (JVM) initializes one big chunk of your computer's RAM in which all Java classes and methods etc. are to be loaded: the Java memory pool. It does this because Java has a very sophisticated “garbage collection” system. At least on Windows, the initial size is 2MB and the maximum size is 64MB. On Linux this allocated size is much bigger. This part of memory is where the Java-based ODS octave routines (and the Java-based ods routines) live and keep their variables etc.

    For transferring large pieces of information to and from spreadsheets you might hit the limits of this pool. E.g. to be able to handle I/O of an array of around 50,000 cells I needed a memory pool size of 512 MB.

    The memory size can be increased by inserting a file called “java.opts” (without quotes) in the directory ./share/octave/packages/java-<version> (where the script file javaclasspath.m is located), containing just the following lines:

    -Xms16m
    -Xmx512m

    (where 16 = initial size, 512 = maximum size (in this example), m stands for Megabyte. This number is system-dependent).

    After processing a large chunk of spreadsheet information you might notice that octave's memory footprint does not shrink so it looks like Java's memory pool does not shrink back; but rest assured, the memory footprint is the allocated (reserved) memory size, not the actual used size. After the JVM has done its garbage collection, only the so-called “working set” of the memory allocation is really in use and that is a trimmed-down part of the memory allocation pool. On Windows systems it often suffices to minimize the octave terminal for a few seconds to get a more reasonable memory footprint.

    Reading cells containing errors

    Spreadsheet cells containing erroneous stuff are transferred to Octave as NaNs. But not all errors can be catched. Cells showing #Value# in OpenOffice.org Calc often contain invalid formulas but may have a 0 (null) value stored in the value fields. It is impossible to catch this as there is no run-time formula evaluator (yet) in ODF Toolkit nor jOpenDocument (like there is in Apache POI for Excel).

    Smaller gotcha's (only with jOpenDocument 1.2b2, fixed in 1.2b3+ and 1.2 final):

    • While reading, empty cells are sometimes not skipped but interpreted with numerical value 0 (zero).

    • A valid range MUST be specified, I haven't found a way to discover the actual occupied rows and columns (jOpenDocument can give the physical ones (= capacity) but that doesn't help).

    NOT fixed in version 1.2 final nor 1.3b1:

    • jOpenDocument doesn't set the so-called <office:value-type='string'> attribute in cells containing text; as a consequence ODF Toolkit will treat them as empty cells. Ooo will read them OK.


    MATLAB COMPATIBILITY

    Depending on the MS-Excel version on the same computer, Matlab may read/write ODS files using xlsread/xlswrite. Note that decent ODS 1.2 support only started with Excel 2013.
    odsread is fairly function-compatible to xlsread, however.

    Same goes for odswrite, odsfinfo and xlsfinfo – however odsfinfo has better functionality IMO.


    COMPARISON OF INTERFACES

    The ODFtoolkit is the one that gives the best (but slow) results at present. However, parsing xml trees into rectangular arrays is not quite straightforward and the other way round is a real nightmare; odftoolkit up til 0.7.5. did little to hide the gory details for the developers.

    While reading ODS is still OK, writing implies checking whether cells already exist explicitly (in table:table-cells) or implicitly (in number-columns-repeated or number-rows-repeated nodes) or not at all yet in which case you'll need to add various types of parent nodes. Inserting new cells (“nodes”) or deleting nodes implies rebuilding possibly large parts of the tree in memory - nothing for the faint-of-heart. Only with ODFToolkit (odfdom) 0.8.6, 0.8.7 and 0.8.8 things have been simplified for developers.
    Unfortunately, with odftoolkit-0.6.0-incubating and odftoolkit-0.6.1-incubating (corresponding to odfdom-0.8.9 and 0.8.10) unresolved dependencies ("jenasin") have been introduced that break their functionality for Octave.

    The jOpenDocument interface is more promising, as it does shield the xml tree details and presents developers something which looks like a spreadsheet model.
    However, unfortunately the developers decided to shield essential methods by making them 'protected' (e.g. the vital getCellType). JopenDocument does support writing. But OTOH many obvious methods are still lacking and formula support is absent (although announced for future version 1.4).
    And last (but not least) the jOpenDocument developers state that their development is primarily driven by requests from customers who pay for support. I do sympathize with this business model but for Octave needs this may hamper progress for a while.
    In addition, jOpenDocument 1.2 and 1.3b1 still have bugs here and there. For one, it doesn't write appropriate OfficeValueType attributes to the cells, so there's no way to reliably read and distinguish boolean, string and integer values.

    The (still experimental) UNO interface, based on a Java/UNO bridge linking a hidden OpenOffice.org invocation to Octave, is the most promising:

    • Admittedly OOo needs some tens of seconds to start for the first time, but once OOo is in the operating system's disk cache, it operates much faster than ODF or JOD;
    • It has built-in formula validator and evaluator;
    • It has a much more reliable data parser;
    • It can read much more spreadsheet formats than just ODS; .sxc (older OOo and StarOffice), but also .xls, .xlsx (Excel), .wk1 (Lotus 123), dbf, etc.
    • It consumes only a fraction of the JVM heap memory that the other Java ODS spreadsheet solutions need because OOo reads the spreadsheet in its own memory chunk in RAM. The other solutions read, expand, parse and manipulate all data in the JVM. In addition, OOo's code is outside the JVM (and Octave) while the ODF Toolkit and jOpenDocument classes also reside in the JVM.
    However, UNO is not stable yet (see below). As stated above, the arch type (32-bit or 64-bit) must match that of Octave..

    The OCT (native Octave) interface is also promising as it is completely under control of Octave (-Forge) developers. Currently it only offers read and -experimental- write support for ODS (relatively slow), gnumeric (faster) and OOXML (very fast). An immense advantage is that no other external software is required. Write support has not extensively tested yet, however.

    TROUBLESHOOTING

    Some hints for troubleshooting ODS support are given here.
    Since April 2011 the function chk_spreadsheet_support() has been included in the io package. Calling it with arguments ('', 3) (empty string and debug level 3) will echo a lot of diagnostics to the screen. Large parts of the steps outlined below have been automated in this script.
    Problems with UNO are too complicated to treat them here; most of the troubleshooting has been implemented in chk_spreadsheet_support.m, only some general guidelines are given below.

    1. Check if Java works.
      NOTE: As of Octave 3.8, Java support is usually built-in in Octave and the below description of the Java package does not apply anymore!
      Do a pkg list and see

      a. If there's a Java package mentioned (then it's installed). If not, install it. Don't do this in Octave 3.8 and higher!!

      b. If there's an asterisk on the java package line (then the package is loaded). If not, do a pkg rebuild-auto java Don't do this in Octave 3.8 and higher!!

    1. Check Java memory settings. Try javamem

      a. If it works, check if it reports sufficiently large max memory (had better be 200 MiB, the bigger the better)

      b. If it doesn't work, do:

        rt = java_invoke ('java.lang.Runtime', 'getRuntime')

        rt.gc

        rt.maxMemory ().doubleValue () / 1024 / 1024

        The last command will show MaxMemory in MiB.

      c. In case you have insufficient memory, see in “GOTCHAS”, “Java memory pool allocation size”, how to increase java's memory pre-reservation.

    2. Check if all classes (.jarfiles) are in class path. Do a 'jcp = javaclasspath (-all)'  (under unix/linux, do 'jcp = javaclasspath; strsplit (jcp,”:”)' (w/o quotes). See above under “REQUIRED SUPPORT SOFTWARE” what classes should be mentioned.

      If classes (.jar files) are missing, download and put them somewhere and add them to the javaclass path with their fully qualified pathname (in quotes) using javaaddpath().

    Once all classes are present and in the javaclasspath, the ods interfaces should just work. The only remaining showstoppers are insufficient write privileges for the working directory, a wrecked up octave or some other problems outside octave.

    1. Try opening an ods file:

      ods1 = odsopen ('test.ods', 1, 'otk'). If this works and ods1 is a struct with various fields containing objects, ODF toolkit interface (OTK) works. Do an ods1 = odsclose (ods1) to close the file.

      ods2 = odsopen ('test.ods', 1, 'jod'). If this works and ods2 is a struct with various fields containing objects, jOpenDocument interface (JOD) works as well. Do ods2 = odsclose (ods2) to close the file.

    2. For the UNO interface, at least version 1.2.8 of the Java package is needed plus the following Java class libs (jars) and directory:
      * unoil.jar (usually found in subdirectory Basis<version>/program/classes/ or the like of the OpenOffice.org (<OOo>) installation directory;
      * juh.jar, jurt.jar, unoloader.jar and ridl.jar, usually found in the subdirectory URE/share/java/ (or the like) of OOo's installation directory;
      * The subdirectory program/ (where soffice[.exe] (or ooffice) resides).
      The exact case (URE or ure, Basis or basis), name ("Basis3.2" or just "basis") and subdirectory tree (URE/java or URE/share/java) varies across OOo versions and -clones, so chk_spreadsheet_support.m can have a hard time finding all needed classes. In particularly bad cases, when chk_spreadsheet_support cannot find them, you might need to add one or more of these these classes manually to the javaclasspath.


    DEVELOPMENT

    As with the Excel r/w stuff, adding new interfaces should be easy and straightforward. Add relevant stanzas for your new interface INTF in odsopen, odsclose, odsfinfo, oct2ods, ods2oct, getusedrange and add new subfunctions (for the real work) in subdir ./private; you'll need a total of six interface-dependent private functions (see the various examples for each interface in subdir ./private).

    Suggestions for future development:

    • Speeding up (ODS is 10 X slower than e.g. OOXML !!!). jOpenDocument is much faster but still immature.
      For large spreadsheets, UNO *is* MUCH faster than jOpenDocument but starting up OpenOffice.org for the first time can take tens of seconds...
      Note that UNO is still experimental. The issue is that odsclose() will simply kill ALL other OpenOffice.org invocations, also those that were not opened through Octave! This is related to UNO-Java limitations.
      The underlying issue is that when Octave starts an OpenOffice.org invocation, OpenOffice.org must be closed for Octave to be able to exit; otherwise Octave will wait for OOo to shut down before it can terminate itself. So Octave must kill OOo to be able to terminate.
      A way out hasn't been found yet.

    • Passing function handle” a la Matlab's xlsread

    • Adding styles (borders, cell lay-out, font, etc.)

    Some notes on the choice for Java (becoming less relevant as the OCT interface gets more mature):
    1. It saves a LOT of development time to use ready-baked Java classes rather than developing your own routines and thus effectively reinvent the wheel.
    2. A BIG advantage is that a Java-based solution is platform-independent (“portable”).
    3. The Java classes offer much more options than just reading and writing. Formatting, recalculation options, hiding/merging cell ranges, etc.
    4. But Java is known to be not very conservative with resources, especially not when processing XML-based formats.
    So Java is a compromise between portability and rapid development time versus capacity (and speed).
    But IMO data sets larger than 5.105 cells should not be kept in spreadsheets anyway. Use real databases for such data sets.

    ODFDOM versions

    I have tried various odfdom versions. As to 0.8 & 0.8.5, while the API has been simplified enormously (finally one can address cells by spreadsheet address rather than find out yourself by parsing the table-column/-row/-cell structure), many irrecoverable bugs have been introduced :-((
    In addition processing ODS files became significantly slower (up to 7 times!).

    End of August 2010 there's implemented support for odfdom-0.8.6.jar – that version is at last sufficiently reliable to use. The few remaining bugs and limitations could easily be worked around by diving in the older TableTable API. Later on (early 2011) version 0.8.7 has been tested too - this needed a few adjustments. Early 2012 odfdom-0.8.8 (from odfdom-0.5-incubator) was accepted. In March 2015 odfdom-0.8.10 (odfdom-0.6.1-incubator) was tested but alas, it doesn't work (needs extraneous dependencies); clearly the odfdom API (currently at main version 0) is not stable yet.
    So at the moment (August 2012 = last I looked) only odfdom versions 0.7.5, 0.8.6, 0.8.7 and 0.8.8(-incubator) are supported. 0.7.5 is deprecated, however.

    If you want to experiment with odfdom 0.8 & 0.8.5, you can try:
    • odsopen.m (revision 7157)
    • ods2oct.m (revision 7158)
    • oct2ods.m (revision 7159)

    Enjoy!

    Philip Nienhuis, October 23, 2016