Data.toml

Data collections are represented on-disk as Data.toml files. While DataToolkit can be used at a basic level without any knowledge of the structure of the file, a little knowledge goes a long way (for instance when editing a dataset).

Overall structure

See the TOML refresher below if you're a bit rusty, then come back to this.

A Data.toml file is broadly composed of three sections:

  1. Global setup information
  2. Configuration
  3. Datasets

Here's what that structure looks like in practice:

data_config_version=0

name="data collection name"
uuid="a UUIDv4"
plugins=["plugin1", "plugin2", ...]

[config]
# [Properties of the data collection itself]

[[mydataset]]
uuid="a UUIDv4"
# other properties...

[[mydataset.TRANSFORMER]]
driver="transformer driver"
type=["a QualifiedType", ...]
priority=1 # (optional)
# other properties...

[[mydataset]]
# There may be multiple data sets by the same name,
# but they must be uniquely identifyable by their properties

[[exampledata]]
# Another data set

Global setup

There are four top-level non-table properties currently recognised:

  • The Data.toml format version (data_config_version)
  • The name (name) and UUID (uuid) of the data collection:
    • name is a string that provides a nice human-readable label for the data collection. It may contain any character but :, however sticking to [A-Za-z0-9_] is recommended.
    • uuid is a UUIDv4 that canonically identifies the data collection
  • The plugins used by the data collection (plugins): a list of strings.

This should all be set as a matter of course.

Configuration

The config TOML table is special, and is used to hold custom attributes of the data collection, for example:

[config]
mykey="value"

[config.defaults]
description="Ooops, somebody forgot to describe this."

[config.defaults.storage.filesystem]
priority=2

As a consequence of this, no dataset may be named "config".

Reserved configuration attributes

The config.locked attribute is used by DataToolkitCore to indicate that the Data.toml file should not be modified, and to override it the attribute must be changed within the Data.toml file. By setting config.locked = true, you protect yourself from accidental modifications to the data file.

Notes

This functionality is part of DataToolkitCore because it is taken into account in the implementation of Base.iswritable(::DataCollection).

Other attributes may be used by plugins to affect their behaviour. For instance, the defaults plugin uses the config.defaults attribute, and both the store and cache plugins use the config.store attribute.

Datasets

All datasets are represented using an array of tables. This allows multiple datasets to have the same name, and be distinguished by other attributes (e.g. version information). The dataset name can be any string that does not contain a colon (:), other than config.

[[mydataset]]
uuid="a UUIDv4"
# other properties...

Besides uuid, the attributes storage, loader, and writer are reserved for specifying transformers. All datasets must have a uuid key, this is important for providing a canonical unique reference to a particular dataset. All other attributes are free to be used by plugins or to add arbitrary metadata.

Data transformers

The transformers of a dataset (storage, loader, and writer) are specified using sub-tables of the dataset. Any number of transformers can be specified, and the first applicable (sorted by priority, lowest winning) will be used.

[[mydataset.TRANSFORMER]]
driver="transformer driver"
type = ["list", "of", "types"] # (optional)
priority=1 # (optional)
# other properties...

The type attribute (either a single type, or a list) is given a default value inferred from loaded code relating to the transformer. In situations where this is unsatisfactory, it can be explicitly specified.

All transformers must set the driver key. All attributes other than driver, type, and priority are free to be used by the transformer and plugins.

TOML refresher

TOML files are already widely used with Julia (for example, Project.toml and Manifest.toml) files, as they strike a good compromise between capability and complexity. See the TOML documentation for a full description of the format, but here are the components most relevant to Data.toml files.

Key-value pairs

key = "value"

This represents a "key" dictionary key having the value "value". Strings, numbers, booleans, and date/time stamps are all appropriate value forms.

a = "value"
b = 2
c = 3.1e+12
d = true
e = 1979-05-27T07:32:00Z

Arrays are written using [ ] syntax, and can spread across multiple lines.

key = [1, 2, 3]

Tables (Dictionaries)

A collection of key-value pairs within a certain scope form a Julia Dict when parsed. TOML allows for nested dictionaries using tables. A new table is created with a bracketed header line, like so:

[new_table]

All key-value entries after such a table header, up to the next table header, belong to that table. For example:

[mytable]
a = 1
b = 2

this is parsed as

Dict("mytable" => Dict("a" => 1, "b" => 2))

It is also possible to represent this using dotted keys, e.g.

mytable.a = 1
mytable.b = 2

These two styles can mixed to form nested tables.

[mytable.innertable.deeply_nested]
key = "value"

Arrays of tables

A list of dictionaries (array of tables in TOML terminology) can be formed using double-bracketed headers, e.g.

[[table_array]]

All double-bracketed tables will be collected together into an array, for example:

[[table_array]]
key = 1

[[table_array]]
key = 2

will be parsed as

Dict("table_array" => [Dict("key" => 1),
                       Dict("key" => 2)])