Data.toml
Data collections are represented on-disk as Data.toml
files. While DataToolkit
can be used at a basic level without any knowledge of the structure of the file, a little knowledge goes a long way (for instance when editing a dataset).
Overall structure
See the TOML refresher below if you're a bit rusty, then come back to this.
A Data.toml
file is broadly composed of three sections:
- Global setup information
- Configuration
- Datasets
Here's what that structure looks like in practice:
data_config_version=0
name="data collection name"
uuid="a UUIDv4"
plugins=["plugin1", "plugin2", ...]
[config]
# [Properties of the data collection itself]
[[mydataset]]
uuid="a UUIDv4"
# other properties...
[[mydataset.TRANSFORMER]]
driver="transformer driver"
type=["a QualifiedType", ...]
priority=1 # (optional)
# other properties...
[[mydataset]]
# There may be multiple data sets by the same name,
# but they must be uniquely identifyable by their properties
[[exampledata]]
# Another data set
Global setup
There are four top-level non-table properties currently recognised:
- The
Data.toml
format version (data_config_version
) - The name (
name
) and UUID (uuid
) of the data collection:name
is a string that provides a nice human-readable label for the data collection. It may contain any character but:
, however sticking to[A-Za-z0-9_]
is recommended.uuid
is a UUIDv4 that canonically identifies the data collection
- The plugins used by the data collection (
plugins
): a list of strings.
This should all be set as a matter of course.
Configuration
The config
TOML table is special, and is used to hold custom attributes of the data collection, for example:
[config]
mykey="value"
[config.defaults]
description="Ooops, somebody forgot to describe this."
[config.defaults.storage.filesystem]
priority=2
As a consequence of this, no dataset may be named "config"
.
Reserved configuration attributes
The config.locked
attribute is used by DataToolkitCore
to indicate that the Data.toml
file should not be modified, and to override it the attribute must be changed within the Data.toml
file. By setting config.locked = true
, you protect yourself from accidental modifications to the data file.
This functionality is part of DataToolkitCore
because it is taken into account in the implementation of Base.iswritable(::DataCollection)
.
Other attributes may be used by plugins to affect their behaviour. For instance, the defaults
plugin uses the config.defaults
attribute, and both the store
and cache
plugins use the config.store
attribute.
Datasets
All datasets are represented using an array of tables. This allows multiple datasets to have the same name, and be distinguished by other attributes (e.g. version information). The dataset name can be any string that does not contain a colon (:
), other than config
.
[[mydataset]]
uuid="a UUIDv4"
# other properties...
Besides uuid
, the attributes storage
, loader
, and writer
are reserved for specifying transformers. All datasets must have a uuid
key, this is important for providing a canonical unique reference to a particular dataset. All other attributes are free to be used by plugins or to add arbitrary metadata.
Data transformers
The transformers of a dataset (storage
, loader
, and writer
) are specified using sub-tables of the dataset. Any number of transformers can be specified, and the first applicable (sorted by priority, lowest winning) will be used.
[[mydataset.TRANSFORMER]]
driver="transformer driver"
type = ["list", "of", "types"] # (optional)
priority=1 # (optional)
# other properties...
The type
attribute (either a single type, or a list) is given a default value inferred from loaded code relating to the transformer. In situations where this is unsatisfactory, it can be explicitly specified.
All transformers must set the driver
key. All attributes other than driver
, type
, and priority
are free to be used by the transformer and plugins.
TOML refresher
TOML files are already widely used with Julia (for example, Project.toml
and Manifest.toml
) files, as they strike a good compromise between capability and complexity. See the TOML documentation for a full description of the format, but here are the components most relevant to Data.toml
files.
Key-value pairs
key = "value"
This represents a "key"
dictionary key having the value "value"
. Strings, numbers, booleans, and date/time stamps are all appropriate value forms.
a = "value"
b = 2
c = 3.1e+12
d = true
e = 1979-05-27T07:32:00Z
Arrays are written using [ ]
syntax, and can spread across multiple lines.
key = [1, 2, 3]
Tables (Dictionaries)
A collection of key-value pairs within a certain scope form a Julia Dict
when parsed. TOML allows for nested dictionaries using tables. A new table is created with a bracketed header line, like so:
[new_table]
All key-value entries after such a table header, up to the next table header, belong to that table. For example:
[mytable]
a = 1
b = 2
this is parsed as
Dict("mytable" => Dict("a" => 1, "b" => 2))
It is also possible to represent this using dotted keys, e.g.
mytable.a = 1
mytable.b = 2
These two styles can mixed to form nested tables.
[mytable.innertable.deeply_nested]
key = "value"
Arrays of tables
A list of dictionaries (array of tables in TOML terminology) can be formed using double-bracketed headers, e.g.
[[table_array]]
All double-bracketed tables will be collected together into an array, for example:
[[table_array]]
key = 1
[[table_array]]
key = 2
will be parsed as
Dict("table_array" => [Dict("key" => 1),
Dict("key" => 2)])