Transformers

The way DataToolkit actually interacts with datasets in through transformers.

Applying a category theory lens

If we consider storage locations (storage), raw data (data), and informative representations of the data (information) to be categories, then if you squint a bit the storage, loader, and writer transformers can be viewed as functors. This doesn't make much of a practical impact, but I think it's a fun way of framing things.

Transformer types

DataToolkitCore.DataTransformer — Type

DataTransformer{kind, driver}

The parent type for structures producing or consuming data.

                 ╭────loader─────╮
                 ╵               ▼
Storage ◀────▶ Data          Information
                 ▲               ╷
                 ╰────writer─────╯

There are three kinds of specialised DataTransformers:

DataStorage
DataLoader
DataWriter

Each transformer takes a Symbol type parameter designating the driver which should be used to perform the data operation.

In addition, each transformer has the following fields:

dataset::DataSet, the data set the method operates on
type::Vector{QualifiedType}, the Julia types the method supports
priority::Int, the priority with which this method should be used, compared to alternatives. Lower values have higher priority.
parameters::Dict{String, Any}, any parameters applied to the method.

source

DataToolkitCore.DataStorage — Type

DataStorage <: DataTransformer

A DataTransformer that can retrieve data from a source, and/or store data in a source.

  Storage ◀────▶ Data

Typically a DataStorage will have methods implemented to provide storage as a FilePath or IO, and potentially writable IO or a FilePath that can be written to.

Data of a certain form retrieved from a storage backend of a DataSet can be accessed by calling open on the dataset.

See also: storage, getstorage, putstorage.

Implementing a DataStorage backend

There are two ways a new DataStorage backend can be implemented:

Implementing a single storage(ds::DataStorage{:name}, as::Type; write::Bool) method, that will provide an as handle for ds, in either read or write mode.
Implement one or both of the following methods:
- getstorage(ds::DataStorage{:name}, as::Type)
- putstorage(ds::DataStorage{:name}, as::Type)

This split approach allows for backends with very similar read/write cases to be easily implemented with a single storage method, while also allowing for more backends with very different read/write methods or that only support reading or writing exclusively to only implement the relevant method.

Optionally, the following extra methods can be implemented:

supportedtypes when storage can be read/written to multiple forms, to give preference to certain types and help DataToolkit make reasonable assumptions (does nothing when only a single concrete type is supported)
createauto and/or createinteractive to improve the user experience when creating instances of the storage backend.
createpriority, when you want to have automatic creation using this storage backend to be tried earlier or later than default by DataToolkit.

Example storage backend implementation

For simple cases, it can only take a few lines to implement a storage backend.

This is the actual implementation of the :filesystem backend from DataToolkitCommon,

function storage(storage::DataStorage{:filesystem}, ::Type{FilePath}; write::Bool)
    path = getpath(storage)
    if @advise storage isfile(path)
        FilePath(path)
    end
end

function storage(storage::DataStorage{:filesystem}, ::Type{DirPath}; write::Bool)
    path = getpath(storage)
    if @advise storage isdir(path)
        DirPath(path)
    end
end

This provides support for both files and directories, assisted by the helper function getpath, which retrieves the "path" parameter using @getparam and then normalises it.

The isfile/isdir calls are wrapped in @advise to allow plugins to dynamically perform additional or even potentially instantiate a file on-demand.

source

DataToolkitCore.DataLoader — Type

DataLoader <: DataTransformer

A DataTransformer that interprets data into a useful form.

    ╭────loader─────╮
    ╵               ▼
  Data          Information

Typically a DataLoader will have methods implemented to interpret a raw data stream such as IO or a FilePath to a richer, more informative form (such as a DataFrame).

A particular form can be loaded from a DataSet by calling read on the dataset.

See also: load, supportedtypes.

Implementing a DataLoader backend

To provide a new DataLoader backend, you need to implement a load method that will provide the data in the requested form:

load(::DataLoader{:name}, source, as::Type)

Often the load implementation will make use of a helpful package. To avoid eagerly loading the package, you can make use of @require and the lazy loading system. In DataToolkitCommon this is combined with the package extension system, resulting in loader implementations that look something like this:

function load(loader::DataLoader{:name}, from::IO, as::Vector{String})
    @require SomePkg
    param = @getparam loader."param"::Int 0
    invokelatest(_load_somepkg, from, param)
end

function _load_somepkg end # Implemented in a package extension

Depending on the number of loaders and other details this may be overkill in some situations.

In order to matchmake DataLoaders and DataStorages, DataToolkit engages in what is essentially custom dispatch using reflection and method table interrogation. In order for this to work well, the source and as arguments should avoid using parametric types beyond the most simple case:

load(::DataLoader{:name}, source::T, as::Type{T}) where {T}

In cases where a given DataLoader can provide multiple types, or Any/parametric types, you can hint which types are most preferred by implementing supportedtypes for the loader.

source

DataToolkitCore.DataWriter — Type

DataWriter <: DataTransformer

A DataTransformer that writes a representation of some information to a source.

  Data          Information
    ▲               ╷
    ╰────writer─────╯

Typically a DataWriter will have methods implemented to write a structured form of the information to a more basic data format such as IO or a FilePath.

A compatible value can be written to a DataSet by calling write on the dataset.

Implementing a DataWriter backend

To provide a new DataWriter backend, you need to implement a save method that can write a value to a certain form.

save(::DataWriter{:name}, destination, info)

As with DataLoaders, DataWriters can also make use of the lazy loading system and package extensions to avoid eager loading of packages.

Often the save implementation will make use of a helpful package. To avoid eagerly saveing the package, you can make use of @require and the lazy saveing system. In DataToolkitCommon this is combined with the package extension system, resulting in saveer implementations that look something like this:

function save(writer::DataWriter{:name}, dest::IO, info::Vector{String})
    @require SomePkg
    invokelatest(_save_somepkg, info)
end

function _save_somepkg end # Implemented in a package extension

Depending on the number of loaders and other details this may be overkill in some situations.

In cases where a given DataWriter can provide multiple types, or Any/parametric types, you can hint which types are most preferred by implementing supportedtypes for the loader.

source

Implementation API

DataToolkitCore.storage — Function

storage(storer::DataStorage, as::Type; write::Bool=false)

Fetch the as from storer, appropiate for reading data from or writing data to (depending on write).

By default, this just calls getstorage or putstorage (depending on write).

This executes the following component of the overall data flow:

Storage ◀────▶ Data

source

DataToolkitCore.getstorage — Function

getstorage(storer::DataStorage, as::Type)

Fetch the as form of storer, for reading data from.

This executes the following component of the overall data flow:

Storage ─────▶ Data

Extras

DataToolkitCore.supportedtypes — Function

supportedtypes(DT::Type{<:DataTransformer}, [spec::Dict{String, Any}, dataset::DataSet]) -> Vector{QualifiedType}

Return a list of types supported by the data transformer DT.

This is used as the default value for the type key in the Data TOML. The list of types is dynamically generated based on the available methods for the data transformer.

In some cases, it makes sense for this to be explicitly defined for a particular transformer, optionally taking into account information in the spec and/or parent dataset.

source

DataToolkitCore.createauto — Function

createauto([dataset::DataSet], T::Type{<:DataTransformer}, source::String)

Automatically attempts to create a data transformer of type T using source and optionally dataset, without requiring user interaction. Returns either a specification for the transformer as a Dict{String, Any}, true to indicate that an empty (no parameters) transformer should be created, or false/nothing if the transformer cannot be created automatically.

Specific transformers should implement specialised forms of this function, either returning nothing if automatic creation is not possible, or a "create spec form" as a list of key::String => value pairs. For example:

["foo" => "bar",
 "baz" => 2]

Use this function when the creation process should be handled programmatically without user input. If user interaction is required to gather additional information use createinteractive.

source

DataToolkitCore.createinteractive — Function

createinteractive([dataset::DataSet], T::Type{<:DataTransformer}, source::String)

Attempts to create a data transformer of type T with user interaction, using source and dataset. Prompts the user for additional information if required. Returns either a specification for the transformer as a dictionary, true to indicate that an empty (no parameters) transformer should be created, or nothing if the transformer cannot be created interactively.

Specific transformers should implement specialised forms of this function, either returning nothing if creation is not applicable, or a "create spec form" as a list of key::String => value pairs. For example:

["foo" => "bar", "baz" => 2]

In addition to accepting TOML-representable values, a NamedTuple can be used to define the interactive prompt with fields like:

(; prompt::String = "key", type::Type{String or Bool or <:Number} = String, default::type = false or "", optional::Bool = false, skipvalue::Any = nothing, post::Function = identity)

The function can also accept a Function that takes the current specification as an argument and returns a TOML-representable value or NamedTuple.

Use this function when user interaction is necessary for the creation process. For cases where the creation can be handled programmatically without user input, consider using createauto.

source

DataToolkitCore.createpriority — Function

createpriority(T::Type{<:DataTransformer})

The priority with which a transformer of type T should be created. This can be any integer, but try to keep to -100–100 (see create).

source