Transformers
The way DataToolkit actually interacts with datasets in through transformers.
Applying a category theory lens
If we consider storage locations (storage), raw data (data), and informative representations of the data (information) to be categories, then if you squint a bit the storage
, loader
, and writer
transformers can be viewed as functors. This doesn't make much of a practical impact, but I think it's a fun way of framing things.
Transformer types
DataToolkitCore.DataTransformer
— TypeDataTransformer{kind, driver}
The parent type for structures producing or consuming data.
╭────loader─────╮
╵ ▼
Storage ◀────▶ Data Information
▲ ╷
╰────writer─────╯
There are three kinds of specialised DataTransformer
s:
Each transformer takes a Symbol
type parameter designating the driver which should be used to perform the data operation.
In addition, each transformer has the following fields:
dataset::DataSet
, the data set the method operates ontype::Vector{QualifiedType}
, the Julia types the method supportspriority::Int
, the priority with which this method should be used, compared to alternatives. Lower values have higher priority.parameters::Dict{String, Any}
, any parameters applied to the method.
See also: DataStorage
, DataLoader
, DataWriter
, supportedtypes
.
DataToolkitCore.DataStorage
— TypeDataStorage <: DataTransformer
A DataTransformer
that can retrieve data from a source, and/or store data in a source.
Storage ◀────▶ Data
Typically a DataStorage
will have methods implemented to provide storage as a FilePath
or IO
, and potentially writable IO
or a FilePath
that can be written to.
Data of a certain form retrieved from a storage backend of a DataSet
can be accessed by calling open
on the dataset.
See also: storage
, getstorage
, putstorage
.
Implementing a DataStorage backend
There are two ways a new DataStorage
backend can be implemented:
- Implementing a single
storage(ds::DataStorage{:name}, as::Type; write::Bool)
method, that will provide anas
handle fords
, in either read or write mode. - Implement one or both of the following methods:
This split approach allows for backends with very similar read/write cases to be easily implemented with a single storage
method, while also allowing for more backends with very different read/write methods or that only support reading or writing exclusively to only implement the relevant method.
Optionally, the following extra methods can be implemented:
supportedtypes
when storage can be read/written to multiple forms, to give preference to certain types and help DataToolkit make reasonable assumptions (does nothing when only a single concrete type is supported)createauto
and/orcreateinteractive
to improve the user experience when creating instances of the storage backend.createpriority
, when you want to have automatic creation using this storage backend to be tried earlier or later than default by DataToolkit.
Example storage backend implementation
For simple cases, it can only take a few lines to implement a storage backend.
This is the actual implementation of the :filesystem
backend from DataToolkitCommon
,
function storage(storage::DataStorage{:filesystem}, ::Type{FilePath}; write::Bool)
path = getpath(storage)
if @advise storage isfile(path)
FilePath(path)
end
end
function storage(storage::DataStorage{:filesystem}, ::Type{DirPath}; write::Bool)
path = getpath(storage)
if @advise storage isdir(path)
DirPath(path)
end
end
This provides support for both files and directories, assisted by the helper function getpath
, which retrieves the "path"
parameter using @getparam
and then normalises it.
The isfile
/isdir
calls are wrapped in @advise
to allow plugins to dynamically perform additional or even potentially instantiate a file on-demand.
DataToolkitCore.DataLoader
— TypeDataLoader <: DataTransformer
A DataTransformer
that interprets data into a useful form.
╭────loader─────╮
╵ ▼
Data Information
Typically a DataLoader
will have methods implemented to interpret a raw data stream such as IO
or a FilePath
to a richer, more informative form (such as a DataFrame
).
A particular form can be loaded from a DataSet
by calling read
on the dataset.
See also: load
, supportedtypes
.
Implementing a DataLoader backend
To provide a new DataLoader
backend, you need to implement a load
method that will provide the data in the requested form:
load(::DataLoader{:name}, source, as::Type)
Often the load
implementation will make use of a helpful package. To avoid eagerly loading the package, you can make use of @require
and the lazy loading system. In DataToolkitCommon
this is combined with the package extension system, resulting in loader implementations that look something like this:
function load(loader::DataLoader{:name}, from::IO, as::Vector{String})
@require SomePkg
param = @getparam loader."param"::Int 0
invokelatest(_load_somepkg, from, param)
end
function _load_somepkg end # Implemented in a package extension
Depending on the number of loaders and other details this may be overkill in some situations.
In order to matchmake DataLoader
s and DataStorage
s, DataToolkit engages in what is essentially custom dispatch using reflection and method table interrogation. In order for this to work well, the source
and as
arguments should avoid using parametric types beyond the most simple case:
load(::DataLoader{:name}, source::T, as::Type{T}) where {T}
In cases where a given DataLoader
can provide multiple types, or Any
/parametric types, you can hint which types are most preferred by implementing supportedtypes
for the loader.
DataToolkitCore.DataWriter
— TypeDataWriter <: DataTransformer
A DataTransformer
that writes a representation of some information to a source.
Data Information
▲ ╷
╰────writer─────╯
Typically a DataWriter
will have methods implemented to write a structured form of the information to a more basic data format such as IO
or a FilePath
.
A compatible value can be written to a DataSet
by calling write
on the dataset.
Implementing a DataWriter backend
To provide a new DataWriter
backend, you need to implement a save
method that can write a value to a certain form.
save(::DataWriter{:name}, destination, info)
As with DataLoader
s, DataWriter
s can also make use of the lazy loading system and package extensions to avoid eager loading of packages.
Often the save
implementation will make use of a helpful package. To avoid eagerly saveing the package, you can make use of @require
and the lazy saveing system. In DataToolkitCommon
this is combined with the package extension system, resulting in saveer implementations that look something like this:
function save(writer::DataWriter{:name}, dest::IO, info::Vector{String})
@require SomePkg
invokelatest(_save_somepkg, info)
end
function _save_somepkg end # Implemented in a package extension
Depending on the number of loaders and other details this may be overkill in some situations.
In cases where a given DataWriter
can provide multiple types, or Any
/parametric types, you can hint which types are most preferred by implementing supportedtypes
for the loader.
Implementation API
DataToolkitCore.storage
— Functionstorage(storer::DataStorage, as::Type; write::Bool=false)
Fetch the as
from storer
, appropiate for reading data from or writing data to (depending on write
).
By default, this just calls getstorage
or putstorage
(depending on write
).
This executes the following component of the overall data flow:
Storage ◀────▶ Data
DataToolkitCore.getstorage
— Functiongetstorage(storer::DataStorage, as::Type)
Fetch the as
form of storer
, for reading data from.
This executes the following component of the overall data flow:
Storage ─────▶ Data
See also: storage
, putstorage
.
DataToolkitCore.putstorage
— Functionputstorage(storer::DataStorage, as::Type)
Fetch a handle in the form as
from storer
, that data can be written to.
This executes the following component of the overall data flow:
Storage ◀───── Data
See also: storage
, getstorage
.
DataToolkitCore.load
— Functionload(loader::DataLoader{driver}, source::Any, as::Type)
Using a certain loader
, obtain information in the form of as
from the data given by source
.
This fulfils this component of the overall data flow:
╭────loader─────╮
╵ ▼
Data Information
When the loader produces nothing
this is taken to indicate that it was unable to load the data for some reason, and that another loader should be tried if possible. This can be considered a soft failure. Any other value is considered valid information.
DataToolkitCore.save
— Functionsave(writer::Datasaveer{driver}, destination::Any, information::Any)
Using a certain writer
, save the information
to the destination
.
This fulfils this component of the overall data flow:
Data Information
▲ ╷
╰────writer─────╯
Extras
DataToolkitCore.supportedtypes
— Functionsupportedtypes(DT::Type{<:DataTransformer}, [spec::Dict{String, Any}, dataset::DataSet]) -> Vector{QualifiedType}
Return a list of types supported by the data transformer DT
.
This is used as the default value for the type
key in the Data TOML. The list of types is dynamically generated based on the available methods for the data transformer.
In some cases, it makes sense for this to be explicitly defined for a particular transformer, optionally taking into account information in the spec
and/or parent dataset
.
See also: QualifiedType
, DataTransformer
.
DataToolkitCore.createauto
— Functioncreateauto([dataset::DataSet], T::Type{<:DataTransformer}, source::String)
Automatically attempts to create a data transformer of type T
using source
and optionally dataset
, without requiring user interaction. Returns either a specification for the transformer as a Dict{String, Any}
, true
to indicate that an empty (no parameters) transformer should be created, or false
/nothing
if the transformer cannot be created automatically.
Specific transformers should implement specialised forms of this function, either returning nothing
if automatic creation is not possible, or a "create spec form" as a list of key::String => value
pairs. For example:
["foo" => "bar",
"baz" => 2]
Use this function when the creation process should be handled programmatically without user input. If user interaction is required to gather additional information use createinteractive
.
DataToolkitCore.createinteractive
— Functioncreateinteractive([dataset::DataSet], T::Type{<:DataTransformer}, source::String)
Attempts to create a data transformer of type T
with user interaction, using source
and dataset
. Prompts the user for additional information if required. Returns either a specification for the transformer as a dictionary, true
to indicate that an empty (no parameters) transformer should be created, or nothing
if the transformer cannot be created interactively.
Specific transformers should implement specialised forms of this function, either returning nothing
if creation is not applicable, or a "create spec form" as a list of key::String => value
pairs. For example:
["foo" => "bar", "baz" => 2]
In addition to accepting TOML-representable values, a NamedTuple
can be used to define the interactive prompt with fields like:
(; prompt::String = "key", type::Type{String or Bool or <:Number} = String, default::type = false or "", optional::Bool = false, skipvalue::Any = nothing, post::Function = identity)
The function can also accept a Function
that takes the current specification as an argument and returns a TOML-representable value or NamedTuple
.
Use this function when user interaction is necessary for the creation process. For cases where the creation can be handled programmatically without user input, consider using createauto
.
DataToolkitCore.createpriority
— Functioncreatepriority(T::Type{<:DataTransformer})
The priority with which a transformer of type T
should be created. This can be any integer, but try to keep to -100–100 (see create
).