Datasets

DataToolkitCore.DataSet — Type

DataSet

A named collection of data, along with the means to retrive the source and interpret in into a useful form.

╭╴DataSet(name, UUID) ─▶ DataCollection╶─╮
│ ├╴Loaders: DataLoader,  […]            │
│ │  ╰╌◁╌╮                               │
│ ├╴Storage: DataStorage, […]            │
│ │  ╰╌◁╌╮                               │
│ └╴Writers: DataWriter,  […]            │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Parameters(…)                          │
╰────────────────────────────────────────╯

More concretely, a DataSet:

Belongs to a DataCollection
Is identified by its name and a UUID
Holds any number of key-value parameters
Contains any number of DataStorage, DataLoader, and DataWriter transformers

The name or UUID can of a DataSet can be used (optionally with a given DataCollection) to create an serializable Identifier that is able to be resolved back to the DataSet in question.

The storage of a DataSet can be accessed with open(::DataSet, ::Type), and loaded with read(::DataSet, ::Type).

A DataSet can be directly instantiated using the method

DataSet(collection::DataCollection, name::String, uuid::UUID,
        parameter::Dict{String, Any}, storage::Vector{DataStorage},
        loaders::Vector{DataLoader}, writers::Vector{DataWriter})

but it is generally going to be more convenient to use create or create! depending on whether you want the created dataset to be registered in the DataCollection passed.

A DataSet can be also constructed from a TOML specification using fromspec, and a TOML spec created with tospec.

Transformers can be added to a DataSet with create! or the dedicated methods storage!, loader!, and writer!.

source

Working with datasets

DataToolkitCore.dataset — Function

dataset([collection::DataCollection], identstr::AbstractString, [parameters::Dict{String, Any}])
dataset([collection::DataCollection], identstr::AbstractString, [parameters::Pair{String, Any}...])

Return the data set identified by identstr, optionally specifying the collection the data set should be found in and any parameters that apply.

When run interactively, any changes to the underlying data collection file will be reflected in the returned data set.

source

Base.read — Function

read(filename::AbstractString, DataCollection; writer::Union{Function, Nothing})

Read the entire contents of a file as a DataCollection.

The default value of writer is self -> write(filename, self).

source

read(io::IO, DataCollection; path::Union{String, Nothing}=nothing, mod::Module=Base.Main)

Read the entirety of io, as a DataCollection.

source

read(dataset::DataSet, as::Type)
read(dataset::DataSet, as::QualifiedType)
read(dataset::DataSet) # as default type

Obtain information from dataset in the form of as, with the appropriate loader and storage provider automatically determined.

This executes the following component of the overall data flow:

                 ╭────loader─────╮
                 ╵               ▼
Storage ◀────▶ Data          Information

The loader and storage provider are selected by identifying the highest priority loader that can be satisfied by a storage provider. What this looks like in practice is illustrated in the diagram below.

      read(dataset, Matrix) ⟶ ::Matrix ◀╮
         ╭───╯        ╰────────────▷┬───╯
╔═════╸dataset╺══════════════════╗  │
║ STORAGE      LOADERS           ║  │
║ (⟶ File)─┬─╮ (File ⟶ String)   ║  │
║ (⟶ IO)   ┊ ╰─(File ⟶ Matrix)─┬─╫──╯
║ (⟶ File)┄╯   (IO ⟶ String)   ┊ ║
║              (IO ⟶ Matrix)╌╌╌╯ ║
╚════════════════════════════════╝

  ─ the load path used
  ┄ an option not taken

The types that a DataSet can be loaded as are determined by the loaders, their declared types, and the implemented methods. If a method exists that can load dataset to a subtype of as, it will be used. Methods that produce a type declared in dataset's loaders are preferred.

source

Base.write — Function

write(dataset::DataSet, info::Any)

TODO write docstring

source

Base.open — Function

open(dataset::DataSet, as::Type; write::Bool=false)

Obtain the data of dataset in the form of as, with the appropriate storage provider automatically selected.

A write flag is also provided, to help the driver pick a more appropriate form of as.

This executes the following component of the overall data flow:

                 ╭────loader─────╮
                 ╵               ▼
Storage ◀────▶ Data          Information

source

DataToolkitCore.create — Function

create(parent::DataCollection, ::Type{DataSet}, name::AbstractString, specification::Dict{String, <:Any})
create(parent::DataCollection, ::Type{DataSet}, name::AbstractString, specification::Pair{String, <:Any}...)

Create a new DataSet that is a child of parent with a given name and specification.

Dataset identification

DataToolkitCore.Identifier — Type

Identifier

A description that can be used to uniquely identify a DataSet.

Four fields are used to describe the target DataSet:

collection, the name or UUID of the collection (optional).
dataset, the name or UUID of the dataset.
type, the type that should be loaded from the dataset.
parameters, any extra parameters of the dataset that should match.

Data collections

DataToolkitCore.DataCollection — Type

DataCollection

A collection of DataSets, with global configuration, Plugins, and a few other extras.

╭╴DataCollection(name, UUID, path, module)╶─╮
│ ├╴DataSet(…)                              │
│ ├╴DataSet                                 │
│ │ ├╴Loaders: DataLoader,  […]             │
│ │ │  ╰╌◁╌╮                                │
│ │ ├╴Storage: DataStorage, […]             │
│ │ │  ╰╌◁╌╮                                │
│ │ └╴Writers: DataWriter,  […]             │
│ ⋮                                         │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Plugins(…)                                │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Parameters(…)                             │
╰───────────────────────────────────────────╯

Working with DataCollections

It is usual for non-transient DataCollections to be put onto the "STACK" (this is done automatically by loadcollection!). This is a collection of globally known DataCollections.

Being on the STACK allows a dataset to be retrieved by its name or UUID using getlayer, and for a DataSets in one DataCollection to refer to a DataSet in another.

When using the data> REPL mode, the top collection on the stack is used as a default target for all operations.

Creating a DataCollection

A Data.toml file can be loaded as a DataCollection (and put on the STACK) with loadcollection!.

To programatically create a DataToolkit you can either call the full constructor, but that's rather involved, and so a more convenient constructor is also defined:

DataCollection(name::Union{String, Nothing}, [parameters::Dict{String, Any}];
               path::Union{String, Nothing} = nothing,
               uuid::UUID = uuid4(),
               plugins::Vector{String} = String[],
               mod::Module = Base.Main,
               parameters...) -> DataCollection

Note that parameters can either be provided as the second positional argument, or extra keyword arguments, but not both.

Once a DataCollection has been created, DataSets can be added to it with create!.

Examples

julia> DataCollection("test")
DataCollection: test
  Data sets:

julia> c1 = DataCollection(nothing, Dict("customparam" => 77))
DataCollection:
  Data sets:

julia> c2 = DataCollection("test2", plugins = ["defaults", "store"], customparam=77)
DataCollection: test2
  Plugins: defaults ✔, store ✔
  Data sets:

julia> c1.parameters
Dict{String, Any} with 1 entry:
  "customparam" => 77

julia> c2.parameters
Dict{String, Any} with 1 entry:
  "customparam" => 77

Saving a DataCollection

After modifying a file-backed DataCollection, the file can be updated by calling write(::DataCollection) (so long as iswritable(::DataCollection) is true).

Any DataCollection can also be written to a particular destination with write(dest, ::DataCollection).

Writing a DataCollection to plaintext is essentially performed by calling TOML.print on the result of convert(::Type{Dict}, ::DataCollection).

Fields

version::Int
name::Union{String, Nothing}
uuid::UUID
plugins::Vector{String}
parameters::Dict{String, Any}
datasets::Vector{DataSet}
path::Union{String, Nothing}
advise::AdviceAmalgamation
mod::Module

source

DataToolkitCore.loadcollection! — Function

loadcollection!(source::Union{<:AbstractString, <:IO}, mod::Module=Base.Main;
                soft::Bool=false, index::Int=1)

Load a data collection from source and add it to the data stack at index. source must be accepted by read(source, DataCollection).

mod should be set to the Module within which loadcollection! is being invoked. This is important when code is run by the collection. As such, it is usually appropriate to call:

loadcollection!(source, @__MODULE__; soft)

When soft is set, should an data collection already exist with the same UUID, nothing will be done and nothing will be returned.

source

DataToolkitCore.STACK — Constant

The set of data collections currently available.

source

Base.iswritable — Method

iswritable(dc::DataCollection)

Check whether the data collection dc is backed by a writable file.

source

Manipulation API

DataToolkitCore.getlayer — Function

getlayer([stack])

Return the first DataCollection on the stack.

stack defaults to STACK, and must be a Vector{DataCollection}.

source

getlayer([stack], name::AbstractString)
getlayer([stack], uuid::UUID)

Find the DataCollection in STACK with name/uuid.

stack defaults to STACK, and must be a Vector{DataCollection}.

source

DataToolkitCore.stack_index — Function

stack_index(ident::Union{Int, String, UUID, DataCollection}; quiet::Bool=false)

Obtain the index of the data collection identified by ident on the stack, if it is present. If it is not found, nothing is returned and unless quiet is set a warning is printed.

source

DataToolkitCore.stack_move — Function

stack_move(ident::Union{Int, String, UUID, DataCollection}, shift::Int; quiet::Bool=false)

Find ident in the data collection stack, and shift its position by shift, returning the new index. shift is clamped so that the new index lies within STACK.

If ident could not be resolved, then nothing is returned and unless quiet is set a warning is printed.

source

DataToolkitCore.stack_remove! — Function

stack_remove!(ident::Union{Int, String, UUID, DataCollection}; quiet::Bool=false)

Find ident in the data collection stack and remove it from the stack, returning the index at which it was found.

If ident could not be resolved, then nothing is returned and unless quiet is set a warning is printed.

source

DataToolkitCore.plugin_add! — Function

plugin_add!([collection::DataCollection=first(STACK)], plugins::Vector{<:AbstractString};
           quiet::Bool=false)

Return a variation of collection with all plugins not currently used added to the plugin list.

Unless quiet is a set an informative message is printed.

Side effects

The new collection is written, if possible.

Should collection be part of STACK, the stack entry is updated in-place.

source

DataToolkitCore.plugin_remove! — Function

plugin_remove!([collection::DataCollection=first(STACK)], plugins::Vector{<:AbstractString};
              quiet::Bool=false)

Return a variation of collection with all plugins currently used removed from the plugin list.

Unless quiet is a set an informative message is printed.

Side effects

The new collection is written, if possible.

Should collection be part of STACK, the stack entry is updated in-place.

source

DataToolkitCore.plugin_info — Function

plugin_info(plugin::AbstractString; quiet::Bool=false)

Fetch the documentation of plugin, or return nothing if documentation could not be fetched.

If quiet is not set warning messages will be omitted when no documentation could be fetched.

source

DataToolkitCore.plugin_list — Function

plugin_list(collection::DataCollection=first(STACK); quiet::Bool=false)

Obtain a list of plugins used in collection.

quiet is unused but accepted as an argument for the sake of consistency.

source

DataToolkitCore.config_get — Function

config_get(propertypath::Vector{String};
           collection::DataCollection=first(STACK), quiet::Bool=false)

Obtain the configuration value at propertypath in collection.

When no value is set, nothing is returned instead and if quiet is unset "unset" is printed.

source

DataToolkitCore.config_set! — Function

config_set!([collection::DataCollection=first(STACK)], propertypath::Vector{String}, value::Any;
           quiet::Bool=false)

Return a variation of collection with the configuration at propertypath set to value.

Unless quiet is set, a success message is printed.

Side effects

The new collection is written, if possible.

Should collection be part of STACK, the stack entry is updated in-place.

source

DataToolkitCore.config_unset! — Function

config_unset!([collection::DataCollection=first(STACK)], propertypath::Vector{String};
              quiet::Bool=false)

Return a variation of collection with the configuration at propertypath removed.

Unless quiet is set, a success message is printed.

Side effects

The new collection is written, if possible.

Should collection be part of STACK, the stack entry is updated in-place.

source

Base.delete! — Function

delete!(dataset::DataSet)

Remove dataset from its parent collection.

source

Base.replace! — Function

replace!(dataset::DataSet; [name, uuid, parameters, storage, loaders, writers])

Perform an in-place update of dataset, optionally replacing any of the name, uuid, parameters, storage, loaders, or writers fields.

source

TOML serialization

DataToolkitCore.fromspec — Function

fromspec(DT::Type{<:DataTransformer}, dataset::DataSet, spec::Dict{String, Any})

Create an DT of dataset according to spec.

DT can either contain the driver name as a type parameter, or it will be read from the "driver" key in spec.

source

fromspec(::Type{DataCollection}, spec::Dict{String, Any};
         path::Union{String, Nothing}=nothing, mod::Module=Base.Main)

Create a DataCollection from spec.

The path and mod keywords are used as the values for the corresponding fields in the DataCollection.

source

fromspec(::Type{DataSet}, collection::DataCollection, name::String, spec::Dict{String, Any})

Create a DataSet for collection called name, according to spec.

source

DataToolkitCore.tospec — Function

tospec(thing::DataTransformer)
tospec(thing::DataSet)
tospec(thing::DataCollection)

Return a Dict representation of thing for writing as TOML.

source

Qualified types

DataToolkitCore.QualifiedType — Type

QualifiedType

A representation of a Julia type that does not need the type to be defined in the Julia session, and can be stored as a string. This is done by storing the type name and the module it belongs to as Symbols.

Warning

While QualifiedType is currently quite capable, it is not currently able to express the full gamut of Julia types. In future this will be improved, but it will likely always be restricted to a certain subset.