Datasets

DataToolkitCore.DataSetType
DataSet

A named collection of data, along with the means to retrive the source and interpret in into a useful form.

╭╴DataSet(name, UUID) ─▶ DataCollection╶─╮
│ ├╴Loaders: DataLoader,  […]            │
│ │  ╰╌◁╌╮                               │
│ ├╴Storage: DataStorage, […]            │
│ │  ╰╌◁╌╮                               │
│ └╴Writers: DataWriter,  […]            │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Parameters(…)                          │
╰────────────────────────────────────────╯

More concretely, a DataSet:

The name or UUID can of a DataSet can be used (optionally with a given DataCollection) to create an serializable Identifier that is able to be resolved back to the DataSet in question.

The storage of a DataSet can be accessed with open(::DataSet, ::Type), and loaded with read(::DataSet, ::Type).

A DataSet can be directly instantiated using the method

DataSet(collection::DataCollection, name::String, uuid::UUID,
        parameter::Dict{String, Any}, storage::Vector{DataStorage},
        loaders::Vector{DataLoader}, writers::Vector{DataWriter})

but it is generally going to be more convenient to use create or create! depending on whether you want the created dataset to be registered in the DataCollection passed.

A DataSet can be also constructed from a TOML specification using fromspec, and a TOML spec created with tospec.

Transformers can be added to a DataSet with create! or the dedicated methods storage!, loader!, and writer!.

See also: DataCollection, DataStorage, DataLoader, DataWriter.

source

Working with datasets

DataToolkitCore.datasetFunction
dataset([collection::DataCollection], identstr::AbstractString, [parameters::Dict{String, Any}])
dataset([collection::DataCollection], identstr::AbstractString, [parameters::Pair{String, Any}...])

Return the data set identified by identstr, optionally specifying the collection the data set should be found in and any parameters that apply.

source
Base.readFunction
read(filename::AbstractString, DataCollection; writer::Union{Function, Nothing})

Read the entire contents of a file as a DataCollection.

The default value of writer is self -> write(filename, self).

source
read(io::IO, DataCollection; path::Union{String, Nothing}=nothing, mod::Module=Base.Main)

Read the entirety of io, as a DataCollection.

source
read(dataset::DataSet, as::Type)
read(dataset::DataSet) # as default type

Obtain information from dataset in the form of as, with the appropriate loader and storage provider automatically determined.

This executes the following component of the overall data flow:

                 ╭────loader─────╮
                 ╵               ▼
Storage ◀────▶ Data          Information

The loader and storage provider are selected by identifying the highest priority loader that can be satisfied by a storage provider. What this looks like in practice is illustrated in the diagram below.

      read(dataset, Matrix) ⟶ ::Matrix ◀╮
         ╭───╯        ╰────────────▷┬───╯
╔═════╸dataset╺══════════════════╗  │
║ STORAGE      LOADERS           ║  │
║ (⟶ File)─┬─╮ (File ⟶ String)   ║  │
║ (⟶ IO)   ┊ ╰─(File ⟶ Matrix)─┬─╫──╯
║ (⟶ File)┄╯   (IO ⟶ String)   ┊ ║
║              (IO ⟶ Matrix)╌╌╌╯ ║
╚════════════════════════════════╝

  ─ the load path used
  ┄ an option not taken

The types that a DataSet can be loaded as are determined by the loaders, their declared types, and the implemented methods. If a method exists that can load dataset to a subtype of as, it will be used. Methods that produce a type declared in dataset's loaders are preferred.

source
Base.writeFunction
write(dataset::DataSet, info::Any)

TODO write docstring

source
Base.openFunction
open(dataset::DataSet, as::Type; write::Bool=false)

Obtain the data of dataset in the form of as, with the appropriate storage provider automatically selected.

A write flag is also provided, to help the driver pick a more appropriate form of as.

This executes the following component of the overall data flow:

                 ╭────loader─────╮
                 ╵               ▼
Storage ◀────▶ Data          Information
source
DataToolkitCore.createFunction
create(parent::DataCollection, ::Type{DataSet}, name::AbstractString, specification::Dict{String, <:Any})
create(parent::DataCollection, ::Type{DataSet}, name::AbstractString, specification::Pair{String, <:Any}...)

Create a new DataSet that is a child of parent with a given name and specification.

See also: create!.

source
create(parent::DataSet, T::Type{<:DataTransformer}, spec::Dict{String, <:Any})
create(parent::DataSet, T::Type{<:DataTransformer}, driver::Symbol, spec::Dict{String, <:Any})
create(parent::DataSet, T::Type{<:DataTransformer}, driver::Symbol, specs::Pair{String, <:Any}...)

Create a new data transformer of type T that is a child of the parent dataset, with a given specification spec.

The driver argument may be explicitly specified as a symbol, or it may be included as part of spec.

See also: create!.

source
DataToolkitCore.create!Function
create!(::Type{DataCollection}, name::Union{String, Nothing}, path::Union{String, Nothing};
        uuid::UUID=uuid4(), plugins::Vector{String}=String[], mod::Module=Base.Main)

Create a new data collection.

This can be an in-memory data collection, when path is set to nothing, or a collection which corresponds to a Data TOML file, in which case path should be set to either a path to a .toml file or an existing directory in which a Data.toml file should be placed.

When a path is provided, the data collection will immediately be written, overwriting any existing file at the path.

source
create!(parent::DataSet, ::Type{DataSet}, name::AbstractString, specification::Dict{String, <:Any})
create!(parent::DataSet, ::Type{DataSet}, name::AbstractString, specification::Pair{String, <:Any}...)

Create a new DataSet that is a child of parent with a given name and specification, and add it to the parent's list of datasets.

See also: create.

source
create!(parent::DataSet, T::Type{<:DataTransformer}, spec::Dict{String, <:Any})
create!(parent::DataSet, T::Type{<:DataTransformer}, driver::Symbol, spec::Dict{String, <:Any})
create!(parent::DataSet, T::Type{<:DataTransformer}, driver::Symbol, specs::Pair{String, <:Any}...)

Create a new data transformer of type T that is a child of the parent dataset, with a given specification spec, and add it to the appropriate list of transformers.

See also: create, loader!, storage!, writer!.

source
DataToolkitCore.loader!Function
loader!(dataset::DataSet, driver::Symbol, parameters::Dict{String, <:Any})
loader!(dataset::DataSet, driver::Symbol, parameters::Pair{String, <:Any}...)

Create a new data loader transformer that is a child of the dataset dataset, with a given driver driver and specification parameters, and add it to the dataset's list of loader transformers.

See also: create!, storage!, writer!.

source
DataToolkitCore.storage!Function
storage!(dataset::DataSet, driver::Symbol, parameters::Dict{String, <:Any})
storage!(dataset::DataSet, driver::Symbol, parameters::Pair{String, <:Any}...)

Create a new data storage transformer that is a child of the dataset dataset, with a given driver driver and specification parameters, and add it to the dataset's list of storage transformers.

See also: create!, loader!, writer!.

source
DataToolkitCore.writer!Function
writer!(dataset::DataSet, driver::Symbol, parameters::Dict{String, <:Any})
writer!(dataset::DataSet, driver::Symbol, parameters::Pair{String, <:Any}...)

Create a new data writer transformer that is a child of the dataset dataset, with a given driver driver and specification parameters, and add it to the dataset's list of writer transformers.

See also: create!, storage!, loader!.

source

Dataset identification

DataToolkitCore.IdentifierType
Identifier

A description that can be used to uniquely identify a DataSet.

Four fields are used to describe the target DataSet:

  • collection, the name or UUID of the collection (optional).
  • dataset, the name or UUID of the dataset.
  • type, the type that should be loaded from the dataset.
  • parameters, any extra parameters of the dataset that should match.

See also: resolve, refine.

Constructors

Identifier(collection::Union{AbstractString, UUID, Nothing},
           dataset::Union{AbstractString, UUID},
           type::Union{QualifiedType, Nothing},
           parameters::Dict{String, Any})

Parsing

An Identifier can be represented as a string with the following form, with the optional components enclosed by square brackets:

[COLLECTION:]DATASET[::TYPE]

Such forms can be parsed to an Identifier by simply calling the parse function, i.e. parse(Identifier, "mycollection:dataset").

source
DataToolkitCore.resolveFunction
resolve(collection::DataCollection, ident::Identifier;
        resolvetype::Bool=true, requirematch::Bool=true)

Attempt to resolve an identifier (ident) to a particular data set. Matching data sets will searched for from collection.

When resolvetype is set and ident specifies a datatype, the identified data set will be read to that type.

When requirematch is set an error is raised should no dataset match ident. Otherwise, nothing is returned.

source
resolve(ident::Identifier; resolvetype::Bool=true, stack=STACK)

Attempt to resolve ident using the specified data layer, if present, trying every layer of the data stack in turn otherwise.

source
resolve(identstr::AbstractString, parameters::Union{Dict{String, Any}, Nothing}=nothing;
        resolvetype::Bool=true, stack::Vector{DataCollection}=STACK)

Attempt to resolve the identifier given by identstr and parameters against each layer of the data stack in turn.

source
DataToolkitCore.refineFunction
refine(collection::DataCollection, datasets::Vector{DataSet}, ident::Identifier)

Filter datasets (from collection) to data sets than match the identifier ident.

This function contains an advise entrypoint where plugins can apply further filtering, applied to the method refine(::Vector{DataSet}, ::Identifier, ::Vector{String}).

source
refine(datasets::Vector{DataSet}, ::Identifier, ignoreparams::Vector{String})

This is a stub function that exists soley as as an advise point for data set filtering during resolution of an identifier.

source

Data collections

DataToolkitCore.DataCollectionType
DataCollection

A collection of DataSets, with global configuration, Plugins, and a few other extras.

╭╴DataCollection(name, UUID, path, module)╶─╮
│ ├╴DataSet(…)                              │
│ ├╴DataSet                                 │
│ │ ├╴Loaders: DataLoader,  […]             │
│ │ │  ╰╌◁╌╮                                │
│ │ ├╴Storage: DataStorage, […]             │
│ │ │  ╰╌◁╌╮                                │
│ │ └╴Writers: DataWriter,  […]             │
│ ⋮                                         │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Plugins(…)                                │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Parameters(…)                             │
╰───────────────────────────────────────────╯

Working with DataCollections

It is usual for non-transient DataCollections to be put onto the "STACK" (this is done automatically by loadcollection!). This is a collection of globally known DataCollections.

Being on the STACK allows a dataset to be retrieved by its name or UUID using getlayer, and for a DataSets in one DataCollection to refer to a DataSet in another.

When using the data> REPL mode, the top collection on the stack is used as a default target for all operations.

Creating a DataCollection

A Data.toml file can be loaded as a DataCollection (and put on the STACK) with loadcollection!.

To programatically create a DataToolkit you can either call the full constructor, but that's rather involved, and so a more convenient constructor is also defined:

DataCollection(name::Union{String, Nothing}, [parameters::Dict{String, Any}];
               path::Union{String, Nothing} = nothing,
               uuid::UUID = uuid4(),
               plugins::Vector{String} = String[],
               mod::Module = Base.Main,
               parameters...) -> DataCollection

Note that parameters can either be provided as the second positional argument, or extra keyword arguments, but not both.

Once a DataCollection has been created, DataSets can be added to it with create!.

Examples

julia> DataCollection("test")
DataCollection: test
  Data sets:

julia> c1 = DataCollection(nothing, Dict("customparam" => 77))
DataCollection:
  Data sets:

julia> c2 = DataCollection("test2", plugins = ["defaults", "store"], customparam=77)
DataCollection: test2
  Plugins: defaults ✔, store ✔
  Data sets:

julia> c1.parameters
Dict{String, Any} with 1 entry:
  "customparam" => 77

julia> c2.parameters
Dict{String, Any} with 1 entry:
  "customparam" => 77

Saving a DataCollection

After modifying a file-backed DataCollection, the file can be updated by calling write(::DataCollection) (so long as iswritable(::DataCollection) is true).

Any DataCollection can also be written to a particular destination with write(dest, ::DataCollection).

Writing a DataCollection to plaintext is essentially performed by calling TOML.print on the result of convert(::Type{Dict}, ::DataCollection).

Fields

version::Int
name::Union{String, Nothing}
uuid::UUID
plugins::Vector{String}
parameters::Dict{String, Any}
datasets::Vector{DataSet}
path::Union{String, Nothing}
advise::AdviceAmalgamation
mod::Module
source
DataToolkitCore.loadcollection!Function
loadcollection!(source::Union{<:AbstractString, <:IO}, mod::Module=Base.Main;
                soft::Bool=false, index::Int=1)

Load a data collection from source and add it to the data stack at index. source must be accepted by read(source, DataCollection).

mod should be set to the Module within which loadcollection! is being invoked. This is important when code is run by the collection. As such, it is usually appropriate to call:

loadcollection!(source, @__MODULE__; soft)

When soft is set, should an data collection already exist with the same UUID, nothing will be done and nothing will be returned.

source
Base.iswritableMethod
iswritable(dc::DataCollection)

Check whether the data collection dc is backed by a writable file.

source

Manipulation API

DataToolkitCore.stack_indexFunction
stack_index(ident::Union{Int, String, UUID, DataCollection}; quiet::Bool=false)

Obtain the index of the data collection identified by ident on the stack, if it is present. If it is not found, nothing is returned and unless quiet is set a warning is printed.

source
DataToolkitCore.stack_moveFunction
stack_move(ident::Union{Int, String, UUID, DataCollection}, shift::Int; quiet::Bool=false)

Find ident in the data collection stack, and shift its position by shift, returning the new index. shift is clamped so that the new index lies within STACK.

If ident could not be resolved, then nothing is returned and unless quiet is set a warning is printed.

source
DataToolkitCore.stack_remove!Function
stack_remove!(ident::Union{Int, String, UUID, DataCollection}; quiet::Bool=false)

Find ident in the data collection stack and remove it from the stack, returning the index at which it was found.

If ident could not be resolved, then nothing is returned and unless quiet is set a warning is printed.

source
DataToolkitCore.plugin_addFunction
plugin_add([collection::DataCollection=first(STACK)], plugins::Vector{<:AbstractString};
           quiet::Bool=false)

Return a variation of collection with all plugins not currently used added to the plugin list.

Unless quiet is a set an informative message is printed.

Side effects

The new collection is written, if possible.

Should collection be part of STACK, the stack entry is updated in-place.

source
DataToolkitCore.plugin_removeFunction
plugin_remove([collection::DataCollection=first(STACK)], plugins::Vector{<:AbstractString};
              quiet::Bool=false)

Return a variation of collection with all plugins currently used removed from the plugin list.

Unless quiet is a set an informative message is printed.

Side effects

The new collection is written, if possible.

Should collection be part of STACK, the stack entry is updated in-place.

source
DataToolkitCore.plugin_infoFunction
plugin_info(plugin::AbstractString; quiet::Bool=false)

Fetch the documentation of plugin, or return nothing if documentation could not be fetched.

If quiet is not set warning messages will be omitted when no documentation could be fetched.

source
DataToolkitCore.plugin_listFunction
plugin_list(collection::DataCollection=first(STACK); quiet::Bool=false)

Obtain a list of plugins used in collection.

quiet is unused but accepted as an argument for the sake of consistency.

source
DataToolkitCore.config_getFunction
config_get(propertypath::Vector{String};
           collection::DataCollection=first(STACK), quiet::Bool=false)

Obtain the configuration value at propertypath in collection.

When no value is set, nothing is returned instead and if quiet is unset "unset" is printed.

source
DataToolkitCore.config_setFunction
config_set([collection::DataCollection=first(STACK)], propertypath::Vector{String}, value::Any;
           quiet::Bool=false)

Return a variation of collection with the configuration at propertypath set to value.

Unless quiet is set, a success message is printed.

Side effects

The new collection is written, if possible.

Should collection be part of STACK, the stack entry is updated in-place.

source
DataToolkitCore.config_unsetFunction
config_unset([collection::DataCollection=first(STACK)], propertypath::Vector{String};
              quiet::Bool=false)

Return a variation of collection with the configuration at propertypath removed.

Unless quiet is set, a success message is printed.

Side effects

The new collection is written, if possible.

Should collection be part of STACK, the stack entry is updated in-place.

source
Base.delete!Function
delete!(dataset::DataSet)

Remove dataset from its parent collection.

source
Base.replace!Function
replace!(dataset::DataSet; [name, uuid, parameters, storage, loaders, writers])

Perform an in-place update of dataset, optionally replacing any of the name, uuid, parameters, storage, loaders, or writers fields.

source

TOML serialization

DataToolkitCore.fromspecFunction
fromspec(DT::Type{<:DataTransformer}, dataset::DataSet, spec::Dict{String, Any})

Create an DT of dataset according to spec.

DT can either contain the driver name as a type parameter, or it will be read from the "driver" key in spec.

source
fromspec(::Type{DataCollection}, spec::Dict{String, Any};
         path::Union{String, Nothing}=nothing, mod::Module=Base.Main)

Create a DataCollection from spec.

The path and mod keywords are used as the values for the corresponding fields in the DataCollection.

source
fromspec(::Type{DataSet}, collection::DataCollection, name::String, spec::Dict{String, Any})

Create a DataSet for collection called name, according to spec.

source
DataToolkitCore.tospecFunction
tospec(thing::DataTransformer)
tospec(thing::DataSet)
tospec(thing::DataCollection)

Return a Dict representation of thing for writing as TOML.

source

Qualified types

DataToolkitCore.QualifiedTypeType
QualifiedType

A representation of a Julia type that does not need the type to be defined in the Julia session, and can be stored as a string. This is done by storing the type name and the module it belongs to as Symbols.

Warning

While QualifiedType is currently quite capable, it is not currently able to express the full gamut of Julia types. In future this will be improved, but it will likely always be restricted to a certain subset.

See also: typeify.

Subtyping

While the subtype operator cannot work on QualifiedTypes (<: is a built-in), when the Julia types are defined the subset operator can be used instead. This works by simply converting the QualifiedTypes to the corresponding Type and then applying the subtype operator.

julia> QualifiedTypes(:Base, :Vector) ⊆ QualifiedTypes(:Core, :Array)
true

julia> Matrix ⊆ QualifiedTypes(:Core, :Array)
true

julia> QualifiedTypes(:Base, :Vector) ⊆ AbstractVector
true

julia> QualifiedTypes(:Base, :Foobar) ⊆ AbstractVector
false

Constructors

QualifiedType(parentmodule::Symbol, typename::Symbol)
QualifiedType(t::Type)

Parsing

A QualifiedType can be expressed as a string as "$parentmodule.$typename". This can be easily parsed as a QualifiedType, e.g. parse(QualifiedType, "Core.IO").

source