Datasets
DataToolkitCore.DataSet
— TypeDataSet
A named collection of data, along with the means to retrive the source and interpret in into a useful form.
╭╴DataSet(name, UUID) ─▶ DataCollection╶─╮
│ ├╴Loaders: DataLoader, […] │
│ │ ╰╌◁╌╮ │
│ ├╴Storage: DataStorage, […] │
│ │ ╰╌◁╌╮ │
│ └╴Writers: DataWriter, […] │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Parameters(…) │
╰────────────────────────────────────────╯
More concretely, a DataSet
:
- Belongs to a
DataCollection
- Is identified by its name and a UUID
- Holds any number of key-value parameters
- Contains any number of
DataStorage
,DataLoader
, andDataWriter
transformers
The name or UUID can of a DataSet
can be used (optionally with a given DataCollection
) to create an serializable Identifier
that is able to be resolved back to the DataSet
in question.
The storage of a DataSet
can be accessed with open(::DataSet, ::Type)
, and loaded with read(::DataSet, ::Type)
.
A DataSet
can be directly instantiated using the method
DataSet(collection::DataCollection, name::String, uuid::UUID,
parameter::Dict{String, Any}, storage::Vector{DataStorage},
loaders::Vector{DataLoader}, writers::Vector{DataWriter})
but it is generally going to be more convenient to use create
or create!
depending on whether you want the created dataset to be registered in the DataCollection
passed.
A DataSet
can be also constructed from a TOML specification using fromspec
, and a TOML spec created with tospec
.
Transformers can be added to a DataSet
with create!
or the dedicated methods storage!
, loader!
, and writer!
.
See also: DataCollection
, DataStorage
, DataLoader
, DataWriter
.
Working with datasets
DataToolkitCore.dataset
— Functiondataset([collection::DataCollection], identstr::AbstractString, [parameters::Dict{String, Any}])
dataset([collection::DataCollection], identstr::AbstractString, [parameters::Pair{String, Any}...])
Return the data set identified by identstr
, optionally specifying the collection
the data set should be found in and any parameters
that apply.
Base.read
— Functionread(filename::AbstractString, DataCollection; writer::Union{Function, Nothing})
Read the entire contents of a file as a DataCollection
.
The default value of writer is self -> write(filename, self)
.
read(io::IO, DataCollection; path::Union{String, Nothing}=nothing, mod::Module=Base.Main)
Read the entirety of io
, as a DataCollection
.
read(dataset::DataSet, as::Type)
read(dataset::DataSet) # as default type
Obtain information from dataset
in the form of as
, with the appropriate loader and storage provider automatically determined.
This executes the following component of the overall data flow:
╭────loader─────╮
╵ ▼
Storage ◀────▶ Data Information
The loader and storage provider are selected by identifying the highest priority loader that can be satisfied by a storage provider. What this looks like in practice is illustrated in the diagram below.
read(dataset, Matrix) ⟶ ::Matrix ◀╮
╭───╯ ╰────────────▷┬───╯
╔═════╸dataset╺══════════════════╗ │
║ STORAGE LOADERS ║ │
║ (⟶ File)─┬─╮ (File ⟶ String) ║ │
║ (⟶ IO) ┊ ╰─(File ⟶ Matrix)─┬─╫──╯
║ (⟶ File)┄╯ (IO ⟶ String) ┊ ║
║ (IO ⟶ Matrix)╌╌╌╯ ║
╚════════════════════════════════╝
─ the load path used
┄ an option not taken
The types that a DataSet
can be loaded as are determined by the loaders
, their declared types, and the implemented methods. If a method exists that can load dataset
to a subtype of as
, it will be used. Methods that produce a type declared in dataset
's loaders
are preferred.
Base.write
— Functionwrite(dataset::DataSet, info::Any)
TODO write docstring
Base.open
— Functionopen(dataset::DataSet, as::Type; write::Bool=false)
Obtain the data of dataset
in the form of as
, with the appropriate storage provider automatically selected.
A write
flag is also provided, to help the driver pick a more appropriate form of as
.
This executes the following component of the overall data flow:
╭────loader─────╮
╵ ▼
Storage ◀────▶ Data Information
DataToolkitCore.create
— Functioncreate(parent::DataCollection, ::Type{DataSet}, name::AbstractString, specification::Dict{String, <:Any})
create(parent::DataCollection, ::Type{DataSet}, name::AbstractString, specification::Pair{String, <:Any}...)
Create a new DataSet
that is a child of parent
with a given name
and specification
.
See also: create!
.
create(parent::DataSet, T::Type{<:DataTransformer}, spec::Dict{String, <:Any})
create(parent::DataSet, T::Type{<:DataTransformer}, driver::Symbol, spec::Dict{String, <:Any})
create(parent::DataSet, T::Type{<:DataTransformer}, driver::Symbol, specs::Pair{String, <:Any}...)
Create a new data transformer of type T
that is a child of the parent
dataset, with a given specification spec
.
The driver
argument may be explicitly specified as a symbol, or it may be included as part of spec
.
See also: create!
.
DataToolkitCore.create!
— Functioncreate!(::Type{DataCollection}, name::Union{String, Nothing}, path::Union{String, Nothing};
uuid::UUID=uuid4(), plugins::Vector{String}=String[], mod::Module=Base.Main)
Create a new data collection.
This can be an in-memory data collection, when path
is set to nothing
, or a collection which corresponds to a Data TOML file, in which case path
should be set to either a path to a .toml
file or an existing directory in which a Data.toml
file should be placed.
When a path is provided, the data collection will immediately be written, overwriting any existing file at the path.
create!(parent::DataSet, ::Type{DataSet}, name::AbstractString, specification::Dict{String, <:Any})
create!(parent::DataSet, ::Type{DataSet}, name::AbstractString, specification::Pair{String, <:Any}...)
Create a new DataSet
that is a child of parent
with a given name
and specification
, and add it to the parent
's list of datasets.
See also: create
.
create!(parent::DataSet, T::Type{<:DataTransformer}, spec::Dict{String, <:Any})
create!(parent::DataSet, T::Type{<:DataTransformer}, driver::Symbol, spec::Dict{String, <:Any})
create!(parent::DataSet, T::Type{<:DataTransformer}, driver::Symbol, specs::Pair{String, <:Any}...)
Create a new data transformer of type T
that is a child of the parent
dataset, with a given specification spec
, and add it to the appropriate list of transformers.
DataToolkitCore.loader!
— Functionloader!(dataset::DataSet, driver::Symbol, parameters::Dict{String, <:Any})
loader!(dataset::DataSet, driver::Symbol, parameters::Pair{String, <:Any}...)
Create a new data loader transformer that is a child of the dataset
dataset, with a given driver driver
and specification parameters
, and add it to the dataset
's list of loader transformers.
DataToolkitCore.storage!
— Functionstorage!(dataset::DataSet, driver::Symbol, parameters::Dict{String, <:Any})
storage!(dataset::DataSet, driver::Symbol, parameters::Pair{String, <:Any}...)
Create a new data storage transformer that is a child of the dataset
dataset, with a given driver driver
and specification parameters
, and add it to the dataset
's list of storage transformers.
DataToolkitCore.writer!
— Functionwriter!(dataset::DataSet, driver::Symbol, parameters::Dict{String, <:Any})
writer!(dataset::DataSet, driver::Symbol, parameters::Pair{String, <:Any}...)
Create a new data writer transformer that is a child of the dataset
dataset, with a given driver driver
and specification parameters
, and add it to the dataset
's list of writer transformers.
Dataset identification
DataToolkitCore.Identifier
— TypeIdentifier
A description that can be used to uniquely identify a DataSet.
Four fields are used to describe the target DataSet:
collection
, the name or UUID of the collection (optional).dataset
, the name or UUID of the dataset.type
, the type that should be loaded from the dataset.parameters
, any extra parameters of the dataset that should match.
Constructors
Identifier(collection::Union{AbstractString, UUID, Nothing},
dataset::Union{AbstractString, UUID},
type::Union{QualifiedType, Nothing},
parameters::Dict{String, Any})
Parsing
An Identifier can be represented as a string with the following form, with the optional components enclosed by square brackets:
[COLLECTION:]DATASET[::TYPE]
Such forms can be parsed to an Identifier by simply calling the parse
function, i.e. parse(Identifier, "mycollection:dataset")
.
DataToolkitCore.resolve
— Functionresolve(collection::DataCollection, ident::Identifier;
resolvetype::Bool=true, requirematch::Bool=true)
Attempt to resolve an identifier (ident
) to a particular data set. Matching data sets will searched for from collection
.
When resolvetype
is set and ident
specifies a datatype, the identified data set will be read to that type.
When requirematch
is set an error is raised should no dataset match ident
. Otherwise, nothing
is returned.
resolve(ident::Identifier; resolvetype::Bool=true, stack=STACK)
Attempt to resolve ident
using the specified data layer, if present, trying every layer of the data stack in turn otherwise.
resolve(identstr::AbstractString, parameters::Union{Dict{String, Any}, Nothing}=nothing;
resolvetype::Bool=true, stack::Vector{DataCollection}=STACK)
Attempt to resolve the identifier given by identstr
and parameters
against each layer of the data stack
in turn.
DataToolkitCore.refine
— Functionrefine(collection::DataCollection, datasets::Vector{DataSet}, ident::Identifier)
Filter datasets
(from collection
) to data sets than match the identifier ident
.
This function contains an advise entrypoint where plugins can apply further filtering, applied to the method refine(::Vector{DataSet}, ::Identifier, ::Vector{String})
.
refine(datasets::Vector{DataSet}, ::Identifier, ignoreparams::Vector{String})
This is a stub function that exists soley as as an advise point for data set filtering during resolution of an identifier.
Data collections
DataToolkitCore.DataCollection
— TypeDataCollection
A collection of DataSet
s, with global configuration, Plugin
s, and a few other extras.
╭╴DataCollection(name, UUID, path, module)╶─╮
│ ├╴DataSet(…) │
│ ├╴DataSet │
│ │ ├╴Loaders: DataLoader, […] │
│ │ │ ╰╌◁╌╮ │
│ │ ├╴Storage: DataStorage, […] │
│ │ │ ╰╌◁╌╮ │
│ │ └╴Writers: DataWriter, […] │
│ ⋮ │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Plugins(…) │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Parameters(…) │
╰───────────────────────────────────────────╯
Working with DataCollection
s
It is usual for non-transient DataCollections
to be put onto the "STACK
" (this is done automatically by loadcollection!
). This is a collection of globally known DataCollection
s.
Being on the STACK
allows a dataset to be retrieved by its name or UUID using getlayer
, and for a DataSets
in one DataCollection
to refer to a DataSet
in another.
When using the data>
REPL mode, the top collection on the stack is used as a default target for all operations.
Creating a DataCollection
A Data.toml file can be loaded as a DataCollection
(and put on the STACK
) with loadcollection!
.
To programatically create a DataToolkit
you can either call the full constructor, but that's rather involved, and so a more convenient constructor is also defined:
DataCollection(name::Union{String, Nothing}, [parameters::Dict{String, Any}];
path::Union{String, Nothing} = nothing,
uuid::UUID = uuid4(),
plugins::Vector{String} = String[],
mod::Module = Base.Main,
parameters...) -> DataCollection
Note that parameters
can either be provided as the second positional argument, or extra keyword arguments, but not both.
Once a DataCollection
has been created, DataSet
s can be added to it with create!
.
Examples
julia> DataCollection("test")
DataCollection: test
Data sets:
julia> c1 = DataCollection(nothing, Dict("customparam" => 77))
DataCollection:
Data sets:
julia> c2 = DataCollection("test2", plugins = ["defaults", "store"], customparam=77)
DataCollection: test2
Plugins: defaults ✔, store ✔
Data sets:
julia> c1.parameters
Dict{String, Any} with 1 entry:
"customparam" => 77
julia> c2.parameters
Dict{String, Any} with 1 entry:
"customparam" => 77
Saving a DataCollection
After modifying a file-backed DataCollection
, the file can be updated by calling write(::DataCollection)
(so long as iswritable(::DataCollection)
is true
).
Any DataCollection
can also be written to a particular destination with write(dest, ::DataCollection)
.
Writing a DataCollection
to plaintext is essentially performed by calling TOML.print
on the result of convert(::Type{Dict}, ::DataCollection)
.
Fields
version::Int
name::Union{String, Nothing}
uuid::UUID
plugins::Vector{String}
parameters::Dict{String, Any}
datasets::Vector{DataSet}
path::Union{String, Nothing}
advise::AdviceAmalgamation
mod::Module
DataToolkitCore.loadcollection!
— Functionloadcollection!(source::Union{<:AbstractString, <:IO}, mod::Module=Base.Main;
soft::Bool=false, index::Int=1)
Load a data collection from source
and add it to the data stack at index
. source
must be accepted by read(source, DataCollection)
.
mod
should be set to the Module within which loadcollection!
is being invoked. This is important when code is run by the collection. As such, it is usually appropriate to call:
loadcollection!(source, @__MODULE__; soft)
When soft
is set, should an data collection already exist with the same UUID, nothing will be done and nothing
will be returned.
DataToolkitCore.STACK
— ConstantThe set of data collections currently available.
Base.iswritable
— Methodiswritable(dc::DataCollection)
Check whether the data collection dc
is backed by a writable file.
Manipulation API
DataToolkitCore.getlayer
— Functiongetlayer([::Nothing])
Return the first DataCollection
on the STACK
.
getlayer(name::AbstractString)
getlayer(uuid::UUID)
Find the DataCollection
in STACK
with name
/uuid
.
DataToolkitCore.stack_index
— Functionstack_index(ident::Union{Int, String, UUID, DataCollection}; quiet::Bool=false)
Obtain the index of the data collection identified by ident
on the stack, if it is present. If it is not found, nothing
is returned and unless quiet
is set a warning is printed.
DataToolkitCore.stack_move
— Functionstack_move(ident::Union{Int, String, UUID, DataCollection}, shift::Int; quiet::Bool=false)
Find ident
in the data collection stack, and shift its position by shift
, returning the new index. shift
is clamped so that the new index lies within STACK.
If ident
could not be resolved, then nothing
is returned and unless quiet
is set a warning is printed.
DataToolkitCore.stack_remove!
— Functionstack_remove!(ident::Union{Int, String, UUID, DataCollection}; quiet::Bool=false)
Find ident
in the data collection stack and remove it from the stack, returning the index at which it was found.
If ident
could not be resolved, then nothing
is returned and unless quiet
is set a warning is printed.
DataToolkitCore.plugin_add
— Functionplugin_add([collection::DataCollection=first(STACK)], plugins::Vector{<:AbstractString};
quiet::Bool=false)
Return a variation of collection
with all plugins
not currently used added to the plugin list.
Unless quiet
is a set an informative message is printed.
The new collection
is written, if possible.
Should collection
be part of STACK
, the stack entry is updated in-place.
DataToolkitCore.plugin_remove
— Functionplugin_remove([collection::DataCollection=first(STACK)], plugins::Vector{<:AbstractString};
quiet::Bool=false)
Return a variation of collection
with all plugins
currently used removed from the plugin list.
Unless quiet
is a set an informative message is printed.
The new collection
is written, if possible.
Should collection
be part of STACK
, the stack entry is updated in-place.
DataToolkitCore.plugin_info
— Functionplugin_info(plugin::AbstractString; quiet::Bool=false)
Fetch the documentation of plugin
, or return nothing
if documentation could not be fetched.
If quiet
is not set warning messages will be omitted when no documentation could be fetched.
DataToolkitCore.plugin_list
— Functionplugin_list(collection::DataCollection=first(STACK); quiet::Bool=false)
Obtain a list of plugins used in collection
.
quiet
is unused but accepted as an argument for the sake of consistency.
DataToolkitCore.config_get
— Functionconfig_get(propertypath::Vector{String};
collection::DataCollection=first(STACK), quiet::Bool=false)
Obtain the configuration value at propertypath
in collection
.
When no value is set, nothing
is returned instead and if quiet
is unset "unset" is printed.
DataToolkitCore.config_set
— Functionconfig_set([collection::DataCollection=first(STACK)], propertypath::Vector{String}, value::Any;
quiet::Bool=false)
Return a variation of collection
with the configuration at propertypath
set to value
.
Unless quiet
is set, a success message is printed.
The new collection
is written, if possible.
Should collection
be part of STACK
, the stack entry is updated in-place.
DataToolkitCore.config_unset
— Functionconfig_unset([collection::DataCollection=first(STACK)], propertypath::Vector{String};
quiet::Bool=false)
Return a variation of collection
with the configuration at propertypath
removed.
Unless quiet
is set, a success message is printed.
The new collection
is written, if possible.
Should collection
be part of STACK
, the stack entry is updated in-place.
Base.delete!
— Functiondelete!(dataset::DataSet)
Remove dataset
from its parent collection.
Base.replace!
— Functionreplace!(dataset::DataSet; [name, uuid, parameters, storage, loaders, writers])
Perform an in-place update of dataset
, optionally replacing any of the name
, uuid
, parameters
, storage
, loaders
, or writers
fields.
TOML serialization
DataToolkitCore.fromspec
— Functionfromspec(DT::Type{<:DataTransformer}, dataset::DataSet, spec::Dict{String, Any})
Create an DT
of dataset
according to spec
.
DT
can either contain the driver name as a type parameter, or it will be read from the "driver"
key in spec
.
fromspec(::Type{DataCollection}, spec::Dict{String, Any};
path::Union{String, Nothing}=nothing, mod::Module=Base.Main)
Create a DataCollection
from spec
.
The path
and mod
keywords are used as the values for the corresponding fields in the DataCollection
.
fromspec(::Type{DataSet}, collection::DataCollection, name::String, spec::Dict{String, Any})
Create a DataSet
for collection
called name
, according to spec
.
DataToolkitCore.tospec
— Functiontospec(thing::DataTransformer)
tospec(thing::DataSet)
tospec(thing::DataCollection)
Return a Dict
representation of thing
for writing as TOML.
Qualified types
DataToolkitCore.QualifiedType
— TypeQualifiedType
A representation of a Julia type that does not need the type to be defined in the Julia session, and can be stored as a string. This is done by storing the type name and the module it belongs to as Symbols.
While QualifiedType
is currently quite capable, it is not currently able to express the full gamut of Julia types. In future this will be improved, but it will likely always be restricted to a certain subset.
See also: typeify
.
Subtyping
While the subtype operator cannot work on QualifiedTypes (<:
is a built-in), when the Julia types are defined the subset operator ⊆
can be used instead. This works by simply convert
ing the QualifiedTypes to the corresponding Type and then applying the subtype operator.
julia> QualifiedTypes(:Base, :Vector) ⊆ QualifiedTypes(:Core, :Array)
true
julia> Matrix ⊆ QualifiedTypes(:Core, :Array)
true
julia> QualifiedTypes(:Base, :Vector) ⊆ AbstractVector
true
julia> QualifiedTypes(:Base, :Foobar) ⊆ AbstractVector
false
Constructors
QualifiedType(parentmodule::Symbol, typename::Symbol)
QualifiedType(t::Type)
Parsing
A QualifiedType can be expressed as a string as "$parentmodule.$typename"
. This can be easily parse
d as a QualifiedType, e.g. parse(QualifiedType, "Core.IO")
.