Internals
Type transformations
DataToolkitCore.issubtype — Functionissubtype(X::Type, T::Union{Type, TypeVar})
issubtype(x::X, T::Union{Type, TypeVar})Check if X is indeed a subtype of T.
This is a tweaked version of isa that can (mostly) handle TypeVar instances.
DataToolkitCore.paramtypebound — Functionparamtypebound(T::Union{Type, TypeVar}, Tparam::Union{Type, TypeVar}, paramT::Type)Return the Type that bounds T.
This is simply T when T isa Type, but T may also be a TypeVar that is parameterised by Tparam. In this case, the Type that T is parameterised by is returned, which is taken to be paramT.
Given a type T that may be parameterised according to Tparam,
julia> paramtypebound(String, IO, IO)
String
julia> T = TypeVar(:T)
T
julia> paramtypebound(T, Type{T}, Float64)
Float64DataToolkitCore.targettypes — Functiontargettypes(types::Vector{QualifiedType}, desired::Type) -> Vector{Type}
targettypes(transformer::DataTransformer, desired::Type) -> Vector{Type}Return all Types that one might hope to produce from types or transformer.
More specifically, this will give all Types that can be produced which are a subtype of desired, and desired itself.
Priority order is preserved.
DataToolkitCore.ispreferredpath — Functionispreferredpath(a, b)Compares two "type paths" a and b, returning whether a is preferred.
Each "type path" is a tuple of the form:
(Tin::Type => Tout::Type, index::Int, transformer::Type{<:DataTransformer})This operates on the following rules:
- The path with the lower index is preferred.
- If the indices are equal, the path with the more specific output type is preferred.
- If the output types are equally specific, the path with the more specific loader is preferred.
- If the loaders are equally specific, the more similar data transformation (
Tin => Tout) is preferred.
DataToolkitCore.transformersigs — Functiontransformersigs(transformer::Type{<:DataTransformer}, desired::Type)Return processed signatures of the transformation methods implemented for transformer that could produce/provide a subtype of desired.
DataStorageproduces tuples of(Type{<:DataStorage}, Type{out})DataLoadersproduces tuples of(Type{<:DataLoader}, Type{in}, Type{out})DataWriterproduces tuples of(Type{<:DataWriter}, Type{in}, Type{data})
The DataStorage method takes a write::Bool keyword argument.
DataToolkitCore.typesteps — Functiontypesteps(loader::DataLoader, desired::Type) -> Vector{Pair{Type, Type}}Identify and order all uses of loader that may produce a subtype of desired.
More specifically, this finds all load methods that can produce a subtype of desired, checks what input and output types they work with, and orders them according to the declared types of loader and the specificity of the output types (more specific is interpreted as better).
The output vector gives the step-change in the type domain that each method performs.
TOML related
DataToolkitCore.toml_safe — Functiontoml_safe(value)Recursively convert value to a form that DataToolkit can safely encode to TOML.
DataToolkitCore.tomlreformat! — Functiontomlreformat!(io::IO)Consume io representing a TOML file, and reformat it to improve readability. Currently this takes the form of the following changes:
- Replace inline multi-line strings with multi-line toml strings.
An IOBuffer containing the reformatted content is returned.
The processing assumes that io contains TOML.print-formatted content. Should this not be the case, mangled TOML may be emitted.
DataToolkitCore.dataset_parameters — Functiondataset_parameters(source::Union{DataCollection, DataSet, DataTransformer},
action::Val{:extract|:resolve|:encode}, value::Any)Obtain a form (depending on action) of value, a property within source.
Actions
:extract Look for DataSet references ("📇DATASET<<…>>") within value, and turn them into Identifiers (the inverse of :encode).
:resolve Look for Identifiers in value, and resolve them to the referenced DataSet/value.
:encode Look for Identifiers in value, and turn them into DataSet references (the inverse of :extract).
String utils
DataToolkitCore.natkeygen — Functionnatkeygen(key::String)Generate a sorting key for key that when used with sort will put the collection in "natural order".
julia> natkeygen.(["A1", "A10", "A02", "A1.5"])
4-element Vector{Vector{String}}:
["a", "0\x01"]
["a", "0\n"]
["a", "0\x02"]
["a", "0\x015"]
julia> sort(["A1", "A10", "A02", "A1.5"], by=natkeygen)
4-element Vector{String}:
"A1"
"A1.5"
"A02"
"A10"DataToolkitCore.stringdist — Functionstringdist(a::AbstractString, b::AbstractString; halfcase::Bool=false)Calculate the Restricted Damerau-Levenshtein distance (aka. Optimal String Alignment) between a and b.
This is the minimum number of edits required to transform a to b, where each edit is a deletion, insertion, substitution, or transposition of a character, with the restriction that no substring is edited more than once.
When halfcase is true, substitutions that just switch the case of a character cost half as much.
Examples
julia> stringdist("The quick brown fox jumps over the lazy dog",
"The quack borwn fox leaps ovver the lzy dog")
7
julia> stringdist("typo", "tpyo")
1
julia> stringdist("frog", "cat")
4
julia> stringdist("Thing", "thing", halfcase=true)
0.5DataToolkitCore.stringsimilarity — Functionstringsimilarity(a::AbstractString, b::AbstractString; halfcase::Bool=false)Return the stringdist as a proportion of the maximum length of a and b, take one. When halfcase is true, case switches cost half as much.
Example
julia> stringsimilarity("same", "same")
1.0
julia> stringsimilarity("semi", "demi")
0.75
julia> stringsimilarity("Same", "same", halfcase=true)
0.875DataToolkitCore.issubseq — Functionissubseq(a, b)Return true if a is a subsequence of b, false otherwise.
Examples
julia> issubseq("abc", "abc")
true
julia> issubseq("adg", "abcdefg")
true
julia> issubseq("gda", "abcdefg")
falseDataToolkitCore.longest_common_subsequence — Functionlongest_common_subsequence(a, b)Find the longest common subsequence of b within a, returning the indices of a that comprise the subsequence.
This function is intended for strings, but will work for any indexable objects with == equality defined for their elements.
Example
julia> longest_common_subsequence("same", "same")
4-element Vector{Int64}:
1
2
3
4
julia> longest_common_subsequence("fooandbar", "foobar")
6-element Vector{Int64}:
1
2
3
7
8
9DataToolkitCore.highlight_lcs — Functionhighlight_lcs(io::IO, a::String, b::String;
before::String="\e[1m", after::String="\e[22m",
invert::Bool=false)Print a, highlighting the longest common subsequence between a and b by inserting before prior to each subsequence region and after afterwards.
If invert is set, the before/after behaviour is switched.
Small dicts
DataToolkitCore.newdict — Functionnewdict(K::Type, V::Type, capacity::Int) -> Dict{K, V}Create a new Dict{K, V} sized to hold capacity elements, hopefully without resizing. Depending on the particular value of capacity and the Julia version, this can result in substantial memory savings for small dictionaries.
DataToolkitCore.shrinkdict — Functionshrinkdict(dict::Dict) -> DictIf dict looks like it may be smaller if reconstructed using newdict, do so.
Logging
DataToolkitCore.should_log — Functionshould_log(category::String) -> BoolDetermine whether a message should be logged based on its category.
The category string can contain any number of subcategories separated by colons. If any parent category is enabled, the subcategory is also enabled.
DataToolkitCore.wait_maybe_log — Functionwait_maybe_log(category::String, message::AbstractString; mod::Module, file::String, line::Int) -> TimerWait for a delay before logging message with category if should_log(category).
The log is produced with metadata from mod, file, and line.
Miscilanious
DataToolkitCore.typeify — Functiontypeify(qt::QualifiedType; mod::Module=Main)Convert qt to a Type available in mod, if possible.
If this cannot be done an ImpossibleTypeException is thrown.
DataToolkitCore.trytypeify — Functiontrytypeify(qt::QualifiedType; mod::Module=Main)Convert qt to a Type available in mod, if possible.
If this cannot be done the value nothing is returned instead.
DataToolkitCore.reinit! — Functionreinit!(dta::AdviceAmalgamation)Check that dta is well initialised before using it.
This does noting if dta.plugins_wanted is the same as dta.plugins_used.
When they differ, it re-builds the advisors function list based on the currently available plugins, and updates dta.plugins_used.
DataToolkitCore._dataadvisecall — Function_dataadvisecall([noassert], func::Function, args...; kwargs...)Identify the first data-like argument of args (i.e. a DataCollection, DataSet, or DataTransformer), obtain its advise, and perform an advised call of func(args...; kwargs...).
Unless an initial noassert argument of Val{:noassert} is provided, it is assumed (and asserted) that the advised function will have the same return type as the original.
DataToolkitCore.strip_stacktrace_advice! — Functionstrip_stacktrace_advice!(st::Vector{Base.StackTraces.StackFrame})Remove stack frames related to @advise and invokepkglatest from st.
DataToolkitCore.get_package — Functionget_package(pkg::Base.PkgId)
get_package(from::Module, name::Symbol)Obtain a module specified by either pkg or identified by name and declared by from. Should the package not be currently loaded DataToolkit will attempt to lazy-load the package and return its module.
Failure to either locate name or require pkg will result in an exception being thrown.
See also: @require, @addpkg, try_install_pkg.
DataToolkitCore.try_install_pkg — Functiontry_install_pkg(pkg::Base.PkgId)Attempt to install the package identified by pkg if it is not currently installed.
This function is called automatically by get_package if the package is not currently loaded, and calls Pkg's try_prompt_pkg_add method from its REPLExt package extension. If the REPL has not been loaded, nothing will be done.
DataToolkitCore.read1 — Functionread1(dataset::DataSet, as::Type)The advisable implementation of read(dataset::DataSet, as::Type), which see.
This is essentially an exercise in useful indirection.
Consts
DataToolkitCore.LATEST_DATA_CONFIG_VERSION — ConstantThe DataCollection.version set on all created DataCollections, and assumed when reading any Data.toml files which do not set data_config_version.
DataToolkitCore.PLUGINS_DOCUMENTATION — ConstantA mapping from Plugin names to the documentation of said plugin.
DataToolkitCore.DEFAULT_PLUGINS — ConstantThe set of plugins (by name) that should used by default when creating a new data collection.
DataToolkitCore.PLUGINS — ConstantThe set of plugins currently available.
DataToolkitCore.TRANSFORMER_DOCUMENTATION — ConstantList of (category::Symbol, named::Symbol) => docs::Any forms.
category can be :storage, :loader, or :writer.
DataToolkitCore.EXTRA_PACKAGES — ConstantThe set of packages loaded by each module via @addpkg, for import with @require.
More specifically, when a module M invokes @addpkg pkg id then EXTRA_PACKAGES[M][pkg] = id is set, and then this information is used with @require to obtain the package from the root module.
DataToolkitCore.DEFAULT_DATATRANSFORMER_PRIORITY — ConstantThe default priority field value for instances of DataTransformer.
DataToolkitCore.DEFAULT_DATA_ADVISOR_PRIORITY — ConstantThe default priority field value for Advices.
DataToolkitCore.DATASET_REFERENCE_WRAPPER — ConstantA tuple of delimiters defining a dataset reference. For example, if set to ("{", "}") then {abc} would be recognised as a dataset reference for abc.
DataToolkitCore.DATASET_REFERENCE_REGEX — ConstantA regex which matches dataset references. This is constructed from DATASET_REFERENCE_WRAPPER.
DataToolkitCore.QUALIFIED_TYPE_CACHE — ConstantQUALIFIED_TYPE_CACHEA cache of QualifiedType instances, indexed by the type they represent.
While one would hope that QualifiedType(::Type) calls would be constant-folded, in practice this is not the case, and so this cache is used to avoid an unfortunate large performance hit when constructing many QualifiedType instances.
DataToolkitCore.DATA_CONFIG_RESERVED_ATTRIBUTES — ConstantThe data specification TOML format constructs a DataCollection, which itself contains DataSets, comprised of metadata and DataTransformers.
DataCollection
├─ DataSet
│ ├─ DataTransformer
│ └─ DataTransformer
├─ DataSet
⋮Within each scope, there are certain reserved attributes. They are listed in this Dict under the following keys:
:collectionforDataCollection:datasetforDataSet:transformerforDataTransformer
DataToolkitCore.DATA_CONFIG_KEY_SORT_MAPPING — ConstantWhen writing data configuration TOML file, the keys are (recursively) sorted. Some keys are particularly important though, and so to ensure they are placed higher a mappings from such keys to a higher sort priority string can be registered here.
For example, "config" => "\0x01" ensures that the special configuration section is placed before all of the data sets.
This can cause odd behaviour if somebody gives a dataset the same name as a special key, but frankly that would be a bit silly (given the key names, e.g. "uuid") and so this is of minimal concern.
DataToolkitCore.LINT_SEVERITY_MESSAGES — ConstantA mapping from severity numbers (see LINT_SEVERITY_MAPPING) to a tuple giving the color the message should be accented with and the severity title string.
DataToolkitCore.LINT_SEVERITY_MAPPING — ConstantA mapping from severity symbols to integers.
This is used to assist with more readable construction of LintItems.
See also: LINT_SEVERITY_MESSAGES for the reverse mapping of integer to severity title string.
DataToolkitCore.DEFAULT_LOG_DELAY — ConstantThe delay in seconds before a log message is displayed.
When zero or less, log messages are displayed immediately.
DataToolkitCore.SIMPLIFY_STACKTRACES — ConstantWhether stacktraces should be simplified by removing likely incidental DataToolkit-internals frames when displayed.