Internals

Type transformations

DataToolkitCore.issubtype — Function

issubtype(X::Type, T::Union{Type, TypeVar})
issubtype(x::X, T::Union{Type, TypeVar})

Check if X is indeed a subtype of T.

This is a tweaked version of isa that can (mostly) handle TypeVar instances.

source

DataToolkitCore.paramtypebound — Function

paramtypebound(T::Union{Type, TypeVar}, Tparam::Union{Type, TypeVar}, paramT::Type)

Return the Type that bounds T.

This is simply T when T isa Type, but T may also be a TypeVar that is parameterised by Tparam. In this case, the Type that T is parameterised by is returned, which is taken to be paramT.

Given a type T that may be parameterised according to Tparam,

julia> paramtypebound(String, IO, IO)
String

julia> T = TypeVar(:T)
T

julia> paramtypebound(T, Type{T}, Float64)
Float64

source

DataToolkitCore.targettypes — Function

targettypes(types::Vector{QualifiedType}, desired::Type) -> Vector{Type}
targettypes(transformer::DataTransformer, desired::Type) -> Vector{Type}

Return all Types that one might hope to produce from types or transformer.

More specifically, this will give all Types that can be produced which are a subtype of desired, and desired itself.

Priority order is preserved.

source

DataToolkitCore.ispreferredpath — Function

ispreferredpath(a, b)

Compares two "type paths" a and b, returning whether a is preferred.

Each "type path" is a tuple of the form:

(Tin::Type => Tout::Type, index::Int, transformer::Type{<:DataTransformer})

This operates on the following rules:

The path with the lower index is preferred.
If the indices are equal, the path with the more specific output type is preferred.
If the output types are equally specific, the path with the more specific loader is preferred.
If the loaders are equally specific, the more similar data transformation (Tin => Tout) is preferred.

source

DataToolkitCore.transformersigs — Function

transformersigs(transformer::Type{<:DataTransformer}, desired::Type)

Return processed signatures of the transformation methods implemented for transformer that could produce/provide a subtype of desired.

DataStorage produces tuples of (Type{<:DataStorage}, Type{out})
DataLoaders produces tuples of (Type{<:DataLoader}, Type{in}, Type{out})
DataWriter produces tuples of (Type{<:DataWriter}, Type{in}, Type{data})

The DataStorage method takes a write::Bool keyword argument.

source

DataToolkitCore.typesteps — Function

typesteps(loader::DataLoader, desired::Type) -> Vector{Pair{Type, Type}}

Identify and order all uses of loader that may produce a subtype of desired.

More specifically, this finds all load methods that can produce a subtype of desired, checks what input and output types they work with, and orders them according to the declared types of loader and the specificity of the output types (more specific is interpreted as better).

The output vector gives the step-change in the type domain that each method performs.

source

DataToolkitCore.toml_safe — Function

toml_safe(value)

Recursively convert value to a form that DataToolkit can safely encode to TOML.

source

DataToolkitCore.tomlreformat! — Function

tomlreformat!(io::IO)

Consume io representing a TOML file, and reformat it to improve readability. Currently this takes the form of the following changes:

Replace inline multi-line strings with multi-line toml strings.

An IOBuffer containing the reformatted content is returned.

The processing assumes that io contains TOML.print-formatted content. Should this not be the case, mangled TOML may be emitted.

source

DataToolkitCore.dataset_parameters — Function

dataset_parameters(source::Union{DataCollection, DataSet, DataTransformer},
                   action::Val{:extract|:resolve|:encode}, value::Any)

Obtain a form (depending on action) of value, a property within source.

Actions

:extract Look for DataSet references ("📇DATASET<<…>>") within value, and turn them into Identifiers (the inverse of :encode).

:resolve Look for Identifiers in value, and resolve them to the referenced DataSet/value.

:encode Look for Identifiers in value, and turn them into DataSet references (the inverse of :extract).

source

String utils

DataToolkitCore.natkeygen — Function

natkeygen(key::String)

Generate a sorting key for key that when used with sort will put the collection in "natural order".

julia> natkeygen.(["A1", "A10", "A02", "A1.5"])
4-element Vector{Vector{String}}:
 ["a", "0\x01"]
 ["a", "0\n"]
 ["a", "0\x02"]
 ["a", "0\x015"]

julia> sort(["A1", "A10", "A02", "A1.5"], by=natkeygen)
4-element Vector{String}:
 "A1"
 "A1.5"
 "A02"
 "A10"

source

DataToolkitCore.stringdist — Function

stringdist(a::AbstractString, b::AbstractString; halfcase::Bool=false)

Calculate the Restricted Damerau-Levenshtein distance (aka. Optimal String Alignment) between a and b.

This is the minimum number of edits required to transform a to b, where each edit is a deletion, insertion, substitution, or transposition of a character, with the restriction that no substring is edited more than once.

When halfcase is true, substitutions that just switch the case of a character cost half as much.

Examples

julia> stringdist("The quick brown fox jumps over the lazy dog",
                  "The quack borwn fox leaps ovver the lzy dog")
7

julia> stringdist("typo", "tpyo")
1

julia> stringdist("frog", "cat")
4

julia> stringdist("Thing", "thing", halfcase=true)
0.5

source

DataToolkitCore.stringsimilarity — Function

stringsimilarity(a::AbstractString, b::AbstractString; halfcase::Bool=false)

Return the stringdist as a proportion of the maximum length of a and b, take one. When halfcase is true, case switches cost half as much.

Example

julia> stringsimilarity("same", "same")
1.0

julia> stringsimilarity("semi", "demi")
0.75

julia> stringsimilarity("Same", "same", halfcase=true)
0.875

source

DataToolkitCore.issubseq — Function

issubseq(a, b)

Return true if a is a subsequence of b, false otherwise.

Examples

julia> issubseq("abc", "abc")
true

julia> issubseq("adg", "abcdefg")
true

julia> issubseq("gda", "abcdefg")
false

source

DataToolkitCore.longest_common_subsequence — Function

longest_common_subsequence(a, b)

Find the longest common subsequence of b within a, returning the indices of a that comprise the subsequence.

This function is intended for strings, but will work for any indexable objects with == equality defined for their elements.

Example

julia> longest_common_subsequence("same", "same")
4-element Vector{Int64}:
 1
 2
 3
 4

julia> longest_common_subsequence("fooandbar", "foobar")
6-element Vector{Int64}:
 1
 2
 3
 7
 8
 9

source

DataToolkitCore.highlight_lcs — Function

highlight_lcs(io::IO, a::String, b::String;
              before::String="\e[1m", after::String="\e[22m",
              invert::Bool=false)

Print a, highlighting the longest common subsequence between a and b by inserting before prior to each subsequence region and after afterwards.

If invert is set, the before/after behaviour is switched.

source

Small dicts

DataToolkitCore.newdict — Function

newdict(K::Type, V::Type, capacity::Int) -> Dict{K, V}

Create a new Dict{K, V} sized to hold capacity elements, hopefully without resizing. Depending on the particular value of capacity and the Julia version, this can result in substantial memory savings for small dictionaries.

source

DataToolkitCore.shrinkdict — Function

shrinkdict(dict::Dict) -> Dict

If dict looks like it may be smaller if reconstructed using newdict, do so.

source

Logging

DataToolkitCore.should_log — Function

should_log(category::String) -> Bool

Determine whether a message should be logged based on its category.

The category string can contain any number of subcategories separated by colons. If any parent category is enabled, the subcategory is also enabled.

source

DataToolkitCore.wait_maybe_log — Function

wait_maybe_log(category::String, message::AbstractString; mod::Module, file::String, line::Int) -> Timer

Wait for a delay before logging message with category if should_log(category).

The log is produced with metadata from mod, file, and line.

source

Miscilanious

DataToolkitCore.typeify — Function

typeify(qt::QualifiedType; mod::Module=Main)

Convert qt to a Type available in mod, if possible. If this cannot be done, nothing is returned instead.

source

DataToolkitCore.reinit — Function

reinit(dta::AdviceAmalgamation)

Check that dta is well initialised before using it.

This does noting if dta.plugins_wanted is the same as dta.plugins_used.

When they differ, it re-builds the advisors function list based on the currently available plugins, and updates dta.plugins_used.

source

DataToolkitCore._dataadvisecall — Function

_dataadvisecall(func::Function, args...; kwargs...)

Identify the first data-like argument of args (i.e. a DataCollection, DataSet, or DataTransformer), obtain its advise, and perform an advised call of func(args...; kwargs...).

source

DataToolkitCore.strip_stacktrace_advice! — Function

strip_stacktrace_advice!(st::Vector{Base.StackTraces.StackFrame})

Remove stack frames related to @advise and invokepkglatest from st.

source

DataToolkitCore.get_package — Function

get_package(pkg::Base.PkgId)
get_package(from::Module, name::Symbol)

Obtain a module specified by either pkg or identified by name and declared by from. Should the package not be currently loaded DataToolkit will attempt to lazy-load the package and return its module.

Failure to either locate name or require pkg will result in an exception being thrown.

source

DataToolkitCore.try_install_pkg — Function

try_install_pkg(pkg::Base.PkgId)

Attempt to install the package identified by pkg if it is not currently installed.

This function is called automatically by get_package if the package is not currently loaded, and calls Pkg's try_prompt_pkg_add method from its REPLExt package extension. If the REPL has not been loaded, nothing will be done.

source

DataToolkitCore.read1 — Function

read1(dataset::DataSet, as::Type)

The advisable implementation of read(dataset::DataSet, as::Type), which see.

This is essentially an exercise in useful indirection.

source

Consts

DataToolkitCore.LATEST_DATA_CONFIG_VERSION — Constant

The DataCollection.version set on all created DataCollections, and assumed when reading any Data.toml files which do not set data_config_version.

source

DataToolkitCore.PLUGINS_DOCUMENTATION — Constant

A mapping from Plugin names to the documentation of said plugin.

source

DataToolkitCore.DEFAULT_PLUGINS — Constant

The set of plugins (by name) that should used by default when creating a new data collection.

source

DataToolkitCore.PLUGINS — Constant

The set of plugins currently available.

source

DataToolkitCore.TRANSFORMER_DOCUMENTATION — Constant

List of (category::Symbol, named::Symbol) => docs::Any forms.

category can be :storage, :loader, or :writer.

source

DataToolkitCore.EXTRA_PACKAGES — Constant

The set of packages loaded by each module via @addpkg, for import with @require.

More specifically, when a module M invokes @addpkg pkg id then EXTRA_PACKAGES[M][pkg] = id is set, and then this information is used with @require to obtain the package from the root module.

source

DataToolkitCore.DEFAULT_DATATRANSFORMER_PRIORITY — Constant

The default priority field value for instances of DataTransformer.

source

DataToolkitCore.DEFAULT_DATA_ADVISOR_PRIORITY — Constant

The default priority field value for Advices.

source

DataToolkitCore.DATASET_REFERENCE_WRAPPER — Constant

A tuple of delimiters defining a dataset reference. For example, if set to ("{", "}") then {abc} would be recognised as a dataset reference for abc.

source

DataToolkitCore.DATASET_REFERENCE_REGEX — Constant

A regex which matches dataset references. This is constructed from DATASET_REFERENCE_WRAPPER.

source

DataToolkitCore.QUALIFIED_TYPE_CACHE — Constant

QUALIFIED_TYPE_CACHE

A cache of QualifiedType instances, indexed by the type they represent.

While one would hope that QualifiedType(::Type) calls would be constant-folded, in practice this is not the case, and so this cache is used to avoid an unfortunate large performance hit when constructing many QualifiedType instances.

source

DataToolkitCore.DATA_CONFIG_RESERVED_ATTRIBUTES — Constant

The data specification TOML format constructs a DataCollection, which itself contains DataSets, comprised of metadata and DataTransformers.

DataCollection
├─ DataSet
│  ├─ DataTransformer
│  └─ DataTransformer
├─ DataSet
⋮

Within each scope, there are certain reserved attributes. They are listed in this Dict under the following keys:

:collection for DataCollection
:dataset for DataSet
:transformer for DataTransformer

source

DataToolkitCore.DATA_CONFIG_KEY_SORT_MAPPING — Constant

When writing data configuration TOML file, the keys are (recursively) sorted. Some keys are particularly important though, and so to ensure they are placed higher a mappings from such keys to a higher sort priority string can be registered here.

For example, "config" => "\0x01" ensures that the special configuration section is placed before all of the data sets.

This can cause odd behaviour if somebody gives a dataset the same name as a special key, but frankly that would be a bit silly (given the key names, e.g. "uuid") and so this is of minimal concern.

source

DataToolkitCore.LINT_SEVERITY_MESSAGES — Constant

A mapping from severity numbers (see LINT_SEVERITY_MAPPING) to a tuple giving the color the message should be accented with and the severity title string.

source

DataToolkitCore.LINT_SEVERITY_MAPPING — Constant

A mapping from severity symbols to integers.

This is used to assist with more readable construction of LintItems.

See also: LINT_SEVERITY_MESSAGES for the reverse mapping of integer to severity title string.

source

DataToolkitCore.DEFAULT_LOG_DELAY — Constant

The delay in seconds before a log message is displayed.

When zero or less, log messages are displayed immediately.

source

DataToolkitCore.SIMPLIFY_STACKTRACES — Constant

Whether stacktraces should be simplified by removing likely incidental DataToolkit-internals frames when displayed.

source