Internals

Type transformations

DataToolkitCore.issubtypeFunction
issubtype(X::Type, T::Union{Type, TypeVar})
issubtype(x::X, T::Union{Type, TypeVar})

Check if X is indeed a subtype of T.

This is a tweaked version of isa that can (mostly) handle TypeVar instances.

source
DataToolkitCore.paramtypeboundFunction
paramtypebound(T::Union{Type, TypeVar}, Tparam::Union{Type, TypeVar}, paramT::Type)

Return the Type that bounds T.

This is simply T when T isa Type, but T may also be a TypeVar that is parameterised by Tparam. In this case, the Type that T is parameterised by is returned, which is taken to be paramT.

Given a type T that may be parameterised according to Tparam,

julia> paramtypebound(String, IO, IO)
String

julia> T = TypeVar(:T)
T

julia> paramtypebound(T, Type{T}, Float64)
Float64
source
DataToolkitCore.targettypesFunction
targettypes(types::Vector{QualifiedType}, desired::Type) -> Vector{Type}
targettypes(transformer::DataTransformer, desired::Type) -> Vector{Type}

Return all Types that one might hope to produce from types or transformer.

More specifically, this will give all Types that can be produced which are a subtype of desired, and desired itself.

Priority order is preserved.

source
DataToolkitCore.ispreferredpathFunction
ispreferredpath(a, b)

Compares two "type paths" a and b, returning whether a is preferred.

Each "type path" is a tuple of the form:

(Tin::Type => Tout::Type, index::Int, transformer::Type{<:DataTransformer})

This operates on the following rules:

  1. The path with the lower index is preferred.
  2. If the indices are equal, the path with the more specific output type is preferred.
  3. If the output types are equally specific, the path with the more specific loader is preferred.
  4. If the loaders are equally specific, the more similar data transformation (Tin => Tout) is preferred.
source
DataToolkitCore.transformersigsFunction
transformersigs(transformer::Type{<:DataTransformer}, desired::Type)

Return processed signatures of the transformation methods implemented for transformer that could produce/provide a subtype of desired.

  • DataStorage produces tuples of (Type{<:DataStorage}, Type{out})
  • DataLoaders produces tuples of (Type{<:DataLoader}, Type{in}, Type{out})
  • DataWriter produces tuples of (Type{<:DataWriter}, Type{in}, Type{data})

The DataStorage method takes a write::Bool keyword argument.

source
DataToolkitCore.typestepsFunction
typesteps(loader::DataLoader, desired::Type) -> Vector{Pair{Type, Type}}

Identify and order all uses of loader that may produce a subtype of desired.

More specifically, this finds all load methods that can produce a subtype of desired, checks what input and output types they work with, and orders them according to the declared types of loader and the specificity of the output types (more specific is interpreted as better).

The output vector gives the step-change in the type domain that each method performs.

source
DataToolkitCore.tomlreformat!Function
tomlreformat!(io::IO)

Consume io representing a TOML file, and reformat it to improve readability. Currently this takes the form of the following changes:

  • Replace inline multi-line strings with multi-line toml strings.

An IOBuffer containing the reformatted content is returned.

The processing assumes that io contains TOML.print-formatted content. Should this not be the case, mangled TOML may be emitted.

source
DataToolkitCore.dataset_parametersFunction
dataset_parameters(source::Union{DataCollection, DataSet, DataTransformer},
                   action::Val{:extract|:resolve|:encode}, value::Any)

Obtain a form (depending on action) of value, a property within source.

Actions

:extract Look for DataSet references ("📇DATASET<<…>>") within value, and turn them into Identifiers (the inverse of :encode).

:resolve Look for Identifiers in value, and resolve them to the referenced DataSet/value.

:encode Look for Identifiers in value, and turn them into DataSet references (the inverse of :extract).

source

String utils

DataToolkitCore.natkeygenFunction
natkeygen(key::String)

Generate a sorting key for key that when used with sort will put the collection in "natural order".

julia> natkeygen.(["A1", "A10", "A02", "A1.5"])
4-element Vector{Vector{String}}:
 ["a", "0\x01"]
 ["a", "0\n"]
 ["a", "0\x02"]
 ["a", "0\x015"]

julia> sort(["A1", "A10", "A02", "A1.5"], by=natkeygen)
4-element Vector{String}:
 "A1"
 "A1.5"
 "A02"
 "A10"
source
DataToolkitCore.stringdistFunction
stringdist(a::AbstractString, b::AbstractString; halfcase::Bool=false)

Calculate the Restricted Damerau-Levenshtein distance (aka. Optimal String Alignment) between a and b.

This is the minimum number of edits required to transform a to b, where each edit is a deletion, insertion, substitution, or transposition of a character, with the restriction that no substring is edited more than once.

When halfcase is true, substitutions that just switch the case of a character cost half as much.

Examples

julia> stringdist("The quick brown fox jumps over the lazy dog",
                  "The quack borwn fox leaps ovver the lzy dog")
7

julia> stringdist("typo", "tpyo")
1

julia> stringdist("frog", "cat")
4

julia> stringdist("Thing", "thing", halfcase=true)
0.5
source
DataToolkitCore.stringsimilarityFunction
stringsimilarity(a::AbstractString, b::AbstractString; halfcase::Bool=false)

Return the stringdist as a proportion of the maximum length of a and b, take one. When halfcase is true, case switches cost half as much.

Example

julia> stringsimilarity("same", "same")
1.0

julia> stringsimilarity("semi", "demi")
0.75

julia> stringsimilarity("Same", "same", halfcase=true)
0.875
source
DataToolkitCore.issubseqFunction
issubseq(a, b)

Return true if a is a subsequence of b, false otherwise.

Examples

julia> issubseq("abc", "abc")
true

julia> issubseq("adg", "abcdefg")
true

julia> issubseq("gda", "abcdefg")
false
source
DataToolkitCore.longest_common_subsequenceFunction
longest_common_subsequence(a, b)

Find the longest common subsequence of b within a, returning the indices of a that comprise the subsequence.

This function is intended for strings, but will work for any indexable objects with == equality defined for their elements.

Example

julia> longest_common_subsequence("same", "same")
4-element Vector{Int64}:
 1
 2
 3
 4

julia> longest_common_subsequence("fooandbar", "foobar")
6-element Vector{Int64}:
 1
 2
 3
 7
 8
 9
source
DataToolkitCore.highlight_lcsFunction
highlight_lcs(io::IO, a::String, b::String;
              before::String="\e[1m", after::String="\e[22m",
              invert::Bool=false)

Print a, highlighting the longest common subsequence between a and b by inserting before prior to each subsequence region and after afterwards.

If invert is set, the before/after behaviour is switched.

source

Small dicts

DataToolkitCore.newdictFunction
newdict(K::Type, V::Type, capacity::Int) -> Dict{K, V}

Create a new Dict{K, V} sized to hold capacity elements, hopefully without resizing. Depending on the particular value of capacity and the Julia version, this can result in substantial memory savings for small dictionaries.

source

Logging

DataToolkitCore.should_logFunction
should_log(category::String) -> Bool

Determine whether a message should be logged based on its category.

The category string can contain any number of subcategories separated by colons. If any parent category is enabled, the subcategory is also enabled.

source
DataToolkitCore.wait_maybe_logFunction
wait_maybe_log(category::String, message::AbstractString; mod::Module, file::String, line::Int) -> Timer

Wait for a delay before logging message with category if should_log(category).

The log is produced with metadata from mod, file, and line.

source

Miscilanious

DataToolkitCore.typeifyFunction
typeify(qt::QualifiedType; mod::Module=Main)

Convert qt to a Type available in mod, if possible. If this cannot be done, nothing is returned instead.

source
DataToolkitCore.reinitFunction
reinit(dta::AdviceAmalgamation)

Check that dta is well initialised before using it.

This does noting if dta.plugins_wanted is the same as dta.plugins_used.

When they differ, it re-builds the advisors function list based on the currently available plugins, and updates dta.plugins_used.

source
DataToolkitCore.get_packageFunction
get_package(pkg::Base.PkgId)
get_package(from::Module, name::Symbol)

Obtain a module specified by either pkg or identified by name and declared by from. Should the package not be currently loaded DataToolkit will attempt to lazy-load the package and return its module.

Failure to either locate name or require pkg will result in an exception being thrown.

See also: @require, @addpkg, try_install_pkg.

source
DataToolkitCore.try_install_pkgFunction
try_install_pkg(pkg::Base.PkgId)

Attempt to install the package identified by pkg if it is not currently installed.

This function is called automatically by get_package if the package is not currently loaded, and calls Pkg's try_prompt_pkg_add method from its REPLExt package extension. If the REPL has not been loaded, nothing will be done.

source
DataToolkitCore.read1Function
read1(dataset::DataSet, as::Type)

The advisable implementation of read(dataset::DataSet, as::Type), which see.

This is essentially an exercise in useful indirection.

source

Consts

DataToolkitCore.EXTRA_PACKAGESConstant

The set of packages loaded by each module via @addpkg, for import with @require.

More specifically, when a module M invokes @addpkg pkg id then EXTRA_PACKAGES[M][pkg] = id is set, and then this information is used with @require to obtain the package from the root module.

source
DataToolkitCore.QUALIFIED_TYPE_CACHEConstant
QUALIFIED_TYPE_CACHE

A cache of QualifiedType instances, indexed by the type they represent.

While one would hope that QualifiedType(::Type) calls would be constant-folded, in practice this is not the case, and so this cache is used to avoid an unfortunate large performance hit when constructing many QualifiedType instances.

source
DataToolkitCore.DATA_CONFIG_RESERVED_ATTRIBUTESConstant

The data specification TOML format constructs a DataCollection, which itself contains DataSets, comprised of metadata and DataTransformers.

DataCollection
├─ DataSet
│  ├─ DataTransformer
│  └─ DataTransformer
├─ DataSet
⋮

Within each scope, there are certain reserved attributes. They are listed in this Dict under the following keys:

source
DataToolkitCore.DATA_CONFIG_KEY_SORT_MAPPINGConstant

When writing data configuration TOML file, the keys are (recursively) sorted. Some keys are particularly important though, and so to ensure they are placed higher a mappings from such keys to a higher sort priority string can be registered here.

For example, "config" => "\0x01" ensures that the special configuration section is placed before all of the data sets.

This can cause odd behaviour if somebody gives a dataset the same name as a special key, but frankly that would be a bit silly (given the key names, e.g. "uuid") and so this is of minimal concern.

source