Internals
Type transformations
DataToolkitCore.issubtype
— Functionissubtype(X::Type, T::Union{Type, TypeVar})
issubtype(x::X, T::Union{Type, TypeVar})
Check if X
is indeed a subtype of T
.
This is a tweaked version of isa
that can (mostly) handle TypeVar
instances.
DataToolkitCore.paramtypebound
— Functionparamtypebound(T::Union{Type, TypeVar}, Tparam::Union{Type, TypeVar}, paramT::Type)
Return the Type
that bounds T
.
This is simply T
when T
isa Type
, but T
may also be a TypeVar
that is parameterised by Tparam
. In this case, the Type
that T
is parameterised by is returned, which is taken to be paramT
.
Given a type T
that may be parameterised according to Tparam
,
julia> paramtypebound(String, IO, IO)
String
julia> T = TypeVar(:T)
T
julia> paramtypebound(T, Type{T}, Float64)
Float64
DataToolkitCore.targettypes
— Functiontargettypes(types::Vector{QualifiedType}, desired::Type) -> Vector{Type}
targettypes(transformer::DataTransformer, desired::Type) -> Vector{Type}
Return all Type
s that one might hope to produce from types
or transformer
.
More specifically, this will give all Type
s that can be produced which are a subtype of desired
, and desired
itself.
Priority order is preserved.
DataToolkitCore.ispreferredpath
— Functionispreferredpath(a, b)
Compares two "type paths" a
and b
, returning whether a
is preferred.
Each "type path" is a tuple of the form:
(Tin::Type => Tout::Type, index::Int, transformer::Type{<:DataTransformer})
This operates on the following rules:
- The path with the lower index is preferred.
- If the indices are equal, the path with the more specific output type is preferred.
- If the output types are equally specific, the path with the more specific loader is preferred.
- If the loaders are equally specific, the more similar data transformation (
Tin => Tout
) is preferred.
DataToolkitCore.transformersigs
— Functiontransformersigs(transformer::Type{<:DataTransformer}, desired::Type)
Return processed signatures of the transformation methods implemented for transformer
that could produce/provide a subtype of desired
.
DataStorage
produces tuples of(Type{<:DataStorage}, Type{out})
DataLoaders
produces tuples of(Type{<:DataLoader}, Type{in}, Type{out})
DataWriter
produces tuples of(Type{<:DataWriter}, Type{in}, Type{data})
The DataStorage
method takes a write::Bool
keyword argument.
DataToolkitCore.typesteps
— Functiontypesteps(loader::DataLoader, desired::Type) -> Vector{Pair{Type, Type}}
Identify and order all uses of loader
that may produce a subtype of desired
.
More specifically, this finds all load
methods that can produce a subtype of desired
, checks what input and output types they work with, and orders them according to the declared types of loader
and the specificity of the output types (more specific is interpreted as better).
The output vector gives the step-change in the type domain that each method performs.
TOML related
DataToolkitCore.toml_safe
— Functiontoml_safe(value)
Recursively convert value
to a form that DataToolkit can safely encode to TOML.
DataToolkitCore.tomlreformat!
— Functiontomlreformat!(io::IO)
Consume io
representing a TOML file, and reformat it to improve readability. Currently this takes the form of the following changes:
- Replace inline multi-line strings with multi-line toml strings.
An IOBuffer containing the reformatted content is returned.
The processing assumes that io
contains TOML.print
-formatted content. Should this not be the case, mangled TOML may be emitted.
DataToolkitCore.dataset_parameters
— Functiondataset_parameters(source::Union{DataCollection, DataSet, DataTransformer},
action::Val{:extract|:resolve|:encode}, value::Any)
Obtain a form (depending on action
) of value
, a property within source
.
Actions
:extract
Look for DataSet
references ("📇DATASET<<…>>
") within value
, and turn them into Identifier
s (the inverse of :encode
).
:resolve
Look for Identifier
s in value
, and resolve them to the referenced DataSet
/value.
:encode
Look for Identifier
s in value
, and turn them into DataSet
references (the inverse of :extract
).
String utils
DataToolkitCore.natkeygen
— Functionnatkeygen(key::String)
Generate a sorting key for key
that when used with sort
will put the collection in "natural order".
julia> natkeygen.(["A1", "A10", "A02", "A1.5"])
4-element Vector{Vector{String}}:
["a", "0\x01"]
["a", "0\n"]
["a", "0\x02"]
["a", "0\x015"]
julia> sort(["A1", "A10", "A02", "A1.5"], by=natkeygen)
4-element Vector{String}:
"A1"
"A1.5"
"A02"
"A10"
DataToolkitCore.stringdist
— Functionstringdist(a::AbstractString, b::AbstractString; halfcase::Bool=false)
Calculate the Restricted Damerau-Levenshtein distance (aka. Optimal String Alignment) between a
and b
.
This is the minimum number of edits required to transform a
to b
, where each edit is a deletion, insertion, substitution, or transposition of a character, with the restriction that no substring is edited more than once.
When halfcase
is true, substitutions that just switch the case of a character cost half as much.
Examples
julia> stringdist("The quick brown fox jumps over the lazy dog",
"The quack borwn fox leaps ovver the lzy dog")
7
julia> stringdist("typo", "tpyo")
1
julia> stringdist("frog", "cat")
4
julia> stringdist("Thing", "thing", halfcase=true)
0.5
DataToolkitCore.stringsimilarity
— Functionstringsimilarity(a::AbstractString, b::AbstractString; halfcase::Bool=false)
Return the stringdist
as a proportion of the maximum length of a
and b
, take one. When halfcase
is true, case switches cost half as much.
Example
julia> stringsimilarity("same", "same")
1.0
julia> stringsimilarity("semi", "demi")
0.75
julia> stringsimilarity("Same", "same", halfcase=true)
0.875
DataToolkitCore.issubseq
— Functionissubseq(a, b)
Return true
if a
is a subsequence of b
, false
otherwise.
Examples
julia> issubseq("abc", "abc")
true
julia> issubseq("adg", "abcdefg")
true
julia> issubseq("gda", "abcdefg")
false
DataToolkitCore.longest_common_subsequence
— Functionlongest_common_subsequence(a, b)
Find the longest common subsequence of b
within a
, returning the indices of a
that comprise the subsequence.
This function is intended for strings, but will work for any indexable objects with ==
equality defined for their elements.
Example
julia> longest_common_subsequence("same", "same")
4-element Vector{Int64}:
1
2
3
4
julia> longest_common_subsequence("fooandbar", "foobar")
6-element Vector{Int64}:
1
2
3
7
8
9
DataToolkitCore.highlight_lcs
— Functionhighlight_lcs(io::IO, a::String, b::String;
before::String="\e[1m", after::String="\e[22m",
invert::Bool=false)
Print a
, highlighting the longest common subsequence between a
and b
by inserting before
prior to each subsequence region and after
afterwards.
If invert
is set, the before
/after
behaviour is switched.
Small dicts
DataToolkitCore.newdict
— Functionnewdict(K::Type, V::Type, capacity::Int) -> Dict{K, V}
Create a new Dict{K, V}
sized to hold capacity
elements, hopefully without resizing. Depending on the particular value of capacity
and the Julia version, this can result in substantial memory savings for small dictionaries.
DataToolkitCore.shrinkdict
— Functionshrinkdict(dict::Dict) -> Dict
If dict
looks like it may be smaller if reconstructed using newdict
, do so.
Logging
DataToolkitCore.should_log
— Functionshould_log(category::String) -> Bool
Determine whether a message should be logged based on its category
.
The category string can contain any number of subcategories separated by colons. If any parent category is enabled, the subcategory is also enabled.
DataToolkitCore.wait_maybe_log
— Functionwait_maybe_log(category::String, message::AbstractString; mod::Module, file::String, line::Int) -> Timer
Wait for a delay before logging message
with category
if should_log(category)
.
The log is produced with metadata from mod
, file
, and line
.
Miscilanious
DataToolkitCore.typeify
— Functiontypeify(qt::QualifiedType; mod::Module=Main)
Convert qt
to a Type
available in mod
, if possible. If this cannot be done, nothing
is returned instead.
DataToolkitCore.reinit
— Functionreinit(dta::AdviceAmalgamation)
Check that dta
is well initialised before using it.
This does noting if dta.plugins_wanted
is the same as dta.plugins_used
.
When they differ, it re-builds the advisors function list based on the currently available plugins, and updates dta.plugins_used
.
DataToolkitCore._dataadvisecall
— Function_dataadvisecall(func::Function, args...; kwargs...)
Identify the first data-like argument of args
(i.e. a DataCollection
, DataSet
, or DataTransformer
), obtain its advise, and perform an advised call of func(args...; kwargs...)
.
DataToolkitCore.strip_stacktrace_advice!
— Functionstrip_stacktrace_advice!(st::Vector{Base.StackTraces.StackFrame})
Remove stack frames related to @advise
and invokepkglatest
from st
.
DataToolkitCore.get_package
— Functionget_package(pkg::Base.PkgId)
get_package(from::Module, name::Symbol)
Obtain a module specified by either pkg
or identified by name
and declared by from
. Should the package not be currently loaded DataToolkit will attempt to lazy-load the package and return its module.
Failure to either locate name
or require pkg
will result in an exception being thrown.
See also: @require
, @addpkg
, try_install_pkg
.
DataToolkitCore.try_install_pkg
— Functiontry_install_pkg(pkg::Base.PkgId)
Attempt to install the package identified by pkg
if it is not currently installed.
This function is called automatically by get_package
if the package is not currently loaded, and calls Pkg
's try_prompt_pkg_add
method from its REPLExt
package extension. If the REPL
has not been loaded, nothing will be done.
DataToolkitCore.read1
— Functionread1(dataset::DataSet, as::Type)
The advisable implementation of read(dataset::DataSet, as::Type)
, which see.
This is essentially an exercise in useful indirection.
Consts
DataToolkitCore.LATEST_DATA_CONFIG_VERSION
— ConstantThe DataCollection.version
set on all created DataCollection
s, and assumed when reading any Data.toml files which do not set data_config_version
.
DataToolkitCore.PLUGINS_DOCUMENTATION
— ConstantA mapping from Plugin names to the documentation of said plugin.
DataToolkitCore.DEFAULT_PLUGINS
— ConstantThe set of plugins (by name) that should used by default when creating a new data collection.
DataToolkitCore.PLUGINS
— ConstantThe set of plugins currently available.
DataToolkitCore.TRANSFORMER_DOCUMENTATION
— ConstantList of (category::Symbol, named::Symbol) => docs::Any
forms.
category
can be :storage
, :loader
, or :writer
.
DataToolkitCore.EXTRA_PACKAGES
— ConstantThe set of packages loaded by each module via @addpkg
, for import with @require
.
More specifically, when a module M invokes @addpkg pkg id
then EXTRA_PACKAGES[M][pkg] = id
is set, and then this information is used with @require
to obtain the package from the root module.
DataToolkitCore.DEFAULT_DATATRANSFORMER_PRIORITY
— ConstantThe default priority
field value for instances of DataTransformer
.
DataToolkitCore.DEFAULT_DATA_ADVISOR_PRIORITY
— ConstantThe default priority
field value for Advice
s.
DataToolkitCore.DATASET_REFERENCE_WRAPPER
— ConstantA tuple of delimiters defining a dataset reference. For example, if set to ("{", "}")
then {abc}
would be recognised as a dataset reference for abc
.
DataToolkitCore.DATASET_REFERENCE_REGEX
— ConstantA regex which matches dataset references. This is constructed from DATASET_REFERENCE_WRAPPER
.
DataToolkitCore.QUALIFIED_TYPE_CACHE
— ConstantQUALIFIED_TYPE_CACHE
A cache of QualifiedType
instances, indexed by the type they represent.
While one would hope that QualifiedType(::Type)
calls would be constant-folded, in practice this is not the case, and so this cache is used to avoid an unfortunate large performance hit when constructing many QualifiedType
instances.
DataToolkitCore.DATA_CONFIG_RESERVED_ATTRIBUTES
— ConstantThe data specification TOML format constructs a DataCollection
, which itself contains DataSet
s, comprised of metadata and DataTransformer
s.
DataCollection
├─ DataSet
│ ├─ DataTransformer
│ └─ DataTransformer
├─ DataSet
⋮
Within each scope, there are certain reserved attributes. They are listed in this Dict under the following keys:
:collection
forDataCollection
:dataset
forDataSet
:transformer
forDataTransformer
DataToolkitCore.DATA_CONFIG_KEY_SORT_MAPPING
— ConstantWhen writing data configuration TOML file, the keys are (recursively) sorted. Some keys are particularly important though, and so to ensure they are placed higher a mappings from such keys to a higher sort priority string can be registered here.
For example, "config" => "\0x01"
ensures that the special configuration section is placed before all of the data sets.
This can cause odd behaviour if somebody gives a dataset the same name as a special key, but frankly that would be a bit silly (given the key names, e.g. "uuid") and so this is of minimal concern.
DataToolkitCore.LINT_SEVERITY_MESSAGES
— ConstantA mapping from severity numbers (see LINT_SEVERITY_MAPPING
) to a tuple giving the color the message should be accented with and the severity title string.
DataToolkitCore.LINT_SEVERITY_MAPPING
— ConstantA mapping from severity symbols to integers.
This is used to assist with more readable construction of LintItem
s.
See also: LINT_SEVERITY_MESSAGES
for the reverse mapping of integer to severity title string.
DataToolkitCore.DEFAULT_LOG_DELAY
— ConstantThe delay in seconds before a log message is displayed.
When zero or less, log messages are displayed immediately.
DataToolkitCore.SIMPLIFY_STACKTRACES
— ConstantWhether stacktraces should be simplified by removing likely incidental DataToolkit-internals frames when displayed.