Cache

Cache the results of data loaders using the Serialisation standard library. Cache keys are determined by the loader "recipe" and the type requested.

It is important to note that not all data types can be cached effectively, such as an IOStream.

Recipe hashing

The driver, parameters, type(s), of a loader and the storage drivers of a dataset are all combined into the "recipe hash" of a loader.

╭─────────╮             ╭──────╮
│ Storage │             │ Type │
╰───┬─────╯             ╰───┬──╯
    │    ╭╌╌╌╌╌╌╌╌╌╮    ╭───┴────╮ ╭────────╮
    ├╌╌╌╌┤ DataSet ├╌╌╌╌┤ Loader ├─┤ Driver │
    │    ╰╌╌╌╌╌╌╌╌╌╯    ╰───┬────╯ ╰────────╯
╭───┴─────╮             ╭───┴───────╮
│ Storage ├─╼           │ Parmeters ├─╼
╰─────┬───╯             ╰───────┬───╯
      ╽                         ╽

Since the parameters of the loader (and each storage backend) can reference other data sets (indicated with and ), this hash is computed recursively, forming a Merkle Tree. In this manner the entire "recipe" leading to the final result is hashed.

                ╭───╮
                │ E │
        ╭───╮   ╰─┬─╯
        │ B ├──▶──┤
╭───╮   ╰─┬─╯   ╭─┴─╮
│ A ├──▶──┤     │ D │
╰───╯   ╭─┴─╮   ╰───╯
        │ C ├──▶──┐
        ╰───╯   ╭─┴─╮
                │ D │
                ╰───╯

In this example, the hash for a loader of data set "A" relies on the data sets "B" and "C", and so their hashes are calculated and included. "D" is required by both "B" and "C", and so is included in each. "E" is also used in "D".

Configuration

Store path

This uses the same store.path configuration variable as the store plugin (which see).

Disabling on a per-loader basis

Caching of individual loaders can be disabled by setting the "cache" parameter to false, i.e.

[[somedata.loader]]
cache = false
...

Store management

System-wide configuration can be set via the store config set REPL command, or directly modifying the DataToolkitCommon.Store.getinventory().config struct.

A few (system-wide) settings determine garbage collection behaviour:

  • auto_gc (default 2): How often to automatically run garbage collection (in hours). Set to a non-positive value to disable.
  • max_age (default 30): The maximum number of days since a collection was last seen before it is removed from consideration.
  • max_size (default 53687091200): The maximum (total) size of the store.
  • recency_beta (default 1): When removing items to avoid going over max_size, how much recency should be valued. Can be set to any value in (-∞, ∞). Larger (positive) values weight recency more, and negative values weight size more. -1 and 1 are equivalent.
  • store_dir (default store): The directory (either as an absolute path, or relative to the inventory file) that should be used for storage (IO) cache files.
  • cache_dir (default cache): The directory (either as an absolute path, or relative to the inventory file) that should be used for Julia cache files.