Store

Cache IO from data storage backends, by saving the contents to the disk.

Configuration

Store path

The directory the the store is maintained in can be set via the store.path configuration parameter.

config.store.path = "relative/to/datatoml"

The system default is ~/.cache/julia/datatoolkit, which can be overriden with the DATATOOLKIT_STORE environment variable.

Disabling on a per-storage basis

Saving of individual storage sources can be disabled by setting the "save" parameter to false, i.e.

[[somedata.storage]]
save = false

Checksums

To ensure data integrity, a checksum can be specified, and checked when saving to the store. For example,

[[iris.storage]]
checksum = "k12:cfb9a6a302f58e5a9b0c815bb7e8efb4"

If you do not have a checksum, but wish for one to be calculated upon accessing the data, the checksum parameter can be set to the special value "auto". When the data is first accessed, a checksum will be generated and replace the "auto" value.

Instead of "auto", a particular checksum algorithm can be specified, by naming it, e.g. "sha256". The currently supported algorithms are: k12 (Kangaroo Twelve), sha512, sha348, sha256, sha224, sha1, md5, and crc32c.

To explicitly specify no checksum, set the parameter to false.

Expiry/lifecycle

After a storage source is saved, the cache file can be made to expire after a certain period. This is done by setting the "lifetime" parameter of the storage, i.e.

[[updatingdata.storage]]
lifetime = "3 days"

The lifetime parameter accepts a few formats, namely:

ISO8061 periods (with whole numbers only), both forms

  1. P[n]Y[n]M[n]DT[n]H[n]M[n]S, e.g.

    • P3Y6M4DT12H30M5S represents a duration of "3 years, 6 months, 4 days, 12 hours, 30 minutes, and 5 seconds"
    • P23DT23H represents a duration of "23 days, 23 hours"
    • P4Y represents a duration of "4 years"
  2. PYYYYMMDDThhmmss / P[YYYY]-[MM]-[DD]T[hh]:[mm]:[ss], e.g.

    • P0003-06-04T12:30:05
    • P00030604T123005

"Prose style" period strings, which are a repeated pattern of [number] [unit], where unit matches year|y|month|week|wk|w|day|d|hour|h|minute|min|second|sec| optionally followed by an "s", comma, or whitespace. E.g.

  • 3 years 6 months 4 days 12 hours 30 minutes 5 seconds
  • 23 days, 23 hours
  • 4d12h

By default, the first lifetime period begins at the Unix epoch. This means a daily lifetime will tick over at 00:00 UTC. The "lifetime_offset" parameter can be used to shift this. It can be set to a lifetime string, date/time-stamp, or number of seconds.

For example, to have the lifetime expire at 03:00 UTC instead, the lifetime offset could be set to three hours.

[[updatingdata.storage]]
lifetime = "1 day"
lifetime_offset = "3h"

We can produce the same effect by specifying a different reference point for the lifetime.

[[updatingdata.storage]]
lifetime = "1 day"
lifetime_offset = 1970-01-01T03:00:00

Store management

System-wide configuration can be set via the store config set REPL command, or directly modifying the DataToolkitCommon.Store.getinventory().config struct.

A few (system-wide) settings determine garbage collection behaviour:

  • auto_gc (default 2): How often to automatically run garbage collection (in hours). Set to a non-positive value to disable.
  • max_age (default 30): The maximum number of days since a collection was last seen before it is removed from consideration.
  • max_size (default 53687091200): The maximum (total) size of the store.
  • recency_beta (default 1): When removing items to avoid going over max_size, how much recency should be valued. Can be set to any value in (-∞, ∞). Larger (positive) values weight recency more, and negative values weight size more. -1 and 1 are equivalent.
  • store_dir (default store): The directory (either as an absolute path, or relative to the inventory file) that should be used for storage (IO) cache files.
  • cache_dir (default cache): The directory (either as an absolute path, or relative to the inventory file) that should be used for Julia cache files.