The Inventory
The store needs a way of keeping track of which data sources are actively being used, and which are orphaned. To do this, we need to keep track of two things:
- The data sources being stored
- The data collections using those data sources
Essentially we need a "stock list" for our data sources, or a "data inventory", and this is what the Inventory.toml
file functions as.
Format of the Inventory file
The inventory file needs to contain three categories of information:
- Configuration (e.g. how long data should "expire" after):
config
- Known data collections:
collections
- Stored entries:
store
andcache
Configuration is stored in a simple TOML table, and currently recognises the following attributes:
auto_gc
max_age
max_size
recency_beta
store_dir
cache_dir
Known collections are also stored in a TOML table, using the collection UUIDs as keys. The name, collection source path, and a timestamp of the last access are recorded under the keys name
, path
, and seen
respectively.
Entries of the inventory are recorded in two arrays of tables: store
and cache
. These tables share three common attributes:
recipie
, a (fairly stable) hash of the inputs and parameters that lead to the object being storedaccessed
, a time stamp of the last access timereferences
, a list of collections that reference the object, by UUID
The store
tables extend the common trio with an optional checksum
attribute, as well as an optional file extension
attribute.
The cache
tables extend the common trio with a list of the types
involved, typehashes
of their structure, and packages
involved.
Example Inventory file
inventory_version = 1
[config]
max_age = 30
# ...
[collections.d0c50be9-f198-43c6-9411-63aaecb83ed0]
name = "known-collection"
path = "/absolute/path/to/Data.toml"
seen = 2023-08-12T17:34:56.816Z
# ...
[[store]]
recipe = "15edb3448b1460c8"
accessed = 2023-08-12T17:34:56.816Z
references = ["d0c50be9-f198-43c6-9411-63aaecb83ed0"]
checksum = "k12:d8e04889edd692e1f36dfac9cc7e73b0"
extension = "csv"
# ...
[[cache]]
recipe = "e41acbb05ac155d1"
accessed = 2024-06-04T19:35:57.074Z
references = ["d0c50be9-f198-43c6-9411-63aaecb83ed0"]
types = ["Array{Any,2}", "InlineStrings.String15"]
typehashes = ["5420e006bc4fe707", "ad994c6b8c7e2324"]
[[cache.packages]]
name = "InlineStrings"
uuid = "842dd82b-1e85-43dc-bf29-5d0ee9dffc48"
# ...