Introduction
The problem with the current state of affairs
Data is beguiling. It can initially seem simple to deal with: "here I have a file, and that's it". However as soon as you do things with the data you're prone to be asked tricky questions like:
- where's the data?
- how did you process that data?
- how can I be sure I'm looking at the same data as you?
This is no small part of the replication crisis.
Further concerns arise as soon as you start dealing with large quantities of data, or computationally expensive derived data sets. For example:
- Have I already computed this data set somewhere else?
- Is my generated data up to date with its sources/dependencies?
Generic tools exist for many parts of this problem, but there are some benefits that can be realised by creating a Julia-specific system, namely:
- Having all pertinent environmental information in the data processing contained in a single
Project.toml
- Improved convenience in data loading and management, compared to a generic solution
- Allowing datasets to be easily shared with a Julia package
In addition, the Julia community seems to have a strong tendency to NIH[NIH] tools, so we may as well get ahead of this and try to make something good 😛.
Pre-existing solutions
DataLad
- Does a lot of things well
- Puts information on how to create data in git commit messages (bad)
- No data file specification
Kedro data catalog
- Has a file defining all the data (good)
- Has poor versioning
- https://kedro.readthedocs.io/en/stable/data/data_catalog.html
- Data Catalog CLI
Snakemake
- Workflow manager, with remote file support
- Snakemake Remote Files
- Good list of possible file locations to handle
- Drawback is that you have to specify the location you expect(S3, http, FTP, etc.)
- No data file specification
Nextflow
- Workflow manager, with remote file support
- Docs on files and IO
- Docs on S3
- You just call
file()
and nextflow figures out under the hood the protocol whether it should pull it from S3, http, FTP, or a local file. - No data file specification
- NIHNot Invented Here, a tendency to "reinvent the wheel" to avoid using tools from external origins — it would of course be better if you (re)made it.