Generate data points for later configuring the learning task generator and GARegressor's init: scripts/genkdata.jl
If you don't already have a sample available (long-term, there should be one shipped with the library), you should generate your own one.
nix develop --impure
julia -p 20 --project=. scripts/genkdata.jl --n-iter 10000
This generates a CSV file such as 2023-11-23-12-23-35-578-Bym3-kdata.csv.
Let's assume that this file's name is $kdata_csv.
Note that you you should probably adjust the number of processes (-p option to
julia) according to the hardware you use.
Since I'm running this on our Slurm cluster, see slurm/genkdata.sbatch for the
exact call I'm using.
If you have access to a Slurm cluster as well, consider to start many
medium-sized jobs (e.g. 1000 jobs with --n-iter 2000 each) each of which will
create a CSV file. Put the files generated into a new directory and use that
directory as $kdata_csv in the next step instead of a single file name.
A Slurm job corresponding to
nix develop . --impure --command julia -p 4 --project=. scripts/genkdata.jl --usemmap --n-iter=2000
running on an allocation of 4 cores of an AMD EPYC 7502 32-Core Processor and 100GB RAM takes around
sacct --job 439387 -o Elapsed --state COMPLETED | awk -F: '{ sec += ($1 * 3600) + ($2 * 60) + $3; count++ } END { avg_sec = sec / count; printf "%02d:%02d:%02d\n", avg_sec/3600, (avg_sec%3600)/60, avg_sec%60 }'
01:26:45
Since learning task generation is probabilistic, it consists of three steps: Deriving hyperparameters for the generator, generating tasks and finally selecting a suitable subset of the generated tasks.
nix develop --impure
julia --project=. scripts/selectgendataparams.jl $kdata_csv
Replace the CSV file name with the name of the file generated by
scripts/genkdata.jl (or the folder name containing all the CSV files to use).
This generates a file sharing the same prefix as $kdata_csv but with the
suffix .paramselect.csv. Let's assume the resulting file name is
$kdata_paramselect_csv.
nix develop --impure
julia --project=. -e "import Pkg; Pkg.instantiate()"
julia -p 30 --project=. "scripts/gendata.jl" genall --startseed=0 --endseed=49 --prefix-fname="data" --usemmap $kdata_paramselect_csv
Remember to adjust the parameter to -p to the actual number of workers you
want to use.
Also, if you don't want to use mmap (because you have a lot of RAM) then don't
add the --usemmap flag. You probably want to do that, though, especially for
higher-than-ten dimensional data (memory is only mapped to disk if RAM doesn't
suffice so this should have low overhead if your RAM is large enough).
Note that for each learning task generate, three files are created:
….data.npz: The training and test data of the task.….task.jls: The full JuliaTaskobject, serialized.….stats.jls: A serialized JuliaDictwith theTaskparameters such asDXas well as some statistics such as the actual number of rules that make up the task.
There is also a hashing scheme used with hashes stored in each generated file (task hashes change not only if data changes but also if a different Git commit is used etc.). This needs to be documented.
This receives the file with the selected parameters $kdata_paramselect_csv and
the directory that the data sets where written to (see gendata.jl's
--prefix-fname option)
nix develop --impure
julia --project=. scripts/drawdata.jl $kdata_paramselect_csv data
This tries to randomly select for each entry in $kdata_paramselect_csv (i.e.
for each combination of DX, rate_coverage_min and K considered) a certain
number (currently, 4) of data sets from the set of data sets generated by
gendata.jl. At that, it operates not on the data sets directly but on the
respective ….stats.jls files generated by gendata.jl.
It generates a fish script that can be then used to perform the actual copying
of tasks from the task folder. drawdata.jl does not perform copying itself so
that you can simply copy the ….stats.jl files (which are much smaller than the
possibly huge ….task.jl and ….data.npz files) to your local machine and run
drawdata.jl there instead of having to do it remotely).
In the process, drawdata.jl it also plots some statistics to visualize the set
of tasks it considers. The last plot shows how many learning tasks it was able
to select per combination of DX, coverage rate bin and K.
Inspect and then run the fish script generated by drawdata.jl in order to copy
the selected tasks into a folder.
For now, the interface is to adjust in the script the path to the mlruns folder to check.
This serializes a DataFrame with all the mlflow data (including the dissimilarities) for the next step.
You probably want to run this on the server since computing the pairwise distances requires some compute.
- Load Julia and the script:
julia --project=. include("scripts/analyse-runbest.jl")
To run all tests:
julia --project=.
] test
To run only the tests in test/select.jl and test/ga.jl:
import Pkg; Pkg.test(;test_args=["select", "ga"])
See CITATION.cff.