Skip to content

dpaetzel/RSLModels.jl

Repository files navigation

RSLModels.jl

Generate data points for later configuring the learning task generator and GARegressor's init: scripts/genkdata.jl

If you don't already have a sample available (long-term, there should be one shipped with the library), you should generate your own one.

nix develop --impure
julia -p 20 --project=. scripts/genkdata.jl --n-iter 10000

This generates a CSV file such as 2023-11-23-12-23-35-578-Bym3-kdata.csv. Let's assume that this file's name is $kdata_csv.

Note that you you should probably adjust the number of processes (-p option to julia) according to the hardware you use.

Since I'm running this on our Slurm cluster, see slurm/genkdata.sbatch for the exact call I'm using.

If you have access to a Slurm cluster as well, consider to start many medium-sized jobs (e.g. 1000 jobs with --n-iter 2000 each) each of which will create a CSV file. Put the files generated into a new directory and use that directory as $kdata_csv in the next step instead of a single file name.

A Slurm job corresponding to

nix develop . --impure --command julia -p 4 --project=. scripts/genkdata.jl --usemmap --n-iter=2000

running on an allocation of 4 cores of an AMD EPYC 7502 32-Core Processor and 100GB RAM takes around

sacct --job 439387 -o Elapsed --state COMPLETED | awk -F: '{ sec += ($1 * 3600) + ($2 * 60) + $3; count++ } END { avg_sec = sec / count; printf "%02d:%02d:%02d\n", avg_sec/3600, (avg_sec%3600)/60, avg_sec%60 }'
01:26:45

Generating learning tasks

Since learning task generation is probabilistic, it consists of three steps: Deriving hyperparameters for the generator, generating tasks and finally selecting a suitable subset of the generated tasks.

Extract learning task generator hyperparameters: scripts/selectgendataparams.jl

nix develop --impure
julia --project=. scripts/selectgendataparams.jl $kdata_csv

Replace the CSV file name with the name of the file generated by scripts/genkdata.jl (or the folder name containing all the CSV files to use).

This generates a file sharing the same prefix as $kdata_csv but with the suffix .paramselect.csv. Let's assume the resulting file name is $kdata_paramselect_csv.

Generate learning tasks: scripts/gendata.jl

nix develop --impure
julia --project=. -e "import Pkg; Pkg.instantiate()"
julia -p 30 --project=. "scripts/gendata.jl" genall --startseed=0 --endseed=49 --prefix-fname="data" --usemmap $kdata_paramselect_csv

Remember to adjust the parameter to -p to the actual number of workers you want to use.

Also, if you don't want to use mmap (because you have a lot of RAM) then don't add the --usemmap flag. You probably want to do that, though, especially for higher-than-ten dimensional data (memory is only mapped to disk if RAM doesn't suffice so this should have low overhead if your RAM is large enough).

Note that for each learning task generate, three files are created:

  • ….data.npz: The training and test data of the task.
  • ….task.jls: The full Julia Task object, serialized.
  • ….stats.jls: A serialized Julia Dict with the Task parameters such as DX as well as some statistics such as the actual number of rules that make up the task.

There is also a hashing scheme used with hashes stored in each generated file (task hashes change not only if data changes but also if a different Git commit is used etc.). This needs to be documented.

Selecting a subset of the generated tasks: scripts/drawdata.jl

This receives the file with the selected parameters $kdata_paramselect_csv and the directory that the data sets where written to (see gendata.jl's --prefix-fname option)

nix develop --impure
julia --project=. scripts/drawdata.jl $kdata_paramselect_csv data

This tries to randomly select for each entry in $kdata_paramselect_csv (i.e. for each combination of DX, rate_coverage_min and K considered) a certain number (currently, 4) of data sets from the set of data sets generated by gendata.jl. At that, it operates not on the data sets directly but on the respective ….stats.jls files generated by gendata.jl.

It generates a fish script that can be then used to perform the actual copying of tasks from the task folder. drawdata.jl does not perform copying itself so that you can simply copy the ….stats.jl files (which are much smaller than the possibly huge ….task.jl and ….data.npz files) to your local machine and run drawdata.jl there instead of having to do it remotely).

In the process, drawdata.jl it also plots some statistics to visualize the set of tasks it considers. The last plot shows how many learning tasks it was able to select per combination of DX, coverage rate bin and K.

Inspect and then run the fish script generated by drawdata.jl in order to copy the selected tasks into a folder.

Other older/possibly deprecated tools tools

Check ranges of tuned parameters (i.e. the results of run.py optparams): scripts/chkoptparams.jl

For now, the interface is to adjust in the script the path to the mlruns folder to check.

Compute dissimilarities (can take a long time, run this on a compute node): scripts/computesims.jl

This serializes a DataFrame with all the mlflow data (including the dissimilarities) for the next step.

Peform analysis of the run set: scripts/analyserunbest.jl

You probably want to run this on the server since computing the pairwise distances requires some compute.

  1. Load Julia and the script:
    julia --project=.
    include("scripts/analyse-runbest.jl")
    

Testing

To run all tests:

julia --project=.
] test

To run only the tests in test/select.jl and test/ga.jl:

import Pkg; Pkg.test(;test_args=["select", "ga"])

Citing

See CITATION.cff.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Contributors

Languages