A Replication package for reproducing the results of paper "Mata: A Fast and Simple Finite Automata Library"
The link to the Zenodo archive: 10.5281/zenodo.10040695
This repository contains data, tools, benchmark and scripts for
replicating (and processing) tool paper describing libmata library.
This package is meant to be used for running benchmarks on set of
libraries that supports various types of Nondeterministic Finite Automata.
The available tools are listed in jobs/*.yaml (each file corresponds to
a~group of similar experiments); each tool is installed in tools directory,
and takes as input automata in various formats (each library tends to use
different format). We use pycobench (a python script) to measure the time of
each library and output a report in csv (delimited by ;) which lists
individual measurements, results and some other metrics (e.g., time of
mintermization or optionally time of some preprocessing) for each
library. For each library we implemented particular source codes, that compiles
to an executable which models our experiments.
Note, that some tools/libraries are run either as a wrapper script (to simplify
the usage in benchmarking or fine-tuning the measurement) or specific binary
program that either takes an input automaton (and processes it) or might take
an input program that specifies sequence of automata operation (and executes
them in order; e.g., load automaton A; load automaton B; create
intersection of A and B).
The package contains dependencies, tools, as well as some additional helper scripts. Each helper script is accompanied by its own (sometimes short) documentation.
For the sake of simplicity we prepared several master scripts run_all.sh
(to replicate---to a certain degree---the results presented in our paper),
run_smoke_test.sh (to test that package and tools were initialized correctly),
run_subset.sh (to replicate a subset of results presented in our paper), and
generate_figs.py (to replicate the figures and tables presented in the paper).
We make our replication package (and some of its parts) as available as possible:
- 10.5281/zenodo.10040695: downloadable replication package.
- VM: virtual machine, where one can run the experiments.
- The Package Repository: the repository with this package.
- Benchmarks: source benchmarks that were used in the comparison.
- Results: measured results presented in paper..
- libmata: our
libmatalibrary. - automata.net: our fork of
automata.netlibrary. - nfa-program-parser: our interpreter for
mata,vata,awali,fadoandautomata.py. - brics: our interpreter for
bricsandautomatalib.
In the following we describe necessary steps for replicating (to a certain degree) our results presented in the tool paper.
Download the tacas-artifact.tar.gz from 10.5281/zenodo.10040695. Then unpack the artifact and change directory.
tar -xzvf tacas-artifact.tar.gz
cd tacas-artifactWe assume, that everything is executed from the root (tacas-artifact) folder, executing from other folders might not work.
For offline setup, we prepared all necessary requirements in packages and pip-packages directories.
You can run the following script to install all the presented packages (this will take some time, between 20 and 30 minutes):
sudo ./install_requirements.shThis will install the requirements for running java applications, .NET aplications, libmata and awali
libraries, necessary python packages (including python binding and pyinterpret script that runs python automata libraries.
If you are, running the experiments in TACAS'23 VM, you might need to run the following command (if you are running
as other user than tacas23 than change the name to your username):
export PATH=$PATH:/home/tacas23/.local/binYou can test, that everything is set up correctly by running pyintepret command, which should output
usage: python3 interpreter.py <ENGINE> <PROGRAM.emp> [AUT1.mata, ..., AUTn.mata].
To test that the environment is working correctly run the following script:
./run_smoke_test.shThis will run only b-smt benchmark, with timeout 60s, and runs only
first two instances, in total 22 jobs should be run, and should take
at most 10 minutes. The results are stored in ./results/data/smoke-test
directory.
You can then test replicating of the images as follows:
./generate_figs.py ./results/data/smoke-testNote, alternatively try to run python3 generate_figs.py ./results/data/smoke-test if the command gives you an error.
This will generate two figures (in .png and .pdf format), and five tables (in .tex format)
stored in ./results/data/smoke-test/figs directory. The script also print the five tables
in human readable to standard output.
Note, that the graphs will be not interesting, as they will contain only about two data points per tool.
To run a selected subset of our experiments run the following script:
./run_subset.shThis will run only three selected benchmarks (armc-incl, lia-explicit and lia-symbolic), with 4 parallel
jobs and 60 seconds timeout. This should take between one and four hours (depending on the number of parallel jobs),
timeout, and machine specification. (We run our experiments on Intel Core 3.4 GHz processor with 20GB RAM)
Note, that you might need to increase the RAM or number of CPU (if you are running the replication in VM),
or lower the number of parallel jobs (you can modify run_subset.sh and change --jobs 4 to lower number of jobs,
where --jobs 1 corresponds to sequential execution of the benchmarks).
Again, you can generate the figures and tables as follows:
./generate_figs.py ./results/data/partial-experimentsThis will generate two figures (in .png and .pdf format), and five tables (in .tex format)
stored in ./results/data/partial-experiments/figs directory as well as human readable tables to standard output.
Note, that the graphs should replicate parts of the figures, for three benchmarks.
To run all of the benchmarks, simply execute the following script:
./run_all.sh --output-dir replication
This should replicate (to a certain degree) results presented in our paper.
Feel free, to make your own changes, and
run the benchmarks, or extend the script with other benchmarks (by default, it
runs about 11 benchmarks distributed over several directories that are described in the
paper or in the nfa-bench/README.md). This runs all tools on all experiments
with 6 parallel workers. The output is saved in the results/data/replication directory.
This generates csv files for each benchmark directory.
Finally, you can replicate the figures and tables from our paper as follows:
./generate_figs.py ./results/data/replicationThis will generate the figures and tables as follows:
- Table 1:
stats-per-bench-{1, 2}.texcontains two parts of the Table 1. - Table 2:
stats-relative.texcontains the Table 2. - Table 3:
stats-per-op-{1, 2}.texcontains two parts of the flipped Table 3. - Figure 4:
paper-cactus-plot-per-bench-sum-fairest-of-them-all-4-x-3.{png, pdf}: contains cactus plots of Figure 4. - Figure 5:
paper-cactus-plot-per-operation-sum.{png, pdf}contains cactus plots of Figure 5.
The output will look something like:
Processing: ./results/data/partial-experiments
100% (6 of 6) |########################################################| Elapsed Time: 0:00:00 Time: 0:00:00
- Computing fair overall times
100% (5214 of 5214) |##################################################| Elapsed Time: 0:00:09 Time: 0:00:09
- Generating Figure IV.
100% (3 of 3) |########################################################| Elapsed Time: 0:00:00 Time: 0:00:00
Saving to ./results/data/partial-experiments/figs/paper-cactus-plot-per-bench-sum-fairest-of-them-all-4-x-3.png
Saving to ./results/data/partial-experiments/figs/paper-cactus-plot-per-bench-sum-fairest-of-them-all-4-x-3.pdf
- Generating Figure V.
100% (7 of 7) |########################################################| Elapsed Time: 0:00:00 Time: 0:00:00
Saving to ./results/data/partial-experiments/figs/paper-cactus-plot-per-operation-sum.png
Saving to ./results/data/partial-experiments/figs/paper-cactus-plot-per-operation-sum.pdf
- Generating Table I.
100% (33 of 33) |######################################################| Elapsed Time: 0:00:00 Time: 0:00:00
100% (33 of 33) |######################################################| Elapsed Time: 0:00:00 Time: 0:00:00
benchmark mata awali vata automata.net
armc-incl 0|211|3|1 s 7|1 s|23|5 s 0|408|72|702 9|664|79|2 s
lia-explicit 0|39|6|330 6|18|18|13 0|91|40|455 0|65|67|25
lia-symbolic 1|2|1|6 1|8|6|21 1|7|6|8 1|73|71|31
Saved to ./results/data/partial-experiments/figs/stats-per-bench-1.tex
100% (33 of 33) |######################################################| Elapsed Time: 0:00:00 Time: 0:00:00
benchmark mata brics automatalib fado automata.py
armc-incl 0|211|3|1 s 5|915|46|4 s 5|2 s|479|8 s 58|10 s|25 s|12 s 10|1 s|191|3 s
lia-explicit 0|39|6|330 0|46|46|48 0|310|223|866 1|857|424|2 s 1|229|105|543
lia-symbolic 1|2|1|6 0|42|42|35 0|127|131|49 1|114|104|253 1|34|28|64
Saved to ./results/data/partial-experiments/figs/stats-per-bench-2.tex
- Generating Table II.
100% (116 of 116) |####################################################| Elapsed Time: 0:00:00 Time: 0:00:00
100% (116 of 116) |####################################################| Elapsed Time: 0:00:00 Time: 0:00:00
operation mata awali vata automata.net
complement 20|2|234 9|5|16 37|7|298 62|62|28
emptiness ~0|~0|~0 1|1|1 5|1|14 ~0|~0|~0
inclusion 211|3|1 s 1 s|23|5 s 408|72|702 656|71|2 s
trim ~0|~0|~0 2|1|1 5|1|14 6|6|1
Saved to ./results/data/partial-experiments/figs/stats-per-op-1.tex
100% (116 of 116) |####################################################| Elapsed Time: 0:00:00 Time: 0:00:00
100% (116 of 116) |####################################################| Elapsed Time: 0:00:00 Time: 0:00:00
operation mata brics automatalib fado automata.py
complement 20|2|234 44|44|42 26|22|13 233|70|658 158|23|2 s
emptiness ~0|~0|~0 ~0|~0|~0 2|2|~0 16|6|22 ~0|~0|~0
inclusion 211|3|1 s 915|46|4 s 63|55|29 10 s|25 s|12 s 528|101|1 s
trim ~0|~0|~0 - - 3|1|4 -
Saved to ./results/data/partial-experiments/figs/stats-per-op-2.tex
- Generating Table III.
bench mata mata-sim awali vata automata.net brics automatalib fado automata.py (py)mata
armc-incl 1 1.04 29.77 1.93 14.46 18.37 38.09 4585.83 20.86 0.95
lia-explicit 1 0.61 2.12 2.33 1.68 1.18 7.94 62.4 16.72 0.89
lia-symbolic 1 0.96 4.03 3.68 35.44 20.75 61.63 55.32 16.91 1.46
Saved to ./results/data/partial-experiments/figs/stats-relative.texIn results/data/tacas24 we provide the original data in .csv formats that we used to generate our results.
You can inspect them yourself, or run:
./generate_figs.py ./results/data/tacas24You can run ./run_all.sh --help to print short help and parameters, that can be used to fine tune the experiments.
You can run the benchmark with more options by running ./run_all.sh.
For example, to run tools mata, vata, brics and automata on
b-smt and armc-incl benchmark directories with 7 workers and
120s timeout and store the results in results/data/my_experimenst directory
you can run the following:
./run_experiments.sh -m mata;vata;brics;automata-j 7 -t 120s --b-smt --armc-incl --output-dir my_experiments
(Note that this command takes some time.) This should generate two files:
results/data/my_experiments/tacas-24-automata-inclusion-bench-double-automata-inclusion-DATE-timeout-120-jobs-7-m-mata;vata;brics;automata.csv
and
results/data/my_experiments/tacas-24-bool-comb-ere-bench-variadic-bool-comb-ere-DATE-timeout-120-jobs-7-m-mata;vata;brics;automata.csv,
where DATE is the time of starting the measurement.
For more fine tuning of the process read the ./run_all.sh --help, or nfa-bench/README.md.
We provide two ways of interpreting the results.
You can either run ./generate_figs.py <PATH_TO_DIRECTORY_CONTAINING_CSV> that will generate figures and tables
to figs dir or you can use our jupyter notebook by running:
cd results/
jupyter notebookThis will run instance of jupyter notebook accessible through your web browser (the exact address will be printed
to standard output). You can then run the notebook to replicate the figures and tables in same way as using
generate_figs.py.
All results are in .csv format delimited by ;. To manually inspect the data, we recommend using python3 and
pandas library to further interact with the results:
import pandas
pandas.read_csv("results/data/<DIR>/<FILE>.csv")This loads a csv file in DataFrame representation, that can be used easily to
handle common grouping, statistics, and many other data operations.
See pandas for more information.
We estimate the runtimes of each benchmark (if run sequentially, i.e., run with --jobs 1) as follows (sorted by runtime):
lia-explicit: around half hour;lia-symbolic: around 1 hour;armc-incl: around 2 hours;noodler-compl: around 2 hours;email-filter: around 4 hours;noodler-concat: around 4 hours;b-smt: around 6 hours;noodler-inter: around 10 hours;param-union: around 12 hours.param-inter: around 30 hours;
Note, this is estimated for 8 cores and 20GB, hence, if run in VM this might take longer. Running instances in parallel will speed up the experiments (however, the results might be less stable).
- The
awalitool takes much more time than others/in results: When run for the first time,awalineeds to create a certain context for each operation. Running the instance again should be fast again. automatafails with error, thatBoolCombinationEmptiness.dlldoes not exists: this might be incorrectly compiledautomatalibrary or issue with symbolic links.- There is some weird error: we strongly advise to not unpack the package in shared directory. There are known issues in creating symlinks in shared directories (i.e., directories shared between virtual machine and host). We advise to move the package directly to virtual machine and rerun the experiments.
- You found bug in some tool: Naturally, There might be some error that we missed.
- You found inconsitency in results: try re-running the tools or decrease number of jobs or decrease/increase the timeout. We cannot guarantee that the experiments will behave the same on all architectures and machines. The tools and experiments are quite complex, and there might be contest for resources between the tools, as well as some tools are quite resource-hungry. While, we are quite confident that the results should be consistent, there might be some nondeterministic errors that manifest under certain circumstances.
- The tool fails in benchmarking, however, when manually run it simply takes too long:
try lowering the number of jobs below 7.
pycobenchseems to have some internal problems when run with the maximal number of jobs converting timeouts to errors instead. - The tool performs better, when run alone with single job only: this
might happen, as some tools could (e.g.
mata) potentially compete over resources, such as for memory, resulting into worse performance. - The script that converts benchmarks to other formats prints some warning/error: some warnings are expected (because of symlinks). If you encounter anything that seems to be unexpected, please tell us.
- Your custom benchmark cannot be run: please, tell us report with your benchmark and output of the conversion script.
We ran our experiments on Debian GNU/Linux 11, with IntelCore 3.4GHz process, 8 CPU cores, and 20 GB RAM. Such specification should be enough to reproduce our results.
Increasing the amount of RAM or number of cores might lead to better results. The virtual machine machine requires at least 35GB of spaces, however, we recommend users to allocate at least 70GB.
It is recommended to allow network access to this virtual machine, since we track our benchmarks in remote repository. This way one can pull the latest version of benchmarks.
The virtual machine contains a single user tacas23 with admin rights, hence,
you can install additional packages, change configurations, etc. The password
for the user is tacas23.
⚠️ We strongly advise to either backup the image, or proceed carefully, before doing any major changes to the virtual machine. We are not responsible for any damage you will done by making major changes to the machine.
The package has following files in the root directory:
install_requirements.sh: prepares the environmen.trun_smoke_test.sh: runs short (~1-10min) reproducibility test; requires confirmation.run_subset.sh: runs moderate (1-4hours) experiments; requires initial confirmation.run_all.sh: runs all experiments (13+ hours).generate_figs.py: generates figure and tables for directory with.csvresults.
The package has following directory structure:
-
bin: a~directory containing precompiled binaries and commands; these can be used right after installing all the requirements (usinginstall_requirements.sh) -
nfa-bench/: a~directory, which contains our benchmarks. See VeriFIT/automata-bench or./nfa-bench/README.mdfor more information about each benchmark. -
results/: contains jupyter notebook for managing the results anddatarepository containing the measured results in.csvformat. -
tools/: directory, which contains installed tools. -
scripts/: contains various helper scripts./scripts/pycobench: is a local and modified version of pycobench script used for small benchmarking../scripts/pycobenchcan be used for running list of benchmarks (one per line specified in text file or passed by<operator from stdin) on list of tools (specified in.yamlformat; seetools.yamlto see how to add new a~tool) and outputs a.outputfile which contains the log of the benchmarking, and./scripts/pyco_procwhich takes.outputfile and creates eithercsv,htmlortextreport of the results. We generate results incsvfile, which list times and other metrics for each tool on each benchmark delimited by;.- You can specify concrete tools delimited by
;using-mparameter, e.g. (./pycobench/bin/pycobench -m jaltimpact;mata-nfa) - You can specify timeout for each tool by
-tparameter (e.g.,./scripts/pycobench -t 60) - We recommend to use
./scripts/run_pyco.shinstead.
- You can specify concrete tools delimited by
run_pyco.sh: The script takes list of paths to.inputfiles, which contains paths to benchmarks (seeautomata-bench/for examples of.inputfiles). The script generates files according to the used benchmark, timeout, measured tools, number of jobs and time/date. The script supports following options:-c|--config conf.yamlto change the measured tools to a specification inconf.yamlfile.-t|--timeout TIMEto change the internal timeout of benchmarking toTIMEseconds.-v|--verboseto print additional information about the benchmarking process (for debug mostly).-m|--methods a;b;cto change the measured tools to only measure toolsa,bandc.-j|--jobs JOBSto change the benchmarking to run withJOBSthreads. We advise not to use all 8 CPUs. In our experience, it leads to some unexpected behaviour.-s|--suffix STRINGto append additional string to results (this is mostly used for better tagging of the results).
process_pyco.sh: don't use this at all, userun_pyco.shinstead.compare_profiles.py: run aspython3 compare_profiles [file1.csv, file2.csv, ..., filen.csv]; prints short summary of the results in form of tables.build_all.sh: script that builds all the tools fromtools/directory; this should not be needed at all since the binaries inbin/should be executable. Run this, if you made any changes to the source codes; we are not responsible for any screw ups during the build.massage_arguments.py: this is used to convert the.matafiles into.mata.mintermfiles, which are automata inmataformat after mintermization algorithm, that are used as inputs for some tools.pack_me.sh: creates archive of this replication package.run_alib.sh/run_automata.sh/run_brics.sh: wrapper scripts for running selected tools.- Other scripts are not needed.
-
jobs/*.yaml: specification of tools inyamlformat which is supported bypycobench. You can specify tool as follows:tool: cmd: run_tool.sh $1.fmt
This will register tool under the name
toolwhich is run asrun_tool.sh, takes*.fmtfiles on input. The$1will be substituted by concrete benchmarks. We recommend writing wrapper shell scripts. -
inputs/*.input: list of automata (one per line; multiple automata can be delimited with;), which are fed to compiled binaries/programs using pycobench. -
programs: contains programs, that are interpreted by binaries inbinor python commands; these corresponds to sequence of automata operations (e.g. loading automata, complementing automata, intersecting automata). -
packages: contains debian packages necessary for replicating the results. -
pip-packages: contains python packages necessary for replicating the results.
If you wish to add a new tool to this package, please follow these recommendations:
-
Install any prerequisites you need (see Access Credentials if you need sudo). We strongly advise to keep it simple and prefer isolated environments (e.g. using python
venv). -
Install your tool in
tools\directory. Consider adding your tool to$PATH. -
We advise to create a wrapping script that takes at least one argument (name of the benchmark), and internally handles errors and outputs. Be default,
pycobenchtimes the runtime of the whole tool, but one can measure additional metrics by includingkey: value\nlines in the output. Such lines in output will be recorded in the final.csvastool-key, with cells corresponding tovalue. -
Register your tool in
jobs*.yaml`.tool: cmd: run_tool.sh $1.fmt
-
Now you can run your tool.
- Copy your benchmark to
nfa-benchdirectory. - Generate
.inputfiles that will list benchmarks one per line. - Add your benchmark to
run_all.sh.
-
Comparing the tools based on other metrics (e.g., memory peak) could further highlight the strengths of each tools. Nowadays, limiting the memory peak in exchange of machine time can be useful in order to employ the tools in cloud infrastructure. Metrics other than time could be measured to compare the tools and their methods better (e.g., explored state space), however, we argue that the machine time is the most generic metric for comparing such a wide, diverse range of tools.
-
We compared the tools in a virtualized environment running in parallel. The virtualized environment could introduce some overhead to running some of the tools, however, it should not be as significant.
-
Running the tools in parallel could potentially slow down some of the tools or their steps (e.g, by competing over the resources or interfering with cache access).
-
Due to the time limitations, we did not run enough warm-ups, hence the results could be less stable. However, due to the sheer size of the benchmarks, we believe the results are stable enough to provide fair comparison.
-
Increasing the timeout could help the tools to solve more problems, however, in our experience, this did not lead to significant improvements.
-
The selected benchmarks represent only a limited class of problems, and, moreover, some contain only a selected range of inputs.
There may be other existing suitable benchmarks for comparison. However, we believe the selected benchmarks are quite diverse: there are parametric benchmarks, benchmarks that come from practice or benchmarks that were used to evaluate other approaches published in papers. -
Other kinds of parametric formulae (e.g., introducing alternations, or other patterns that can potentially be hard for solvers) would yield further insight on compared tools.
-
The comparison of the tools might be unfair, since the tools: (1) supports different formats, (2) requires hand-written parser, (3) are implemented in different languages.
-
E.g.,
automata.netlibrary could perform better on native Windows system, instead of Unix system employing .NET architecture that might be less efficient. The results, however, suggests that despite these challenges, each tool could perform well. -
Implementing the corresponding tools in same language, with same format could indeed provide more fair comparison.
-
Implementing support of the uniform format could also reduce some of the overhead of the parsing, as well as mitigate possible overhead of conversion between formats.
-
Most of the tools can be run with different optimizations, heuristics, or arguments. Fine tuning these parameters could potentially enhance the results for the tools. Better evaluation of different parameters could be interesting, however, it currently is beyond the scope of this project. We argue, that using the default settings of the tools is still fair enough.
-
There might be tools or libraries that could potentially outperform the selected tools or provide a~more meaningful comparison. However, the chosen tools represent the state-of-the-art well.
-
Some of the existing tools are either not well-maintained and could not be built or are proprietary.
- Lukáš Holík (kilohsakul): the supreme leader, the emperor of theory;
- Ondřej Lengál (ondrik): prototype developer and the world's tallest hobbit;
- Martin Hruška (martinhruska): library maintainer;
- Tomáš Fiedor (tfiedor): python binding maintainer;
- David Chocholatý (Adda0) library and binding developer;
- Vojtěch Havlena (vhavlena) library developer;
- Juraj Síč (jurajsic): library developer;
This work has been supported by the Czech Ministry of Education, Youth and Sports ERC.CZ project LL1908, and the FIT BUT internal project FIT-S-20-6427.