Description
Hi @DonggeLiu @jonathanmetzman
Lately, I've been running lots of local experiments on fuzzbench and noticed that after I added --runners-cpus
flag reports were sometimes incomplete due to race condition.
This is my config:
# The number of trials of a fuzzer-benchmark pair.
trials: 5
# The amount of time in seconds that each trial is run for.
# 1 day = 24 * 60 * 60 = 86400
max_total_time: 3600
# The location of the docker registry.
# FIXME: Support custom docker registry.
# See https://github.com/google/fuzzbench/issues/777
docker_registry: gcr.io/fuzzbench
# The local experiment folder that will store most of the experiment data.
# Please use an absolute path.
experiment_filestore: /home/zuka/hexhive/data/local-runs/experiment-data
# The local report folder where HTML reports and summary data will be stored.
# Please use an absolute path.
report_filestore: /home/zuka/hexhive/data/local-runs/report-data
# Flag that indicates this is a local experiment.
local_experiment: true
and I use this command to start experiment:
PYTHONPATH=. python3 experiment/run_experiment.py \
--experiment-config experiment-config.yaml \
--benchmarks curl_curl_fuzzer_http freetype2_ftfuzzer bloaty_fuzz_target jsoncpp_jsoncpp_fuzzer libxml2_xml sqlite3_ossfuzz vorbis_decode_fuzzer \
--experiment-name libafl-1h-with-seeds \
--fuzzers libafl_default libafl_random libafl_weighted libafl_valprof libafl_covaccount \
--concurrent-builds 15 --runners-cpus 15 --measurers-cpus 1
Adding runners-cpus besides restricting number of usable CPUs, also adds pinning to docker command. Most of the times I am getting only first cycle of trials (If I run with --runners-cpus 16, then I get only 16 trials in the report). For other trials there were fuzzer logs, corpus archives, but no coverage archives.
The reason for this is measurer_main_process
ends before the next cycle of trials is started. I see Finished measure loop.
in the logs after the first cycle and the loop is never restarted.
After some more debugging I found the issue in this piece of code inside measure_manager_loop
while not scheduler.all_trials_ended(experiment):
continue_inner_loop = measure_manager_inner_loop(
experiment, max_cycle, request_queue, response_queue,
queued_snapshots)
if not continue_inner_loop:
break
time.sleep(MEASUREMENT_LOOP_WAIT)
After the first cycle ends, measure_manager_inner_loop
returns False and the loop breaks out, because there are no unmeasured snapshots in the database yet.
I don't really understand the need for this break, so to fix the issue for my runs, I just removed break
logic from the measurer loop and just let it run until scheduler.all_trials_ended
. If you think this is an acceptable solution I can create PR.