Skip to content

Latest commit

 

History

History
339 lines (241 loc) · 15.3 KB

File metadata and controls

339 lines (241 loc) · 15.3 KB

LiveCodeBench Dataset and Evaluation

This directory contains the necessary files to run LiveCodeBench evaluations in a containerized environment. LiveCodeBench is a code generation benchmark that requires executing potentially untrusted LLM-generated code against test cases.

Table of Contents


Requirements

To run LiveCodeBench in a container, you need:

System Requirements

  • Docker Engine: Use an up-to-date stable version of Docker. All scripts were tested with Docker engine v26+.
  • System Resources:
    • All scripts were tested with 24 cores and 16 workers.
    • Minimum 32GB RAM available to allocate to container
    • 10GB available disk space

Build Dependencies

  • Build Secrets: HuggingFace token for dataset access
    • Set as Docker build secret with ID HF_TOKEN
    • Required to download LiveCodeBench datasets during image build
  • Requires authenticating with Docker Hardened Images. Please do docker login dhi.io.

Python Dependencies (Bundled in Container)

The following dependencies are automatically installed in the container:

  • datasets==3.6.0 - Dataset loading and management
  • pandas==2.3.3 - Data manipulation
  • fastapi==0.128.0 - Web server framework
  • uvicorn[standard]==0.40.0 - ASGI server
  • pydantic==2.12.5 - Data validation
  • tqdm==4.67.1 - Progress bars

Network Requirements

  • Exposed Port: 13835 (WebSocket server)
  • WebSocket Support: Long-lived connections (up to 2 hours)
  • Internet Access: Required during build for package installation and dataset downloads. No internet access is required during container execution.

Host-Side Dataset Generation

The LiveCodeBench.generate() method on the host side creates an isolated Python virtual environment with datasets==3.6.0 to create the dataset. By default, the host side generate method will have the --no-test-cases flag set, as the host only needs to send the input to the endpoint, and the output is evaluated within the lcb-service container.

To enable local evaluation fallback (NOT RECOMMENDED), set save_test_cases to True. This can be done by manually calling the generate method.


Why Containerization is Required

LiveCodeBench must be run in a containerized environment for the following critical reasons:

1. Security Isolation

LiveCodeBench executes arbitrary code generated by language models against test cases. This code is:

  • Untrusted: Generated by AI, not human-reviewed. Since arbitrary endpoints can be benchmarked, it is also possible for the endpoint to intentionally respond with malicious code if it knows it is being benchmarked for code generation.
  • Potentially malicious: May contain bugs, infinite loops, resource exhaustion attacks, data exfiltration attacks, etc.
  • Unpredictable: Behavior cannot be guaranteed in advance

Running this code directly on a host system poses severe security risks including:

  • File system access and modification
  • Network access and data exfiltration
  • Resource exhaustion (CPU, memory, disk)
  • Privilege escalation attempts

2. Environment Consistency

  • Reproducible Builds: Containers ensure identical execution environments across different machines
  • Dependency Isolation: Prevents conflicts with host system packages and Python environments
  • Version Control: Locked dependency versions guarantee consistent behavior

3. Resource Management

  • CPU/Memory Limits: Containers allow strict resource constraints to prevent runaway processes
  • Process Isolation: Failed or hanging code executions don't affect host system stability
  • Easy Cleanup: Containers can be stopped and restarted to recover from errors

4. Network Isolation

  • Controlled Communication: Only the WebSocket port (13835) is exposed
  • Ingress/Egress Control: Network policies can restrict external communication
  • Service Discovery: Container networking simplifies service-to-service communication

⚠️ WARNING: Running LiveCodeBench code execution outside a container is strongly discouraged and may compromise system security.


Docker Security Hardening

When running LiveCodeBench containers, to minimize risk and maximize security isolation:

  1. Make sure your Docker versions are up-to-date with all security patches installed.
  2. Run Docker in a Virtual Machine
  3. Run the Docker container on an isolated network with only the capability for to communicate over the network with the host machine via the LCB-Service port (default: 13835). The service does not require any outbound network connections over the internet so it is best to limit outbound connections only between the host and container, and only on the service port.
  • This specific configuration can be a little complicated to set up. Automated script is currently work-in-progress.
  1. Run the docker daemon and container as an unprivileged user (See: Rootless Mode).
  2. Run the container with --rm and --read-only.
  3. Drop all unnecesssary capabilities (See: Runtime Privileges and Capabilities)
  4. Use AppArmor (or other Linux security module) (See: AppArmor Security Profiles)
  5. Run the container with --security-opt=no-new-priveleges:true to prevent most privilege escalation attempts within the container.
  6. Enable container logging, and monitor logs (especially if using LCB-Serve as a long-running microservice).

For resource limiting, there are a few caveats:

  • When using --cpus to limit the number of CPUs, you must set the number of workers manually via the LCB_N_WORKERS environment variable (see Environment Variables). This is because Python's os.cpu_count() function reads from /proc/cpuinfo which is derived from the host, and is unaware of any cgroup or other CPU restricting functions.
  • If using --memory, the value should be set to at least 32g.
    • By default, all test suites will be loaded into memory and cached upon the launching of the service.
    • During loading, around 21GiB is used at the maximal memory usage.
    • After loading, the service will consume around 17-18 GiB while idle.
    • During eval, memory usage of up to 31GiB is observed with the configuration provided in the sample command during a single /evaluate connection.
    • Setting the LCB_TEST_CACHE_SIZE environment variable to a lower value will greatly reduce the memory footprint at the cost of longer evaluate queries. See Environment Variables for more details.
      • The full dataset size consists of 1055 test suites. Reducing the cache size to 512 or 256 will reduce the memory capacity requirements.
    • If multiple /evaluate connections are expected or required, it is recommended to increase the --memory limit greatly (by 16GB per expected connection) and also reduce the cache size (see above).
      • Currently there is no programmatic limit on the number of concurrent /evaluate connections.
    • It is advised to set --memory-swap to the same value as --memory to disable swap memory unless your machine does not have enough system memory to support the workload, in which case you should increase the --memory-swap limit accordingly.
  • Process limits: For most cases, --pids-limit=4096 is fine, but it is recommended to increase this value by at least double if multiple concurrent /evaluate connections are expected.
  • For best practices, when performing evaluation, to reduce the memory and PID footprint of the container, it is also best to reduce the size of each /evaluate query to 1000 or fewer code generation samples. The sample command was tested with 3 repeats of LiveCodeBench Lite v6 (3165 code generation samples) in a single query, but due to variance in the generated code, this is not recommended and may not always work.

Container Details

The lcb-service container is built on Docker's Python hardened image. The built image inherits the following features:

  • Contains no package manager
  • Runs as non-root user
  • No interactive shell
  • Minimal dependency list

Furthermore, the container will generate a read-only pristine copy at build time.


Running the Container

Build the Image

# Authenticate with Docker's Hardened Images hub
docker login dhi.io

# Create HF_TOKEN secrets file. Change the location to somewhere else if you would like it to persist.
echo "<your HuggingFace Token>" > /tmp/.hf_token

# From Inference Endpoint repository root:
docker build \
  -f src/inference_endpoint/evaluation/livecodebench/lcb_serve.dockerfile \
  --secret id=HF_TOKEN,src=/tmp/.hf_token \
  -t lcb-service \
  src/inference_endpoint/evaluation/livecodebench

# Clean up HF Token if you saved it in tmp
rm /tmp/.hf_token

(Only if using enroot) Generating a .sqsh file for enroot

Once the image has been built, it can be imported into enroot via the standard enroot import process.

First, add authentication with dhi.io to enroot via the $ENROOT_CONFIG_PATH/.credentials file. Create a read-only personal access token (PAT) for dhi.io and store it in an environment variable (i.e. export DHI_PAT_RO=<your PAT>). Then add this to your enroot credentials file:

machine dhi.io login <docker username / email> password $DHI_PAT_RO

After doing so, you should be able to create the sqsh file with the pre-built Docker image:

enroot import --output lcb_service.sqsh dockerd://lcb-service:latest

After which you can run the service via enroot. Note however that this is NOT RECOMMENDED and should only be done if your infrastructure only supports enroot and not docker. This is because enroot is designed for high-performance and does not have some of the extra security isolation features that docker has that are detailed in the Docker Security Hardening section.

Run the Container

Basic Run Command (FOR DEBUGGING ONLY)

docker run \
  --rm \
  -p 13835:13835 \
  lcb-service

Hardened Run Command

Automated creation of isolated docker network configuration is currently work-in-progress, but if you are able to do so, please add your private network to the command.

docker run \
  --name lcb-service \
  --rm \
  --read-only \
  --tmpfs /tmp:rw,noexec,nosuid,size=1g \
  -p 127.0.0.1:13835:13835 \
  --security-opt=no-new-privileges:true \
  --security-opt apparmor=docker-default \
  --memory=32g \
  --memory-swap=32g \
  --cpus=24 \
  -e LCB_N_WORKERS=16 \
  --pids-limit=4096 \
  --cap-drop=ALL \
  lcb-service:latest

Verify the Container

# Check container status
docker ps | grep "lcb-service"

# Check container logs
docker logs "lcb-service"

# Test WebSocket endpoint
curl localhost:13835/info

# Monitor resource usage
docker stats "lcb-service"

Stop and Cleanup

# Stop the container
docker stop "lcb-service"

# Remove the container
docker rm "lcb-service"

# Remove the image (if needed)
docker rmi lcb-service:latest

Troubleshooting

Common Issues

  1. Container fails to start

    • Check logs: docker logs livecodebench
    • Verify port 13835 is not already in use
  2. Dataset generation errors on host side

    • Ensure the host has internet access to download datasets from HuggingFace
    • If the error is a rate limit, ensure that the HF_TOKEN environment variable is set to your HuggingFace API key
    • Check that the virtual environment was created successfully at <datasets_dir>/livecodebench/venv
    • Verify that datasets==3.6.0 was installed correctly in the venv
  3. WebSocket connection issues

    • Verify firewall rules allow port 13835
    • Check network mode configuration
    • Ensure keep-alive settings are sufficient for long-running tests
  4. Resource exhaustion

    • Increase memory limits if OOM errors occur
    • Adjust CPU limits based on workload
    • Monitor with docker stats

Environment Variables

The following environment variables can be configured when running the LCB-Service container via the Docker -e flag.

  • LCB_N_WORKERS: This number sets the number of worker processes to run when performing evaluation on LiveCodeBench. By default, the number of workers is half the number of available CPU cores.
  • LCB_DATASETS_DIR: If you would like to test a different version or custom dataset file, you can either mount a read-only volume to /opt/LiveCodeBench_Datasets, or you can mount it to a different directory within the container and set this variable to that directory.
  • LCB_VERSION_TAG: Specify the version of LiveCodeBench to use. By default this is release_v6.
  • LCB_AUTO_GENERATE_DATASET: If this is set to true, then if there is no dataset located in the LCB_DATASETS_DIR, it will automatically be generated when the service starts. This is useful when used in conjunction with LCB_VERSION_TAG and LCB_DATASETS_DIR. For instance, the following can be used to do a 1-time test run of LiveCodeBench v4:
...
--tmpfs /opt/lcb_release_v4:rw,noexec,nosuid,size=8g \
-e LCB_VERSION_TAG=release_v4 \
-e LCB_AUTO_GENERATE_DATASET=true \
-e LCB_DATASETS_DIR=/opt/lcb_release_v4 \
...
  • LCB_SERVER_DEBUG: When set to any value (e.g., true, 1, yes), enables DEBUG level logging for the server. By default, only INFO level logs and above are shown. This is useful for troubleshooting and development purposes.
  • LCB_TEST_CACHE_SIZE: Controls the maximum number of problems to cache test suites for in memory. By default (if not set), there is no limit and all test cases will be cached. Set to a positive integer to limit the cache size (e.g., 100 to cache only test suites of the 100 most recently used problems). Set to none, inf, infinity, or unlimited to explicitly disable the limit. This is useful for memory-constrained environments where caching all test cases would consume too much memory.
  • LCB_PRELOAD_TESTS: If set to true (or 1, yes, on), all test cases will be preloaded into the cache during service startup. This can improve performance for the first evaluation run at the cost of longer startup time and higher initial memory usage. By default, this is true.

Additional Resources


Support

For issues specific to this implementation:

  1. Check the container logs: docker logs lcb-service
  2. Review the troubleshooting section above
  3. Consult the main project documentation
  4. Open an issue in the project repository

Security Concerns: Report security vulnerabilities through responsible disclosure channels, not public issue trackers.