Skip to content

Latest commit

 

History

History
461 lines (342 loc) · 19.2 KB

File metadata and controls

461 lines (342 loc) · 19.2 KB

Flaky Tests in CI: Replication Package

Replication package for: "An Empirical Study of Detected and Undetected Flaky Test Failures in Real-World CI Pipelines"

DOI Dataset

Table of Contents

Quick Start

Goal: Verify that the analysis scripts work using a small sample dataset (2 days of data).

Why only 2 days? The complete dataset is ~430 GB, because we provide all collected data for better reuse (e.g. failure symptoms like stack traces, durations, ...), even data not used in our paper. Downloading and executing on the full dataset takes considerable time and has high system requirements. The 2-day sample allows quick validation of the analysis pipeline. Note that the 2-day sample is intended only for testing the analysis pipeline, not for drawing conclusions.

Recommended Setup: Docker installed, 6+ CPU cores, 20GB+ RAM, 15GB free disk space (See System Requirements section for details on minimum requirements)

Runtime Reference (sample dataset):

We have encountered the following runtimes for reference:

  • Server (50 cores, 270GB RAM): ~15 minutes
  • MacBook Pro (16GB RAM, Apple M1): ~3.5 hours

If performance is poor or the analysis fails due to resource constraints, adjust the Spark configuration in src/analysis/utils.py.

Note: All commands below are for Unix-based systems (Linux/macOS). Windows users should use WSL2 or adapt commands accordingly.

1. Clone the repository and build the Docker image:

# Clone from GitHub
git clone https://github.com/tum-i4/flaky-tests-in-ci.git
cd flaky-tests-in-ci

# Build Docker image
docker build -t flaky-tests-ci .

2. Download sample dataset (first 2 days + validation data):

This downloads ~13 GB of data and may take some time depending on your network speed.

# Create directory structure
mkdir -p data

# Download using rsync
# Set password as environment variable (only required during review period)
export RSYNC_PASSWORD=m1836976  # For bash/zsh; Fish users: set -x RSYNC_PASSWORD m1836976

rsync -avz --progress \
  --include="raw/" \
  --include="raw/google-chromium/" \
  --include="raw/google-chromium/**/" \
  --include="raw/google-chromium/**/2024-07-01*.parquet" \
  --include="raw/google-chromium/**/2024-07-02*.parquet" \
  --include="raw/microsoft-playwright/" \
  --include="raw/microsoft-playwright/2024-08-01*.parquet" \
  --include="raw/microsoft-playwright/2024-08-02*.parquet" \
  --include="raw/gitlab-org-gitlab/" \
  --include="raw/gitlab-org-gitlab/2024-08-01*.parquet" \
  --include="raw/gitlab-org-gitlab/2024-08-02*.parquet" \
  --include="raw/cqse-teamscale/" \
  --include="raw/cqse-teamscale/2024-02-01*.parquet" \
  --include="raw/cqse-teamscale/2024-02-02*.parquet" \
  --include="validate_failures/" \
  --include="validate_failures/**/" \
  --include="validate_failures/**/*.parquet" \
  --exclude="*" \
  rsync://m1836976@dataserv.ub.tum.de/m1836976/data/ data/

# Clean up password from environment
unset RSYNC_PASSWORD            # For bash/zsh; Fish users: set -e RSYNC_PASSWORD

# Make data directory writable for Docker container
chmod -R 777 data/

3. Run analysis pipeline in Docker:

This will write analysis results to analysis_output.log while also displaying them in the console.

docker run --rm -v "$(pwd)/data:/workspace/data" \
  -v "$(pwd)/generated_figures:/workspace/generated_figures" \
  flaky-tests-ci python /workspace/run_all_analyses.py 2>&1 | tee analysis_output.log

Note on caching: The downloaded data contains individual test executions (TEs). On first run, the scripts will create a cache in data/cache/ containing aggregated test execution batches (TEBs) and test cases. This caching process can take a while and should not be interrupted. If interrupted, delete the entire data/cache/ folder so it will be recreated. After the cache is created, subsequent runs will be faster (though still potentially slow depending on the analysis).

4. Check results:

Results are printed to the console during execution and saved to analysis_output.log. Generated figures are saved to:

ls -la generated_figures/*.pdf

For full replication with complete dataset, see Full Replication below.

Full Replication

Goal: Reproduce all results from the paper using the complete dataset.

System Requirements

Minimum:

  • Python 3.12+
  • Java 17 (for PySpark)
  • 6+ CPU cores
  • 20GB RAM
  • 430GB free disk space

Runtime Reference (sample dataset):

For reference, on a system with 50 cores and 270GB RAM, the full analysis took ~8 hours (~5 hours for creating the cache).

Note: All commands in this document are for Unix-based systems (Linux/macOS). Windows users should use WSL2 or adapt commands for PowerShell/CMD.

Setup Instructions

1. Clone the repository and build the Docker image:

# Clone from GitHub
git clone https://github.com/tum-i4/flaky-tests-in-ci.git
cd flaky-tests-in-ci

# Build Docker image
docker build -t flaky-tests-ci .

2. Download the complete dataset (see Obtaining the Dataset)

3. Prepare directories for Docker:

# Make data directory writable for Docker container
chmod -R 777 data/

4. Run analysis pipeline in Docker:

# Run complete analysis pipeline with output logged to file
# Adjust CPU and memory limits based on your host system
docker run --rm \
  -v "$(pwd)/data:/workspace/data" \
  -v "$(pwd)/generated_figures:/workspace/generated_figures" \
  flaky-tests-ci python /workspace/run_all_analyses.py 2>&1 | tee analysis_output.log

Analysis Pipeline Details

The run_all_analyses.py script runs all analyses in sequence. On the first run, the cache (Test Execution Batches and test cases) will be created automatically, which increases the runtime significantly.

  1. Dataset overview
  2. RQ1: Pipeline-level flakiness
  3. RQ2: Time series spike detection
  4. RQ3: Pareto analysis
  5. RQ4: Flaky type overlap
  6. RQ5: Environment-specific patterns

Outputs:

  • Figures: generated_figures/*.pdf - All figures from the paper
  • Spike Reports: generated_figures/spike_reports/ - Detailed spike analysis reports (as a basis for manual labels in RQ2)
  • Console Output: Statistics, tables, and analysis results
  • Cache: data/cache/ - Preprocessed data (created automatically)

Performance Notes:

  • First run creates cache files
  • Subsequent runs are faster (cache is reused)

System Requirements

Hardware

Recommended:

  • RAM: 20GB+
  • Storage: 430GB free disk space (SSD preferred)
  • CPU: 6+ cores

Note: The analysis can run with fewer resources than recommended, but may fail with out-of-memory errors. Even with the recommended resources, processing the full dataset will be extremely slow. If you encounter issues, adjust the Spark configuration in src/analysis/utils.py.

Software

Component Version Purpose
Python 3.12+ Main runtime
Java (OpenJDK) 17 PySpark backend
Node.js 18+ Playwright connector only*
LaTeX texlive Figure generation
Python packages See requirements.txt Data processing, visualization, analysis

*Only required for data collection, not for analysis

Obtaining the Dataset

Sample Dataset (Quick Start)

For quick validation and testing, download the first 2 days of raw data per project plus all validation data.

Download commands are provided in the Quick Start section above.

Complete Dataset (Full Replication)

The complete dataset contains test executions from four projects across their respective study periods and consists of 3,131 raw data files plus 112 validation files.

Download:

You can download the complete dataset using rsync as follows:

# Set password as environment variable (only required during review period)
export RSYNC_PASSWORD=m1836976  # For bash/zsh; Fish users: set -x RSYNC_PASSWORD m1836976
rsync -avz --progress rsync://m1836976@dataserv.ub.tum.de/m1836976/data/ data/
unset RSYNC_PASSWORD            # For bash/zsh; Fish users: set -e RSYNC_PASSWORD

Download via http or ftp is also available on the data server (Username and password required during review phase: Username: reviewer-access-03 Password: DeNp36MnS%ZhaL!qW3)

Dataset Structure:

data/
├── raw/                                    # Raw CI test execution data
│   ├── google-chromium/
│   │   ├── Linux-Tests/                    # Platform-specific subdirectories for Chromium
│   │   ├── mac14-tests/
│   │   └── Win11-Tests-x64/
│   │       └── 2024-MM-DD_NNN.parquet      # Multiple parquet files per day
│   ├── gitlab-org-gitlab/
│   │   └── 2024-MM-DD_NNN.parquet
│   ├── microsoft-playwright/
│   │   └── 2024-MM-DD_NNN.parquet
│   └── cqse-teamscale/
│       └── 2024-MM-DD_NNN.parquet          # Anonymized
└── validate_failures/                      # Validation run data
    ├── gitlab-org-gitlab/
    │   └── 2024-MM-DD_NNN.parquet
    ├── microsoft-playwright/
    │   └── 2024-MM-DD_NNN.parquet
    └── cqse-teamscale/
        └── 2024-MM-DD_NNN.parquet          # Anonymized

Checksums

Verify data integrity after download:

# Download checksums file (if not already included in your rsync download)
export RSYNC_PASSWORD=m1836976
rsync -avz --progress rsync://m1836976@dataserv.ub.tum.de/m1836976/checksums.sha512 ./
unset RSYNC_PASSWORD

# Verify all downloaded files (run from repository root)
sha512sum -c checksums.sha512

# Or verify specific files only
grep "data/raw/microsoft-playwright" checksums.sha512 | sha512sum -c

The checksums file contains SHA-512 hashes for all 3,243 files in the dataset.

Repository Structure

.
├── src/
│   ├── analysis/                   # Analysis script
│   │   ├── dataset-data.py         # Dataset overview statistics
│   │   ├── RQ1.py                  # Prevalence of Flaky Failures in Pipelines
│   │   ├── RQ2.py                  # Causes of Temporal Spikes
│   │   ├── RQ3.py                  # Flaky Failures from Test Cases
│   │   ├── RQ4.py                  # Test Cases with Detected and Undetected Flaky Failures
│   │   ├── RQ5.py                  # Flaky Failures in Different Environments
│   │   ├── utils.py                # Data loading and processing utilities
│   │   └── utils_visualization.py  # Plotting and figure generation utilities
│   │
│   ├── connectors/                 # Data collection tools*
│   │   ├── main.py                 # Main data collection entry point
│   │   ├── chromium/               # Google Chromium connector
│   │   ├── gitlabProject/          # GitLab connector
│   │   ├── playwright/             # Microsoft Playwright connector
│   │   └── _commons/               # Shared connector utilities
│   │
│   ├── validate_failures/          # Failure validation framework*
│   │   ├── gitlab_org_gitlab/      # GitLab validation scripts
│   │   └── microsoft_playwright/   # Playwright validation scripts
│   │
│   └── prepare_for_publication/    # Data anonymization tools
│       └── hash_teamscale_data.py  # Teamscale data anonymization script
│
├── data/                           # Data directory (download separately)
│   ├── raw/                        # Raw CI test execution data
│   ├── cache/                      # Processed caches (auto-generated)
│   │   ├── teb/                    # Test Execution Batches
│   │   └── test_cases/             # Test case aggregations
│   └── validate_failures/          # Validation run data
│
├── generated_figures/              # Output: Generated figures (PDFs)
│   └── spike_reports/              # Detailed spike analysis reports (RQ2)
│
├── spikes_labels.csv               # Manual labeling of RQ2 spike root causes
├── .flake8                         # Python linting configuration
├── requirements.txt                # Python dependencies
└── README.md                       # This file

* Not required for replicating analysis results

Understanding the Data

Key Concepts

For detailed formal definitions, see Section II (Terminology) of the paper. Here's a brief overview:

Test Case: Defined by its name and environment (e.g., loginTest in Chrome vs. Firefox are different test cases).

Test Execution (TE): A single execution instance of a test case, uniquely identified by test case name, environment, job ID, and retry count. Raw data consists of individual TEs.

Test Execution Batch (TEB): All TEs of the same test case within a single job (including automatic retries). This is the primary unit of analysis, generated during data processing. A TEB is flaky if it contains both passing and failing TEs within the job, or if it failed in the job but passed when validated in a fresh environment.

Job: A collection of test executions run together within the same CI runner instance.

Pipeline: A collection of jobs targeting the same commit.

Verdict Classification

Each Test Execution Batch is assigned one of these verdicts:

Verdict Description
passed All executions passed
flaky (detected) Initial failure + subsequent pass within the job
flaky (undetected survived immediate rerun) Failed consistently with immediate reruns in job, but passed in validation
flaky (undetected unchallenged by immediate rerun) Failed once without immediate reruns, but passed in validation
failed (with immediate reruns) Failed consistently with immediate reruns in job and validation
failed (no immediate reruns) Failed once without immediate reruns, and failed in validation
failed (unvalidated) Failure not validated (rare cases)

Data Schema

Each parquet file contains test execution records with 23 columns across four levels:

Test Level (8 columns):

  • test_case_name - Test identifier (e.g., class, method, file path)
  • test_case_environment - Execution context (e.g., OS, browser, configuration)
  • retry_count - Rerun attempt number (0-based)
  • duration - Execution duration (seconds)
  • result - Project-specific result
  • generic_result - Normalized result (pass-like/fail-like)
  • failure_symptoms - Error messages, stack traces
  • metadata - Project-specific test metadata

Job Level (5 columns):

  • job_id - Unique identifier for the job
  • job_name - Name/label of the job
  • job_duration - Total execution time of the job (seconds)
  • job_result - Overall result of the job
  • job_trigger - What initiated the job (e.g., commit, manual, schedule)

Pipeline Level (9 columns):

  • pipeline_id - Unique identifier for the pipeline run
  • pipeline_result - Overall result of the pipeline
  • pipeline_created_at - Timestamp when pipeline was created
  • pipeline_duration - Total execution time of the pipeline (seconds)
  • pipeline_commit_sha - Git commit hash the pipeline targets
  • pipeline_branch_name - Git branch the pipeline runs on
  • pipeline_trigger - What initiated the pipeline
  • pipeline_metadata - Project-specific pipeline data
  • pipeline_name - Name/label of the pipeline

Project Level (1 column):

  • project_name - e.g., "microsoft/playwright"

Note: CQSE Teamscale data has anonymized test_case_name, job_name, and pipeline_branch_name fields (except "master").

Manual Spike Analysis (spikes_labels.csv)

The spikes_labels.csv file documents the manual labeling of temporal spikes in flaky test failures discovered during RQ2. It contains root cause analyses for spikes, with independent reviews from two authors each and final consensus decisions on technical root causes and temporal triggers.

Advanced Usage

Collecting New Data

Note: Not required for replication. Only use if you want to extend the dataset or collect data from new time periods. Due to the high variability of CI systems and APIs, the scripts might require updates to work with current versions.

Prerequisites:

  • Node.js 18+ (for Playwright connector - already installed in Docker)
  • API tokens (see below) - must be set as environment variables

Authentication Requirements:

Project Token Scopes Required
Google Chromium None Public LUCI API
GitLab CONNECTORS_GITLAB_COM_TOKEN read_api
Microsoft Playwright GITHUB_TOKEN public_repo

Usage:

python src/connectors/main.py <project> <start_date> <end_date> <output_folder> [--workers N]

# Example:
python src/connectors/main.py microsoft/playwright 2024-08-01 2024-08-31 data/raw --workers 4

Supported projects: google/chromium, gitlab-org/gitlab, microsoft/playwright (the connector for cqse/teamscale has been removed since it's a proprietary system)

Validating Failures

Purpose: Distinguish flaky tests from deterministic failures by rerunning failed tests in the original CI environment.

GitLab:

  • Enhanced API token (api + write_repository)
  • Local clone of repository with configured fork remote
  • Environment variables: CONNECTORS_GITLAB_COM_TOKEN, GITLAB_USER_NAME, GITLAB_LOCAL_REPO_PATH
  • Script modifies local .gitlab-ci.yml, commits to fork branches, triggers pipelines
  • Run: python -m validate_failures.gitlab_org_gitlab.main <input_parquet> [<input_parquet2> ...]

Playwright:

  • Enhanced GitHub token (public_repo + actions)
  • Fork repository to your account (no local clone needed)
  • Environment variable: VALIDATE_FAILURES_PLAYWRIGHT_GITHUB_TOKEN
  • Script uses GitHub API to create commits and trigger workflows
  • Run: python -m validate_failures.microsoft_playwright.rerun_failures <input_parquet> <output_folder>

Validation takes hours to days depending on failure count and CI queue times. Results are collected with the respective download_results.py scripts.

Citation

If you use this replication package or dataset in your research, please cite:

@inproceedings{TODO,
  title={An Empirical Study of Detected and Undetected Flaky Test Failures in Real-World CI Pipelines},
  author={TODO},
  booktitle={TODO},
  year={TODO},
  organization={TODO}
}

Paper: TODO (paper DOI/link) Dataset: TODO (mediaTUM DOI/link)

Acknowledgments: This research was conducted at the Technical University of Munich (TUM). We thank the maintainers of Chromium, GitLab, and Playwright for maintaining open CI systems, and CQSE for providing access to their industrial CI data.