Replication package for: "An Empirical Study of Detected and Undetected Flaky Test Failures in Real-World CI Pipelines"
- Quick Start
- Full Replication
- System Requirements
- Obtaining the Dataset
- Repository Structure
- Understanding the Data
- Advanced Usage
- Troubleshooting
- Citation
Goal: Verify that the analysis scripts work using a small sample dataset (2 days of data).
Why only 2 days? The complete dataset is ~430 GB, because we provide all collected data for better reuse (e.g. failure symptoms like stack traces, durations, ...), even data not used in our paper. Downloading and executing on the full dataset takes considerable time and has high system requirements. The 2-day sample allows quick validation of the analysis pipeline. Note that the 2-day sample is intended only for testing the analysis pipeline, not for drawing conclusions.
Recommended Setup: Docker installed, 6+ CPU cores, 20GB+ RAM, 15GB free disk space (See System Requirements section for details on minimum requirements)
Runtime Reference (sample dataset):
We have encountered the following runtimes for reference:
- Server (50 cores, 270GB RAM): ~15 minutes
- MacBook Pro (16GB RAM, Apple M1): ~3.5 hours
If performance is poor or the analysis fails due to resource constraints, adjust the Spark configuration in src/analysis/utils.py.
Note: All commands below are for Unix-based systems (Linux/macOS). Windows users should use WSL2 or adapt commands accordingly.
1. Clone the repository and build the Docker image:
# Clone from GitHub
git clone https://github.com/tum-i4/flaky-tests-in-ci.git
cd flaky-tests-in-ci
# Build Docker image
docker build -t flaky-tests-ci .2. Download sample dataset (first 2 days + validation data):
This downloads ~13 GB of data and may take some time depending on your network speed.
# Create directory structure
mkdir -p data
# Download using rsync
# Set password as environment variable (only required during review period)
export RSYNC_PASSWORD=m1836976 # For bash/zsh; Fish users: set -x RSYNC_PASSWORD m1836976
rsync -avz --progress \
--include="raw/" \
--include="raw/google-chromium/" \
--include="raw/google-chromium/**/" \
--include="raw/google-chromium/**/2024-07-01*.parquet" \
--include="raw/google-chromium/**/2024-07-02*.parquet" \
--include="raw/microsoft-playwright/" \
--include="raw/microsoft-playwright/2024-08-01*.parquet" \
--include="raw/microsoft-playwright/2024-08-02*.parquet" \
--include="raw/gitlab-org-gitlab/" \
--include="raw/gitlab-org-gitlab/2024-08-01*.parquet" \
--include="raw/gitlab-org-gitlab/2024-08-02*.parquet" \
--include="raw/cqse-teamscale/" \
--include="raw/cqse-teamscale/2024-02-01*.parquet" \
--include="raw/cqse-teamscale/2024-02-02*.parquet" \
--include="validate_failures/" \
--include="validate_failures/**/" \
--include="validate_failures/**/*.parquet" \
--exclude="*" \
rsync://m1836976@dataserv.ub.tum.de/m1836976/data/ data/
# Clean up password from environment
unset RSYNC_PASSWORD # For bash/zsh; Fish users: set -e RSYNC_PASSWORD
# Make data directory writable for Docker container
chmod -R 777 data/3. Run analysis pipeline in Docker:
This will write analysis results to analysis_output.log while also displaying them in the console.
docker run --rm -v "$(pwd)/data:/workspace/data" \
-v "$(pwd)/generated_figures:/workspace/generated_figures" \
flaky-tests-ci python /workspace/run_all_analyses.py 2>&1 | tee analysis_output.logNote on caching: The downloaded data contains individual test executions (TEs). On first run, the scripts will create a cache in data/cache/ containing aggregated test execution batches (TEBs) and test cases. This caching process can take a while and should not be interrupted. If interrupted, delete the entire data/cache/ folder so it will be recreated. After the cache is created, subsequent runs will be faster (though still potentially slow depending on the analysis).
4. Check results:
Results are printed to the console during execution and saved to analysis_output.log. Generated figures are saved to:
ls -la generated_figures/*.pdfFor full replication with complete dataset, see Full Replication below.
Goal: Reproduce all results from the paper using the complete dataset.
Minimum:
- Python 3.12+
- Java 17 (for PySpark)
- 6+ CPU cores
- 20GB RAM
- 430GB free disk space
Runtime Reference (sample dataset):
For reference, on a system with 50 cores and 270GB RAM, the full analysis took ~8 hours (~5 hours for creating the cache).
Note: All commands in this document are for Unix-based systems (Linux/macOS). Windows users should use WSL2 or adapt commands for PowerShell/CMD.
1. Clone the repository and build the Docker image:
# Clone from GitHub
git clone https://github.com/tum-i4/flaky-tests-in-ci.git
cd flaky-tests-in-ci
# Build Docker image
docker build -t flaky-tests-ci .2. Download the complete dataset (see Obtaining the Dataset)
3. Prepare directories for Docker:
# Make data directory writable for Docker container
chmod -R 777 data/4. Run analysis pipeline in Docker:
# Run complete analysis pipeline with output logged to file
# Adjust CPU and memory limits based on your host system
docker run --rm \
-v "$(pwd)/data:/workspace/data" \
-v "$(pwd)/generated_figures:/workspace/generated_figures" \
flaky-tests-ci python /workspace/run_all_analyses.py 2>&1 | tee analysis_output.logThe run_all_analyses.py script runs all analyses in sequence. On the first run, the cache (Test Execution Batches and test cases) will be created automatically, which increases the runtime significantly.
- Dataset overview
- RQ1: Pipeline-level flakiness
- RQ2: Time series spike detection
- RQ3: Pareto analysis
- RQ4: Flaky type overlap
- RQ5: Environment-specific patterns
Outputs:
- Figures:
generated_figures/*.pdf- All figures from the paper - Spike Reports:
generated_figures/spike_reports/- Detailed spike analysis reports (as a basis for manual labels in RQ2) - Console Output: Statistics, tables, and analysis results
- Cache:
data/cache/- Preprocessed data (created automatically)
Performance Notes:
- First run creates cache files
- Subsequent runs are faster (cache is reused)
Recommended:
- RAM: 20GB+
- Storage: 430GB free disk space (SSD preferred)
- CPU: 6+ cores
Note: The analysis can run with fewer resources than recommended, but may fail with out-of-memory errors. Even with the recommended resources, processing the full dataset will be extremely slow. If you encounter issues, adjust the Spark configuration in src/analysis/utils.py.
| Component | Version | Purpose |
|---|---|---|
| Python | 3.12+ | Main runtime |
| Java (OpenJDK) | 17 | PySpark backend |
| Node.js | 18+ | Playwright connector only* |
| LaTeX | texlive | Figure generation |
| Python packages | See requirements.txt | Data processing, visualization, analysis |
*Only required for data collection, not for analysis
For quick validation and testing, download the first 2 days of raw data per project plus all validation data.
Download commands are provided in the Quick Start section above.
The complete dataset contains test executions from four projects across their respective study periods and consists of 3,131 raw data files plus 112 validation files.
Download:
You can download the complete dataset using rsync as follows:
# Set password as environment variable (only required during review period)
export RSYNC_PASSWORD=m1836976 # For bash/zsh; Fish users: set -x RSYNC_PASSWORD m1836976
rsync -avz --progress rsync://m1836976@dataserv.ub.tum.de/m1836976/data/ data/
unset RSYNC_PASSWORD # For bash/zsh; Fish users: set -e RSYNC_PASSWORDDownload via http or ftp is also available on the data server (Username and password required during review phase: Username: reviewer-access-03 Password: DeNp36MnS%ZhaL!qW3)
Dataset Structure:
data/
├── raw/ # Raw CI test execution data
│ ├── google-chromium/
│ │ ├── Linux-Tests/ # Platform-specific subdirectories for Chromium
│ │ ├── mac14-tests/
│ │ └── Win11-Tests-x64/
│ │ └── 2024-MM-DD_NNN.parquet # Multiple parquet files per day
│ ├── gitlab-org-gitlab/
│ │ └── 2024-MM-DD_NNN.parquet
│ ├── microsoft-playwright/
│ │ └── 2024-MM-DD_NNN.parquet
│ └── cqse-teamscale/
│ └── 2024-MM-DD_NNN.parquet # Anonymized
└── validate_failures/ # Validation run data
├── gitlab-org-gitlab/
│ └── 2024-MM-DD_NNN.parquet
├── microsoft-playwright/
│ └── 2024-MM-DD_NNN.parquet
└── cqse-teamscale/
└── 2024-MM-DD_NNN.parquet # Anonymized
Verify data integrity after download:
# Download checksums file (if not already included in your rsync download)
export RSYNC_PASSWORD=m1836976
rsync -avz --progress rsync://m1836976@dataserv.ub.tum.de/m1836976/checksums.sha512 ./
unset RSYNC_PASSWORD
# Verify all downloaded files (run from repository root)
sha512sum -c checksums.sha512
# Or verify specific files only
grep "data/raw/microsoft-playwright" checksums.sha512 | sha512sum -cThe checksums file contains SHA-512 hashes for all 3,243 files in the dataset.
.
├── src/
│ ├── analysis/ # Analysis script
│ │ ├── dataset-data.py # Dataset overview statistics
│ │ ├── RQ1.py # Prevalence of Flaky Failures in Pipelines
│ │ ├── RQ2.py # Causes of Temporal Spikes
│ │ ├── RQ3.py # Flaky Failures from Test Cases
│ │ ├── RQ4.py # Test Cases with Detected and Undetected Flaky Failures
│ │ ├── RQ5.py # Flaky Failures in Different Environments
│ │ ├── utils.py # Data loading and processing utilities
│ │ └── utils_visualization.py # Plotting and figure generation utilities
│ │
│ ├── connectors/ # Data collection tools*
│ │ ├── main.py # Main data collection entry point
│ │ ├── chromium/ # Google Chromium connector
│ │ ├── gitlabProject/ # GitLab connector
│ │ ├── playwright/ # Microsoft Playwright connector
│ │ └── _commons/ # Shared connector utilities
│ │
│ ├── validate_failures/ # Failure validation framework*
│ │ ├── gitlab_org_gitlab/ # GitLab validation scripts
│ │ └── microsoft_playwright/ # Playwright validation scripts
│ │
│ └── prepare_for_publication/ # Data anonymization tools
│ └── hash_teamscale_data.py # Teamscale data anonymization script
│
├── data/ # Data directory (download separately)
│ ├── raw/ # Raw CI test execution data
│ ├── cache/ # Processed caches (auto-generated)
│ │ ├── teb/ # Test Execution Batches
│ │ └── test_cases/ # Test case aggregations
│ └── validate_failures/ # Validation run data
│
├── generated_figures/ # Output: Generated figures (PDFs)
│ └── spike_reports/ # Detailed spike analysis reports (RQ2)
│
├── spikes_labels.csv # Manual labeling of RQ2 spike root causes
├── .flake8 # Python linting configuration
├── requirements.txt # Python dependencies
└── README.md # This file
* Not required for replicating analysis results
For detailed formal definitions, see Section II (Terminology) of the paper. Here's a brief overview:
Test Case: Defined by its name and environment (e.g., loginTest in Chrome vs. Firefox are different test cases).
Test Execution (TE): A single execution instance of a test case, uniquely identified by test case name, environment, job ID, and retry count. Raw data consists of individual TEs.
Test Execution Batch (TEB): All TEs of the same test case within a single job (including automatic retries). This is the primary unit of analysis, generated during data processing. A TEB is flaky if it contains both passing and failing TEs within the job, or if it failed in the job but passed when validated in a fresh environment.
Job: A collection of test executions run together within the same CI runner instance.
Pipeline: A collection of jobs targeting the same commit.
Each Test Execution Batch is assigned one of these verdicts:
| Verdict | Description |
|---|---|
passed |
All executions passed |
flaky (detected) |
Initial failure + subsequent pass within the job |
flaky (undetected survived immediate rerun) |
Failed consistently with immediate reruns in job, but passed in validation |
flaky (undetected unchallenged by immediate rerun) |
Failed once without immediate reruns, but passed in validation |
failed (with immediate reruns) |
Failed consistently with immediate reruns in job and validation |
failed (no immediate reruns) |
Failed once without immediate reruns, and failed in validation |
failed (unvalidated) |
Failure not validated (rare cases) |
Each parquet file contains test execution records with 23 columns across four levels:
Test Level (8 columns):
test_case_name- Test identifier (e.g., class, method, file path)test_case_environment- Execution context (e.g., OS, browser, configuration)retry_count- Rerun attempt number (0-based)duration- Execution duration (seconds)result- Project-specific resultgeneric_result- Normalized result (pass-like/fail-like)failure_symptoms- Error messages, stack tracesmetadata- Project-specific test metadata
Job Level (5 columns):
job_id- Unique identifier for the jobjob_name- Name/label of the jobjob_duration- Total execution time of the job (seconds)job_result- Overall result of the jobjob_trigger- What initiated the job (e.g., commit, manual, schedule)
Pipeline Level (9 columns):
pipeline_id- Unique identifier for the pipeline runpipeline_result- Overall result of the pipelinepipeline_created_at- Timestamp when pipeline was createdpipeline_duration- Total execution time of the pipeline (seconds)pipeline_commit_sha- Git commit hash the pipeline targetspipeline_branch_name- Git branch the pipeline runs onpipeline_trigger- What initiated the pipelinepipeline_metadata- Project-specific pipeline datapipeline_name- Name/label of the pipeline
Project Level (1 column):
project_name- e.g., "microsoft/playwright"
Note: CQSE Teamscale data has anonymized test_case_name, job_name, and pipeline_branch_name fields (except "master").
The spikes_labels.csv file documents the manual labeling of temporal spikes in flaky test failures discovered during RQ2. It contains root cause analyses for spikes, with independent reviews from two authors each and final consensus decisions on technical root causes and temporal triggers.
Note: Not required for replication. Only use if you want to extend the dataset or collect data from new time periods. Due to the high variability of CI systems and APIs, the scripts might require updates to work with current versions.
Prerequisites:
- Node.js 18+ (for Playwright connector - already installed in Docker)
- API tokens (see below) - must be set as environment variables
Authentication Requirements:
| Project | Token | Scopes Required |
|---|---|---|
| Google Chromium | None | Public LUCI API |
| GitLab | CONNECTORS_GITLAB_COM_TOKEN |
read_api |
| Microsoft Playwright | GITHUB_TOKEN |
public_repo |
Usage:
python src/connectors/main.py <project> <start_date> <end_date> <output_folder> [--workers N]
# Example:
python src/connectors/main.py microsoft/playwright 2024-08-01 2024-08-31 data/raw --workers 4Supported projects: google/chromium, gitlab-org/gitlab, microsoft/playwright (the connector for cqse/teamscale has been removed since it's a proprietary system)
Purpose: Distinguish flaky tests from deterministic failures by rerunning failed tests in the original CI environment.
GitLab:
- Enhanced API token (
api+write_repository) - Local clone of repository with configured fork remote
- Environment variables:
CONNECTORS_GITLAB_COM_TOKEN,GITLAB_USER_NAME,GITLAB_LOCAL_REPO_PATH - Script modifies local
.gitlab-ci.yml, commits to fork branches, triggers pipelines - Run:
python -m validate_failures.gitlab_org_gitlab.main <input_parquet> [<input_parquet2> ...]
Playwright:
- Enhanced GitHub token (
public_repo+actions) - Fork repository to your account (no local clone needed)
- Environment variable:
VALIDATE_FAILURES_PLAYWRIGHT_GITHUB_TOKEN - Script uses GitHub API to create commits and trigger workflows
- Run:
python -m validate_failures.microsoft_playwright.rerun_failures <input_parquet> <output_folder>
Validation takes hours to days depending on failure count and CI queue times. Results are collected with the respective download_results.py scripts.
If you use this replication package or dataset in your research, please cite:
@inproceedings{TODO,
title={An Empirical Study of Detected and Undetected Flaky Test Failures in Real-World CI Pipelines},
author={TODO},
booktitle={TODO},
year={TODO},
organization={TODO}
}Paper: TODO (paper DOI/link) Dataset: TODO (mediaTUM DOI/link)
Acknowledgments: This research was conducted at the Technical University of Munich (TUM). We thank the maintainers of Chromium, GitLab, and Playwright for maintaining open CI systems, and CQSE for providing access to their industrial CI data.