Flaky Tests in CI: Replication Package

Replication package for: "An Empirical Study of Detected and Undetected Flaky Test Failures in Real-World CI Pipelines"

Quick Start
Full Replication
System Requirements
Obtaining the Dataset
Repository Structure
Understanding the Data
Advanced Usage
Troubleshooting
Citation

Quick Start

Goal: Verify that the analysis scripts work using a small sample dataset (2 days of data).

Why only 2 days? The complete dataset is ~430 GB, because we provide all collected data for better reuse (e.g. failure symptoms like stack traces, durations, ...), even data not used in our paper. Downloading and executing on the full dataset takes considerable time and has high system requirements. The 2-day sample allows quick validation of the analysis pipeline. Note that the 2-day sample is intended only for testing the analysis pipeline, not for drawing conclusions.

Recommended Setup: Docker installed, 6+ CPU cores, 20GB+ RAM, 15GB free disk space (See System Requirements section for details on minimum requirements)

Runtime Reference (sample dataset):

We have encountered the following runtimes for reference:

Server (50 cores, 270GB RAM): ~15 minutes
MacBook Pro (16GB RAM, Apple M1): ~3.5 hours

If performance is poor or the analysis fails due to resource constraints, adjust the Spark configuration in src/analysis/utils.py.

Note: All commands below are for Unix-based systems (Linux/macOS). Windows users should use WSL2 or adapt commands accordingly.

1. Clone the repository and build the Docker image:

# Clone from GitHub
git clone https://github.com/tum-i4/flaky-tests-in-ci.git
cd flaky-tests-in-ci

# Build Docker image
docker build -t flaky-tests-ci .

2. Download sample dataset (first 2 days + validation data):

This downloads ~13 GB of data and may take some time depending on your network speed.

# Create directory structure
mkdir -p data

# Download using rsync
# Set password as environment variable (only required during review period)
export RSYNC_PASSWORD=m1836976  # For bash/zsh; Fish users: set -x RSYNC_PASSWORD m1836976

rsync -avz --progress \
  --include="raw/" \
  --include="raw/google-chromium/" \
  --include="raw/google-chromium/**/" \
  --include="raw/google-chromium/**/2024-07-01*.parquet" \
  --include="raw/google-chromium/**/2024-07-02*.parquet" \
  --include="raw/microsoft-playwright/" \
  --include="raw/microsoft-playwright/2024-08-01*.parquet" \
  --include="raw/microsoft-playwright/2024-08-02*.parquet" \
  --include="raw/gitlab-org-gitlab/" \
  --include="raw/gitlab-org-gitlab/2024-08-01*.parquet" \
  --include="raw/gitlab-org-gitlab/2024-08-02*.parquet" \
  --include="raw/cqse-teamscale/" \
  --include="raw/cqse-teamscale/2024-02-01*.parquet" \
  --include="raw/cqse-teamscale/2024-02-02*.parquet" \
  --include="validate_failures/" \
  --include="validate_failures/**/" \
  --include="validate_failures/**/*.parquet" \
  --exclude="*" \
  rsync://m1836976@dataserv.ub.tum.de/m1836976/data/ data/

# Clean up password from environment
unset RSYNC_PASSWORD            # For bash/zsh; Fish users: set -e RSYNC_PASSWORD

# Make data directory writable for Docker container
chmod -R 777 data/

3. Run analysis pipeline in Docker:

This will write analysis results to analysis_output.log while also displaying them in the console.

docker run --rm -v "$(pwd)/data:/workspace/data" \
  -v "$(pwd)/generated_figures:/workspace/generated_figures" \
  flaky-tests-ci python /workspace/run_all_analyses.py 2>&1 | tee analysis_output.log

Note on caching: The downloaded data contains individual test executions (TEs). On first run, the scripts will create a cache in data/cache/ containing aggregated test execution batches (TEBs) and test cases. This caching process can take a while and should not be interrupted. If interrupted, delete the entire data/cache/ folder so it will be recreated. After the cache is created, subsequent runs will be faster (though still potentially slow depending on the analysis).

4. Check results:

Results are printed to the console during execution and saved to analysis_output.log. Generated figures are saved to:

ls -la generated_figures/*.pdf

For full replication with complete dataset, see Full Replication below.

Full Replication

Goal: Reproduce all results from the paper using the complete dataset.

System Requirements

Minimum:

Python 3.12+
Java 17 (for PySpark)
6+ CPU cores
20GB RAM
430GB free disk space

Runtime Reference (sample dataset):

For reference, on a system with 50 cores and 270GB RAM, the full analysis took ~8 hours (~5 hours for creating the cache).

Note: All commands in this document are for Unix-based systems (Linux/macOS). Windows users should use WSL2 or adapt commands for PowerShell/CMD.

Setup Instructions

1. Clone the repository and build the Docker image:

# Clone from GitHub
git clone https://github.com/tum-i4/flaky-tests-in-ci.git
cd flaky-tests-in-ci

# Build Docker image
docker build -t flaky-tests-ci .

2. Download the complete dataset (see Obtaining the Dataset)

3. Prepare directories for Docker:

# Make data directory writable for Docker container
chmod -R 777 data/

4. Run analysis pipeline in Docker:

# Run complete analysis pipeline with output logged to file
# Adjust CPU and memory limits based on your host system
docker run --rm \
  -v "$(pwd)/data:/workspace/data" \
  -v "$(pwd)/generated_figures:/workspace/generated_figures" \
  flaky-tests-ci python /workspace/run_all_analyses.py 2>&1 | tee analysis_output.log

Analysis Pipeline Details

The run_all_analyses.py script runs all analyses in sequence. On the first run, the cache (Test Execution Batches and test cases) will be created automatically, which increases the runtime significantly.

Dataset overview
RQ1: Pipeline-level flakiness
RQ2: Time series spike detection
RQ3: Pareto analysis
RQ4: Flaky type overlap
RQ5: Environment-specific patterns

Outputs:

Figures: generated_figures/*.pdf - All figures from the paper
Spike Reports: generated_figures/spike_reports/ - Detailed spike analysis reports (as a basis for manual labels in RQ2)
Console Output: Statistics, tables, and analysis results
Cache: data/cache/ - Preprocessed data (created automatically)

Performance Notes:

First run creates cache files
Subsequent runs are faster (cache is reused)

System Requirements

Hardware

Recommended:

RAM: 20GB+
Storage: 430GB free disk space (SSD preferred)
CPU: 6+ cores

Note: The analysis can run with fewer resources than recommended, but may fail with out-of-memory errors. Even with the recommended resources, processing the full dataset will be extremely slow. If you encounter issues, adjust the Spark configuration in src/analysis/utils.py.

Software

Component	Version	Purpose
Python	3.12+	Main runtime
Java (OpenJDK)	17	PySpark backend
Node.js	18+	Playwright connector only*
LaTeX	texlive	Figure generation
Python packages	See requirements.txt	Data processing, visualization, analysis

*Only required for data collection, not for analysis

Obtaining the Dataset

Sample Dataset (Quick Start)

For quick validation and testing, download the first 2 days of raw data per project plus all validation data.

Download commands are provided in the Quick Start section above.

Complete Dataset (Full Replication)

The complete dataset contains test executions from four projects across their respective study periods and consists of 3,131 raw data files plus 112 validation files.

Download:

You can download the complete dataset using rsync as follows:

# Set password as environment variable (only required during review period)
export RSYNC_PASSWORD=m1836976  # For bash/zsh; Fish users: set -x RSYNC_PASSWORD m1836976
rsync -avz --progress rsync://m1836976@dataserv.ub.tum.de/m1836976/data/ data/
unset RSYNC_PASSWORD            # For bash/zsh; Fish users: set -e RSYNC_PASSWORD

Download via http or ftp is also available on the data server (Username and password required during review phase: Username: reviewer-access-03 Password: DeNp36MnS%ZhaL!qW3)

Dataset Structure:

data/
├── raw/                                    # Raw CI test execution data
│   ├── google-chromium/
│   │   ├── Linux-Tests/                    # Platform-specific subdirectories for Chromium
│   │   ├── mac14-tests/
│   │   └── Win11-Tests-x64/
│   │       └── 2024-MM-DD_NNN.parquet      # Multiple parquet files per day
│   ├── gitlab-org-gitlab/
│   │   └── 2024-MM-DD_NNN.parquet
│   ├── microsoft-playwright/
│   │   └── 2024-MM-DD_NNN.parquet
│   └── cqse-teamscale/
│       └── 2024-MM-DD_NNN.parquet          # Anonymized
└── validate_failures/                      # Validation run data
    ├── gitlab-org-gitlab/
    │   └── 2024-MM-DD_NNN.parquet
    ├── microsoft-playwright/
    │   └── 2024-MM-DD_NNN.parquet
    └── cqse-teamscale/
        └── 2024-MM-DD_NNN.parquet          # Anonymized

Checksums

Verify data integrity after download:

# Download checksums file (if not already included in your rsync download)
export RSYNC_PASSWORD=m1836976
rsync -avz --progress rsync://m1836976@dataserv.ub.tum.de/m1836976/checksums.sha512 ./
unset RSYNC_PASSWORD

# Verify all downloaded files (run from repository root)
sha512sum -c checksums.sha512

# Or verify specific files only
grep "data/raw/microsoft-playwright" checksums.sha512 | sha512sum -c

The checksums file contains SHA-512 hashes for all 3,243 files in the dataset.

Repository Structure

.
├── src/
│   ├── analysis/                   # Analysis script
│   │   ├── dataset-data.py         # Dataset overview statistics
│   │   ├── RQ1.py                  # Prevalence of Flaky Failures in Pipelines
│   │   ├── RQ2.py                  # Causes of Temporal Spikes
│   │   ├── RQ3.py                  # Flaky Failures from Test Cases
│   │   ├── RQ4.py                  # Test Cases with Detected and Undetected Flaky Failures
│   │   ├── RQ5.py                  # Flaky Failures in Different Environments
│   │   ├── utils.py                # Data loading and processing utilities
│   │   └── utils_visualization.py  # Plotting and figure generation utilities
│   │
│   ├── connectors/                 # Data collection tools*
│   │   ├── main.py                 # Main data collection entry point
│   │   ├── chromium/               # Google Chromium connector
│   │   ├── gitlabProject/          # GitLab connector
│   │   ├── playwright/             # Microsoft Playwright connector
│   │   └── _commons/               # Shared connector utilities
│   │
│   ├── validate_failures/          # Failure validation framework*
│   │   ├── gitlab_org_gitlab/      # GitLab validation scripts
│   │   └── microsoft_playwright/   # Playwright validation scripts
│   │
│   └── prepare_for_publication/    # Data anonymization tools
│       └── hash_teamscale_data.py  # Teamscale data anonymization script
│
├── data/                           # Data directory (download separately)
│   ├── raw/                        # Raw CI test execution data
│   ├── cache/                      # Processed caches (auto-generated)
│   │   ├── teb/                    # Test Execution Batches
│   │   └── test_cases/             # Test case aggregations
│   └── validate_failures/          # Validation run data
│
├── generated_figures/              # Output: Generated figures (PDFs)
│   └── spike_reports/              # Detailed spike analysis reports (RQ2)
│
├── spikes_labels.csv               # Manual labeling of RQ2 spike root causes
├── .flake8                         # Python linting configuration
├── requirements.txt                # Python dependencies
└── README.md                       # This file

* Not required for replicating analysis results

Understanding the Data

Key Concepts

For detailed formal definitions, see Section II (Terminology) of the paper. Here's a brief overview:

Test Case: Defined by its name and environment (e.g., loginTest in Chrome vs. Firefox are different test cases).

Test Execution (TE): A single execution instance of a test case, uniquely identified by test case name, environment, job ID, and retry count. Raw data consists of individual TEs.

Test Execution Batch (TEB): All TEs of the same test case within a single job (including automatic retries). This is the primary unit of analysis, generated during data processing. A TEB is flaky if it contains both passing and failing TEs within the job, or if it failed in the job but passed when validated in a fresh environment.

Job: A collection of test executions run together within the same CI runner instance.

Pipeline: A collection of jobs targeting the same commit.

Verdict Classification

Each Test Execution Batch is assigned one of these verdicts:

Verdict	Description
`passed`	All executions passed
`flaky (detected)`	Initial failure + subsequent pass within the job
`flaky (undetected survived immediate rerun)`	Failed consistently with immediate reruns in job, but passed in validation
`flaky (undetected unchallenged by immediate rerun)`	Failed once without immediate reruns, but passed in validation
`failed (with immediate reruns)`	Failed consistently with immediate reruns in job and validation
`failed (no immediate reruns)`	Failed once without immediate reruns, and failed in validation
`failed (unvalidated)`	Failure not validated (rare cases)

Data Schema

Each parquet file contains test execution records with 23 columns across four levels:

Test Level (8 columns):

test_case_name - Test identifier (e.g., class, method, file path)
test_case_environment - Execution context (e.g., OS, browser, configuration)
retry_count - Rerun attempt number (0-based)
duration - Execution duration (seconds)
result - Project-specific result
generic_result - Normalized result (pass-like/fail-like)
failure_symptoms - Error messages, stack traces
metadata - Project-specific test metadata

Job Level (5 columns):

job_id - Unique identifier for the job
job_name - Name/label of the job
job_duration - Total execution time of the job (seconds)
job_result - Overall result of the job
job_trigger - What initiated the job (e.g., commit, manual, schedule)

Pipeline Level (9 columns):

pipeline_id - Unique identifier for the pipeline run
pipeline_result - Overall result of the pipeline
pipeline_created_at - Timestamp when pipeline was created
pipeline_duration - Total execution time of the pipeline (seconds)
pipeline_commit_sha - Git commit hash the pipeline targets
pipeline_branch_name - Git branch the pipeline runs on
pipeline_trigger - What initiated the pipeline
pipeline_metadata - Project-specific pipeline data
pipeline_name - Name/label of the pipeline

Project Level (1 column):

project_name - e.g., "microsoft/playwright"

Note: CQSE Teamscale data has anonymized test_case_name, job_name, and pipeline_branch_name fields (except "master").

Manual Spike Analysis (`spikes_labels.csv`)

The spikes_labels.csv file documents the manual labeling of temporal spikes in flaky test failures discovered during RQ2. It contains root cause analyses for spikes, with independent reviews from two authors each and final consensus decisions on technical root causes and temporal triggers.

Advanced Usage

Collecting New Data

Note: Not required for replication. Only use if you want to extend the dataset or collect data from new time periods. Due to the high variability of CI systems and APIs, the scripts might require updates to work with current versions.

Prerequisites:

Node.js 18+ (for Playwright connector - already installed in Docker)
API tokens (see below) - must be set as environment variables

Authentication Requirements:

Project	Token	Scopes Required
Google Chromium	None	Public LUCI API
GitLab	`CONNECTORS_GITLAB_COM_TOKEN`	`read_api`
Microsoft Playwright	`GITHUB_TOKEN`	`public_repo`

Usage:

python src/connectors/main.py <project> <start_date> <end_date> <output_folder> [--workers N]

# Example:
python src/connectors/main.py microsoft/playwright 2024-08-01 2024-08-31 data/raw --workers 4

Supported projects: google/chromium, gitlab-org/gitlab, microsoft/playwright (the connector for cqse/teamscale has been removed since it's a proprietary system)

Validating Failures

Purpose: Distinguish flaky tests from deterministic failures by rerunning failed tests in the original CI environment.

GitLab:

Enhanced API token (api + write_repository)
Local clone of repository with configured fork remote
Environment variables: CONNECTORS_GITLAB_COM_TOKEN, GITLAB_USER_NAME, GITLAB_LOCAL_REPO_PATH
Script modifies local .gitlab-ci.yml, commits to fork branches, triggers pipelines
Run: python -m validate_failures.gitlab_org_gitlab.main <input_parquet> [<input_parquet2> ...]

Playwright:

Enhanced GitHub token (public_repo + actions)
Fork repository to your account (no local clone needed)
Environment variable: VALIDATE_FAILURES_PLAYWRIGHT_GITHUB_TOKEN
Script uses GitHub API to create commits and trigger workflows
Run: python -m validate_failures.microsoft_playwright.rerun_failures <input_parquet> <output_folder>

Validation takes hours to days depending on failure count and CI queue times. Results are collected with the respective download_results.py scripts.

Citation

If you use this replication package or dataset in your research, please cite:

@inproceedings{TODO,
  title={An Empirical Study of Detected and Undetected Flaky Test Failures in Real-World CI Pipelines},
  author={TODO},
  booktitle={TODO},
  year={TODO},
  organization={TODO}
}

Paper: TODO (paper DOI/link) Dataset: TODO (mediaTUM DOI/link)

Acknowledgments: This research was conducted at the Technical University of Munich (TUM). We thank the maintainers of Chromium, GitLab, and Playwright for maintaining open CI systems, and CQSE for providing access to their industrial CI data.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flaky Tests in CI: Replication Package

Table of Contents

Quick Start

Full Replication

System Requirements

Setup Instructions

Analysis Pipeline Details

System Requirements

Hardware

Software

Obtaining the Dataset

Sample Dataset (Quick Start)

Complete Dataset (Full Replication)

Checksums

Repository Structure

Understanding the Data

Key Concepts

Verdict Classification

Data Schema

Manual Spike Analysis (`spikes_labels.csv`)

Advanced Usage

Collecting New Data

Validating Failures

Citation

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Flaky Tests in CI: Replication Package

Table of Contents

Quick Start

Full Replication

System Requirements

Setup Instructions

Analysis Pipeline Details

System Requirements

Hardware

Software

Obtaining the Dataset

Sample Dataset (Quick Start)

Complete Dataset (Full Replication)

Checksums

Repository Structure

Understanding the Data

Key Concepts

Verdict Classification

Data Schema

Manual Spike Analysis (spikes_labels.csv)

Advanced Usage

Collecting New Data

Validating Failures

Citation

Manual Spike Analysis (`spikes_labels.csv`)