Skip to content

WIP: net, infra, stuntime: Measure connectivity gap during VM live migration#4238

Closed
Anatw wants to merge 2 commits intoRedHatQE:mainfrom
Anatw:stuntime-lib-helpers
Closed

WIP: net, infra, stuntime: Measure connectivity gap during VM live migration#4238
Anatw wants to merge 2 commits intoRedHatQE:mainfrom
Anatw:stuntime-lib-helpers

Conversation

@Anatw
Copy link
Copy Markdown
Contributor

@Anatw Anatw commented Mar 22, 2026

Short description:

Quantify downtime between pings for regression detection and baselines.

What this PR does / why we need it:

Add compute_stuntime helper to parse ping -D output and compute connectivity gap. To be used in the measurement scenarios.

Special notes for reviewer:

Based on the stuntime STP, not yet merged.

Verification

  • Ran with 2 VMs on Linux bridge and live migration.
  • Ping output captured from the static VM (/tmp/stuntime_ping.log) during migration of the other VM.
    Output:
    ...
    [1774185115.382314] 64 bytes from 172.16.2.2: icmp_seq=41 ttl=64 time=0.695 ms
    [1774185115.486496] 64 bytes from 172.16.2.2: icmp_seq=42 ttl=64 time=0.785 ms
    [1774185115.590352] 64 bytes from 172.16.2.2: icmp_seq=43 ttl=64 time=0.641 ms
    [1774185115.694331] 64 bytes from 172.16.2.2: icmp_seq=44 ttl=64 time=0.676 ms
    [1774185116.630529] 64 bytes from 172.16.2.2: icmp_seq=53 ttl=64 time=0.859 ms
    [1774185126.824966] 64 bytes from 172.16.2.2: icmp_seq=151 ttl=64 time=3.22 ms
    [1774185126.923577] 64 bytes from 172.16.2.2: icmp_seq=152 ttl=64 time=1.11 ms
    [1774185127.023927] 64 bytes from 172.16.2.2: icmp_seq=153 ttl=64 time=0.907 ms
    [1774185127.124168] 64 bytes from 172.16.2.2: icmp_seq=154 ttl=64 time=0.781 ms
    [1774185127.230467] 64 bytes from 172.16.2.2: icmp_seq=155 ttl=64 time=0.719 ms
    ...
  • Example measured stuntime: 10.194s (connectivity gap during migration) - from test output:
    Measured stuntime: 10.194s
jira-ticket:

https://redhat.atlassian.net/browse/CNV-80581

Summary by CodeRabbit

  • Tests
    • Added test infrastructure for measuring network performance during virtual machine migration scenarios across different network configurations.

Anatw added 2 commits March 17, 2026 17:59
Introduce the STD for VM stuntime measurement during live migration
on Linux bridge and OVN localnet secondary networks, for both IPv4
and IPv6. These tests provide baseline measurements and a framework
for regression detection.

Baseline and Threshold:
- Per-scenario threshold - setting the threshold to the minimum of
  (max_observed × 4) and 5 seconds — i.e. 4× the worst observed value
  for that scenario, capped at 5 seconds.
- Thresholds will be hardcoded based on 10-run baselines per-scenario,
  on a BM cluster.
- A per-scenario approach is used instead of a global threshold to
  prevent slow-path scenarios from masking regressions in faster ones.

Measurement Methodology:
- Ping command: ICMP ping at 100ms intervals with UNIX timestamps
  (`ping -D -O -i 0.1` for IPv4, `ping -6 -D -O -i 0.1` for IPv6).
- Stuntime calculation: Stuntime is the largest gap between any two
  consecutive successful replies in the ping log - i.e., the maximum
  time difference between timestamps of successive successful packets.
  The boundaries (last success before loss, first success after
  recovery) are the pair of packets that define that gap.
  Stuntime = (Timestamp of first success after) - (Timestamp of last
  success before)
- Alternatives rejected: tcping (introducing a new dependency), iperf3
  (unnecessary complexity), and curl (unnecessary complexity - requires
  an active web server inside the VM).

IP Family:
- IPv4 and IPv6 are measured in separate migrations to avoid
  interactions between ARP (IPv4) and NDP (IPv6) recovery paths.
- ip_family is a parametrize dimension with pytest.mark.ipv4 and
  pytest.mark.ipv6 applied per value, allowing selective runs:
  IPv4-only (`pytest -m ipv4`), IPv6-only (`pytest -m ipv6`),
  or both (no `-m` flag needed).
- Total scenarios: 24 (2 CNI types × 3 migration paths × 2 ping
  initiators × 2 IP families).

Both VMs start on the same node to align with the first scenario logic:
migrating from the same node to a different node (static_to_different).

Signed-off-by: Anat Wax <awax@redhat.com>
Assisted by: Cursor
Quantify downtime between pings for regression detection and baselines.

Signed-off-by: Anat Wax <awax@redhat.com>
Assisted by: Cursor
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 22, 2026

📝 Walkthrough

Walkthrough

Two new files added for VM stuntime measurement during live migration: a helper module providing stuntime computation logic that parses ping output and calculates maximum gap between consecutive timestamps, and a test module containing test class skeletons for Linux bridge and OVN localnet network types.

Changes

Cohort / File(s) Summary
Stuntime computation helper
tests/network/stuntime/lib_helpers.py
New module with InsufficientStuntimeDataError exception class and compute_stuntime() function that parses raw ping -D output via regex to extract timestamps, validates minimum two timestamps exist, calculates maximum gap between consecutive timestamps using itertools.pairwise, and logs results.
Stuntime measurement tests
tests/network/stuntime/test_stuntime_measurement.py
Two new test classes (TestStuntimeLinuxBridge, TestStuntimeOvnLocalnet) with test_migration_stuntime() methods decorated with Polarion markers and explicitly disabled from discovery via __test__ = False. Methods include docstrings describing parametrized measurement steps across IPv4/IPv6 and migration scenarios but contain no implementation logic.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description check ✅ Passed The description covers the essential sections: short description, purpose, special reviewer notes, verification with concrete examples, and JIRA ticket reference. All required template sections are addressed.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Title check ✅ Passed The title clearly and specifically describes the main change: introducing stuntime measurement capabilities for VM live migration connectivity analysis across network types.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-virtualization-qe-bot
Copy link
Copy Markdown

Report bugs in Issues

Welcome! 🎉

This pull request will be automatically processed with the following features:

🔄 Automatic Actions

  • Reviewer Assignment: Reviewers are automatically assigned based on the OWNERS file in the repository root
  • Size Labeling: PR size labels (XS, S, M, L, XL, XXL) are automatically applied based on changes
  • Issue Creation: A tracking issue is created for this PR and will be closed when the PR is merged or closed
  • Branch Labeling: Branch-specific labels are applied to track the target branch
  • Auto-verification: Auto-verified users have their PRs automatically marked as verified
  • Labels: Enabled categories: branch, can-be-merged, cherry-pick, has-conflicts, hold, needs-rebase, size, verified, wip

📋 Available Commands

PR Status Management

  • /wip - Mark PR as work in progress (adds WIP: prefix to title)
  • /wip cancel - Remove work in progress status
  • /hold - Block PR merging (approvers only)
  • /hold cancel - Unblock PR merging
  • /verified - Mark PR as verified
  • /verified cancel - Remove verification status
  • /reprocess - Trigger complete PR workflow reprocessing (useful if webhook failed or configuration changed)
  • /regenerate-welcome - Regenerate this welcome message

Review & Approval

  • /lgtm - Approve changes (looks good to me)
  • /approve - Approve PR (approvers only)
  • /assign-reviewers - Assign reviewers based on OWNERS file
  • /assign-reviewer @username - Assign specific reviewer
  • /check-can-merge - Check if PR meets merge requirements

Testing & Validation

  • /retest tox - Run Python test suite with tox
  • /retest build-container - Rebuild and test container image
  • /retest verify-bugs-are-open - verify-bugs-are-open
  • /retest all - Run all available tests

Container Operations

  • /build-and-push-container - Build and push container image (tagged with PR number)
    • Supports additional build arguments: /build-and-push-container --build-arg KEY=value

Cherry-pick Operations

  • /cherry-pick <branch> - Schedule cherry-pick to target branch when PR is merged
    • Multiple branches: /cherry-pick branch1 branch2 branch3

Label Management

  • /<label-name> - Add a label to the PR
  • /<label-name> cancel - Remove a label from the PR

✅ Merge Requirements

This PR will be automatically approved when the following conditions are met:

  1. Approval: /approve from at least one approver
  2. LGTM Count: Minimum 2 /lgtm from reviewers
  3. Status Checks: All required status checks must pass
  4. No Blockers: No wip, hold, has-conflicts labels and PR must be mergeable (no conflicts)
  5. Verified: PR must be marked as verified

📊 Review Process

Approvers and Reviewers

Approvers:

  • EdDev

Reviewers:

  • Anatw
  • EdDev
  • azhivovk
  • servolkov
  • yossisegev
Available Labels
  • hold
  • verified
  • wip
  • lgtm
  • approve

💡 Tips

  • WIP Status: Use /wip when your PR is not ready for review
  • Verification: The verified label is automatically removed on each new commit
  • Cherry-picking: Cherry-pick labels are processed when the PR is merged
  • Container Builds: Container images are automatically tagged with the PR number
  • Permission Levels: Some commands require approver permissions
  • Auto-verified Users: Certain users have automatic verification and merge privileges

For more information, please refer to the project documentation or contact the maintainers.

Copy link
Copy Markdown
Contributor

@azhivovk azhivovk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if we can merge this helper without usage, it also might be failing tox for unused code

"""Parse ping -D output and compute stuntime as the largest gap between successful replies.

Stuntime is the connectivity gap duration: the largest interval where no ICMP replies
were received. For example, with ping at 0.1s intervals, any gap > 0.1s indicates packet loss.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, with ping at 0.1s intervals, any gap > 0.1s indicates packet loss.

Please re-consider if this sentence should be here.
Why?

  1. It is supposed to serve as an example for the preceding sentence that explains what stuntime means, but it in fact explains response intervals.
  2. I am not sure it is true. AFAIK, the ICMP responses may be delayed, regardless of the intervals between the echos, but arrive eventually, and this is not considered as a packet-loss.

Stuntime in seconds (float).

Raises:
InsufficientStuntimeDataError: When ping log has fewer than 2 reply timestamps.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this explanation is not clear. I had to return to the class docstring to understand why fewer than 2 timestamps is a problem.


Preconditions:
- Running VM for migration on Linux bridge secondary network, running on worker1.
- Running peer VM on Linux bridge secondary network, running on worker1.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The class is defined as parameterized, but OTOH the pre-conditions are of a specific scenario from the matrix, where both VMs are scheduled on the same node (the co_located_to_remote scenario).

Comment on lines +87 to +88
- Running VM for migration on OVN localnet secondary network, running on worker1.
- Running peer VM on OVN localnet secondary network, running on worker1.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

@Anatw
Copy link
Copy Markdown
Contributor Author

Anatw commented Mar 31, 2026

/wip

@openshift-virtualization-qe-bot-3 openshift-virtualization-qe-bot-3 changed the title net, infra, stuntime: Measure connectivity gap during VM live migration WIP: net, infra, stuntime: Measure connectivity gap during VM live migration Mar 31, 2026
@Anatw Anatw closed this Apr 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants