Skip to content

Commit 1fe77d2

Browse files
authored
test: add AI generated RC manual testing plan (#27492)
<!-- Please submit this PR as a draft initially. Do not mark it as "Ready for review" until the template has been completely filled out, and PR status checks have passed at least once. --> ## **Description** This PR adds an AI-powered release testing plan generator that: - Analyzes release PRs – Fetches PR metadata, changed files, and team sign-offs from GitHub - Categorizes changes – Identifies high-impact files (app/, patches/, manifests) - Uses LLMs – Uses Claude (default), GPT-5, or Gemini with automatic fallback - Auto-detects feature flags – Excludes disabled features from test scenarios - Produces a structured plan – Outputs JSON test plan with scenarios, steps, and risk levels - Generates HTML viewer – Styled, readable test plan deployed to GitHub Pages - The test plan is generated automatically when Bitrise posts "RC Builds Ready for Testing" on release PRs. Each new build with cherry-picks can trigger an updated plan. Test Plan Output Summary includes: - releaseRiskScore (0–100, formula: min(100, round(10 * sqrt(highRisk * 4 + mediumRisk)))) - totalFilesChanged, highImpactFiles - highRiskScenarios, mediumRiskScenarios counts - teamsNeedingSignOff Executive Summary includes: - releaseFocus – One-line release description - keyChanges – 3-5 bullet points - overallRisk – low/medium/high - recommendation – Go/no-go guidance Scenario groups: - initialScenarios – Risky areas from initial release commits - cherryPickScenarios – Risky areas from cherry-pick commits Each scenario includes: - area – Feature area (e.g., "Card", "Swaps", "Send Flow") - riskLevel – high/medium - preconditions – Setup required before testing - testSteps – 5-8 detailed, automation-ready steps - expectedOutcomes – What success looks like - whyThisMatters – References specific code changes <!-- Write a short description of the changes included in this pull request, also include relevant motivation and context. Have in mind the following questions: 1. What is the reason for the change? 2. What is the improvement/solution? --> CI Workflow: - .github/workflows/generate-rc-test-plan.yml – Triggered by Bitrise comment, generates test plan, deploys to GitHub Pages Test Plan Generation: - modes/generate-test-plan/fast-analyzer.ts – Single-call LLM test plan generation with delta/combined modes - modes/generate-test-plan/handlers.ts – Agentic mode handlers (legacy) - modes/generate-test-plan/prompt.ts – System prompts for test plan generation Utilities: - utils/feature-flags.ts – Auto-detect disabled feature flags from remote API - utils/github-client.ts – GitHub API for PR info, team sign-offs, build numbers - utils/git-utils.ts – Cherry-pick detection between commits, commit validation Provider: - Provider priority: Claude → OpenAI → Gemini - Added usage tracking to Opus streaming responses CI Changes - New workflow triggers on issue_comment for release PRs - Generates test-plan-{version}.json and test-plan-{version}.html - Deploys to GitHub Pages: metamask.github.io/metamask-mobile/test-plans/ - Updates Bitrise comment with test plan links - Uses existing secrets: E2E_CLAUDE_API_KEY, E2E_OPENAI_API_KEY, E2E_GEMINI_API_KEY ## **Changelog** <!-- If this PR is not End-User-Facing and should not show up in the CHANGELOG, you can choose to either: 1. Write `CHANGELOG entry: null` 2. Label with `no-changelog` If this PR is End-User-Facing, please write a short User-Facing description in the past tense like: `CHANGELOG entry: Added a new tab for users to see their NFTs` `CHANGELOG entry: Fixed a bug that was causing some NFTs to flicker` (This helps the Release Engineer do their job more quickly and accurately) --> CHANGELOG entry: null ## **Related issues** Fixes: https://consensyssoftware.atlassian.net/browse/INFRA-3426?actionerId=6126045c1827d1006848bec4&sourceType=assign&atlOrigin=eyJpIjoiNzQxOTQ5ZDQ4NjExNGI1ZjgzYWFjYTZhYzhhN2JmMzYiLCJwIjoiaiJ9 ## **Manual testing steps** # Export API key export E2E_CLAUDE_API_KEY=sk-... # Run locally against a release PR node -r esbuild-register tests/tools/e2e-ai-analyzer \ --mode generate-test-plan \ --pr 25900 \ --auto-ff # Check output cat release-test-plan.json Verify: - release-test-plan.json includes scenarios with riskLevel, testSteps, whyThisMatters - Executive summary has releaseFocus and recommendation - Disabled feature flags are listed in excludedFeatures ```gherkin Feature: my feature name Scenario: user [verb for user action] Given [describe expected initial app state] When user [verb for user action] Then [describe expected outcome] ``` ## **Screenshots/Recordings** <!-- If applicable, add screenshots and/or recordings to visualize the before and after of your change. --> https://github.com/user-attachments/assets/cbcdcf77-79be-469c-8056-9e1d85be2c36 ### **Before** <!-- [screenshots/recordings] --> ### **After** <!-- [screenshots/recordings] --> ## **Pre-merge author checklist** - [ ] I've followed [MetaMask Contributor Docs](https://github.com/MetaMask/contributor-docs) and [MetaMask Mobile Coding Standards](https://github.com/MetaMask/metamask-mobile/blob/main/.github/guidelines/CODING_GUIDELINES.md). - [ ] I've completed the PR template to the best of my ability - [ ] I've included tests if applicable - [ ] I've documented my code using [JSDoc](https://jsdoc.app/) format if applicable - [ ] I've applied the right labels on the PR (see [labeling guidelines](https://github.com/MetaMask/metamask-mobile/blob/main/.github/guidelines/LABELING_GUIDELINES.md)). Not required for external contributors. ## **Pre-merge reviewer checklist** - [ ] I've manually tested the PR (e.g. pull and build branch, run the app, test code being changed). - [ ] I confirm that this PR addresses all acceptance criteria described in the ticket it closes and includes the necessary testing evidence such as recordings and or screenshots. <!-- CURSOR_SUMMARY --> --- > [!NOTE] > **Medium Risk** > Adds a new GitHub Actions workflow that runs on PR comments, writes to `gh-pages`, and posts back to PR comments; plus expands the `e2e-ai-analyzer` to call external APIs/LLMs and `gh`/`git` commands, which increases CI and release-pipeline surface area despite some input sanitization. > > **Overview** > Automates RC manual testing documentation by triggering a new workflow (`generate-rc-test-plan.yml`) when Bitrise posts the "RC Builds Ready for Testing" PR comment on `release/*` branches, running the `tests/tools/e2e-ai-analyzer` to generate `release-test-plan.json`, rendering an HTML viewer, publishing both to `gh-pages`, and appending links back onto the originating comment. > > Extends `tests/tools/e2e-ai-analyzer` with a new `generate-test-plan` mode (including new result types and finalize tool), a fast single-call LLM path that pulls PR metadata/files/sign-offs via `gh`, optionally computes cherry-pick deltas between commits/builds, and auto-excludes disabled remote feature flags via a remote-config API call. It also updates provider behavior (Claude-first failover and Anthropic Opus streaming) and ignores generated release artifacts via `.gitignore`. > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit 36764f6. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup> <!-- /CURSOR_SUMMARY -->
1 parent 0d6ecb7 commit 1fe77d2

20 files changed

Lines changed: 3920 additions & 16 deletions

File tree

.github/workflows/generate-rc-test-plan.yml

Lines changed: 367 additions & 0 deletions
Large diffs are not rendered by default.

.gitignore

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -185,3 +185,8 @@ temp/
185185
tests/coverage-systems/
186186

187187
runway-artifacts/
188+
189+
# E2E AI Analyzer output files
190+
release-test-plan.json
191+
release-delta.json
192+
release-signoffs.json
Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
module.exports = {
2+
rules: {
3+
// Disable deprecated rule that doesn't exist in current ESLint version
4+
'@typescript-eslint/no-parameter-properties': 'off',
5+
},
6+
};
Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
/**
2+
* Finalize Test Plan Tool Handler
3+
*
4+
* Handles the finalization of the AI's test plan generation
5+
*/
6+
7+
import { ToolInput } from '../../types';
8+
9+
export function handleFinalizeTestPlan(input: ToolInput): string {
10+
return JSON.stringify(input);
11+
}

tests/tools/e2e-ai-analyzer/ai-tools/tool-executor.ts

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,7 @@ import { handleListDirectory } from './handlers/list-directory';
1212
import { handleGrepCodebase } from './handlers/grep-codebase';
1313
import { handleLoadSkill } from './handlers/load-skill';
1414
import { handleFinalizeTagSelection } from './handlers/finalize-tag-selection';
15+
import { handleFinalizeTestPlan } from './handlers/finalize-test-plan';
1516

1617
/**
1718
* Tool execution context
@@ -57,6 +58,9 @@ export async function executeTool(
5758
case 'finalize_tag_selection':
5859
return handleFinalizeTagSelection(input);
5960

61+
case 'finalize_test_plan_generation':
62+
return handleFinalizeTestPlan(input);
63+
6064
default:
6165
return `Unknown tool: ${toolName}`;
6266
}

tests/tools/e2e-ai-analyzer/ai-tools/tool-registry.ts

Lines changed: 205 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -9,10 +9,10 @@ import { LLMTool } from '../providers';
99
import { TOOL_LIMITS } from '../config';
1010

1111
/**
12-
* Gets all tool definitions for the AI agent
12+
* Gets tool definitions for the AI agent
1313
*/
1414
export function getToolDefinitions(): LLMTool[] {
15-
return [
15+
const allTools: LLMTool[] = [
1616
{
1717
name: 'read_file',
1818
description:
@@ -194,5 +194,208 @@ export function getToolDefinitions(): LLMTool[] {
194194
],
195195
},
196196
},
197+
{
198+
name: 'finalize_test_plan_generation',
199+
description: 'Submit the final exploratory test plan for the release',
200+
input_schema: {
201+
type: 'object',
202+
properties: {
203+
summary: {
204+
type: 'object',
205+
description: 'High-level metrics for the test plan',
206+
properties: {
207+
total_changed_files: { type: 'number' },
208+
total_commits: { type: 'number' },
209+
critical_areas: { type: 'number' },
210+
high_risk_areas: { type: 'number' },
211+
medium_risk_areas: { type: 'number' },
212+
low_risk_areas: { type: 'number' },
213+
estimated_testing_hours: { type: 'string' },
214+
release_version: { type: 'string' },
215+
},
216+
required: [
217+
'total_changed_files',
218+
'critical_areas',
219+
'high_risk_areas',
220+
'estimated_testing_hours',
221+
],
222+
},
223+
feature_areas: {
224+
type: 'array',
225+
description:
226+
'Prioritized list of feature areas with test scenarios',
227+
items: {
228+
type: 'object',
229+
properties: {
230+
feature_area: { type: 'string' },
231+
risk_level: {
232+
type: 'string',
233+
enum: ['critical', 'high', 'medium', 'low'],
234+
},
235+
risk_justification: { type: 'string' },
236+
impacted_components: {
237+
type: 'array',
238+
items: { type: 'string' },
239+
},
240+
exploratory_scenarios: {
241+
type: 'array',
242+
items: {
243+
type: 'object',
244+
properties: {
245+
id: { type: 'string' },
246+
title: { type: 'string' },
247+
description: { type: 'string' },
248+
preconditions: {
249+
type: 'array',
250+
items: { type: 'string' },
251+
},
252+
exploration_guidance: {
253+
type: 'array',
254+
items: { type: 'string' },
255+
},
256+
risk_indicators: {
257+
type: 'array',
258+
items: { type: 'string' },
259+
},
260+
related_changes: {
261+
type: 'array',
262+
items: { type: 'string' },
263+
},
264+
},
265+
required: ['id', 'title', 'description'],
266+
},
267+
},
268+
platform_notes: {
269+
type: 'object',
270+
properties: {
271+
ios: { type: 'array', items: { type: 'string' } },
272+
android: { type: 'array', items: { type: 'string' } },
273+
shared: { type: 'array', items: { type: 'string' } },
274+
},
275+
},
276+
priority: { type: 'number' },
277+
exploratory_priority: {
278+
type: 'number',
279+
description:
280+
'Score 1-10 indicating how much this area needs exploratory testing',
281+
},
282+
exploration_charters: {
283+
type: 'array',
284+
description: 'Specific exploration missions for this area',
285+
items: {
286+
type: 'object',
287+
properties: {
288+
id: { type: 'string' },
289+
mission: {
290+
type: 'string',
291+
description: 'The exploration goal',
292+
},
293+
context: {
294+
type: 'string',
295+
description: 'Why this matters for this release',
296+
},
297+
what_ifs: {
298+
type: 'array',
299+
items: { type: 'string' },
300+
description: 'Specific questions to investigate',
301+
},
302+
time_box: {
303+
type: 'string',
304+
description: 'Suggested exploration time',
305+
},
306+
},
307+
required: ['id', 'mission', 'what_ifs'],
308+
},
309+
},
310+
},
311+
required: ['feature_area', 'risk_level', 'priority'],
312+
},
313+
},
314+
cross_cutting_concerns: {
315+
type: 'array',
316+
items: { type: 'string' },
317+
description: 'Issues that span multiple feature areas',
318+
},
319+
regression_focus_areas: {
320+
type: 'array',
321+
items: { type: 'string' },
322+
description: 'Areas requiring extra regression attention',
323+
},
324+
platform_specific_guidance: {
325+
type: 'object',
326+
properties: {
327+
ios: { type: 'array', items: { type: 'string' } },
328+
android: { type: 'array', items: { type: 'string' } },
329+
shared: { type: 'array', items: { type: 'string' } },
330+
},
331+
},
332+
exploration_themes: {
333+
type: 'array',
334+
description:
335+
'Cross-cutting exploration approaches that apply across features',
336+
items: {
337+
type: 'object',
338+
properties: {
339+
name: {
340+
type: 'string',
341+
description: 'Theme name (e.g., "Interruption Testing")',
342+
},
343+
description: {
344+
type: 'string',
345+
description: 'What this theme covers',
346+
},
347+
techniques: {
348+
type: 'array',
349+
items: { type: 'string' },
350+
description: 'Specific testing techniques for this theme',
351+
},
352+
applicable_areas: {
353+
type: 'array',
354+
items: { type: 'string' },
355+
description:
356+
'Feature areas where this theme is especially relevant',
357+
},
358+
},
359+
required: ['name', 'description', 'techniques'],
360+
},
361+
},
362+
exploratory_focus_areas: {
363+
type: 'array',
364+
description:
365+
'Top 3-5 areas most deserving of creative exploratory testing',
366+
items: {
367+
type: 'object',
368+
properties: {
369+
feature_area: { type: 'string' },
370+
exploratory_priority: {
371+
type: 'number',
372+
description: 'Score 1-10',
373+
},
374+
reason: {
375+
type: 'string',
376+
description: 'Why this area needs exploration',
377+
},
378+
suggested_time_box: {
379+
type: 'string',
380+
description: 'Recommended exploration time',
381+
},
382+
},
383+
required: ['feature_area', 'exploratory_priority', 'reason'],
384+
},
385+
},
386+
reasoning: {
387+
type: 'string',
388+
description: 'Explanation of analysis approach and key findings',
389+
},
390+
confidence: {
391+
type: 'number',
392+
description: 'Confidence score 0-100',
393+
},
394+
},
395+
required: ['summary', 'feature_areas', 'reasoning', 'confidence'],
396+
},
397+
},
197398
];
399+
400+
return allTools;
198401
}

tests/tools/e2e-ai-analyzer/analysis/analyzer.ts

Lines changed: 23 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -33,6 +33,16 @@ import {
3333
outputAnalysis as outputSelectTagsAnalysis,
3434
checkHardRules as checkSelectTagsHardRules,
3535
} from '../modes/select-tags/handlers';
36+
import {
37+
buildSystemPrompt as buildTestPlanSystemPrompt,
38+
buildTaskPrompt as buildTestPlanTaskPrompt,
39+
} from '../modes/generate-test-plan/prompt';
40+
import {
41+
processAnalysis as processTestPlanAnalysis,
42+
createConservativeResult as createTestPlanConservativeResult,
43+
createEmptyResult as createTestPlanEmptyResult,
44+
outputAnalysis as outputTestPlanAnalysis,
45+
} from '../modes/generate-test-plan/handlers';
3646

3747
/**
3848
* Mode Registry — see ModeConfig in types/index.ts for the full interface.
@@ -56,6 +66,16 @@ export const MODES: {
5666
outputAnalysis: outputSelectTagsAnalysis,
5767
checkHardRules: checkSelectTagsHardRules,
5868
},
69+
'generate-test-plan': {
70+
description: 'Generate exploratory test plan for release testing',
71+
finalizeToolName: 'finalize_test_plan_generation',
72+
systemPromptBuilder: buildTestPlanSystemPrompt,
73+
taskPromptBuilder: buildTestPlanTaskPrompt,
74+
processAnalysis: processTestPlanAnalysis,
75+
createConservativeResult: createTestPlanConservativeResult,
76+
createEmptyResult: createTestPlanEmptyResult,
77+
outputAnalysis: outputTestPlanAnalysis,
78+
},
5979
};
6080

6181
// Type aliases for mode keys and analysis results
@@ -112,6 +132,7 @@ export async function analyzeWithAgent<M extends ModeKey>(
112132
const taskPrompt = modeConfig.taskPromptBuilder(
113133
allChangedFiles,
114134
criticalFiles,
135+
context,
115136
);
116137

117138
const tools = getToolDefinitions();
@@ -229,7 +250,7 @@ export async function analyzeWithAgent<M extends ModeKey>(
229250
return analysis as ModeAnalysisResult<M>;
230251
}
231252

232-
console.log('⚠️ Failed to parse finalize_tag_selection');
253+
console.log(`⚠️ Failed to parse ${modeConfig.finalizeToolName}`);
233254
printTokenReport();
234255
return modeConfig.createConservativeResult() as ModeAnalysisResult<M>;
235256
}
@@ -245,8 +266,7 @@ export async function analyzeWithAgent<M extends ModeKey>(
245266
// Update conversation history
246267
conversationHistory.push({
247268
role: 'user',
248-
content:
249-
typeof currentMessage === 'string' ? currentMessage : currentMessage,
269+
content: currentMessage,
250270
});
251271
conversationHistory.push({
252272
role: 'assistant',

tests/tools/e2e-ai-analyzer/config.ts

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -28,8 +28,9 @@ export const LLM_CONFIG = {
2828
/**
2929
* Provider priority order for automatic fallback
3030
* The first available provider in this list will be used
31+
* Order: Claude → OpenAI → Gemini (matching Extension team)
3132
*/
32-
providerPriority: ['openai', 'anthropic', 'google'] as ProviderType[],
33+
providerPriority: ['anthropic', 'openai', 'google'] as ProviderType[],
3334

3435
/**
3536
* Per-provider configuration

0 commit comments

Comments
 (0)