Fix BI metrics misclassifying HTTP 500 errors as successful requests#560
Fix BI metrics misclassifying HTTP 500 errors as successful requests#560SasinduDilshara wants to merge 5 commits intowso2:mainfrom
Conversation
Derive http.status_code_group from http.status_code in the Fluent Bit Lua pipeline when the group field is absent. Ballerina emits http.status_code (integer) but not http.status_code_group (string), causing all BI requests to be classified as 'successful' regardless of their actual status code. Fixes wso2#154 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Issue Analysis — [Issue #154]: Metrics dashboard not capturing 500 responses as errorsClassification
Reproducibility
Root Cause AnalysisThe bug spans two layers, with the root cause in the Fluent Bit Lua script. Layer 1: Fluent Bit Lua script (root cause)File: local status = "successful" -- default: always "successful"
if record["http.status_code_group"] == "4xx" or record["http.status_code_group"] == "5xx" then
status = "failed"
endThe function defaults The field Layer 2: Frontend (downstream effect)File: const isFailed = m.tags.status === 'failed';The frontend correctly uses Why MI is unaffectedMI metrics use a completely different classification path in the Ballerina backend: FixIn -- In extract_bal_metrics_data, replace lines 58-62:
local status = "successful"
local status_code_group = record["http.status_code_group"]
if not status_code_group or status_code_group == "" then
local code = tonumber(record["http.status_code"])
if code then
status_code_group = math.floor(code / 100) .. "xx"
end
end
if status_code_group == "4xx" or status_code_group == "5xx" then
status = "failed"
end
-- Update status_code_group storage to use the derived value
record["status"] = status
record["status_code_group"] = status_code_group or ""Test Coverage Assessment
|
WalkthroughThese changes introduce comprehensive project documentation spanning AI agent instructions, system architecture, development contribution standards, deployment procedures, feature enumeration, and testing guidelines. A Lua observability script is enhanced to normalize missing status code groups from HTTP status codes, with corresponding unit tests added. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~15 minutes Poem
🚥 Pre-merge checks | ✅ 1 | ❌ 4❌ Failed checks (3 warnings, 1 inconclusive)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
🧹 Nitpick comments (1)
specs/Architecture.md (1)
32-32: Clarify the Ballerina initialization sequence.The arrow notation
main.bal → init.bal → config.bal → webserver.balsuggests an explicit call chain, but Ballerina uses implicit module initialization. The actual sequence is:
- Ballerina runtime loads
config.balconfigurables (includingConfig.tomloverrides)- Module-level
init()function ininit.balruns automatically beforemain()main()inmain.balexecuteswebserver.ballisteners are started as part of initializationConsider rephrasing to clarify the implicit initialization behavior.
📝 Suggested clarification
-Init: `main.bal` → `init.bal` (DB, listeners) → `config.bal` (all configurables) → `webserver.bal` (SPA serving, updates `config.json`). +Init: Ballerina loads `config.bal` configurables → module `init()` in `init.bal` (DB, listeners) → `main()` in `main.bal` → `webserver.bal` listeners start (SPA serving, updates `config.json`).🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@specs/Architecture.md` at line 32, The current arrow-based line implies an explicit call chain; rewrite it to describe Ballerina's implicit init order: state that module configurables in config.bal (and Config.toml overrides) are loaded first, then the module-level init() function in init.bal runs automatically before main(), then main() in main.bal executes, and webserver.bal listeners are started as part of module initialization—use the symbols config.bal configurables, init() in init.bal, main() in main.bal, and webserver.bal listeners in the revised sentence to make the implicit behavior clear.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@specs/Contributing.md`:
- Line 24: Update the Contributing.md guidance to explicitly state that
classifySqlError() in error_mapper.bal is declared as an isolated (module-local)
function and is intended for use only inside the storage repository
implementations under modules/storage (or, if cross-module usage is desired,
change the function signature in error_mapper.bal from isolated function
classifySqlError() to public isolated function classifySqlError()); mention that
repository code should use classifySqlError() (or the documented pattern) and
that other modules must implement their own mapping or rely on a public wrapper
if you choose to make it public.
In `@specs/Deployement.md`:
- Line 1: The file name is misspelled as Deployement.md; rename the file to
Deployment.md and update any references or links that point to "Deployement.md"
(e.g., documentation indexes, README links, or import paths) to use
"Deployment.md" so references remain valid.
- Line 56: Update the incorrect port in Deployement.md from 9460 to 9445 so the
URL becomes https://localhost:9445; this aligns Deployement.md with
specs/Architecture.md and specs/Features.md and with the serverPort setting in
icp_server/config.bal (serverPort = 9445).
In `@specs/Testing.md`:
- Line 37: Update the Environment line in Testing.md to accurately reflect that
the primary build (pnpm install and integration tests) runs on Node.js 20 while
Node.js 22.19.0 is only used for the Prettier formatting check; edit the
Environment entry in Testing.md to either state "Node.js 20 (primary build: pnpm
install, integration tests); Node.js 22.19.0 (Prettier formatting check)" or
simply change the single version to "Node.js 20" if you prefer a simpler
statement.
---
Nitpick comments:
In `@specs/Architecture.md`:
- Line 32: The current arrow-based line implies an explicit call chain; rewrite
it to describe Ballerina's implicit init order: state that module configurables
in config.bal (and Config.toml overrides) are loaded first, then the
module-level init() function in init.bal runs automatically before main(), then
main() in main.bal executes, and webserver.bal listeners are started as part of
module initialization—use the symbols config.bal configurables, init() in
init.bal, main() in main.bal, and webserver.bal listeners in the revised
sentence to make the implicit behavior clear.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: a2af0b44-827f-4394-9cd7-5e1d4319846d
📒 Files selected for processing (8)
CLAUDE.mdicp_server/resources/observability/opensearch-observability-dashboard/config/fluent-bit/scripts/scripts.luaicp_server/resources/observability/opensearch-observability-dashboard/config/fluent-bit/scripts/test_scripts.luaspecs/Architecture.mdspecs/Contributing.mdspecs/Deployement.mdspecs/Features.mdspecs/Testing.md
| - One service per top-level `.bal` file bound to a port. One `*_repository.bal` per entity in `modules/storage/`. | ||
| - Types in `modules/types/` only. Auth checks via `modules/auth/permission_checker.bal`. | ||
| - All configurables in `config.bal` with defaults. Use `$secret{alias}` for encrypted values; code outside `config.bal` uses `resolved*` vars. | ||
| - Error handling: use `check` keyword, `error` type, `classifySqlError()` from `error_mapper.bal`. Never expose raw SQL errors. |
There was a problem hiding this comment.
Clarify classifySqlError() usage scope.
Line 24 states "use classifySqlError() from error_mapper.bal", but the function is defined as isolated function classifySqlError() (module-local) rather than public isolated function. This means it can only be called within the modules/storage/ module, not from other modules as the general guidance might suggest.
Consider clarifying that classifySqlError() is for use within storage repository implementations, or document the error handling pattern more precisely.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@specs/Contributing.md` at line 24, Update the Contributing.md guidance to
explicitly state that classifySqlError() in error_mapper.bal is declared as an
isolated (module-local) function and is intended for use only inside the storage
repository implementations under modules/storage (or, if cross-module usage is
desired, change the function signature in error_mapper.bal from isolated
function classifySqlError() to public isolated function classifySqlError());
mention that repository code should use classifySqlError() (or the documented
pattern) and that other modules must implement their own mapping or rely on a
public wrapper if you choose to make it public.
| @@ -0,0 +1,59 @@ | |||
| # Deployment Guide | |||
There was a problem hiding this comment.
Critical: Filename is misspelled.
The filename Deployement.md should be Deployment.md (missing 'n').
📝 Proposed fix
Rename the file from specs/Deployement.md to specs/Deployment.md.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@specs/Deployement.md` at line 1, The file name is misspelled as
Deployement.md; rename the file to Deployment.md and update any references or
links that point to "Deployement.md" (e.g., documentation indexes, README links,
or import paths) to use "Deployment.md" so references remain valid.
|
|
||
| ## UI | ||
|
|
||
| `https://localhost:9460` admin:admin |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Description: Verify the correct port for UI access by checking config and service bindings.
echo "=== Checking config for UI/SPA port ==="
rg -n "serverPort|9460|9445" icp_server/config.bal icp_server/Config.toml
echo -e "\n=== Checking webserver.bal for SPA serving ==="
rg -n -A5 -B5 "listener.*9445|listener.*9460" icp_server/webserver.bal
echo -e "\n=== Checking auth_service.bal for port binding ==="
rg -n -A3 "listener.*9445|listener.*9460" icp_server/auth_service.balRepository: wso2/integration-control-plane
Length of output: 640
🏁 Script executed:
#!/bin/bash
# Check the actual documentation files mentioned in the review
echo "=== specs/Deployement.md around line 56 ==="
sed -n '50,60p' specs/Deployement.md 2>/dev/null || echo "File not found or error reading"
echo -e "\n=== specs/Architecture.md around line 24 ==="
sed -n '20,30p' specs/Architecture.md 2>/dev/null || echo "File not found or error reading"
echo -e "\n=== specs/Features.md around line 17 ==="
sed -n '12,22p' specs/Features.md 2>/dev/null || echo "File not found or error reading"
echo -e "\n=== All files in specs/ directory ==="
ls -la specs/ 2>/dev/null || echo "specs directory not found"Repository: wso2/integration-control-plane
Length of output: 3544
🏁 Script executed:
#!/bin/bash
# Check if 9460 is used anywhere in the codebase
echo "=== Searching for 9460 in entire codebase ==="
rg "9460" --type-list | head -20
echo "---"
rg "9460" -i 2>/dev/null || echo "No matches found for 9460"
echo -e "\n=== Searching for 9460 in markdown/documentation ==="
rg "9460" -t md 2>/dev/null || echo "No matches in markdown files"
echo -e "\n=== Searching for references to service ports ==="
rg "localhost:[0-9]{4}" specs/ -o 2>/dev/null | sort | uniqRepository: wso2/integration-control-plane
Length of output: 897
Change port from 9460 to 9445 in Deployement.md
Port 9460 is incorrect. Per specs/Architecture.md and specs/Features.md, the SPA and auth services run on port 9445. This is also confirmed by icp_server/config.bal which sets serverPort = 9445. Update line 56 to reference https://localhost:9445 instead.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@specs/Deployement.md` at line 56, Update the incorrect port in Deployement.md
from 9460 to 9445 so the URL becomes https://localhost:9445; this aligns
Deployement.md with specs/Architecture.md and specs/Features.md and with the
serverPort setting in icp_server/config.bal (serverPort = 9445).
| 3. H2 database initialization test | ||
| 4. Docker Compose integration tests against MySQL | ||
|
|
||
| Environment: Node.js 22.19.0, JDK 17, Ballerina 2201.13.1. |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Description: Check all Node.js version references in the CI workflow.
echo "=== Checking pr-check.yml for Node.js version settings ==="
rg -n "node-version|Node" .github/workflows/pr-check.yml
echo -e "\n=== Checking for Node version in package.json ==="
fd -t f "package.json" -x cat {}Repository: wso2/integration-control-plane
Length of output: 1570
🏁 Script executed:
#!/bin/bash
# Get the full pr-check.yml workflow to understand what each step does
cat -n .github/workflows/pr-check.yml | head -70Repository: wso2/integration-control-plane
Length of output: 2303
🏁 Script executed:
#!/bin/bash
# Read specs/Testing.md to see the context around line 37
cat -n specs/Testing.md | head -50Repository: wso2/integration-control-plane
Length of output: 1514
Update Testing.md line 37 to reflect the correct Node.js version used for the main build.
The documentation states "Environment: Node.js 22.19.0", but the workflow uses Node.js 22.19.0 only for the Prettier formatting check. The main build, pnpm install, and integration tests use Node.js 20. Update line 37 to clarify that the primary build environment uses Node.js 20, or explicitly state the version breakdown if both are intentional.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@specs/Testing.md` at line 37, Update the Environment line in Testing.md to
accurately reflect that the primary build (pnpm install and integration tests)
runs on Node.js 20 while Node.js 22.19.0 is only used for the Prettier
formatting check; edit the Environment entry in Testing.md to either state
"Node.js 20 (primary build: pnpm install, integration tests); Node.js 22.19.0
(Prettier formatting check)" or simply change the single version to "Node.js 20"
if you prefer a simpler statement.
Summary
extract_bal_metrics_datainscripts.lua) defaultedstatus = "successful"and only changed it to"failed"whenhttp.status_code_groupwas"4xx"or"5xx". However, the Ballerina observability library emitshttp.status_code(an integer, e.g.500) but does not emithttp.status_code_group(e.g."5xx"). Since the group field was alwaysnilfor BI runtimes, every request — including HTTP 500 errors — was classified as successful, making the error rate always 0% for BI integrations.http.status_code_groupis absent or empty, derive it fromhttp.status_codeusing integer division (math.floor(code / 100) .. "xx"). The derived group is then used for both status classification and stored inrecord["status_code_group"].Metrics.tsxalready correctly usestags.status === 'failed'; the bug was entirely in the Lua pipeline.opensearch_adapter_service.balthat explicitly setstags["status"].Files Changed
icp_server/resources/observability/opensearch-observability-dashboard/config/fluent-bit/scripts/scripts.lua— fix status derivation fallbackicp_server/resources/observability/opensearch-observability-dashboard/config/fluent-bit/scripts/test_scripts.lua— new Lua unit tests covering the fixed and edge-case scenariosFixes #154
Summary by CodeRabbit
Bug Fixes
Documentation
Tests