Parallelizing API calls in compare_job_performance, adding troubleshooting steps for timeouts on large applications #97

andrewdkim47 · 2025-08-25T21:20:00Z

Signed-off-by: Andrew Kim [email protected]

🔄 Pull Request

📝 Description

Adding optimization and debugging steps for timeout scenarios when querying large spark applications. This PR will apply the optimization to compare_job_performance tool, the most complex tool we have.

Some of the default timeouts are too short
Large spark applications that process large amounts of data can easily overflow the JVM heap space for SHS. In this case adding mitigation instrucitons
Some of the tools are not scaled to work with large spark applications. Certain tools may timeout based on the size of the Spark application. For large Spark applications that process Gigabytes and Terabytes of data, an increase in timeout and java heap size may need to be applied in order for the tool to have enough time and space to produce an output.

Solution:

Added TROUBLESHOOTING.md to instruct users on setting up their environment to mitigate the issue.
Added parallelized API caller method with graceful degradation that can help improve performance and query the SHS faster with parallelized api calls rather than sequential calls.

🎯 Type of Change

🐛 Bug fix (non-breaking change that fixes an issue)
✨ New feature (non-breaking change that adds functionality)
💥 Breaking change (fix or feature that would cause existing functionality to change)
[x ] 📖 Documentation update
🧪 Test improvement
🔧 Refactoring (no functional changes)

🧪 Testing

✅ All existing tests pass (task test)
🔬 Tested with MCP Inspector
📊 Tested with sample Spark data
🚀 Tested with real Spark History Server (if applicable)

🔬 Test Commands Run

can you compare job performance between application_a and application_b?

Based on the performance comparison, here's why application_a took much longer than application_b:

Key Performance Differences
Scale Mismatch:

App a (slower): 493 total executors, processed 856GB of input data
App b (faster): 1,765 total executors, processed 20.7TB of input data
The Problem: App A was severely under-resourced for its workload. Despite processing similar amounts of data conceptually, it had only 28% of the executors that App B had.

Resource Allocation Issues
Executor Efficiency:

App a: 65,393 completed tasks across 493 executors = ~133 tasks per executor
App b: 791,434 completed tasks across 1,765 executors = ~448 tasks per executor
Duration Impact:

App a: Total executor duration of 2.48 billion ms with much less parallelism
App b: Total executor duration of 23.1 billion ms but distributed across 3.6x more executors
Root Cause Analysis
The main issue is resource allocation. App 0166 was running with insufficient parallelism:

Under-provisioned cluster: Only 493 executors vs 1,765 needed
Poor task distribution: Tasks were queued up waiting for available executors
Serialized execution: Work that could run in parallel was forced to run sequentially
Recommendations
To fix this performance issue:

Increase executor count to match the workload size (aim for 1,500+ executors)
Monitor cluster resource availability on different days
Set dynamic allocation parameters to scale based on workload
Check if cluster capacity was limited on the day App A ran
The data shows App A was essentially running the same job with 1/4 the resources, which explains the significantly longer runtime.

REvision 2

using the history store hybrid approach to resolve jvm issue, i was able to get a result for compare_job_performance after 236.157s on Q CLI

> Based on the performance comparison between these two applications:

## Application Overview
• **App 0166**: awsdw_lakesmith-load_dim_aws_offering_pricing_-20250818999999 (Aug 18)
• **App 0131**: awsdw_lakesmith-load_dim_aws_offering_pricing_-20250817999999 (Aug 17)

## Key Performance Differences

### Resource Usage
• **App 0131** used 3.6x more executors (1,765 vs 493)
• **App 0131** processed 24x more data (20.7TB vs 856GB input)
• **App 0131** had 12x more completed tasks (791K vs 65K)

### Job Performance
• **App 0166** had slightly more jobs (67 vs 73)
• **App 0166** took longer per job on average (175s vs 137s)
• **App 0131** had higher total duration (9,966s vs 11,704s)

### Data Processing
• **App 0131** handled significantly more shuffle operations:
  • Shuffle read: 71.6TB vs 2.8TB
  • Shuffle write: 62.7TB vs 2.2TB

### Reliability
• **App 0131** had more failed tasks (112 vs 3), likely due to its much larger scale

## Summary
App 0131 (Aug 17) was a much larger-scale job processing ~25x more data with proportionally more resources.
App 0166 (Aug 18) was more efficient per job but processed significantly less data. The performance 
difference appears to be primarily due to data volume rather than efficiency issues.

✅ Checklist

🔍 Code follows project style guidelines
🧪 Added tests for new functionality
📖 Updated documentation (README, TESTING.md, etc.)
🔧 Pre-commit hooks pass
[] 📝 Added entry to CHANGELOG.md (if significant change)

🎉 Thank you for contributing! Your effort helps make Spark monitoring more intelligent.

nabuskey · 2025-08-25T22:03:01Z

src/spark_history_mcp/utils/utils.py

+                    elif (
+                        "Request failed" in error_text and "allexecutors" in error_text
+                    ):
+                        error_msg = f"{name} failed: Spark History Server likely out of memory - try: export SPARK_DAEMON_MEMORY=4g"
+                    elif "Request failed" in error_text:
+                        error_msg = f"{name} failed: Spark History Server error - may need more memory (export SPARK_DAEMON_MEMORY=4g)"


Are these correct? Request failed" could mean a lot of things even with text allexecutors. Maybe a better approach is to handle it in else and surface the error message? Also SPARK_DAEMON_MEMORY=4g may not be a useful message for LLM?

I think its too specific and the first case should be sufficient, willl remove

nabuskey · 2025-08-25T22:08:31Z

src/spark_history_mcp/tools/tools.py

+            }
+        except Exception:
+            return {
+                "error": "Spark History Server unresponsive - likely out of memory. Set SPARK_DAEMON_MEMORY=4g and restart."


It can fail with any reason here so I would rather not indicate oom.

good point, will make it more generic

nabuskey · 2025-08-25T22:15:41Z

src/spark_history_mcp/tools/tools.py

+    execution_result = parallel_execute(
+        api_calls,
+        max_workers=6,
+        timeout=300,  # Apply generous timeout for large scale Spark applications


5 min enough for most use cases?

I have tested with a two spark event logs (attached in the description), the largest one had 1,765 executors and processed 20.7TB of input data. The spark event logs itself was around 7.37 GB. 5 minutes was more than enough for this use case, but i havent tested with greater data than this

I think 5 min is a good starting point

nabuskey · 2025-08-25T22:19:01Z

TROUBLESHOOTING.md

+
+**Note:** SHS_SERVERS_<Replace with server name in config.yaml>_TIMEOUT
+
+### 3: JVM Heap Exhaustion


Another way to deal with this is to use hybrid store

https://github.com/KubedAI/spark-history-server/blob/6acab811ffd368cbdd0bdda4a458cb8d55c27839/stable/spark-history-server/templates/configmap.yaml#L47-L52

thanks for the suggestion! let me test this and add it as a troubleshooting step if it works.

…oting steps for timeouts on large applications Signed-off-by: Andrew Kim <[email protected]>

nabuskey

LGTM

andrewdkim47 force-pushed the fix-timeouts branch from bd181d2 to 50632c7 Compare August 25, 2025 21:25

andrewdkim47 requested review from nabuskey and vara-bonthu August 25, 2025 21:32

nabuskey reviewed Aug 25, 2025

View reviewed changes

andrewdkim47 force-pushed the fix-timeouts branch from 50632c7 to 22024d4 Compare August 26, 2025 14:30

Parallelizing API calls in compare_job_performance, adding troublesho…

f37c835

…oting steps for timeouts on large applications Signed-off-by: Andrew Kim <[email protected]>

andrewdkim47 force-pushed the fix-timeouts branch from 22024d4 to f37c835 Compare August 26, 2025 15:45

andrewdkim47 requested a review from nabuskey August 26, 2025 15:47

andrewdkim47 marked this pull request as ready for review August 26, 2025 16:49

nabuskey approved these changes Aug 26, 2025

View reviewed changes

andrewdkim47 merged commit cdbb5da into main Aug 26, 2025
5 checks passed

andrewdkim47 deleted the fix-timeouts branch August 26, 2025 17:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallelizing API calls in compare_job_performance, adding troubleshooting steps for timeouts on large applications #97

Parallelizing API calls in compare_job_performance, adding troubleshooting steps for timeouts on large applications #97

Uh oh!

andrewdkim47 commented Aug 25, 2025 •

edited

Loading

Uh oh!

nabuskey Aug 25, 2025

Uh oh!

andrewdkim47 Aug 26, 2025

Uh oh!

nabuskey Aug 25, 2025

Uh oh!

andrewdkim47 Aug 26, 2025

Uh oh!

nabuskey Aug 25, 2025

Uh oh!

andrewdkim47 Aug 26, 2025

Uh oh!

andrewdkim47 Aug 26, 2025

Uh oh!

nabuskey Aug 25, 2025

Uh oh!

andrewdkim47 Aug 26, 2025

Uh oh!

nabuskey left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants


		Note: SHS_SERVERS_<Replace with server name in config.yaml>_TIMEOUT

		### 3: JVM Heap Exhaustion

Parallelizing API calls in compare_job_performance, adding troubleshooting steps for timeouts on large applications #97

Parallelizing API calls in compare_job_performance, adding troubleshooting steps for timeouts on large applications #97

Uh oh!

Conversation

andrewdkim47 commented Aug 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔄 Pull Request

📝 Description

🎯 Type of Change

🧪 Testing

🔬 Test Commands Run

REvision 2

✅ Checklist

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nabuskey left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

andrewdkim47 commented Aug 25, 2025 •

edited

Loading