Skip to content

Conversation

@andrewdkim47
Copy link
Contributor

@andrewdkim47 andrewdkim47 commented Aug 25, 2025

Signed-off-by: Andrew Kim [email protected]

🔄 Pull Request

📝 Description

Adding optimization and debugging steps for timeout scenarios when querying large spark applications. This PR will apply the optimization to compare_job_performance tool, the most complex tool we have.

  1. Some of the default timeouts are too short
  2. Large spark applications that process large amounts of data can easily overflow the JVM heap space for SHS. In this case adding mitigation instrucitons
  3. Some of the tools are not scaled to work with large spark applications. Certain tools may timeout based on the size of the Spark application. For large Spark applications that process Gigabytes and Terabytes of data, an increase in timeout and java heap size may need to be applied in order for the tool to have enough time and space to produce an output.

Solution:

  • Added TROUBLESHOOTING.md to instruct users on setting up their environment to mitigate the issue.
  • Added parallelized API caller method with graceful degradation that can help improve performance and query the SHS faster with parallelized api calls rather than sequential calls.

🎯 Type of Change

  • 🐛 Bug fix (non-breaking change that fixes an issue)
  • ✨ New feature (non-breaking change that adds functionality)
  • 💥 Breaking change (fix or feature that would cause existing functionality to change)
  • [x ] 📖 Documentation update
  • 🧪 Test improvement
  • 🔧 Refactoring (no functional changes)

🧪 Testing

  • ✅ All existing tests pass (task test)
  • 🔬 Tested with MCP Inspector
  • 📊 Tested with sample Spark data
  • 🚀 Tested with real Spark History Server (if applicable)

🔬 Test Commands Run

can you compare job performance between application_a and application_b?

Based on the performance comparison, here's why application_a took much longer than application_b:

Key Performance Differences
Scale Mismatch:

App a (slower): 493 total executors, processed 856GB of input data
App b (faster): 1,765 total executors, processed 20.7TB of input data
The Problem: App A was severely under-resourced for its workload. Despite processing similar amounts of data conceptually, it had only 28% of the executors that App B had.

Resource Allocation Issues
Executor Efficiency:

App a: 65,393 completed tasks across 493 executors = ~133 tasks per executor
App b: 791,434 completed tasks across 1,765 executors = ~448 tasks per executor
Duration Impact:

App a: Total executor duration of 2.48 billion ms with much less parallelism
App b: Total executor duration of 23.1 billion ms but distributed across 3.6x more executors
Root Cause Analysis
The main issue is resource allocation. App 0166 was running with insufficient parallelism:

Under-provisioned cluster: Only 493 executors vs 1,765 needed
Poor task distribution: Tasks were queued up waiting for available executors
Serialized execution: Work that could run in parallel was forced to run sequentially
Recommendations
To fix this performance issue:

Increase executor count to match the workload size (aim for 1,500+ executors)
Monitor cluster resource availability on different days
Set dynamic allocation parameters to scale based on workload
Check if cluster capacity was limited on the day App A ran
The data shows App A was essentially running the same job with 1/4 the resources, which explains the significantly longer runtime.

REvision 2

using the history store hybrid approach to resolve jvm issue, i was able to get a result for compare_job_performance after 236.157s on Q CLI

> Based on the performance comparison between these two applications:

## Application Overview
• **App 0166**: awsdw_lakesmith-load_dim_aws_offering_pricing_-20250818999999 (Aug 18)
• **App 0131**: awsdw_lakesmith-load_dim_aws_offering_pricing_-20250817999999 (Aug 17)

## Key Performance Differences

### Resource Usage
• **App 0131** used 3.6x more executors (1,765 vs 493)
• **App 0131** processed 24x more data (20.7TB vs 856GB input)
• **App 0131** had 12x more completed tasks (791K vs 65K)

### Job Performance
• **App 0166** had slightly more jobs (67 vs 73)
• **App 0166** took longer per job on average (175s vs 137s)
• **App 0131** had higher total duration (9,966s vs 11,704s)

### Data Processing
• **App 0131** handled significantly more shuffle operations:
  • Shuffle read: 71.6TB vs 2.8TB
  • Shuffle write: 62.7TB vs 2.2TB

### Reliability
• **App 0131** had more failed tasks (112 vs 3), likely due to its much larger scale

## Summary
App 0131 (Aug 17) was a much larger-scale job processing ~25x more data with proportionally more resources.
App 0166 (Aug 18) was more efficient per job but processed significantly less data. The performance 
difference appears to be primarily due to data volume rather than efficiency issues.

✅ Checklist

  • 🔍 Code follows project style guidelines
  • 🧪 Added tests for new functionality
  • 📖 Updated documentation (README, TESTING.md, etc.)
  • 🔧 Pre-commit hooks pass
  • [] 📝 Added entry to CHANGELOG.md (if significant change)

🎉 Thank you for contributing! Your effort helps make Spark monitoring more intelligent.

Comment on lines 49 to 54
elif (
"Request failed" in error_text and "allexecutors" in error_text
):
error_msg = f"{name} failed: Spark History Server likely out of memory - try: export SPARK_DAEMON_MEMORY=4g"
elif "Request failed" in error_text:
error_msg = f"{name} failed: Spark History Server error - may need more memory (export SPARK_DAEMON_MEMORY=4g)"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these correct? Request failed" could mean a lot of things even with text allexecutors. Maybe a better approach is to handle it in else and surface the error message? Also SPARK_DAEMON_MEMORY=4g may not be a useful message for LLM?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think its too specific and the first case should be sufficient, willl remove

}
except Exception:
return {
"error": "Spark History Server unresponsive - likely out of memory. Set SPARK_DAEMON_MEMORY=4g and restart."
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can fail with any reason here so I would rather not indicate oom.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point, will make it more generic

execution_result = parallel_execute(
api_calls,
max_workers=6,
timeout=300, # Apply generous timeout for large scale Spark applications
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5 min enough for most use cases?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have tested with a two spark event logs (attached in the description), the largest one had 1,765 executors and processed 20.7TB of input data. The spark event logs itself was around 7.37 GB. 5 minutes was more than enough for this use case, but i havent tested with greater data than this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think 5 min is a good starting point


**Note:** SHS_SERVERS_<Replace with server name in config.yaml>_TIMEOUT

### 3: JVM Heap Exhaustion
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the suggestion! let me test this and add it as a troubleshooting step if it works.

…oting steps for timeouts on large applications

Signed-off-by: Andrew Kim <[email protected]>
@andrewdkim47 andrewdkim47 requested a review from nabuskey August 26, 2025 15:47
@andrewdkim47 andrewdkim47 marked this pull request as ready for review August 26, 2025 16:49
Copy link
Collaborator

@nabuskey nabuskey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@andrewdkim47 andrewdkim47 merged commit cdbb5da into main Aug 26, 2025
5 checks passed
@andrewdkim47 andrewdkim47 deleted the fix-timeouts branch August 26, 2025 17:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants