Skip to content

Sift tools #74

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 25 commits into from
Apr 23, 2025
Merged

Sift tools #74

merged 25 commits into from
Apr 23, 2025

Conversation

ioanarm
Copy link
Contributor

@ioanarm ioanarm commented Apr 3, 2025

Adding Sift tools for

  • Create Investigation
  • List investigations with a limit parameter
  • Get investigation
  • Get analyses
  • Run ErrorPatternLogs Sift Check
  • Run SlowRequests Sift Check

Also there is cloud test added checking the get tools for sift.

Implementation Approach:

I initially created separate tools for basic Sift API operations (create/get investigations) to reduce code duplication. However, for specific checks like ErrorPatternLogs and SlowRequests, I created dedicated tools that handle the entire workflow internally.

Current Workflow Example:

When running ErrorPatternLogs in Sift, the process involves:

  1. Creating an investigation with specific tags to limit checks (eg: ErrorPatternLogs)
  2. Getting investigation details when complete
  3. Retrieving the ErrorPatternLog analysis using the analysis ID from investigation data

This can be done through:

  • Three separate tool calls when using the basic API tools
  • A single tool call with our dedicated run_error_pattern_logs tool (which makes three API calls internally)

Questions for Discussion:

Should we provide:

  1. Basic API wrappers and let the model chain them together?
  2. Higher-level tools that handle complete workflows?
  3. A combination of both approaches? This maybe gives a bit more freedom to the LLM to provide more info if needed.

I'm leaning toward removing the createInvestigation tool and keeping just the specialized Sift workflow tools, but I'm open to suggestions.

Note: I've also refactored cloud tests into a utils file for better organization.

@ioanarm ioanarm marked this pull request as ready for review April 4, 2025 12:53
@ioanarm ioanarm requested a review from a team as a code owner April 4, 2025 12:53
@annanay25
Copy link
Contributor

I agree with removing CreateInvestigation, in-fact should we restructure this to have a tool to Run ErrorPatternLogs Sift Check and wait for its results? If we don't do this we would have to add additional logic in the frontend to poll for results. Similarly for SlowRequests. We can just invoke and wait for results, users might be ok waiting for up to 20-30s.

@ioanarm ioanarm force-pushed the sift-tools branch 2 times, most recently from 264870e to c5235db Compare April 11, 2025 08:01
@ioanarm
Copy link
Contributor Author

ioanarm commented Apr 11, 2025

I agree with removing CreateInvestigation, in-fact should we restructure this to have a tool to Run ErrorPatternLogs Sift Check and wait for its results? If we don't do this we would have to add additional logic in the frontend to poll for results. Similarly for SlowRequests. We can just invoke and wait for results, users might be ok waiting for up to 20-30s.

Cool, I removed the tool. On the other comment see here

README.md Outdated
@@ -72,6 +79,12 @@ This is useful if you don't use certain functionality or if you don't want to ta
| `get_current_oncall_users` | OnCall | Get users currently on-call for a specific schedule |
| `list_oncall_teams` | OnCall | List teams from Grafana OnCall |
| `list_oncall_users` | OnCall | List users from Grafana OnCall |
| `create_investigation` | Sift | Create a new Sift investigation to analyze data from different datasources |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: need a docs update.

README.md Outdated
Comment on lines 83 to 86
| `get_investigation` | Sift | Retrieve an existing Sift investigation by its UUID |
| `get_analysis` | Sift | Retrieve a specific analysis from a Sift investigation |
| `list_investigations` | Sift | Retrieve a list of Sift investigations with an optional limit |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For get_investigation, get_analysis and list_investigations - I'm not fully convinced we need to expose that level of detail to the LLM, I'd like to see examples of where we would need this exposed to the LLM and whether we can build some more abstract tools that can help.

For ex: right now, if I wanted to find an investigation with a given labelset, say {cluster='prod-us-central-0', namespace='machine-learning'}, the LLM would need to fire 3 queries: list_investigations -> get_investigation -> get_analysis (x3) to retrieve all the data. This will require several tokens to be used in LLM calls when we send the entire list of investigations to it. Instead, can we build a higher abstraction that can take in a labelset and directly retrieve analysis for it without burning through as many tokens?

Curious if there's a use-case I'm missing or any other actions that can be enabled by the current API.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that's true. Adding the tools there is mainly saving time for the user in case there is anything already running in Sift.
There is an alert in namespace x --> Ask the assistant if there are any investigations for this namespace -->
List investigations tool used --> Listing all recent investigations.

If the user wants to grab data then they can request using the ID provided.

image

That's the main motivation behind exposing the tools. I do agree though that we need to have maybe more targeted tools. Let's discuss in the MCP call later.

Copy link
Collaborator

@sd2k sd2k left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a couple of comments inline. I'm not against keeping the get/list Sift tools in but I think the other high-level tools should avoid mentioning Sift.

tools/sift.go Outdated
Comment on lines 353 to 298
// RunErrorPatternLogs is a tool for running an ErrorPatternLogs check
var RunErrorPatternLogs = mcpgrafana.MustTool(
"run_error_pattern_logs",
"Creates a Sift investigation with ErrorPatternLogs check, waits for it to complete, and returns the analysis results. This tool triggers an investigation with the ErrorPatternLogs check in the relevant Loki datasource. It investigates if the there are elevated errors rates in the logs compared to the last day's average and returns the error pattern found, if any.",
runErrorPatternLogs,
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My gut feeling is that these tools should be a bit less Sift-y in their names and descriptions. For example, this one could be named 'find_error_log_patterns' and the description would be closer to just the last sentence of the existing description, since an LLM is unlikely to know what Sift is, or what a ErrorPatternLogs check means.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that's true. I will refactor it a bit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the "sift" mentions from error pattern logs and slow requests check.
Sift mentions in general I think are good especially if we go further and add another layer to MCP explaining what Sift is and how to use it (a prompt maybe), same with OnCall etc. I haven't removed it from the rest of the tools as they are very specific, list investigations, get investigations etc. Open to discuss this though, maybe we can completely remove Sift mentions for now 🙄

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't removed it from the rest of the tools as they are very specific, list investigations, get investigations etc. Open to discuss this though, maybe we can completely remove Sift mentions for now 🙄

Yeah I think that's fine 🙂 I would even consider changing those ones to 'List Sift investigations' so it matches the docs & UI and is clearly distinct from any informal kind of 'investigation'!

tools/sift.go Outdated
Comment on lines 17 to 92
// AnalysisStep represents a single step in the analysis process.
type AnalysisStep struct {
CreatedAt time.Time `json:"created" validate:"isdefault"`
// State that the Analysis is entering.
State string `json:"state"`
// The exit message of the step. Can be empty if the step was successful.
ExitMessage string `json:"exitMessage"`
// Runtime statistics for this step
Stats map[string]interface{} `json:"stats,omitempty"`
}

type AnalysisEvent struct {
StartTime time.Time `json:"startTime"`
EndTime time.Time `json:"endTime"`
Name string `json:"name"`
Description string `json:"description,omitempty"`
Details map[string]interface{} `json:"details"`
}

// Interesting: The analysis complete with results that indicate a probable cause for failure.
type AnalysisResult struct {
Successful bool `json:"successful"`
Interesting bool `json:"interesting"`
Message string `json:"message"`
MarkdownSummary string `json:"-" gorm:"-"`
Details map[string]interface{} `json:"details"`
Events []AnalysisEvent `json:"events,omitempty" gorm:"serializer:json"`
}

// An Analysis struct provides the status and results
// of running a specific type of check.
type Analysis struct {
ID uuid.UUID `json:"id" gorm:"primarykey;type:char(36)" validate:"isdefault"`
CreatedAt time.Time `json:"created" validate:"isdefault"`
UpdatedAt time.Time `json:"modified" validate:"isdefault"`

Status AnalysisStatus `json:"status" gorm:"default:pending;index:idx_analyses_stats,priority:100"`
StartedAt *time.Time `json:"started" validate:"isdefault"`

// Foreign key to the Investigation that created this Analysis.
InvestigationID uuid.UUID `json:"investigationId" gorm:"index:idx_analyses_stats,priority:10"`

// Name is the name of the check that this analysis represents.
Name string `json:"name"`
Title string `json:"title"`
Steps []AnalysisStep `json:"steps" gorm:"foreignKey:AnalysisID;constraint:OnDelete:CASCADE"`
Result AnalysisResult `json:"result" gorm:"embedded;embeddedPrefix:result_"`
}

type DatasourceConfig struct {
LokiDatasource DatasourceInfo `json:"lokiDatasource" gorm:"not null;embedded;embeddedPrefix:loki_"`
PrometheusDatasource DatasourceInfo `json:"prometheusDatasource" gorm:"not null;embedded;embeddedPrefix:prometheus_"`
TempoDatasource DatasourceInfo `json:"tempoDatasource" gorm:"not null;embedded;embeddedPrefix:tempo_"`
PyroscopeDatasource DatasourceInfo `json:"pyroscopeDatasource" gorm:"not null;embedded;embeddedPrefix:pyroscope_"`
}

type DatasourceInfo struct {
Uid string `json:"uid"`
}

// AnalysisMeta represents metadata about the analyses
type AnalysisMeta struct {
CountsByStage map[string]interface{} `json:"countsByStage"`
Items []Analysis `json:"items"`
}

type Investigation struct {
ID uuid.UUID `json:"id" gorm:"primarykey;type:char(36)" validate:"isdefault"`
CreatedAt time.Time `json:"created" gorm:"index" validate:"isdefault"`
UpdatedAt time.Time `json:"modified" validate:"isdefault"`

TenantID string `json:"tenantId" gorm:"index;not null;size:256"`

Datasources DatasourceConfig `json:"datasources" gorm:"embedded;embeddedPrefix:datasources_"`

Name string `json:"name"`
RequestData InvestigationRequest `json:"requestData" gorm:"not null;embedded;embeddedPrefix:request_"`

// TODO: Add this when we want to extract discovered inputs for later usage
// Inputs Inputs `json:"inputs" gorm:"serializer:json"`

// GrafanaURL is the Grafana URL to be used for datasource queries
// for this investigation.
GrafanaURL string `json:"grafanaUrl"`

// Status describes the state of the investigation (pending, running, failed, or finished).
Status InvestigationStatus `json:"status"`

// FailureReason is a short human-friendly string that explains the reason that the
// investigation failed.
FailureReason string `json:"failureReason,omitempty"`

// Analyses contains metadata about the investigation's analyses
Analyses AnalysisMeta `json:"analyses"`
}

type RequestData struct {
Labels map[string]string `json:"labels"`
Checks []string `json:"checks"`
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think some of these types, annotations etc are redundant here (e.g. the gorm tags aren't needed and some of the types are unused). Would be good to keep as many internals out of here as possible.

Copy link
Collaborator

@sd2k sd2k left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments on naming / nits on exporting inline.

I'm trying to avoid exporting anything other than the actual tool definitions & handlers and the AddXTools functions where possible!

tools/sift.go Outdated
// FindErrorPatternLogs is a tool for running an ErrorPatternLogs check
var FindErrorPatternLogs = mcpgrafana.MustTool(
"find_error_pattern_logs",
"Creates an investigation to search for error patterns in logs, waits for it to complete, and returns the analysis results. This tool triggers an investigation in the relevant Loki datasource to determine if there are elevated error rates compared to the last day's average, and returns the error pattern found, if any.",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we get rid of mentions of an 'investigation' here too, so it looks like the tool just opaquely finds error patterns in logs? (the implementation isn't super important imo)

Same with slow requests

@ioanarm ioanarm force-pushed the sift-tools branch 5 times, most recently from cc612e2 to db86474 Compare April 22, 2025 10:12
@ioanarm ioanarm requested a review from sd2k April 22, 2025 12:23
tools/sift.go Outdated

// FindErrorPatternLogsParams defines the parameters for running an ErrorPatternLogs check
type FindErrorPatternLogsParams struct {
Name string `json:"name" jsonschema:"required,description=The name of the investigation"`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Name string `json:"name" jsonschema:"required,description=The name of the investigation"`
Name string `json:"name" jsonschema:"required,description=The name of the investigation"`

I wonder if we can omit this and generate something, or if it's better to just let the model provide some random name?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does an ok job generating a name (see below). Maybe generating something to show the origin is better. Though we can add an investigation type for this. 🤔
image

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it'd definitely be nice to include some info on origin/user here eventually. This is good for now though!

ioanarm and others added 3 commits April 23, 2025 13:33
Copy link
Collaborator

@sd2k sd2k left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Nice one

@ioanarm ioanarm merged commit 618d957 into main Apr 23, 2025
5 checks passed
@ioanarm ioanarm deleted the sift-tools branch April 23, 2025 12:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants