Sift tools #74

ioanarm · 2025-04-03T20:48:41Z

Adding Sift tools for

Create Investigation
List investigations with a limit parameter
Get investigation
Get analyses
Run ErrorPatternLogs Sift Check
Run SlowRequests Sift Check

Also there is cloud test added checking the get tools for sift.

Implementation Approach:

I initially created separate tools for basic Sift API operations (create/get investigations) to reduce code duplication. However, for specific checks like ErrorPatternLogs and SlowRequests, I created dedicated tools that handle the entire workflow internally.

Current Workflow Example:

When running ErrorPatternLogs in Sift, the process involves:

Creating an investigation with specific tags to limit checks (eg: ErrorPatternLogs)
Getting investigation details when complete
Retrieving the ErrorPatternLog analysis using the analysis ID from investigation data

This can be done through:

Three separate tool calls when using the basic API tools
A single tool call with our dedicated run_error_pattern_logs tool (which makes three API calls internally)

Questions for Discussion:

Should we provide:

Basic API wrappers and let the model chain them together?
Higher-level tools that handle complete workflows?
A combination of both approaches? This maybe gives a bit more freedom to the LLM to provide more info if needed.

I'm leaning toward removing the createInvestigation tool and keeping just the specialized Sift workflow tools, but I'm open to suggestions.

Note: I've also refactored cloud tests into a utils file for better organization.

annanay25 · 2025-04-09T01:49:52Z

I agree with removing CreateInvestigation, in-fact should we restructure this to have a tool to Run ErrorPatternLogs Sift Check and wait for its results? If we don't do this we would have to add additional logic in the frontend to poll for results. Similarly for SlowRequests. We can just invoke and wait for results, users might be ok waiting for up to 20-30s.

tools/sift.go

ioanarm · 2025-04-11T08:15:48Z

I agree with removing CreateInvestigation, in-fact should we restructure this to have a tool to Run ErrorPatternLogs Sift Check and wait for its results? If we don't do this we would have to add additional logic in the frontend to poll for results. Similarly for SlowRequests. We can just invoke and wait for results, users might be ok waiting for up to 20-30s.

Cool, I removed the tool. On the other comment see here

annanay25 · 2025-04-11T22:23:48Z

README.md

@@ -72,6 +79,12 @@ This is useful if you don't use certain functionality or if you don't want to ta
 | `get_current_oncall_users`        | OnCall      | Get users currently on-call for a specific schedule                |
 | `list_oncall_teams`               | OnCall      | List teams from Grafana OnCall                                     |
 | `list_oncall_users`               | OnCall      | List users from Grafana OnCall                                     |
+| `create_investigation`            | Sift        | Create a new Sift investigation to analyze data from different datasources |


nit: need a docs update.

annanay25 · 2025-04-11T22:30:48Z

README.md

+| `get_investigation`               | Sift        | Retrieve an existing Sift investigation by its UUID                     |
+| `get_analysis`                    | Sift        | Retrieve a specific analysis from a Sift investigation                 |
+| `list_investigations`             | Sift        | Retrieve a list of Sift investigations with an optional limit           |


For get_investigation, get_analysis and list_investigations - I'm not fully convinced we need to expose that level of detail to the LLM, I'd like to see examples of where we would need this exposed to the LLM and whether we can build some more abstract tools that can help.

For ex: right now, if I wanted to find an investigation with a given labelset, say {cluster='prod-us-central-0', namespace='machine-learning'}, the LLM would need to fire 3 queries: list_investigations -> get_investigation -> get_analysis (x3) to retrieve all the data. This will require several tokens to be used in LLM calls when we send the entire list of investigations to it. Instead, can we build a higher abstraction that can take in a labelset and directly retrieve analysis for it without burning through as many tokens?

Curious if there's a use-case I'm missing or any other actions that can be enabled by the current API.

Yeah that's true. Adding the tools there is mainly saving time for the user in case there is anything already running in Sift.
There is an alert in namespace x --> Ask the assistant if there are any investigations for this namespace -->
List investigations tool used --> Listing all recent investigations.

If the user wants to grab data then they can request using the ID provided.

That's the main motivation behind exposing the tools. I do agree though that we need to have maybe more targeted tools. Let's discuss in the MCP call later.

sd2k

Left a couple of comments inline. I'm not against keeping the get/list Sift tools in but I think the other high-level tools should avoid mentioning Sift.

sd2k · 2025-04-15T12:45:58Z

tools/sift.go

+// RunErrorPatternLogs is a tool for running an ErrorPatternLogs check
+var RunErrorPatternLogs = mcpgrafana.MustTool(
+	"run_error_pattern_logs",
+	"Creates a Sift investigation with ErrorPatternLogs check, waits for it to complete, and returns the analysis results. This tool triggers an investigation with the ErrorPatternLogs check in the relevant Loki datasource. It investigates if the there are elevated errors rates in the logs compared to the last day's average and returns the error pattern found, if any.",
+	runErrorPatternLogs,
+)


My gut feeling is that these tools should be a bit less Sift-y in their names and descriptions. For example, this one could be named 'find_error_log_patterns' and the description would be closer to just the last sentence of the existing description, since an LLM is unlikely to know what Sift is, or what a ErrorPatternLogs check means.

Yeah that's true. I will refactor it a bit.

Removed the "sift" mentions from error pattern logs and slow requests check.
Sift mentions in general I think are good especially if we go further and add another layer to MCP explaining what Sift is and how to use it (a prompt maybe), same with OnCall etc. I haven't removed it from the rest of the tools as they are very specific, list investigations, get investigations etc. Open to discuss this though, maybe we can completely remove Sift mentions for now 🙄

I haven't removed it from the rest of the tools as they are very specific, list investigations, get investigations etc. Open to discuss this though, maybe we can completely remove Sift mentions for now 🙄

Yeah I think that's fine 🙂 I would even consider changing those ones to 'List Sift investigations' so it matches the docs & UI and is clearly distinct from any informal kind of 'investigation'!

sd2k · 2025-04-15T12:46:48Z

tools/sift.go

+// AnalysisStep represents a single step in the analysis process.
+type AnalysisStep struct {
+	CreatedAt time.Time `json:"created" validate:"isdefault"`
+	// State that the Analysis is entering.
+	State string `json:"state"`
+	// The exit message of the step. Can be empty if the step was successful.
+	ExitMessage string `json:"exitMessage"`
+	// Runtime statistics for this step
+	Stats map[string]interface{} `json:"stats,omitempty"`
+}
+
+type AnalysisEvent struct {
+	StartTime   time.Time              `json:"startTime"`
+	EndTime     time.Time              `json:"endTime"`
+	Name        string                 `json:"name"`
+	Description string                 `json:"description,omitempty"`
+	Details     map[string]interface{} `json:"details"`
+}
+
+// Interesting: The analysis complete with results that indicate a probable cause for failure.
+type AnalysisResult struct {
+	Successful      bool                   `json:"successful"`
+	Interesting     bool                   `json:"interesting"`
+	Message         string                 `json:"message"`
+	MarkdownSummary string                 `json:"-" gorm:"-"`
+	Details         map[string]interface{} `json:"details"`
+	Events          []AnalysisEvent        `json:"events,omitempty" gorm:"serializer:json"`
+}
+
+// An Analysis struct provides the status and results
+// of running a specific type of check.
+type Analysis struct {
+	ID        uuid.UUID `json:"id" gorm:"primarykey;type:char(36)" validate:"isdefault"`
+	CreatedAt time.Time `json:"created" validate:"isdefault"`
+	UpdatedAt time.Time `json:"modified" validate:"isdefault"`
+
+	Status    AnalysisStatus `json:"status" gorm:"default:pending;index:idx_analyses_stats,priority:100"`
+	StartedAt *time.Time     `json:"started" validate:"isdefault"`
+
+	// Foreign key to the Investigation that created this Analysis.
+	InvestigationID uuid.UUID `json:"investigationId" gorm:"index:idx_analyses_stats,priority:10"`
+
+	// Name is the name of the check that this analysis represents.
+	Name   string         `json:"name"`
+	Title  string         `json:"title"`
+	Steps  []AnalysisStep `json:"steps" gorm:"foreignKey:AnalysisID;constraint:OnDelete:CASCADE"`
+	Result AnalysisResult `json:"result" gorm:"embedded;embeddedPrefix:result_"`
+}
+
+type DatasourceConfig struct {
+	LokiDatasource       DatasourceInfo `json:"lokiDatasource" gorm:"not null;embedded;embeddedPrefix:loki_"`
+	PrometheusDatasource DatasourceInfo `json:"prometheusDatasource" gorm:"not null;embedded;embeddedPrefix:prometheus_"`
+	TempoDatasource      DatasourceInfo `json:"tempoDatasource" gorm:"not null;embedded;embeddedPrefix:tempo_"`
+	PyroscopeDatasource  DatasourceInfo `json:"pyroscopeDatasource" gorm:"not null;embedded;embeddedPrefix:pyroscope_"`
+}
+
+type DatasourceInfo struct {
+	Uid string `json:"uid"`
+}
+
+// AnalysisMeta represents metadata about the analyses
+type AnalysisMeta struct {
+	CountsByStage map[string]interface{} `json:"countsByStage"`
+	Items         []Analysis             `json:"items"`
+}
+
+type Investigation struct {
+	ID        uuid.UUID `json:"id" gorm:"primarykey;type:char(36)" validate:"isdefault"`
+	CreatedAt time.Time `json:"created" gorm:"index" validate:"isdefault"`
+	UpdatedAt time.Time `json:"modified" validate:"isdefault"`
+
+	TenantID string `json:"tenantId" gorm:"index;not null;size:256"`
+
+	Datasources DatasourceConfig `json:"datasources" gorm:"embedded;embeddedPrefix:datasources_"`
+
+	Name        string               `json:"name"`
+	RequestData InvestigationRequest `json:"requestData" gorm:"not null;embedded;embeddedPrefix:request_"`
+
+	// TODO: Add this when we want to extract discovered inputs for later usage
+	// Inputs      Inputs               `json:"inputs" gorm:"serializer:json"`
+
+	// GrafanaURL is the Grafana URL to be used for datasource queries
+	// for this investigation.
+	GrafanaURL string `json:"grafanaUrl"`
+
+	// Status describes the state of the investigation (pending, running, failed, or finished).
+	Status InvestigationStatus `json:"status"`
+
+	// FailureReason is a short human-friendly string that explains the reason that the
+	// investigation failed.
+	FailureReason string `json:"failureReason,omitempty"`
+
+	// Analyses contains metadata about the investigation's analyses
+	Analyses AnalysisMeta `json:"analyses"`
+}
+
+type RequestData struct {
+	Labels map[string]string `json:"labels"`
+	Checks []string          `json:"checks"`
+}


I think some of these types, annotations etc are redundant here (e.g. the gorm tags aren't needed and some of the types are unused). Would be good to keep as many internals out of here as possible.

sd2k

A few comments on naming / nits on exporting inline.

I'm trying to avoid exporting anything other than the actual tool definitions & handlers and the AddXTools functions where possible!

README.md

sd2k · 2025-04-15T18:02:54Z

tools/sift.go

+// FindErrorPatternLogs is a tool for running an ErrorPatternLogs check
+var FindErrorPatternLogs = mcpgrafana.MustTool(
+	"find_error_pattern_logs",
+	"Creates an investigation to search for error patterns in logs, waits for it to complete, and returns the analysis results. This tool triggers an investigation in the relevant Loki datasource to determine if there are elevated error rates compared to the last day's average, and returns the error pattern found, if any.",


Shall we get rid of mentions of an 'investigation' here too, so it looks like the tool just opaquely finds error patterns in logs? (the implementation isn't super important imo)

Same with slow requests

tools/sift.go

README.md

tools/sift.go

sd2k · 2025-04-23T09:38:57Z

tools/sift.go

+
+// FindErrorPatternLogsParams defines the parameters for running an ErrorPatternLogs check
+type FindErrorPatternLogsParams struct {
+	Name     string            `json:"name" jsonschema:"required,description=The name of the investigation"`


Suggested change

Name string `json:"name" jsonschema:"required,description=The name of the investigation"`

Name string `json:"name" jsonschema:"required,description=The name of the investigation"`

I wonder if we can omit this and generate something, or if it's better to just let the model provide some random name?

It does an ok job generating a name (see below). Maybe generating something to show the origin is better. Though we can add an investigation type for this. 🤔

Yeah it'd definitely be nice to include some info on origin/user here eventually. This is good for now though!

tools/sift.go

Co-authored-by: Ben Sully <[email protected]>

sd2k

LGTM! Nice one

ioanarm force-pushed the sift-tools branch from 2aa6dc5 to a313ca1 Compare April 4, 2025 12:18

ioanarm marked this pull request as ready for review April 4, 2025 12:53

ioanarm requested a review from a team as a code owner April 4, 2025 12:53

ioanarm force-pushed the sift-tools branch 2 times, most recently from 264870e to c5235db Compare April 11, 2025 08:01

ioanarm commented Apr 11, 2025

View reviewed changes

tools/sift.go Show resolved Hide resolved

annanay25 reviewed Apr 11, 2025

View reviewed changes

sd2k reviewed Apr 15, 2025

View reviewed changes

ioanarm force-pushed the sift-tools branch from 5e4186a to 26d780e Compare April 15, 2025 16:09

sd2k requested changes Apr 15, 2025

View reviewed changes

ioanarm force-pushed the sift-tools branch 5 times, most recently from cc612e2 to db86474 Compare April 22, 2025 10:12

ioanarm requested a review from sd2k April 22, 2025 12:23

ioanarm added 12 commits April 22, 2025 17:49

add sift wip

92b3976

remove orgID

295258c

add pulling as long as investigation is still running

0babc3a

create helper functions and add get investigation tool

48909c5

Run specific checks if requested

c4988d3

get analysis tool is added

83fc5e2

add list investigations

79a1ce0

add a sift cloud test

816a43a

refactor cloud tests

c1d0b92

add slowrequests specific tool

61350b3

clean redundant comments

33fcbf5

improve check tools descriptions

50df10b

ioanarm added 10 commits April 22, 2025 17:50

add tests to get investigations with limit and also get analysis

effe564

clean up createInvestigation description

31a55a2

update README and descriptions

ca5104b

fix cloud test

97828cb

remove createInvestigation Tool

d87d29a

fix incident cloud test

1ebfc76

Update README

82f4ec6

clean up - review comments

3ebd911

review comments

760e724

unexport analysis structs

09c3999

ioanarm force-pushed the sift-tools branch from 2c74d68 to 09c3999 Compare April 22, 2025 14:50

sd2k reviewed Apr 23, 2025

View reviewed changes

ioanarm and others added 3 commits April 23, 2025 13:33

Update tools/sift.go

d0e1934

Co-authored-by: Ben Sully <[email protected]>

Update tools/sift.go

8d1f6cb

Co-authored-by: Ben Sully <[email protected]>

review comments

4434f5e

ioanarm force-pushed the sift-tools branch from ea11dd2 to 4434f5e Compare April 23, 2025 10:38

sd2k approved these changes Apr 23, 2025

View reviewed changes

ioanarm merged commit 618d957 into main Apr 23, 2025
5 checks passed

ioanarm deleted the sift-tools branch April 23, 2025 12:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sift tools #74

Sift tools #74

ioanarm commented Apr 3, 2025 •

edited

Loading

annanay25 commented Apr 9, 2025

ioanarm commented Apr 11, 2025

annanay25 Apr 11, 2025

annanay25 Apr 11, 2025

ioanarm Apr 14, 2025

sd2k left a comment

sd2k Apr 15, 2025

ioanarm Apr 15, 2025

ioanarm Apr 15, 2025

sd2k Apr 15, 2025

sd2k Apr 15, 2025

sd2k left a comment

sd2k Apr 15, 2025

sd2k Apr 23, 2025

ioanarm Apr 23, 2025

sd2k Apr 23, 2025

sd2k left a comment

	Name string `json:"name" jsonschema:"required,description=The name of the investigation"`
	Name string `json:"name" jsonschema:"required,description=The name of the investigation"`

Sift tools #74

Sift tools #74

Conversation

ioanarm commented Apr 3, 2025 • edited Loading

Implementation Approach:

Current Workflow Example:

Questions for Discussion:

annanay25 commented Apr 9, 2025

ioanarm commented Apr 11, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sd2k left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sd2k left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sd2k left a comment

Choose a reason for hiding this comment

ioanarm commented Apr 3, 2025 •

edited

Loading