Skip to content

Refactor Microsoft Auth endpoints and enhance token handling#32

Open
alfarahn wants to merge 12 commits into
mainfrom
feature/crawl-timing-hooks
Open

Refactor Microsoft Auth endpoints and enhance token handling#32
alfarahn wants to merge 12 commits into
mainfrom
feature/crawl-timing-hooks

Conversation

@alfarahn
Copy link
Copy Markdown
Collaborator

  • Updated MicrosoftAuthEndpoints to use a named HTTP client with a 30s timeout for token requests.
  • Changed SaveIdentity and UpdateGraphTier methods in PersistentIdentity to be asynchronous.
  • Enhanced SessionTokenStore to include retry logic for transient network errors when fetching tokens.
  • Introduced a new Ipv4HttpHandler to enforce IPv4-only connections and a 5s connect timeout for all HTTP clients.
  • Improved error handling and logging for token refresh operations.
  • Added ephemeral "cooling down" rows in the ChatView component to visually indicate 429/5xx throttling events.
  • Implemented timing hooks to capture and log performance metrics for tool calls and server interactions.
  • Updated dependencies in the project file for GitHub Copilot SDK.

Summary

Type of change

  • Bug fix
  • New feature / new tool
  • Refactor
  • Documentation
  • Build / CI
  • Security

Checklist

  • Builds locally (dotnet build)
  • Frontend builds (npm run build in src/Dashboard/frontend)
  • No new secrets committed
  • No DELETE methods enabled in HTTP helpers
  • Updated .github/copilot-instructions.md if behaviour or structure changed
  • Updated CHANGELOG.md under [Unreleased]

Related issues

Closes #

- Updated MicrosoftAuthEndpoints to use a named HTTP client with a 30s timeout for token requests.
- Changed SaveIdentity and UpdateGraphTier methods in PersistentIdentity to be asynchronous.
- Enhanced SessionTokenStore to include retry logic for transient network errors when fetching tokens.
- Introduced a new Ipv4HttpHandler to enforce IPv4-only connections and a 5s connect timeout for all HTTP clients.
- Improved error handling and logging for token refresh operations.
- Added ephemeral "cooling down" rows in the ChatView component to visually indicate 429/5xx throttling events.
- Implemented timing hooks to capture and log performance metrics for tool calls and server interactions.
- Updated dependencies in the project file for GitHub Copilot SDK.
Comment thread src/Dashboard/frontend/src/components/ChatView.vue Fixed
Comment thread src/Dashboard/frontend/src/components/ChatView.vue Fixed
var done = new TaskCompletionSource();
var cancelled = false;
var toolTracker = new ConcurrentDictionary<string, (string Name, DateTimeOffset StartTime, Activity? Activity)>();
var firstEventLogged = 0;
Comment thread src/Dashboard/Program.cs Outdated
id = record.UserId,
login = $"user-{record.UserId & 0xFFFF:X4}",
name = record.Name,
avatar = (string?)null,
// Pass null activity — Activity is not safe for concurrent SetTag
// writers, and these drilldowns run in parallel. Each call still
// gets its own ActivitySource span inside HttpHelper.
try { return (c, Breakdown: (object)await GetBreakdownForDay(token, subscriptionId, c.Date, groupBy, activity: null)); }
Comment thread src/Dashboard/AI/Tools/AnomalyTools.cs Outdated
// writers, and these drilldowns run in parallel. Each call still
// gets its own ActivitySource span inside HttpHelper.
try { return (c, Breakdown: (object)await GetBreakdownForDay(token, subscriptionId, c.Date, groupBy, activity: null)); }
catch (Exception ex) { return (c, Breakdown: (object)new { error = ex.Message }); }
Comment thread src/Dashboard/AI/Tools/IdleResourceTools.cs Fixed
Comment thread src/Dashboard/AI/ChatEndpoints.cs Outdated
await ctx.Response.WriteAsync($"data: {donePayload}\n\n");
await ctx.Response.Body.FlushAsync();
}
catch { }
Comment thread src/Dashboard/AI/ChatEndpoints.cs Outdated
Comment on lines +422 to +426
catch (Exception bgEx)
{
logger.LogWarning(bgEx, "Background title generation failed for session {Sid}", bgSessionId);
return null;
}
Comment thread src/Dashboard/AI/ChatEndpoints.cs Outdated
logger.LogInformation("timing phase=sdk.first_event ms={Ms:F0} user={User}", firstMs, userLogin);
await SafeEmit(JsonSerializer.Serialize(new { type = "timing", phase = "sdk.first_event", ms = Math.Round(firstMs, 1), extra = new { evt = evt.GetType().Name } }));
}
catch { }
Comment thread src/Dashboard/AI/ChatEndpoints.cs Outdated
logger.LogInformation("timing phase=sdk.first_event ms={Ms:F0} user={User}", firstMs, userLogin);
await SafeEmit(JsonSerializer.Serialize(new { type = "timing", phase = "sdk.first_event", ms = Math.Round(firstMs, 1), extra = new { evt = evt.GetType().Name } }));
}
catch { }
Comment thread src/Dashboard/AI/ChatEndpoints.cs Fixed
alfarahn and others added 10 commits May 19, 2026 14:56
Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
… on local IDisposable'

Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
@alfarahn
Copy link
Copy Markdown
Collaborator Author

@copilot how do you feel about this PR to main does it improve latency?

Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

Copilot AI commented May 21, 2026

@copilot how do you feel about this PR to main does it improve latency?

Honest read: this PR should help tail latency more than median.

Wins:

  • Ipv4HttpHandler with a 5s connect timeout is the biggest one — on App Service Linux, outbound IPv6 to ARM/Graph/AOAI occasionally stalls for the full OS connect timeout (~21s+). Forcing IPv4 + capping connect at 5s removes that long tail.
  • Named HttpClient for the token endpoint avoids HttpClient re-creation and socket exhaustion under bursty auth refreshes.
  • Async SaveIdentity / UpdateGraphTier removes sync-over-async on the request path.
  • Proactive token refresh in SessionTokenStore (with retry) prevents the 401 → reauth round-trip that previously cost a full prompt.

Neutral / costs:

  • Retry on 429/5xx adds latency on failure paths (by design) — the "cooling down" UI row makes that visible instead of looking like a hang, which is the right tradeoff.
  • Timing hooks are pure observability; no runtime cost worth worrying about, but no speedup either.

Net: median p50 ~unchanged, p95/p99 should improve noticeably on cold token refreshes and any IPv6-flaky egress. I'd merge it. The one thing I'd watch post-merge is the App Insights dependencies p95 for login.microsoftonline.com and management.azure.com — that's where the IPv4 fix should show up.

await ctx.Response.WriteAsync($"data: {donePayload}\n\n");
await ctx.Response.Body.FlushAsync();
}
catch (OperationCanceledException) { }
await ctx.Response.Body.FlushAsync();
}
catch (OperationCanceledException) { }
catch (ObjectDisposedException) { }
// writers, and these drilldowns run in parallel. Each call still
// gets its own ActivitySource span inside HttpHelper.
try { return (c, Breakdown: (object)await GetBreakdownForDay(token, subscriptionId, c.Date, groupBy, activity: null)); }
catch (Exception ex) { return (c, Breakdown: new { error = ex.Message }); }
// writers across the 8 parallel queries. HttpHelper still emits
// its own per-call span.
try { return (p.Label, Result: await RunResourceGraphQuery(token, p.Kql, subs, activity: null)); }
catch (Exception ex) { return (p.Label, Result: new { error = "query exception", detail = ex.Message }); }
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants