Skip to content

Improve build reliability and observability#21

Open
mscottford wants to merge 29 commits intovladbatushkov:mainfrom
mscottford:improve-build-reliability
Open

Improve build reliability and observability#21
mscottford wants to merge 29 commits intovladbatushkov:mainfrom
mscottford:improve-build-reliability

Conversation

@mscottford
Copy link
Copy Markdown
Contributor

Summary

  • Fixes a crash in Extractor.GetInherits where CreateTypeNode() returning null for non-class/non-interface types (structs, delegates, enums) caused a NullReferenceException during triple grouping
  • Fixes concurrent console output corruption in SpectreConsoleProgress by routing all AnsiConsole writes through a ConcurrentQueue drained by the ticker thread, and fixes a FormatException in OnGroupingError where triple JSON was passed as a format string
  • Adds distinct console labels for each skip outcome: Timeout, Unreadable, Failed, Skipped
  • Switches fresh builds from reading the MsBuildPipeLogger pipe result to reading the binlog via manager.Analyze, which is version-independent and doesn't silently drop events on SDK/host version mismatch
  • Adds a 5-minute build timeout via BuildWithTimeoutAsync so hung builds don't block the pipeline
  • Adds a retry in TryAnalyzeBinlogAsync to handle the race where the MSBuild child process closes its stdout pipe before the BinaryLogger finishes flushing the binlog to disk
  • Fixes the cache-hit path which was missing a null check on manager.Analyze and never updating the .deps sidecar
  • Distinguishes "build log could not be read" (binlog exists but unreadable) from "build failed" (no binlog produced) in skip reasons
  • Adds FileAnalysisProgress — a structured, timestamped log written per run for post-run diagnosis
  • Adds CompositeAnalysisProgress — fans out all progress notifications to any number of implementations
  • Adds --no-cache flag to force a full rebuild without deleting the cache directory
  • Adds --build-log-dir option to configure where per-project MSBuild plain-text logs are written

Dependencies

Test plan

  • Run against a real solution and confirm no projects fail with Unreadable on a clean run
  • Run twice; confirm second run uses cache and completes faster
  • Run with --no-cache; confirm all projects rebuild
  • Confirm strazh-run-*.log is written to the build log directory with correct event lines
  • Run a solution containing structs/enums in base lists; confirm no NullReferenceException during grouping

🤖 Generated with Claude Code

mscottford and others added 29 commits April 10, 2026 13:57
- Targets net10.0 (up from net9.0)
- Buildalyzer/Workspaces 7.1.0 → 8.0.0
- Microsoft.CodeAnalysis/CSharp 4.13.0 → 5.3.0 (Analyzers pin removed)
- Neo4j.Driver 5.27.0 → 6.0.0
- System.CommandLine 2.0.0-beta4 → 2.0.5 (first stable release)
- Dockerfile base image updated to dotnet/sdk:10.0-alpine
- Neo4j image updated to 2026.03.1 with corrected env var names
- docker-compose.yml: removed obsolete version attribute, replaced
  hardcoded Windows volume path with relative ./SystemUnderTest path
- Code updated for breaking API changes in System.CommandLine 2.0
  (SetAction, Options.Add, Aliases.Add, Required, AllowMultipleArgumentsPerToken)
  and Neo4j.Driver 6.0 (IDriver.CloseAsync removed, use await using)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…from analysis

When AddToWorkspace(workspace, addProjectReferences: true) is called for a project,
Buildalyzer eagerly adds its referenced projects into the workspace. The previous
deduplication check silently dropped those projects on their own loop iteration,
causing them to receive no CONTAINS triple and no code analysis.

Adds a regression test using a two-project fixture solution where ProjectA references
ProjectB — ensuring both appear in the analysis context regardless of workspace
insertion order.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Stage 1: Build projects in parallel using .AsParallel() in GetAnalysisContext;
  each p.Build() spawns an isolated MSBuild process so there is no shared state.
  AdhocWorkspace mutations remain sequential (not thread-safe).

- Stage 3: Analyze and insert all projects concurrently using Task.WhenAll with
  Task.Run so CPU-bound syntax tree walking runs across thread-pool threads.
  The delete step is extracted to DbManager.DeleteData and run once upfront
  before the parallel loop to avoid races.

- Stage 4: A SemaphoreSlim(4) gates concurrent Neo4j sessions to avoid
  exhausting the connection pool on large solutions.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The workspace population step (AddToWorkspace) is the slowest sequential
phase on large solutions and previously emitted no output, causing a
silent pause between "Building projects - finished." and "done."

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…l to complete

Previously the pipeline had two hard barriers: all projects had to finish
building before workspace loading began, and all projects had to finish
loading before analysis began.

The Build stage now launches all MSBuild design-time builds concurrently via
Task.WhenEach. As each build completes it flows immediately into the Load
stage (sequential AdhocWorkspace mutation) and then yields a
(Project, IAnalyzerResult) pair to the caller. The Analyze and Insert stages
pick up each pair as it arrives via IAsyncEnumerable, so analysis of early-
finishing projects begins while later projects are still building and loading.

The solution-order sort is removed — AddToWorkspace(addProjectReferences:true)
handles any insertion order correctly via the transitive-reference path
introduced in the previous fix.

GetAnalysisContext is updated to async Task<AnalysisContext> to consume the
stream; the regression test is updated accordingly.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
AddToWorkspace(addProjectReferences: true) calls analyzer.Build() internally
for any referenced project not yet in the workspace. When Build and Load were
interleaved via Task.WhenEach, in-flight build tasks raced with AddToWorkspace's
internal builds for the same project, causing Buildalyzer's environment detection
to fail with InvalidOperationException: Could not find build environment.

Fix: wait for all builds to complete (Task.WhenAll) before starting the Load
stage. Build → Load streaming is lost, but Load → Analyze streaming is retained:
each project is yielded as soon as it is loaded, so analysis begins immediately
without waiting for all projects to finish loading.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Previously all projects were built in parallel with no limit, spawning one
dotnet process per project simultaneously. A SemaphoreSlim(Environment.ProcessorCount)
now gates the Build stage so at most one build per logical core runs at a time.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
string.GetHashCode() is randomized per-process in .NET Core and later (hash
randomization was introduced to prevent hash-flooding DoS attacks). Using it
as the Neo4j pk meant every run produced a different pk for the same logical
node, defeating MERGE and silently creating duplicate nodes instead of
updating existing ones.

Symptom: projects shared between two solutions (e.g. via a git submodule)
appeared as two separate Project nodes in the graph rather than one node with
CONTAINS edges from both Solution nodes.

Fix: replace GetHashCode() with MD5 in all SetPrimaryKey overrides. MD5 is
stable across processes and runtimes and is appropriate here since this is a
content-addressing use case, not a security one.

Note: existing databases will have stale pks. Re-analyze with --delete to
clear and rebuild the graph with deterministic pks.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…nged

Introduces a --cache option (defaulting to the platform application data
folder) that stores a .binlog and a .deps sidecar per project. On subsequent
runs, source file and project file timestamps are compared against the binlog;
if nothing has changed the binlog is replayed instead of re-running MSBuild.
Transitive dependency invalidation is derived from the .deps sidecars written
by previous builds, so MSBuild's own evaluation is used rather than our own
project file parsing. Adds tests verifying that cached (cold and warm) builds
yield the same results as uncached builds.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Two bugs prevented project folder nodes from being correctly captured in
the graph for solution-based analysis:

1. AddToWorkspace returns null for project types Roslyn does not support
   (e.g. F# or native projects). The null was yielded into Analyze where
   Project.Name access threw NullReferenceException, silently dropping the
   entire task and all its triples. Unsupported project types are now
   logged and skipped cleanly.

2. Project folder nodes (e.g. FolderNode("SomeProject")) were created and
   connected to their ProjectNode via INCLUDED_IN, but had no edge pointing
   up to the solution's root folder. They appeared in the graph as orphans
   unreachable from the solution folder via INCLUDED_IN traversal. A new
   TripleIncludedIn(projectRootNode, solutionRootNode) triple now connects
   each project folder to the solution's root folder.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Without a uniqueness constraint on the pk property, each MERGE operation
performs a full label scan — performance degrades linearly as the database
grows. EnsureIndexes creates a constraint (and its implicit B-tree index)
for all 8 node labels before any data operations run. The constraint is
idempotent (IF NOT EXISTS), so re-running is always safe.

Also removes the per-triple Console.WriteLine calls from InsertData that
were producing excessive noise in the output.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Introduces IAnalysisProgress as an event bus between Analyzer and the
terminal UI, removing all AnsiConsole/Console calls from Analyzer.cs.
SpectreConsoleProgress implements the interface using AnsiConsole.Live
with a custom IRenderable that sorts in-flight tasks by most-recently-
updated so stage changes always float to the top of the visible area.
Skipped projects (unsupported types such as Android/MAUI) are now
removed from the live display immediately. The summary line is pinned
near the bottom of the terminal via padding lines computed from the
terminal height.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
CreateTypeNode() returns null for non-class/non-interface type kinds
(structs, delegates, enums, unresolvable types). Without the null guard,
new TripleOfType(classNode, null) creates a triple whose NodeB is null,
which throws NullReferenceException during triple.ToString() in the
grouping phase.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Accepts any number of IAnalysisProgress instances and fans out every
notification to all of them. WrapAsync builds a nested chain so the
first implementation is the outermost wrapper, ensuring its UI context
(e.g. Spectre.Console live display) is active when inner implementations
run.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Writes one timestamped, machine-readable line per pipeline event so that
build failures, cache hits, stage timings, and grouping errors can be
diagnosed after a run without re-running or relying solely on the live
console output.

Format: {ISO-8601-UTC} {EVENT} {key=value ...}
String values are double-quoted; numbers and booleans are unquoted.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Direct AnsiConsole.MarkupLine calls from concurrent project tasks
interleaved with ctx.Refresh() in the ticker, corrupting Spectre.Console's
cursor position and crashing the ticker. Route all console writes through
a ConcurrentQueue<Action> that the ticker drains before each Refresh(), so
all output originates from a single thread.

Also fixes a FormatException in OnGroupingError where AnsiConsole.Write
treated the triple JSON string as a format template, causing { characters
to throw. Changed to AnsiConsole.Write(new Text(...)) which bypasses
format processing.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Previously every skipped project showed [yellow]Skipped[/] regardless of
why it was skipped. Now each distinct outcome has its own label:
  Timeout    — build timed out (yellow)
  Unreadable — binlog exists but could not be parsed (yellow)
  Failed     — MSBuild reported a build failure (red)
  Skipped    — other reasons such as unsupported project type (yellow)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Introduces an Options record to replace the positional constructor
parameters, making it easier to add new options without breaking call
sites. Adds BuildLogDirectory (defaults to the platform data folder's
logs/ subdirectory) and NoCache (forces all projects to rebuild even
when a valid cached binlog exists).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When reading from a cached binlog, manager.Analyze could return null
(e.g. corrupted or truncated binlog from a previous interrupted build).
Without the null check the project silently received no triples.

The deps sidecar was also never updated on a cache hit, so a project
whose build previously timed out kept carrying stale dependency
information into subsequent staleness-propagation passes.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The MsBuildPipeLogger can silently drop events when the host MSBuild
version doesn't match the SDK's, leaving p.Build() returning null even
after a successful build. Switching to manager.Analyze(binlogPath) reads
the binlog written natively by MSBuild, which is version-independent.

The child process closes its stdout pipe before the BinaryLogger finishes
flushing the binlog to disk. TryAnalyzeBinlogAsync retries once after
500ms to cover the window where the file exists but hasn't been fully
committed, eliminating an intermittent false "build failed" report.

BuildWithTimeoutAsync enforces a 5-minute limit so a hung build (e.g.
an Android project waiting on SDK tools absent from the environment)
doesn't block the rest of the pipeline indefinitely.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Previously both cases reported "build failed", making it impossible to
tell from the log whether MSBuild actually failed or whether the binlog
was written but then couldn't be parsed. Now reports:
  "build failed"              — no binlog file produced after build
  "build log could not be read" — binlog exists but manager.Analyze yields no result

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds a buildLogDirectory parameter to StreamProjectsAsync and passes it
through to CreateBuildOptions, which adds /fileLoggerParameters to the
MSBuild invocation so each project gets a plain-text build log alongside
the binary log. The directory is created on demand if it doesn't exist.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When noCache is true, marks every project as needing a rebuild regardless
of whether its cached binlog is still fresh. This lets users force a clean
build pass to refresh stale cache entries without deleting the cache
directory manually.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Exposes the NoCache and BuildLogDirectory settings added to AnalyzerConfig
as first-class CLI flags. Also refactors BuildKnowledgeGraph to accept an
AnalyzerConfig.Options record instead of six positional parameters, making
it easier to thread new options through without changing every call site.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Threads the two new AnalyzerConfig properties through the Analyze →
StreamProjectsAsync call so that --no-cache and --build-log-dir actually
take effect during a run.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Every analysis run now writes a structured, timestamped log to the build
log directory alongside the per-project MSBuild logs. The log captures
build starts, cache hits, stage transitions, grouping errors, and final
outcomes so failures can be diagnosed post-run without re-running or
relying on the live terminal output.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant