RFC 0001: Weaver for Apache Thrift Tooling Platform (`thriftfmt`, `thriftlint`, `thriftls`, VS Code Extension)

Status: Accepted
Authors: Dmytro Shteflyuk
Created: 2026-02-23
Target release: Beta (date TBD)

Summary

This RFC proposes Weaver for Apache Thrift, a standalone tooling project for Apache Thrift IDL editing and formatting, consisting of:

thriftfmt: a stable, lossless-aware formatter for .thrift files
thriftlint: a diagnostics-oriented linter for .thrift files
thriftls: an LSP server for editor integrations
a VS Code extension with syntax highlighting and LSP integration

The project will be implemented primarily in Go and designed around a reusable syntax/formatting engine. Parsing will use a tree-sitter grammar (for incremental, error-tolerant parsing suitable for LSP) plus a custom lossless lexer/token-trivia layer (for formatter fidelity).

The public product name is Weaver for Apache Thrift. Repository and module identifiers remain thrift-weaver.

Motivation

The current Apache Thrift C++ compiler frontend is optimized for semantic compilation and code generation, not source-preserving formatting:

whitespace and regular comments are discarded early
some syntax is normalized into semantic representations
top-level declarations are stored in typed collections rather than source order

These are good compiler design choices but they are poor foundations for a modern formatter/LSP stack.

Building a dedicated tooling project allows:

lossless parsing/trivia preservation for formatting
error-tolerant incremental parsing for editors
a cleaner Go-based developer experience
independent release cadence from the Apache Thrift compiler

Goals

Provide a deterministic, idempotent Apache Thrift formatter (thriftfmt)
Provide an Apache Thrift linter CLI (thriftlint) that reuses parser diagnostics and lint rules
Provide baseline structural lint rules for duplicate explicit field IDs, duplicate field names, and other deprecated/unsafe constructs detectable within a single document
Provide bounded single-document semantic diagnostics for locally resolvable type and service constraints without requiring workspace indexing
Provide a production-quality LSP server (thriftls) for editors
Provide a VS Code extension with syntax highlighting and LSP client integration
Preserve comments and syntax fidelity where formatter policy permits
Support invalid/incomplete code in editor workflows
Validate formatted output compatibility against the official Apache Thrift compiler in CI

Non-Goals (Initial Scope)

Replacing the official Apache Thrift compiler
Whole-program semantic type checking, include-graph resolution, and cross-file indexing in v1
Cross-file indexing, go-to-definition, rename in v1
A perfect source-preserving rewriter (formatter may normalize whitespace and selected style choices)
Embedding formatter/LSP into the existing thrift binary

High-Level Architecture

The platform is a shared engine with two frontends (CLI + LSP), plus a VS Code client.

                    +----------------------+
                    |    VS Code Plugin    |
                    | TextMate + LSP client|
                    +----------+-----------+
                               |
                               | JSON-RPC (LSP)
                               v
                    +----------------------+
                    |       thriftls       |
                    |  LSP transport/API   |
                    +----------+-----------+
                               |
                    +----------v-----------+
                    |  Shared Go Engine    |
                    | lexer + tokens       |
                    | tree-sitter parser   |
                    | CST wrappers         |
                    | diagnostics          |
                    | formatter            |
                    +----------+-----------+
                               ^
                               |
                    +----------+-----------+
                    |       thriftfmt      |
                    |  CLI (check/write)   |
                    +----------------------+

Core Technical Decisions

1. Language and Runtime

Implementation language: Go
Parser runtime: embedded tree-sitter wasm executed in-process via wazero
Rationale:
- rapid iteration and testing
- straightforward CLI/LSP packaging
- strong ecosystem for tooling and CI

2. Parsing Strategy

Use tree-sitter for syntax parsing and incremental updates
Add a custom lossless lexer for trivia and exact token lexemes

Rationale:

tree-sitter gives incremental/error-tolerant parsing and node spans
custom lexer gives formatter-grade trivia preservation and lexeme fidelity
hybrid approach reduces risk versus hand-rolling a fully incremental parser

Normative v1 decision:

tree-sitter is the structural parser only; the custom lexer is the token/trivia source of truth for formatting.
All formatter output decisions must be derived from:
- CST structure (node kinds + spans)
- lossless token/trivia spans
- formatter policy
No formatter logic may depend on tree-sitter tokenization internals.

3. Syntax Representation

Internal primary representation for formatting/LSP: CST-oriented syntax tree + lossless token stream
No semantic AST required for v1 formatter/LSP

4. Formatter Strategy

Deterministic pretty-printer with doc-algebra style layout
Preserve comments and token lexemes where policy allows
Regenerate whitespace/indentation
Support full-document and range formatting

5. LSP Strategy

Snapshot-based document model keyed by URI+version
Full reparse on change in v1 (designed to allow incremental optimization later)
Error-tolerant parsing and partial diagnostics for malformed code

Normative v1 decisions:

LSP text sync mode: Incremental (textDocument/didChange with ranged edits)
Internal parse mode: full reparse from reconstructed document text after each accepted change
Formatting on invalid syntax:
- textDocument/formatting and rangeFormatting may return an LSP error (RequestFailed) when formatting is unsafe
- diagnostics continue to be published asynchronously via publishDiagnostics

Repository Layout (Monorepo)

Proposed repository root (new project, separate from Apache Thrift repo):

thrift-weaver/
  README.md
  LICENSE
  go.mod
  go.sum
  .github/
    workflows/
      ci.yml
      release.yml
  docs/
    architecture.md
    formatting-style.md
    release.md
    rfcs/
      0001-thrift-tooling-platform.md
  cmd/
    thriftfmt/
      main.go
    thriftlint/
      main.go
    thriftls/
      main.go
  internal/
    text/
      line_index.go
      positions.go
      edits.go
    lexer/
      token.go
      trivia.go
      lexer.go
      lexer_test.go
    syntax/
      kinds.go
      parse.go
      diagnostics.go
      cst.go
      query.go
      treesitter/
        parser.go
        language.go
        node.go
    format/
      doc.go
      printer.go
      comments.go
      policy.go
      format.go
      range_format.go
      format_test.go
    lsp/
      server.go
      handlers.go
      transport_stdio.go
      snapshots.go
      workspace.go
      capabilities.go
      diagnostics.go
      formatting.go
      symbols.go
      folding.go
      semantic_tokens.go
    testutil/
      corpus.go
      goldens.go
      thrift_oracle.go
  grammar/
    tree-sitter-thrift/
      grammar.js
      src/
      queries/
        highlights.scm
        folds.scm
        symbols.scm
  editors/
    vscode/
      package.json
      src/
        extension.ts
        client.ts
        config.ts
      syntaxes/
        thrift.tmLanguage.json
      language-configuration.json
      scripts/
        package-binaries.ts
      README.md
      CHANGELOG.md
  testdata/
    corpus/
      valid/
      invalid/
      editor/
    format/
      input/
      expected/
    lsp/
      scenarios/
  scripts/
    bootstrap.sh
    generate-tree-sitter.sh
    sync-thrift-corpus.sh

Module Boundaries and Responsibilities

`internal/text`

Purpose:

Line index and offset math
Byte offset <-> UTF-8 line/column
Byte offset <-> LSP UTF-16 positions
Text edit utilities and diff helpers

Constraints:

This package is the only place that understands LSP UTF-16 conversions.
Parser/formatter APIs should use byte offsets internally.

`internal/lexer`

Purpose:

Produce a lossless token stream with trivia and raw spans
Provide stable token kinds independent of tree-sitter internals

Key responsibilities:

Exact lexeme slicing from source
Comment classification (//, #, /* */, /** */)
Whitespace/newline trivia capture
Robust handling of malformed strings/comments (emit error tokens + diagnostics)

v1 decision:

Use a leading-trivia-only storage model unless a concrete formatter bug requires trailing trivia.
Trailing fields in example APIs below are illustrative and may be omitted from implementation.

`internal/syntax`

Purpose:

Wrap tree-sitter parse tree with project-specific CST API
Merge tree nodes with token stream
Produce diagnostics and syntax queries for editor features

Key responsibilities:

Parse source into Tree (CST root + token stream + diagnostics)
Provide node iteration/query helpers
Support parse recovery and error nodes
Track stable spans for range formatting and editor features

Critical invariant:

Every non-synthetic CST node span must map to a contiguous source byte range.
FirstToken / LastToken must reference tokens whose spans are within the node span.
Error/recovery nodes must still preserve source order in Children.

`internal/format`

Purpose:

Format source using CST + token/trivia model
Return full-file output and precise edits

Key responsibilities:

Doc-algebra builder/printer
Comment placement and preservation rules
Full and range formatting
Idempotence guarantees

Critical invariant:

Formatter must never emit text outside the input document's declared encoding assumptions (UTF-8 bytes in, UTF-8 bytes out).

`internal/lsp`

Purpose:

LSP server implementation over shared engine
Snapshot lifecycle and request routing

Key responsibilities:

document lifecycle (didOpen, didChange, didClose)
diagnostics publishing
formatting handlers
symbols/folds/selection ranges
semantic tokens
cancellation and version consistency
structured logging/trace hooks for debugging and support

v1 concurrency model:

requests may be handled concurrently across different documents
operations for the same document must resolve against a single immutable snapshot version
stale formatting requests (older version than current snapshot) may return ContentModified

`editors/vscode`

Purpose:

VS Code client and packaging
Syntax highlighting (TextMate baseline plus semantic-token overlay from thriftls)
Launch managed-install or user-provided thriftls

Key responsibilities:

register language and grammar
spawn server
configure transport and settings
surface logs/errors to users

Data Structures (Go API-Level)

This section defines the core data model for the engine.

Source and Positioning

package text

type ByteOffset int

type Span struct {
    Start ByteOffset // inclusive
    End   ByteOffset // exclusive
}

type Point struct {
    Line   int // 0-based
    Column int // byte column
}

type Range struct {
    Start Point
    End   Point
}

// LSP-facing UTF-16 position/range, kept at edges only.
type UTF16Position struct {
    Line      int
    Character int
}

type UTF16Range struct {
    Start UTF16Position
    End   UTF16Position
}

Token and Trivia Model

package lexer

type TokenKind uint16
type TriviaKind uint8

const (
    TriviaWhitespace TriviaKind = iota
    TriviaNewline
    TriviaLineComment
    TriviaHashComment
    TriviaBlockComment
    TriviaDocComment
)

type Trivia struct {
    Kind TriviaKind
    Span text.Span
}

type Token struct {
    Kind    TokenKind
    Span    text.Span
    Leading []Trivia
    Flags    TokenFlags // e.g. malformed, synthesized, recovered
}

type TokenFlags uint8

Notes:

Token text is recovered via source[token.Span.Start:token.Span.End].
Trivia also points into source via spans; no duplicated strings by default.
A leading-trivia-only model is acceptable in v1 if comment placement remains stable.

Syntax Tree (CST Wrapper)

package syntax

type NodeKind uint16
type NodeID uint32

const NoNode NodeID = 0
// Real node IDs are 1-based. NodeID is not required to equal the slice index.

type ChildRef struct {
    IsToken bool
    Index   uint32 // token index or node index
}

type Node struct {
    ID         NodeID
    Kind       NodeKind
    Span       text.Span
    FirstToken uint32 // inclusive token index
    LastToken  uint32 // inclusive token index
    Parent     NodeID // NoNode for root
    Children   []ChildRef // original source order
    Flags      NodeFlags  // error/recovered/synthetic
}

type NodeFlags uint8

type Tree struct {
    URI         string
    Version     int32
    Source      []byte
    Tokens      []lexer.Token
    Nodes       []Node
    Root        NodeID
    Diagnostics []Diagnostic
    LineIndex   *text.LineIndex
}

Design notes:

Tree is immutable after parse.
Nodes are stored in slices for cache locality and stable indexing.
Parent pointers enable quick ancestor widening for range formatting.
Children preserve exact syntax order, even for malformed or recovered regions.

Diagnostics

package syntax

type Severity uint8

const (
    SeverityError Severity = iota + 1
    SeverityWarning
    SeverityInfo
)

type DiagnosticCode string

type Diagnostic struct {
    Code       DiagnosticCode
    Message    string
    Severity   Severity
    Span       text.Span
    Related    []RelatedDiagnostic
    Source     string // "lexer", "parser", "formatter"
    Recoverable bool
}

type RelatedDiagnostic struct {
    Message string
    Span    text.Span
}

Formatter Result Types

package format

type Options struct {
    LineWidth           int
    Indent              string // default: "  "
    MaxBlankLines       int
    PreserveCommentCols bool // v2, experimental
}

type Result struct {
    Output      []byte
    Changed     bool
    Diagnostics []syntax.Diagnostic
}

type RangeResult struct {
    Edits       []text.ByteEdit
    Diagnostics []syntax.Diagnostic
}

text.ByteEdit (referenced above) is defined as:

package text

type ByteEdit struct {
    Span    Span
    NewText []byte
}

LSP Snapshot Model

package lsp

type Snapshot struct {
    URI       string
    Version   int32
    Tree      *syntax.Tree
    UpdatedAt time.Time
}

type DocumentStore interface {
    Get(uri string) (*Snapshot, bool)
    Put(snapshot *Snapshot)
    Delete(uri string)
}

Engine APIs (Go, Internal-First)

The examples below define the intended engine/package contracts for implementation. v1 does not commit to a public/stable Go library API; packages remain internal until post-beta.

Parsing APIs

package syntax

type ParseOptions struct {
    URI            string
    Version        int32
    IncludeQueries bool // parse tree-sitter query metadata if needed
}

func Parse(ctx context.Context, src []byte, opts ParseOptions) (*Tree, error)

// Future incremental API; v1 may parse from scratch.
func Reparse(ctx context.Context, old *Tree, src []byte, opts ParseOptions) (*Tree, error)

Behavior:

Returns a Tree even if syntax errors exist (best-effort), unless parsing infrastructure fails catastrophically.
Parser errors appear in Tree.Diagnostics.
error is reserved for internal failures (cancellation, parser initialization, invariant violations).
Reparse is an optimization API. It must remain behaviorally equivalent to Parse for the same input bytes and options.

Formatting APIs

package format

func Document(ctx context.Context, tree *syntax.Tree, opts Options) (Result, error)

func Range(ctx context.Context, tree *syntax.Tree, r text.Span, opts Options) (RangeResult, error)

// Convenience wrapper for CLI paths.
func Source(ctx context.Context, src []byte, uri string, opts Options) (Result, error)

Behavior:

Document may refuse to format if parse errors exceed a safety threshold (configurable policy).
Range widens to the nearest format-safe ancestor (declaration/block/list node).
Both functions are deterministic and idempotent given the same tree/options.

Formatting refusal contract:

Refusal due to unsafe syntax is not a process error.
Engine API will return a typed error (e.g., ErrUnsafeToFormat) for unsafe formatting requests.
LSP/CLI layers map ErrUnsafeToFormat to protocol/UX behavior (LSP RequestFailed, CLI exit code 2) while continuing to surface diagnostics from parsing.

LSP Server APIs (Internal)

package lsp

type ServerOptions struct {
    Logf          func(string, ...any)
    FormatOptions format.Options
}

type Server struct {
    // internal state
}

func NewServer(opts ServerOptions) *Server
func (s *Server) RunStdio(ctx context.Context) error

Request handlers (internal signatures):

func (s *Server) DidOpen(ctx context.Context, p DidOpenParams) error
func (s *Server) DidChange(ctx context.Context, p DidChangeParams) error
func (s *Server) DidClose(ctx context.Context, p DidCloseParams) error
func (s *Server) Formatting(ctx context.Context, p DocumentFormattingParams) ([]TextEdit, error)
func (s *Server) RangeFormatting(ctx context.Context, p DocumentRangeFormattingParams) ([]TextEdit, error)
func (s *Server) DocumentSymbol(ctx context.Context, p DocumentSymbolParams) ([]DocumentSymbol, error)
func (s *Server) FoldingRange(ctx context.Context, p FoldingRangeParams) ([]FoldingRange, error)
func (s *Server) SelectionRange(ctx context.Context, p SelectionRangeParams) ([]SelectionRange, error)
func (s *Server) SemanticTokensFull(ctx context.Context, p SemanticTokensParams) (*SemanticTokens, error) // phase 2

LSP protocol contract (normative v1):

initialize advertises incremental sync, document/range formatting, document symbols, folding ranges, and selection ranges.
initialize must not advertise unsupported methods behind placeholders.
shutdown is graceful and idempotent; exit terminates process.
textDocument/formatting and textDocument/rangeFormatting:
- return RequestFailed when formatting is unsafe (ErrUnsafeToFormat)
- return ContentModified when request version is stale relative to current snapshot
Unknown methods return standard JSON-RPC method-not-found behavior.
Server must remain responsive under cancellation and treat cancellation as non-fatal.

Formatter Design

Formatting Policy (v1)

The formatter will:

normalize indentation
normalize horizontal spacing
normalize blank line counts
preserve comments
preserve declaration and member order
preserve token lexemes where possible:
- string quote style and escapes
- hex/decimal literal spelling
- deprecated spellings (async, byte) unless an explicit normalize option is added

The formatter will not (v1):

reorder imports/includes/namespaces
rewrite deprecated syntax
enforce semantic style (e.g. field ids ordering)

Default Style Profile (v1)

These defaults are normative for the first implementation and for golden tests unless changed by a future RFC:

LineWidth = 100
Indent = " " (two spaces)
MaxBlankLines = 2
top-level declarations separated by one blank line
members (fields/functions/enum values) formatted one per line
preserve existing separator lexeme when syntactically equivalent (, vs ;) in v1
preserve literal spellings and comment text
invalid-code formatting in LSP defaults to fail-closed (RequestFailed) unless formatting is provably safe

If a syntax construct cannot be formatted without choosing a canonical separator, choose semicolon for declarations and comma for list/map/annotation items, and document the exception in tests.

Doc-Algebra Model

Internal printer primitives:

Text
Line (hard break)
SoftLine (space or line)
Indent
Group
Concat
IfBreak (optional in v2)

This enables:

stable wrapping at configurable width
consistent nested formatting (types, annotations, const literals)
reuse across full/range formatting

Comment Handling

Comment fidelity is a formatter-critical requirement.

Policy:

comments are lexed as trivia with spans
formatter emits comments at token boundaries based on trivia ownership
blank-line preservation is conservative (cap at MaxBlankLines)
no comment text rewriting in v1

Edge cases to support:

comments between type and identifier
trailing comments on fields and enum values
doc comments preceding declarations and members
comments inside const maps/lists

Source Text and Newline Policy (v1)

Normative rules:

Input bytes are treated as UTF-8 for parsing/formatting.
UTF-8 BOM at file start is preserved if present.
Invalid UTF-8 bytes:
- parser/lexer may emit diagnostics
- formatter must refuse (ErrUnsafeToFormat) rather than rewrite bytes
Newline style:
- preserve dominant file newline style (LF or CRLF) for formatter-emitted line breaks
- mixed newline input may be normalized to the dominant style and should emit a diagnostic (non-fatal if formatting is otherwise safe)
Formatter must not introduce NUL bytes.

Parsing and Tree-Sitter Integration

Grammar Scope

The tree-sitter grammar must support:

current Apache Thrift syntax
common deprecated syntax forms tolerated in practice (as parseable nodes/tokens)
error recovery around top-level declarations and container/literal boundaries

Query Files

grammar/tree-sitter-thrift/queries/ will include:

highlights.scm for syntax highlighting (future semantic overlay optional)
folds.scm for folding ranges
symbols.scm for declarations (services, structs, enums, typedefs, consts)

WASM Build Strategy

tree-sitter integration introduces C code.

Plan:

vendor/generated parser C sources in repo
vendor the tree-sitter core C runtime sources used for wasm artifact generation
build embedded parser wasm artifacts and ship pure-Go (CGO_ENABLED=0) binaries
test builds on macOS/Linux/Windows in CI before extension packaging work starts

Risk mitigation:

lock tree-sitter runtime/parser versions
add a dedicated parser build smoke test in CI

Windows ARM64 note:

Building windows/arm64 is straightforward for the shipped binaries because runtime execution is pure Go and does not require cgo.
Grammar wasm generation still needs the pinned wasm toolchain in development/CI, but not in end-user environments.

Parser/Lexer Alignment Invariants (Must-Have)

Because parsing is hybrid (tree-sitter + custom lexer), alignment rules must be explicit:

all CST node spans are in byte offsets over the same source buffer used by the lexer
lexer token spans must form a monotonically increasing sequence ending at EOF
formatter lookup from CST node -> covering token range must be deterministic
any span mismatch between lexer and parser is a parser bug and should surface as an internal diagnostic/test failure

Implementation note:

create a small conformance test suite that asserts CST node spans align with expected token boundaries for representative grammar forms (declarations, nested containers, comments, malformed inputs)

LSP Feature Set and Phasing

v1 (MVP)

initialize
shutdown, exit
textDocument/didOpen
textDocument/didChange
textDocument/didClose
textDocument/publishDiagnostics
textDocument/formatting
textDocument/rangeFormatting
textDocument/documentSymbol
textDocument/foldingRange
textDocument/selectionRange
textDocument/semanticTokens/full
workspace/didChangeConfiguration (configuration reload only; no complex workspace features)

v2

textDocument/onTypeFormatting (optional)
richer diagnostics and quick fixes (e.g., deprecated syntax hints)

Deferred (post-v2)

go-to-definition
references
rename
code actions requiring cross-file indexing

VS Code Extension Design

v1 Responsibilities

Register thrift language
Provide TextMate syntax highlighting (syntaxes/thrift.tmLanguage.json)
Start thriftls via vscode-languageclient
Manage thriftls installation/version selection (managed install tool flow) or use user-provided path
Route formatting requests to LSP
Expose settings:
- thrift.server.path
- thrift.server.args
- thrift.format.lineWidth
- thrift.trace.server

Non-goal in v1:

Implementing language semantics in the extension. All parsing/formatting/diagnostics logic lives in thriftls.

Binary Packaging Strategy

v1 decision (managed install):

Do not bundle thriftls binaries inside the .vsix by default.
Publish per-platform thriftls binaries as release artifacts.
VS Code extension downloads/installs the matching thriftls binary on demand (or via explicit command), similar to established Go tool installation flows.
Store managed binaries in extension-managed storage/cache.
Allow override via user-specified external path (thrift.server.path).
Optional in v1 if CI/toolchain is ready: Windows arm64 artifact publication.

Managed install contract (normative v1):

Extension downloads thriftls only from a trusted release manifest URL or user-configured override endpoint.
Manifest must include:
- manifest schema version
- tool version
- platform/arch tuple
- download URL
- SHA-256 checksum
- file size (bytes)
Default managed manifest/download endpoints must use HTTPS; non-HTTPS endpoints are allowed only via explicit user override for development or air-gapped mirrors.
Extension verifies checksum before install and rejects mismatches.
Install/update is atomic:
- download to temp file
- verify checksum
- replace managed binary via atomic rename where supported
- preserve last-known-good binary for rollback on failed update
Archive extraction (if used) must reject path traversal entries and unexpected file layouts.
Extension must clearly surface offline/download/verification errors and allow manual thrift.server.path fallback.
Artifact signing/provenance verification (e.g., signatures/attestations) is recommended and may be added before beta if release automation is ready; v1 minimum requirement is checksum verification.

Tradeoffs:

Managed install keeps .vsix small and aligns with established Go tooling UX
Requires robust download/version/checksum handling in extension
External path still provides enterprise/offline escape hatch

Semantic Highlighting Strategy

v1: TextMate baseline plus textDocument/semanticTokens/full from thriftls
v2: expand semantic-token quality/coverage as needed; keep TextMate as fallback

CLI Design (`thriftfmt`)

Commands and Flags

Primary usage:

thriftfmt path/to/file.thrift
thriftfmt --write path/to/file.thrift
thriftfmt --check path/to/file.thrift
thriftfmt --stdin --assume-filename foo.thrift
thriftfmt --line-width 100

Flags:

--write, -w: write result in-place
--check: non-zero exit if changes would be made
--stdin: read source from stdin
--stdout: explicit stdout (default if no -w)
--assume-filename: URI/name for diagnostics and parser context
--line-width: max width
--range start:end (optional in v1 CLI; required by API, not required by CLI)
- v1 syntax (if implemented): byte offsets, half-open [start,end), zero-based (e.g. --range 120:240)
- future line/column syntax, if added, must use a distinct flag to avoid ambiguity
--debug-tokens
--debug-cst

Exit Codes

0: success; no changes (or write success)
1: formatting changes required in --check
2: syntax errors prevented formatting
3: internal error

Input/output conflict rules (normative):

--write and --stdin may not be used together
--check and --write may not be used together
formatting multiple files in one invocation is deferred unless explicitly added later

Error Handling and Recovery Policy

CLI

By default, refuse formatting if syntax tree is too broken to ensure safe output
Emit syntax diagnostics to stderr
Return exit code 2

LSP

Always attempt parse and publish diagnostics
Formatting handlers may:
- return no edits when already formatted
- return LSP RequestFailed / ContentModified when unsafe or stale
Never crash on malformed input

Safety Threshold for Formatting

Formatter may refuse when:

unterminated block/string causes tokenization desync
root tree is mostly recovery/error nodes
selected range cannot be widened to a format-safe ancestor

Exact thresholds should be documented in docs/formatting-style.md and covered by tests.

Minimum v1 threshold policy (to avoid implementation ambiguity):

full-document formatting is allowed if lexer reaches EOF and root parse tree exists, even with recoverable parse diagnostics, unless unterminated string/block comment prevents reliable tokenization
range formatting requires a format-safe ancestor with fully bounded token coverage
if refusal occurs, diagnostics must indicate the blocking region when possible

Performance Targets (Beta)

Targets are for local editor interaction and CI formatting runs.

Parse + diagnostics for typical files (<2k LOC): p95 <50 ms on reference hardware (warm)
Full document format for typical files: p95 <100 ms on reference hardware (warm)
didChange handling and diagnostic refresh: perceived responsive under normal typing (debounce allowed); target p95 <75 ms parse+diagnostics on typical files after debounce
No unbounded memory growth across repeated open/change/close cycles in LSP session

These are non-binding v1 targets but required for beta sign-off.

Measurement rules (required for beta sign-off):

Publish benchmark corpus definitions (at least: small, typical, large Apache Thrift files; malformed-file set).
Record hardware/OS baseline for reported numbers in CI or release notes.
Report p50/p95 latency for parse and format benchmarks.
Track steady-state RSS (or equivalent process memory metric) during repeated LSP open/change/close test loops.

Testing Strategy

1. Unit Tests

lexer tokenization and trivia capture
UTF-16 position mapping
parser node wrappers and queries
formatter doc-printer behavior
LSP handler utilities

2. Golden Tests

input.thrift -> expected.thrift
idempotence: fmt(fmt(x)) == fmt(x)
comment preservation fixtures
malformed syntax recovery fixtures
range-format widening fixtures

3. Corpus Tests

Parse large sets of real-world .thrift files
Include compiler fixtures and custom edge-case corpus

4. Compatibility Oracle Tests

Validate formatted output parses with the official Apache Thrift compiler (thrift)
CI job should fail if formatter emits syntax not accepted by official compiler

Version pinning requirement:

CI must pin the oracle compiler version (container image or released binary) to avoid silent behavior drift.
A separate scheduled job may run against latest upstream for early-warning compatibility signals.

5. LSP Integration Tests

didOpen diagnostics
versioned didChange ordering
formatting and range formatting responses
cancellation handling
UTF-16 edit correctness
initialize capability advertisement matches implemented handlers
formatting request failure semantics (RequestFailed, ContentModified) are covered by integration tests

6. VS Code Smoke Tests

extension activation
server launch
diagnostics visible
formatting command works
syntax highlighting grammar loads
managed thriftls install/update flow works against test manifest
checksum verification failure is surfaced and blocks activation of managed binary

7. Fuzz / Robustness

fuzz lexer and parser for panics/crashes
fuzz formatting on arbitrary token streams/trees (best effort)

CI / Release Plan

CI Required Jobs

go test ./...
golangci-lint (or equivalent)
parser generation drift check (tree-sitter generated files committed and up to date)
corpus parse tests
golden formatter tests
compatibility oracle tests (with thrift compiler installed in job)
VS Code extension build smoke test
cross-platform binary build smoke (at least compile)
release manifest/checksum generation and verification smoke (for managed thriftls install flow)

Recommended additions (required before beta):

race detector run for LSP/document-store packages (go test -race on supported CI runners)
VS Code extension integration smoke against a packaged .vsix

Release Artifacts

thriftfmt binaries (macOS/Linux/Windows)
thriftlint binaries (macOS/Linux/Windows)
thriftls binaries (macOS/Linux/Windows)
thriftls release manifest (machine-readable platform matrix + checksums)
checksums file (SHA-256) for published binaries/artifacts
VS Code extension package (.vsix)

Versioning

SemVer across CLI, LSP server, and VS Code extension
v1 uses a shared repo version; the VS Code extension version tracks the repo release version
Release versions are proposed by a bot-managed release PR derived from merged Conventional Commit-style PR titles
The release PR is the source of truth for version bumps; merging it creates the vX.Y.Z tag that starts the publish workflow
VS Code extension user-facing notes are maintained in editors/vscode/CHANGELOG.md under Unreleased and rolled into the released version during release preparation

Milestones and Acceptance Criteria

M0: Foundation

Scope:

repo scaffold
CI skeleton
test harness
RFC + architecture docs

Acceptance criteria:

repository structure exists and builds
CI runs lint + unit test placeholders
golden test harness can execute sample fixtures
tree-sitter parser generation script stubbed and documented
chosen Go tree-sitter binding and version are pinned in repo docs/build files

M1: Parsing MVP

Scope:

lossless lexer
tree-sitter grammar v1
CST wrapper
syntax diagnostics

Acceptance criteria:

parser returns Tree with tokens and nodes for valid fixtures
parser returns recoverable diagnostics for invalid fixtures
no panics on corpus parse test
node spans map correctly to source bytes and LSP positions (tests)
at least top-level declarations and members are represented in CST query APIs
parser/lexer alignment invariants are enforced by dedicated tests

M2: Formatter MVP

Scope:

full-document formatter
comment preservation
CLI thriftfmt

Acceptance criteria:

supports includes/namespaces/typedefs/enums/consts/structs/exceptions/services
formatter is idempotent on formatter corpus
comments are preserved in output (golden tests)
--check exit codes behave as specified
formatted output parses with official thrift compiler across corpus subset
formatter refusal behavior (ErrUnsafeToFormat or equivalent) is finalized and tested

M3: LSP MVP

Scope:

diagnostics + formatting + range formatting
document symbols/folding/selection ranges

Acceptance criteria:

thriftls handles LSP open/change/close lifecycle without crashes
diagnostics update on edits for valid and invalid files
textDocument/formatting and rangeFormatting return valid edits
range formatting widens to safe ancestors and is covered by tests
document symbols and folding ranges are returned for core declarations
formatting request failure semantics (RequestFailed, ContentModified) are covered by integration tests
initialize advertises only implemented v1 capabilities

M4: VS Code Extension MVP

Scope:

syntax highlighting
LSP client integration
server binary management/install flow (managed install)

Acceptance criteria:

opening .thrift file activates extension
syntax highlighting works with TextMate grammar
diagnostics and formatting work via extension-managed thriftls install (or configured external path)
managed install validates manifest/checksum and preserves last-known-good binary on failed update
offline/download/verification failures produce actionable user-facing errors and do not corrupt existing managed binary
extension works on macOS/Linux/Windows in smoke tests
user can override server path via settings

M5: Hardening and Beta

Scope:

performance tuning
crash hardening and release automation

Acceptance criteria:

beta performance targets met on representative corpora
no known crashers from fuzz/corpus suites
release pipeline produces signed/publishable artifacts (or documented unsigned process)
user documentation covers install, format, and VS Code setup

Decision Log and Remaining Questions

Project hosting and governance:
- Resolved for v1: start in github.com/kpumuk/thrift-weaver and evaluate upstreaming later.
tree-sitter distribution policy:
- Superseded by RFC 0002: ship embedded wasm parser artifacts and pure-Go (CGO_ENABLED=0) binaries; no cgo parser backend remains.
- Follow-up: keep wasm artifact drift and runtime ABI checks green in CI.
Formatter v1 style strictness:
- Resolved for v1: preserve separator lexemes and deprecated spellings by default; canonicalize whitespace/indentation only.
Library API stability:
- Resolved for v1: keep implementation packages internal until post-beta; no public/stable Go library API commitment in v1.
Invalid-code formatting policy in editors:
- Resolved for v1: fail closed (no edits + explicit error) unless formatting is provably safe.
Release orchestration:
- Resolved for v1: use release PR automation driven by Conventional Commit-style PR titles, then publish from the created vX.Y.Z tag.
VS Code changelog ownership:
- Resolved for v1: keep a curated in-repo changelog for extension-user-visible changes and roll Unreleased into the released version during release preparation.

Remaining non-blocking question (can be decided in M3/M4):

Linux managed binary compatibility policy for VS Code extension:
- Resolved direction: follow managed install/distribution patterns rather than bundling.
- Remaining detail (M3/M4): define Linux binary baseline(s) and fallback guidance (glibc floor and/or alternate artifacts).

No M0-blocking open questions remain.

Immediate Implementation Decisions (M0, Resolved)

These are narrower than the open questions above and directly block scaffolding work:

Resolved: repository home/module path starts at github.com/kpumuk/thrift-weaver
Superseded by RFC 0002: use embedded tree-sitter wasm with wazero; keep tree-sitter core/runtime sources vendored for wasm generation
Resolved: use RFC v1 default style profile and preserve separators/deprecated spellings
Resolved: LSP invalid-format behavior defaults to fail-closed
Resolved: Windows arm64 artifact publication is supported in the pure-Go release matrix
Resolved: release automation reads Conventional Commit-style PR titles from squash merges; individual local commits remain unconstrained
Resolved: the shared repo version also updates editors/vscode/package.json and the root package-lock metadata fields during release preparation

Alternatives Considered

A. Extend Existing C++ Compiler Frontend

Rejected for this project scope because:

frontend discards trivia and normalizes syntax too early for formatter needs
significant refactor would be required
editor/LSP incremental parsing remains unsolved

B. Handwritten Go Parser (No `tree-sitter`)

Deferred (possible future alternative) because:

simpler pure-Go distribution
but higher risk/time for error recovery + incremental/LSP-friendly behavior

C. Formatter Only (No LSP)

Rejected because editor integration is a primary requirement and affects parser architecture choices from day one.

Rollout Plan

Implement engine and CLI first (thriftfmt) to stabilize formatting semantics.
Add thriftls on top of same engine.
Ship VS Code extension with managed thriftls install (plus external-path fallback).
Iterate on editor features (semantic tokens, code actions, navigation).

Appendix: Initial Implementation Order (Detailed)

internal/text (line index, UTF-16 conversions, byte edits)
internal/lexer (lossless tokens/trivia + tests)
grammar/tree-sitter-thrift skeleton + parser generation pipeline
internal/syntax parse wrapper + diagnostics + CST queries
internal/format doc printer + declaration formatting
cmd/thriftfmt + golden tests + compiler compatibility CI
internal/lsp core server + formatting/diagnostics handlers
editors/vscode extension with TextMate + managed-install thriftls
Hardening, performance, release automation

FilesExpand file tree

0001-thrift-tooling-platform.md

Latest commit

History

0001-thrift-tooling-platform.md

File metadata and controls

RFC 0001: Weaver for Apache Thrift Tooling Platform (thriftfmt, thriftlint, thriftls, VS Code Extension)

Summary

Motivation

Goals

Non-Goals (Initial Scope)

High-Level Architecture

Core Technical Decisions

1. Language and Runtime

2. Parsing Strategy

3. Syntax Representation

4. Formatter Strategy

5. LSP Strategy

Repository Layout (Monorepo)

Module Boundaries and Responsibilities

internal/text

internal/lexer

internal/syntax

internal/format

internal/lsp

editors/vscode

Data Structures (Go API-Level)

Source and Positioning

Token and Trivia Model

Syntax Tree (CST Wrapper)

Diagnostics

Formatter Result Types

LSP Snapshot Model

Engine APIs (Go, Internal-First)

Parsing APIs

Formatting APIs

LSP Server APIs (Internal)

Formatter Design

Formatting Policy (v1)

Default Style Profile (v1)

Doc-Algebra Model

Comment Handling

Source Text and Newline Policy (v1)

Parsing and Tree-Sitter Integration

Grammar Scope

Query Files

WASM Build Strategy

Parser/Lexer Alignment Invariants (Must-Have)

LSP Feature Set and Phasing

v1 (MVP)

v2

Deferred (post-v2)

VS Code Extension Design

v1 Responsibilities

Binary Packaging Strategy

Semantic Highlighting Strategy

CLI Design (thriftfmt)

Commands and Flags

Exit Codes

Error Handling and Recovery Policy

CLI

LSP

Safety Threshold for Formatting

Performance Targets (Beta)

Testing Strategy

1. Unit Tests

2. Golden Tests

3. Corpus Tests

4. Compatibility Oracle Tests

5. LSP Integration Tests

6. VS Code Smoke Tests

7. Fuzz / Robustness

CI / Release Plan

CI Required Jobs

Release Artifacts

Versioning

Milestones and Acceptance Criteria

M0: Foundation

M1: Parsing MVP

M2: Formatter MVP

RFC 0001: Weaver for Apache Thrift Tooling Platform (`thriftfmt`, `thriftlint`, `thriftls`, VS Code Extension)

`internal/text`

`internal/lexer`

`internal/syntax`

`internal/format`

`internal/lsp`

`editors/vscode`

CLI Design (`thriftfmt`)

B. Handwritten Go Parser (No `tree-sitter`)