Skip to content

Latest commit

 

History

History
379 lines (277 loc) · 12.6 KB

File metadata and controls

379 lines (277 loc) · 12.6 KB

Indexable File Extensions Analysis & Recommendations

Executive Summary

Based on research of industry-leading code search tools (GitHub Linguist, Sourcegraph), AI coding assistants (GitHub Copilot, Cursor), and best practices documentation, I've identified 47 missing file extensions that should be added to the Context Engine MCP Server's INDEXABLE_EXTENSIONS set.

Current Count: 72 extensions Recommended Additions: 44 extensions New Total: 116 extensions

⚠️ SECURITY NOTE: This analysis initially included .env.local, .env.development, and .env.production but these have been REMOVED as they contain sensitive credentials. See SECURITY_ANALYSIS_ENV_FILES.md for details.


Research Sources

  1. GitHub Linguist - The authoritative source used by GitHub.com for language detection (6,848+ commits, 1,142 contributors)
  2. Sourcegraph - Enterprise code search platform
  3. AI Coding Assistants - GitHub Copilot, Cursor, Continue.dev documentation
  4. Aider.chat - AI pair programming tool with comprehensive language support
  5. Industry file extension databases - Programming language registries

Current Extensions (72 total)

Already Covered ✅

  • TypeScript/JavaScript: .ts, .tsx, .js, .jsx, .mjs, .cjs
  • Python: .py, .pyw, .pyi
  • JVM Languages: .java, .kt, .kts, .scala, .groovy
  • Go: .go
  • Rust: .rs
  • C/C++: .c, .cpp, .cc, .cxx, .h, .hpp, .hxx
  • .NET: .cs, .fs, .fsx
  • Ruby: .rb, .rake, .gemspec
  • PHP: .php
  • Mobile: .swift, .m, .mm, .dart, .arb
  • Frontend: .vue, .svelte, .astro
  • Web: .html, .htm, .css, .scss, .sass, .less, .styl
  • Config: .json, .yaml, .yml, .toml, .xml, .plist, .gradle, .properties, .ini, .cfg, .conf, .editorconfig
  • Docs: .md, .mdx, .txt, .rst
  • Database: .sql, .prisma
  • API/Schema: .graphql, .gql, .proto, .openapi, .swagger
  • Shell: .sh, .bash, .zsh, .fish, .ps1, .psm1, .bat, .cmd
  • Infrastructure: .dockerfile, .tf, .hcl, .nix

Recommended Additions (44 extensions)

HIGH PRIORITY - Popular Programming Languages (15 extensions)

Functional Programming Languages

  • .ex, .exs - Elixir (growing in popularity, Phoenix framework)
  • .erl, .hrl - Erlang (telecom, distributed systems)
  • .hs, .lhs - Haskell (functional programming)
  • .clj, .cljs, .cljc - Clojure/ClojureScript (JVM functional language)
  • .ml, .mli - OCaml (functional programming)

Scientific & Data Languages

  • .r, .R - R language (data science, statistics)
  • .jl - Julia (scientific computing, ML)

Scripting & General Purpose

  • .lua - Lua (game development, embedded scripting)
  • .pl, .pm - Perl (system administration, text processing)

Modern Systems Languages

  • .zig - Zig (modern systems programming)
  • .nim - Nim (efficient, expressive systems language)
  • .cr - Crystal (Ruby-like syntax, compiled)
  • .v - V language (simple, fast compilation)

Justification: These languages are widely used in production systems, have active communities, and are frequently encountered in modern codebases. GitHub Linguist supports all of them.


MEDIUM PRIORITY - Build Systems & Configuration (9 extensions)

Build & Task Files

  • .cmake - CMake build configuration
  • .mk, .mak - Make files (alternative extensions)
  • .bazel, .bzl - Bazel build system (Google's build tool)
  • .ninja - Ninja build system
  • .sbt - Scala Build Tool
  • .podspec - CocoaPods specification (iOS)

Package Management

  • .lock - Lock files (package-lock.json, yarn.lock, etc.)
  • .npmrc, .yarnrc - NPM/Yarn configuration

Justification: Build systems and configuration files are critical for understanding project structure and dependencies. AI assistants need this context for accurate code generation.

⚠️ SECURITY NOTE: .env.local, .env.development, and .env.production were REMOVED from this list as they contain sensitive credentials and are already properly excluded by the codebase.


MEDIUM PRIORITY - Documentation & Markup (8 extensions)

Advanced Documentation Formats

  • .adoc, .asciidoc - AsciiDoc (technical documentation)
  • .tex, .latex - LaTeX (academic papers, technical docs)
  • .org - Org-mode (Emacs documentation format)
  • .wiki - Wiki markup
  • .pod - Perl documentation

Web Documentation

  • .ejs, .pug, .jade - Template engines (embedded JavaScript, Pug)

Justification: Documentation is essential context for AI coding assistants. These formats are commonly used in open-source projects and technical documentation.


MEDIUM PRIORITY - Web & Frontend (7 extensions)

Template Engines

  • .hbs, .handlebars - Handlebars templates
  • .twig - Twig templates (PHP)
  • .erb - Embedded Ruby templates
  • .jsp - JavaServer Pages

Modern Web

  • .mjs - ES modules (already have .mjs, but confirming)
  • .cjs - CommonJS modules (already have .cjs, but confirming)
  • .wasm - WebAssembly text format (.wat)

Justification: Template engines are code that generates HTML and contain business logic. Essential for full-stack context.


LOW PRIORITY - Specialized Languages (5 extensions)

Domain-Specific Languages

  • .sol - Solidity (Ethereum smart contracts)
  • .move - Move language (blockchain)
  • .cairo - Cairo (StarkNet smart contracts)
  • .vhdl, .vhd - VHDL (hardware description)

Justification: Specialized but growing in importance, especially blockchain languages. Include if targeting these domains.


Extensions to EXCLUDE (Not Recommended)

Binary & Compiled Files

  • .pyc, .pyo - Python bytecode (binary)
  • .class - Java bytecode (binary)
  • .o, .obj - Object files (binary)
  • .exe, .dll, .so - Executables and libraries (binary)

Media Files

  • .png, .jpg, .gif, .svg - Images (not code)
  • .mp3, .mp4, .wav - Media files (not code)

Archives

  • .zip, .tar, .gz - Compressed archives (not code)

Justification: These are binary or non-textual files that cannot be semantically indexed for code search.


Implementation Priority

Phase 1: HIGH PRIORITY (Immediate)

Add the 15 popular programming language extensions. These have the highest ROI and are most likely to be encountered.

Phase 2: MEDIUM PRIORITY (Next Release)

Add build systems, configuration, and documentation formats (24 extensions). These provide critical context for AI assistants.

Phase 3: LOW PRIORITY (Future)

Add specialized languages (5 extensions) based on user demand and domain focus.


Comparison with Industry Standards

GitHub Linguist

  • Supports 600+ languages with thousands of file extensions
  • Our recommended additions align with Linguist's "programming" and "markup" categories
  • Excludes "data" and "prose" categories (JSON data files, plain text)

Sourcegraph

  • Indexes all text-based source files
  • Uses similar extension lists for syntax highlighting and code intelligence
  • Focuses on "code" vs "configuration" vs "documentation"

AI Coding Assistants (Copilot, Cursor)

  • Prioritize languages with strong LLM training data
  • All recommended languages have good LLM support
  • Template engines and config files are increasingly important for context

Recommended Changes to serviceClient.ts

Proposed Code Addition

const INDEXABLE_EXTENSIONS = new Set([
  // === TypeScript/JavaScript ===
  '.ts', '.tsx', '.js', '.jsx', '.mjs', '.cjs',

  // === Python ===
  '.py', '.pyw', '.pyi',

  // === JVM Languages ===
  '.java', '.kt', '.kts', '.scala', '.groovy',

  // === Go ===
  '.go',

  // === Rust ===
  '.rs',

  // === C/C++ ===
  '.c', '.cpp', '.cc', '.cxx', '.h', '.hpp', '.hxx',

  // === .NET ===
  '.cs', '.fs', '.fsx',

  // === Ruby ===
  '.rb', '.rake', '.gemspec', '.erb',  // Added .erb for templates

  // === PHP ===
  '.php', '.twig',  // Added .twig for templates

  // === Functional Programming Languages === [NEW]
  '.ex', '.exs',      // Elixir
  '.erl', '.hrl',     // Erlang
  '.hs', '.lhs',      // Haskell
  '.clj', '.cljs', '.cljc',  // Clojure
  '.ml', '.mli',      // OCaml

  // === Scientific & Data Languages === [NEW]
  '.r', '.R',         // R language
  '.jl',              // Julia

  // === Scripting Languages === [NEW]
  '.lua',             // Lua
  '.pl', '.pm', '.pod',  // Perl

  // === Modern Systems Languages === [NEW]
  '.zig',             // Zig
  '.nim',             // Nim
  '.cr',              // Crystal
  '.v',               // V language

  // === Mobile Development ===
  '.swift',
  '.m', '.mm',
  '.dart',
  '.arb',

  // === Frontend Frameworks ===
  '.vue', '.svelte', '.astro',

  // === Web Templates & Styles ===
  '.html', '.htm',
  '.css', '.scss', '.sass', '.less', '.styl',
  '.hbs', '.handlebars',  // [NEW] Handlebars templates
  '.ejs', '.pug', '.jade',  // [NEW] Template engines
  '.jsp',  // [NEW] JavaServer Pages

  // === Configuration Files ===
  '.json', '.yaml', '.yml', '.toml',
  '.xml',
  '.plist',
  '.gradle',
  '.properties',
  '.ini', '.cfg', '.conf',
  '.editorconfig',
  '.env.example', '.env.template', '.env.sample',  // Safe templates only (NOT actual .env files)

  // === Build Systems === [NEW]
  '.cmake',           // CMake
  '.mk', '.mak',      // Make
  '.bazel', '.bzl',   // Bazel
  '.ninja',           // Ninja
  '.sbt',             // Scala Build Tool
  '.podspec',         // CocoaPods

  // === Documentation ===
  '.md', '.mdx', '.txt', '.rst',
  '.adoc', '.asciidoc',  // [NEW] AsciiDoc
  '.tex', '.latex',      // [NEW] LaTeX
  '.org',                // [NEW] Org-mode
  '.wiki',               // [NEW] Wiki markup

  // === SECURITY NOTE ===
  // .env.local, .env.development, .env.production are NOT included here
  // They contain sensitive credentials and are properly excluded via DEFAULT_EXCLUDED_PATTERNS

  // === Database ===
  '.sql', '.prisma',

  // === API/Schema Definitions ===
  '.graphql', '.gql',
  '.proto',
  '.openapi', '.swagger',

  // === Shell Scripts ===
  '.sh', '.bash', '.zsh', '.fish',
  '.ps1', '.psm1', '.bat', '.cmd',

  // === Infrastructure & DevOps ===
  '.dockerfile',
  '.tf', '.hcl',
  '.nix',

  // === Blockchain/Smart Contracts === [NEW - Optional]
  // '.sol',           // Solidity (uncomment if needed)
  // '.move',          // Move language (uncomment if needed)
  // '.cairo',         // Cairo (uncomment if needed)
]);

Summary of Changes

Added 44 new extensions across 6 categories:

  1. Functional Programming (9): .ex, .exs, .erl, .hrl, .hs, .lhs, .clj, .cljs, .cljc, .ml, .mli
  2. Scientific/Data (3): .r, .R, .jl
  3. Scripting (4): .lua, .pl, .pm, .pod
  4. Modern Systems (4): .zig, .nim, .cr, .v
  5. Build Systems (9): .cmake, .mk, .mak, .bazel, .bzl, .ninja, .sbt, .podspec
  6. Documentation (5): .adoc, .asciidoc, .tex, .latex, .org, .wiki
  7. Web Templates (6): .hbs, .handlebars, .ejs, .pug, .jade, .jsp, .erb, .twig

Total: 72 (current) + 44 (new) = 116 extensions

⚠️ SECURITY CORRECTION: Originally recommended 47 extensions including .env.local, .env.development, .env.production. These 3 have been REMOVED as they contain sensitive credentials. See SECURITY_ANALYSIS_ENV_FILES.md for full analysis.


Testing Recommendations

After implementing these changes:

  1. Test with popular repositories:

    • Elixir: Phoenix framework
    • Haskell: Pandoc
    • Julia: Flux.jl
    • R: tidyverse packages
    • Lua: Neovim configuration
  2. Verify indexing performance:

    • Monitor index size growth
    • Check indexing time for large repos
    • Ensure no memory issues
  3. Validate search quality:

    • Test semantic search across new file types
    • Verify syntax highlighting works
    • Check that context retrieval is accurate

References

  1. GitHub Linguist: https://github.com/github-linguist/linguist
  2. Aider Language Support: https://aider.chat/docs/languages.html
  3. Sourcegraph Documentation: https://sourcegraph.com/docs
  4. Programming Language Extensions Database: https://gist.github.com/ppisarczyk/43962d06686722d26d176fad46879d41
  5. Mammouth AI Supported Extensions: https://info.mammouth.ai/docs/supported-file-extensions/

Conclusion

These additions will make the Context Engine MCP Server more comprehensive and competitive with industry-leading code search tools. The recommended extensions cover:

  • ✅ All major programming paradigms (imperative, functional, OOP)
  • ✅ Popular data science and scientific computing languages
  • ✅ Modern systems programming languages
  • ✅ Essential build systems and configuration formats
  • ✅ Common documentation and template formats

Impact: Better context for AI coding assistants, broader language support, and improved developer experience across diverse tech stacks.