11# Largefile MCP Server
22
3- An MCP server that enables AI assistants to work with large files that exceed context limits.
3+ Navigate, search, and edit large codebases, logs, and data files that exceed AI context limits.
44
5- [ ![ CI] ( https://img.shields.io/github/actions/workflow/status/peteretelej/largefile/ci.yml?branch=main&logo=github )] ( https://github.com/peteretelej/largefile/actions/workflows/ci.yml ) [ ![ codecov] ( https://codecov.io/gh/peteretelej/largefile/branch/main/graph/badge.svg )] ( https://codecov.io/gh/peteretelej/largefile ) [ ![ PyPI version] ( https://img.shields.io/pypi/v/largefile.svg )] ( https://pypi.org/project/largefile/ ) [ ![ Python 3.10+] ( https://img.shields.io/badge/python-3.10+-blue.svg )] ( https://www.python.org/downloads/ ) [ ![ License: MIT] ( https://img.shields.io/badge/License-MIT-yellow.svg )] ( https://opensource.org/licenses/MIT ) [ ![ Ruff ] ( https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/ruff/main/assets/badge/v2.json )] ( https://github.com/astral-sh/ruff ) [ ![ uv ] ( https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/uv/main/assets/badge/v0.json )] ( https://github.com/astral-sh/uv )
5+ [ ![ CI] ( https://img.shields.io/github/actions/workflow/status/peteretelej/largefile/ci.yml?branch=main&logo=github )] ( https://github.com/peteretelej/largefile/actions/workflows/ci.yml ) [ ![ codecov] ( https://codecov.io/gh/peteretelej/largefile/branch/main/graph/badge.svg )] ( https://codecov.io/gh/peteretelej/largefile ) [ ![ PyPI version] ( https://img.shields.io/pypi/v/largefile.svg )] ( https://pypi.org/project/largefile/ ) [ ![ Python 3.10+] ( https://img.shields.io/badge/python-3.10+-blue.svg )] ( https://www.python.org/downloads/ ) [ ![ License: MIT] ( https://img.shields.io/badge/License-MIT-yellow.svg )] ( https://opensource.org/licenses/MIT )
66
7- Navigate, search, and edit files of any size without loading entire content into memory. Largefile provides targeted access to specific lines, patterns, and sections while maintaining file integrity using research-backed search/replace editing instead of error-prone line-based operations.
7+ ## Why Largefile?
88
9- Perfect for working with large codebases, generated files, logs, and datasets that would otherwise be inaccessible due to context window limitations.
10-
11- ## MCP Tools
12-
13- Five tools that work together for progressive file exploration:
14-
15- | Tool | Purpose |
16- | ------| ---------|
17- | ** ` get_overview ` ** | File structure with Tree-sitter semantic analysis, binary detection, and long line stats |
18- | ** ` search_content ` ** | Pattern search with fuzzy, regex, count-only, and invert matching modes |
19- | ** ` read_content ` ** | Targeted reading by offset/limit, pattern, tail mode, or head mode |
20- | ** ` edit_content ` ** | Batch search/replace editing with automatic backups and preview mode |
21- | ** ` revert_edit ` ** | Recover from bad edits by reverting to previous backup states |
9+ - ** Go beyond context limits** - Read, search, and edit files too large to fit in AI context windows
10+ - ** Semantic code navigation** - Tree-sitter extracts functions/classes for Python, JS/TS, Rust, Go
11+ - ** Fewer LLM errors** - Search/replace editing eliminates line number mistakes common with line-based edits
12+ - ** Smart search** - Fuzzy matching, regex, case-insensitive, inverted, and count-only modes
13+ - ** No size limits** - Handles multi-GB files via tiered memory strategy (RAM → mmap → streaming)
2214
2315## Quick Start
2416
25- ** Prerequisite:** Install [ uv] ( https://docs.astral.sh/uv/getting-started/installation/ ) (an extremely fast Python package manager) which provides the ` uvx ` command.
26-
27- Add to your MCP configuration:
17+ ** Prerequisite:** Install [ uv] ( https://docs.astral.sh/uv/getting-started/installation/ ) for the ` uvx ` command.
2818
2919``` json
3020{
@@ -37,275 +27,110 @@ Add to your MCP configuration:
3727}
3828```
3929
40- ## Usage
30+ ## Tools
4131
42- Your AI Assistant / LLM can now work with very large files that exceed its context limits. Here are some common workflows:
32+ | Tool | Use For |
33+ | ---------------- | ------------------------------------------------------ |
34+ | ` get_overview ` | File structure and semantic outline before diving in |
35+ | ` search_content ` | Finding patterns, counting occurrences, regex matching |
36+ | ` read_content ` | Reading specific sections; tail/head modes for logs |
37+ | ` edit_content ` | Safe search/replace with automatic backups |
38+ | ` revert_edit ` | Recovering from bad edits |
4339
44- ### Analyzing Large Code Files
40+ ## When to Use Largefile
4541
46- ** AI Question :** _ "Can you analyze this large Django models file and tell me about the class structure and any potential issues? It's a large file so use largefile." _
42+ ** Use when :**
4743
48- ** AI Assistant workflow:**
49-
50- 1 . Gets file overview to understand structure
51- 2 . Searches for classes and their methods
52- 3 . Looks for code issues like TODOs or long functions
53-
54- ``` python
55- # AI gets file structure
56- overview = get_overview(" /path/to/django-models.py" )
57- # Returns: 2,847 lines, 15 classes, semantic outline with Tree-sitter
44+ - File exceeds ~ 1000 lines or 100KB (supports multi-GB files)
45+ - Navigating large codebases with semantic structure
46+ - Analyzing log files (especially recent entries with tail mode)
47+ - Making search/replace edits across large files
48+ - Counting occurrences without loading full content
5849
59- # AI searches for all class definitions
60- classes = search_content(" /path/to/django-models.py" , " class " , max_results = 20 )
61- # Returns: Model classes with line numbers and context
50+ ** Don't use for:**
6251
63- # AI examines specific class implementation
64- model_code = read_content(" /path/to/django-models.py" , pattern = " class User" , mode = " semantic" )
65- # Returns: Complete class definition with all methods
66- ```
52+ - Small files that fit in context (AI doesn't need help with those)
53+ - Binary files (images, executables, compressed)
6754
68- ### Working with Documentation
55+ ## Usage Examples
6956
70- ** AI Question: ** _ "Find all the installation methods mentioned in this README file and update the pip install to use uv instead." _
57+ ### Large Codebase Navigation
7158
72- ** AI Assistant workflow:**
59+ ``` pythonß
60+ # Get semantic structure of a large Python file
61+ overview = get_overview("/path/to/large_module.py")
62+ # Returns: 2,847 lines, 15 classes, function outline via Tree-sitter
7363
74- 1 . Search for installation patterns
75- 2 . Read the installation section
76- 3 . Replace pip commands with uv equivalents
64+ # Find all class definitions
65+ classes = search_content("/path/to/large_module.py", "class ", fuzzy=False)
7766
78- ``` python
79- # AI finds installation instructions
80- install_sections = search_content(" /path/to/readme.md" , " install" , fuzzy = True , context_lines = 3 )
81-
82- # AI reads the installation section
83- install_content = read_content(" /path/to/readme.md" , pattern = " ## Installation" , mode = " semantic" )
84-
85- # AI replaces pip with uv
86- edit_result = edit_content(
87- " /path/to/readme.md" ,
88- changes = [{" search" : " pip install anthropic" , " replace" : " uv add anthropic" }],
89- preview = True
90- )
67+ # Read complete class with semantic chunking
68+ code = read_content("/path/to/large_module.py", pattern="class UserModel", mode="semantic")
9169```
9270
93- ### Debugging Large Log Files
94-
95- ** AI Question:** _ "Check this production log file for any critical errors in the last few thousand lines and show me the context around them. Use largefile mcp."_
96-
97- ** AI Assistant workflow:**
98-
99- 1 . Get log file overview
100- 2 . Read the last N lines efficiently with tail mode
101- 3 . Search for error patterns in recent entries
71+ ### Batch Refactoring
10272
10373``` python
104- # AI gets log file overview
105- overview = get_overview(" /path/to/production.log" )
106- # Returns: 150,000 lines, 2.1GB file size
107-
108- # AI reads the last 1000 lines efficiently (no need to know total line count)
109- recent = read_content(" /path/to/production.log" , limit = 1000 , mode = " tail" )
110- # Returns: Last 1000 lines without loading entire file
111-
112- # AI counts errors efficiently
113- error_count = search_content(" /path/to/production.log" , " ERROR" , count_only = True , fuzzy = False )
114- # Returns: {"count": 47, ...} without loading all content
74+ # Preview rename across file
75+ preview = edit_content(" /path/to/api.py" , changes = [
76+ {" search" : " process_data" , " replace" : " transform_data" },
77+ {" search" : " old_endpoint" , " replace" : " new_endpoint" }
78+ ], preview = True )
11579
116- # AI searches for critical errors with context
117- errors = search_content (" /path/to/production.log " , " CRITICAL " , fuzzy = False , max_results = 10 )
80+ # Apply changes (creates automatic backup)
81+ result = edit_content (" /path/to/api.py " , changes = [ ... ], preview = False )
11882
119- # AI examines context around each error
120- for error in errors[" results" ]:
121- context = read_content(" /path/to/production.log" , offset = error[" line_number" ], limit = 20 )
122- # Shows surrounding log entries for debugging
83+ # Undo if needed
84+ revert_edit(" /path/to/api.py" )
12385```
12486
125- ### Refactoring Code
126-
127- ** AI Question:** _ "I need to rename the function ` process_data ` to ` transform_data ` throughout this large codebase file. Can you help me do this safely?"_
128-
129- ** AI Assistant workflow:**
130-
131- 1 . Find all occurrences of the function
132- 2 . Preview changes to ensure accuracy
133- 3 . Apply changes with automatic backup
87+ ### Log Analysis
13488
13589``` python
136- # AI finds all usages
137- usages = search_content(" /path/to/codebase.py" , " process_data" , fuzzy = False , max_results = 50 )
138-
139- # AI previews the changes
140- preview = edit_content(
141- " /path/to/codebase.py" ,
142- changes = [{" search" : " process_data" , " replace" : " transform_data" }],
143- preview = True
144- )
145-
146- # AI applies changes after confirmation
147- result = edit_content(
148- " /path/to/codebase.py" ,
149- changes = [{" search" : " process_data" , " replace" : " transform_data" }],
150- preview = False
151- )
152- # Creates automatic backup before changes
153- ```
90+ # Get log file overview
91+ overview = get_overview(" /var/log/app.log" )
92+ # Returns: 150,000 lines, 2.1GB
15493
155- ### Batch Editing Multiple Patterns
94+ # Read last 500 lines efficiently
95+ recent = read_content(" /var/log/app.log" , limit = 500 , mode = " tail" )
15696
157- ** AI Question:** _ "Update all the deprecated API calls in this file - there are several different ones to change."_
97+ # Count errors without loading content
98+ error_count = search_content(" /var/log/app.log" , " ERROR" , count_only = True , fuzzy = False )
15899
159- ** AI Assistant workflow:**
160-
161- 1 . Identify all deprecated patterns
162- 2 . Apply multiple changes atomically in one call
163-
164- ``` python
165- # AI applies multiple changes in a single atomic operation
166- result = edit_content(
167- " /path/to/api_client.py" ,
168- changes = [
169- {" search" : " client.get_user(" , " replace" : " client.fetch_user(" },
170- {" search" : " client.post_data(" , " replace" : " client.send_data(" },
171- {" search" : " client.delete_item(" , " replace" : " client.remove_item(" },
172- ],
173- preview = True
174- )
175- # Returns per-change results with success/failure status
176- # All changes applied atomically - partial success is reported
100+ # Find errors with regex
101+ errors = search_content(" /var/log/app.log" , r " ERROR. * timeout" , regex = True )
177102```
178103
179- ### Recovering from Bad Edits
180-
181- ** AI Question:** _ "That last edit broke something. Can you undo it?"_
182-
183- ** AI Assistant workflow:**
184-
185- 1 . List available backups
186- 2 . Revert to previous state (current state is preserved as new backup)
187-
188- ``` python
189- # AI reverts to the most recent backup
190- result = revert_edit(" /path/to/broken_file.py" )
191- # Current state saved as backup, file restored to previous version
192-
193- # Or revert to a specific backup by ID
194- result = revert_edit(" /path/to/broken_file.py" , backup_id = " 20240115_143022" )
195- # Returns: available_backups list for reference
196- ```
197-
198- ### Advanced Search with Regex
199-
200- ** AI Question:** _ "Find all IP addresses in this server log file."_
201-
202- ** AI Assistant workflow:**
203-
204- ``` python
205- # AI uses regex mode to find IP address patterns
206- results = search_content(
207- " /path/to/server.log" ,
208- r " \d {1,3} \. \d {1,3} \. \d {1,3} \. \d {1,3} " ,
209- regex = True ,
210- fuzzy = False ,
211- max_results = 50
212- )
213-
214- # AI finds non-INFO lines (invert mode like grep -v)
215- non_info = search_content(" /path/to/app.log" , " INFO" , invert = True , fuzzy = False )
216- ```
217-
218- ### Exploring API Documentation
219-
220- ** AI Question:** _ "What are all the available methods in this large API documentation file and can you show me examples of authentication?"_
221-
222- ** AI Assistant workflow:**
223-
224- 1 . Get document structure overview
225- 2 . Search for method definitions and auth patterns
226- 3 . Extract relevant code examples
227-
228- ``` python
229- # AI analyzes document structure
230- overview = get_overview(" /path/to/api-docs.md" )
231- # Returns: Section outline, headings, suggested search patterns
104+ ## Supported Languages
232105
233- # AI finds API methods
234- methods = search_content(" /path/to/api-docs.md" , " ###" , max_results = 30 )
235- # Returns: All method headings with context
106+ Tree-sitter semantic analysis for: ** Python** , ** JavaScript/JSX** , ** TypeScript/TSX** , ** Rust** , ** Go**
236107
237- # AI searches for authentication examples
238- auth_examples = search_content(" /path/to/api-docs.md" , " auth" , fuzzy = True , context_lines = 5 )
239-
240- # AI reads complete authentication section
241- auth_section = read_content(" /path/to/api-docs.md" , pattern = " ## Authentication" , mode = " semantic" )
242- ```
108+ Other file types use text-based analysis with graceful fallback.
243109
244110## File Size Handling
245111
246- - ** Small files (<50MB)** : Memory loading with Tree-sitter AST caching
247- - ** Medium files (50-500MB)** : Memory-mapped access
248- - ** Large files (>500MB)** : Streaming processing
249- - ** Long lines (>1000 chars)** : Automatic truncation for display
250-
251- ## Supported Languages
252-
253- Tree-sitter semantic analysis for:
254-
255- - Python (.py)
256- - JavaScript/JSX (.js, .jsx)
257- - TypeScript/TSX (.ts, .tsx)
258- - Rust (.rs)
259- - Go (.go)
260-
261- Files without Tree-sitter support use text-based analysis with graceful degradation.
112+ | Size | Strategy |
113+ | -------- | --------------------------------------- |
114+ | < 50MB | Full memory loading with AST caching |
115+ | 50-500MB | Memory-mapped access |
116+ | > 500MB | Streaming (tail/head modes recommended) |
262117
263118## Configuration
264119
265- Configure via environment variables :
120+ Environment variables for tuning :
266121
267122``` bash
268- # File processing thresholds
269- LARGEFILE_MEMORY_THRESHOLD_MB=50 # Memory loading limit
270- LARGEFILE_MMAP_THRESHOLD_MB=500 # Memory mapping limit
271-
272- # Search settings
273- LARGEFILE_FUZZY_THRESHOLD=0.8 # Fuzzy match sensitivity (0.0-1.0)
274- LARGEFILE_MAX_SEARCH_RESULTS=20 # Result limit per search
275- LARGEFILE_CONTEXT_LINES=3 # Context lines around matches
276-
277- # Error recovery
278- LARGEFILE_SIMILAR_MATCH_LIMIT=3 # Similar matches shown on edit failure
279- LARGEFILE_SIMILAR_MATCH_THRESHOLD=0.6 # Min similarity for suggestions
280-
281- # Backup management
282- LARGEFILE_BACKUP_DIR=" ~/.largefile/backups" # Backup location
283- LARGEFILE_MAX_BACKUPS=10 # Backups retained per file
284-
285- # Batch editing
286- LARGEFILE_MAX_BATCH_CHANGES=50 # Max changes per batch call
287-
288- # Performance
289- LARGEFILE_ENABLE_TREE_SITTER=true # Semantic features
123+ LARGEFILE_MEMORY_THRESHOLD_MB=50 # RAM loading limit
124+ LARGEFILE_MMAP_THRESHOLD_MB=500 # Memory mapping limit
125+ LARGEFILE_FUZZY_THRESHOLD=0.8 # Match sensitivity (0.0-1.0)
126+ LARGEFILE_MAX_SEARCH_RESULTS=20 # Results per search
127+ LARGEFILE_BACKUP_DIR=~ /.largefile/backups
290128```
291129
292- ## Key Features
293-
294- - ** Search/replace editing** - Eliminates LLM line number errors with fuzzy matching
295- - ** Batch operations** - Apply multiple changes atomically in one call
296- - ** Regex & invert search** - Powerful pattern matching with grep-like features
297- - ** Count-only mode** - Efficiently count matches without loading content
298- - ** Smart error recovery** - Failed edits show similar matches with suggestions
299- - ** Backup & revert** - Automatic backups with full revert capability
300- - ** Tail & head modes** - Read file endings/beginnings without full scan
301- - ** Binary detection** - Warns when files appear binary
302- - ** Semantic awareness** - Tree-sitter integration for code structure
303- - ** Memory efficient** - Handles files of any size via tiered access strategy
304-
305130## Documentation
306131
307132- [ API Reference] ( docs/API.md ) - Detailed tool documentation
308- - [ Configuration Guide] ( docs/configuration.md ) - Environment variables and tuning
309- - [ Examples] ( docs/examples.md ) - Real-world usage examples and workflows
310- - [ Design Document] ( docs/design.md ) - Architecture and implementation details
311- - [ Contributing] ( docs/CONTRIBUTING.md ) - Development setup and guidelines
133+ - [ Configuration Guide] ( docs/configuration.md ) - All environment variables
134+ - [ Examples] ( docs/examples.md ) - More workflow examples
135+ - [ Design Document] ( docs/design.md ) - Architecture details
136+ - [ Contributing] ( docs/CONTRIBUTING.md ) - Development setup
0 commit comments