Skip to content

Latest commit

 

History

History
951 lines (782 loc) · 24.7 KB

File metadata and controls

951 lines (782 loc) · 24.7 KB

Infinite-Context Chat Storage Specification

Version: 1.1
Date: November 2025
License: MIT

Abstract

A specification for managing unlimited-length chat conversations with Large Language Models using graph database storage, vector embeddings, and intelligent context retrieval. This system enables conversations to persist indefinitely without performance degradation or context loss.

Table of Contents

  1. Overview
  2. Architecture
  3. Data Model
  4. Index Strategies
  5. Tool Interface
  6. Implementation Guide
  7. Configuration
  8. Performance Considerations

1. Overview

1.1 Problem Statement

LLM context windows are finite resources. Traditional chat implementations send entire conversation histories with each request, leading to:

  • Linear growth in token costs
  • Hard limits on conversation length
  • Wasted tokens on irrelevant historical messages
  • Forced conversation splits or truncation

1.2 Solution

Store all messages in a graph database with vector embeddings, presenting the LLM with a navigable index of available context. The LLM retrieves only what it needs through tool calls.

1.3 Key Principles

  • Retrievable State: Messages are mutable and editable
  • Unlimited Storage: No artificial limits on message count
  • Selective Loading: LLM decides what context to retrieve
  • Adaptive Indexing: Different strategies for different scales

2. Architecture

2.1 Components

┌─────────────┐     ┌──────────────┐     ┌─────────────┐
│   Message   │────▶│   Embedding  │────▶│   Vector    │
│   Input     │     │   Model      │     │   Index     │
└─────────────┘     └──────────────┘     └─────────────┘
                            │
                            ▼
                    ┌──────────────┐
                    │  Neo4j Graph │
                    │   Database   │
                    └──────────────┘
                            │
                            ▼
                    ┌──────────────┐
                    │    Index     │
                    │   Builder    │
                    └──────────────┘
                            │
                            ▼
                    ┌──────────────┐
                    │      LLM     │
                    │  + Tools     │
                    └──────────────┘

2.2 Flow

  1. User sends message
  2. System generates embedding and stores in graph
  3. System searches for relevant historical context
  4. Index builder creates navigable context map
  5. LLM receives index + recent messages + current message
  6. LLM uses tools to retrieve specific messages as needed
  7. LLM generates response

3. Data Model

3.1 Message Node

CREATE (m:Message {
  id: String,                 // ULID or UUID
  content: String,            // Full message text
  snippet: String,            // Preview text
  role: String,               // 'user' | 'assistant' | 'system'
  timestamp: DateTime,        // ISO-8601
  parent_id: String,          // Parent message ID (nullable)
  embedding: List<Float>,     // Vector (1536 dimensions)
  token_count: Integer,       // Approximate tokens
  metadata: Map,              // Extensible metadata
  edited: Boolean,            // Edit flag
  deleted: Boolean,           // Soft delete flag
  edit_history: List<Map>,    // Previous versions
  is_chunk: Boolean,          // True if part of chunked content
  chunk_index: Integer,       // Position in sequence (nullable)
  chunk_parent_id: String     // Original message ID (nullable)
})

3.2 Tool Call Node

CREATE (t:ToolCall {
  id: String,                 // ULID or UUID
  tool_name: String,          // Tool identifier
  arguments: String,          // JSON arguments
  result: String,             // JSON result
  timestamp: DateTime,        // ISO-8601
  message_id: String,         // Triggering message
  embedding: List<Float>,     // Vector embedding (required)
  token_count: Integer,       // Approximate tokens
  is_chunk: Boolean,          // True if result was chunked
  chunk_index: Integer,       // Position in sequence (nullable)
  chunk_parent_id: String     // Original tool call ID (nullable)
})

3.3 Content Chunking Strategy

For messages or tool results exceeding a configured threshold (default: 4000 tokens), implementations should chunk the content:

Chunking Rules:

  • Each chunk maintains the same role, timestamp, and parent relationships
  • Chunks are linked via chunk_parent_id to the logical message/tool call
  • chunk_index indicates position (0-based)
  • Each chunk receives its own embedding for granular retrieval
  • Snippet is generated from first chunk only

Retrieval Behavior:

  • Tools can retrieve individual chunks or all chunks of a parent
  • Index presents chunks as separate searchable units
  • LLM decides whether to load full content or specific chunks

Example Chunking:

Original: 10,000 token assistant response
Becomes:
  - Chunk 0: Tokens 0-4000 (chunk_parent_id: original_id)
  - Chunk 1: Tokens 4000-8000 (chunk_parent_id: original_id)
  - Chunk 2: Tokens 8000-10000 (chunk_parent_id: original_id)

3.4 Relationships

// Message chain
(child:Message)-[:REPLIES_TO]->(parent:Message)

// Chunk relationships
(chunk:Message)-[:CHUNK_OF]->(parent:Message)

// Tool calls
(tool:ToolCall)-[:CALLED_BY]->(message:Message)

// Tool chunk relationships
(chunk:ToolCall)-[:CHUNK_OF]->(parent:ToolCall)

// Topic clustering (optional, implementation-defined)
(message:Message)-[:BELONGS_TO]->(topic:Topic)

3.5 Indexes

// Unique constraints
CREATE CONSTRAINT message_id_unique
FOR (m:Message) REQUIRE m.id IS UNIQUE;

CREATE CONSTRAINT toolcall_id_unique
FOR (t:ToolCall) REQUIRE t.id IS UNIQUE;

// Performance indexes
CREATE INDEX message_timestamp
FOR (m:Message) ON (m.timestamp);

CREATE INDEX message_role
FOR (m:Message) ON (m.role);

CREATE INDEX message_chunk_parent
FOR (m:Message) ON (m.chunk_parent_id);

CREATE INDEX toolcall_timestamp
FOR (t:ToolCall) ON (t.timestamp);

CREATE INDEX toolcall_chunk_parent
FOR (t:ToolCall) ON (t.chunk_parent_id);

// Vector indexes
CREATE VECTOR INDEX message_embeddings
FOR (m:Message) ON (m.embedding)
OPTIONS {
  indexConfig: {
    `vector.dimensions`: 1536,
    `vector.similarity_function`: 'cosine'
  }
};

CREATE VECTOR INDEX toolcall_embeddings
FOR (t:ToolCall) ON (t.embedding)
OPTIONS {
  indexConfig: {
    `vector.dimensions`: 1536,
    `vector.similarity_function`: 'cosine'
  }
};

4. Index Strategies

4.1 Index Configuration

interface IndexConfig {
  snippetLength: number; // Characters per snippet (default: 100)
  snippetStrategy: "first" | "semantic_core" | "summary";
  maxIndexTokens: number; // Token budget for index (default: 10000)
  indexStrategy:
    | "adaptive"
    | "recent_plus_relevant"
    | "clustered"
    | "hierarchical";
  clusteringThreshold: number; // Similarity threshold (0.0-1.0)
  minClusterSize: number; // Minimum cluster size
  recentWindowSize: number; // Always-included recent messages
  chunkThreshold: number; // Token count to trigger chunking (default: 4000)
  includeToolCalls: boolean; // Include tool calls in search (default: true)
}

4.2 Adaptive Strategy

The system automatically selects the appropriate index format based on result count:

Result Count Strategy Description
0-50 Full Complete messages shown
51-500 Snippet Recent full + historical snippets
501-5000 Clustered Semantic groups with summaries
5000+ Hierarchical Multi-level navigation structure

4.3 Index Format Examples

Small Result Set (<50 messages)

Recent Conversation:
[user]: What's our API authentication strategy?
[assistant]: We're using OAuth2 with JWT tokens...

Historical Context:
[d8f3a2b1] Discussion about API rate limiting...
[7c4e9f2d] OAuth2 implementation details...

Medium Result Set (51-500 messages)

Recent Conversation:
[last 10 messages shown in full]

Historical Matches (retrieve with get_message_by_id):
- d8f3a2b1 [2024-10-15] "We need to implement rate limiting for..."
- 7c4e9f2d [2024-10-12] "OAuth2 configuration should include..."
[... up to token limit]

Large Result Set (501-5000 messages)

Recent Conversation:
[last 5 messages]

Message Clusters:
Cluster: "API Authentication" (47 messages)
Period: 2024-03-15 to 2024-10-22
Summary: Decisions on OAuth2, JWT tokens, API key support
Sample messages:
  - "OAuth2 implementation with refresh tokens..."
  - "API key fallback for legacy clients..."
Retrieve with: get_cluster("api_auth_cluster_id")

Cluster: "Database Design" (89 messages)
[...]

Huge Result Set (5000+ messages)

Temporal Overview:
  Today: 5 messages
  This Week: 45 messages
  This Month: 423 messages
  Older: 8,234 messages

Topic Overview (This Week):
  "API Development": 12 messages
  "Bug Fixes": 18 messages
  "Architecture": 15 messages

Navigation:
  - get_period_messages('today')
  - get_topic_messages('API Development')
  - vector_search(query, limit)

5. Tool Interface

5.1 Core Tools

interface ChatTools {
  // Single message retrieval
  get_message_by_id(id: string): Message;

  // Bulk retrieval
  get_messages_by_ids(ids: string[]): Message[];

  // Chunk-aware retrieval
  get_message_with_chunks(id: string): Message[];

  // Semantic search (includes tool calls if configured)
  vector_search(query: string, limit?: number): SearchResult[];

  // Cluster/group retrieval
  get_cluster(cluster_id: string, limit?: number): Message[];

  // Temporal retrieval
  get_period_messages(
    period: "today" | "this_week" | "this_month" | string,
    limit?: number
  ): Message[];

  // Thread navigation
  get_conversation_thread(message_id: string, depth?: number): Message[];

  // Tool call retrieval
  get_tool_call(id: string): ToolCall;
  get_tool_calls_by_message(message_id: string): ToolCall[];

  // Combined search and retrieve
  search_and_retrieve(query: string, auto_limit: number): Message[];
}

5.2 Tool Response Format

interface Message {
  id: string;
  content: string;
  role: "user" | "assistant" | "system";
  timestamp: string;
  parentId: string | null;
  metadata?: Record<string, any>;
  isChunk?: boolean;
  chunkIndex?: number;
  chunkParentId?: string;
}

interface ToolCall {
  id: string;
  toolName: string;
  arguments: string;
  result: string;
  timestamp: string;
  messageId: string;
  isChunk?: boolean;
  chunkIndex?: number;
  chunkParentId?: string;
}

interface SearchResult {
  id: string;
  snippet: string;
  timestamp: string;
  score: number;
  type: "message" | "tool_call";
  isChunk?: boolean;
}

6. Implementation Guide

6.1 Message Storage

async function storeMessage(
  content: string,
  role: "user" | "assistant" | "system",
  parentId: string | null = null,
  config: IndexConfig
): Promise<string> {
  const tokenCount = estimateTokens(content);

  // Check if chunking is needed
  if (tokenCount > config.chunkThreshold) {
    return await storeChunkedMessage(content, role, parentId, config);
  }

  const id = ulid();
  const timestamp = new Date().toISOString();
  const snippet = content.slice(0, config.snippetLength);
  const embedding = await generateEmbedding(content);

  await neo4j.run(
    `
    CREATE (m:Message {
      id: $id,
      content: $content,
      snippet: $snippet,
      role: $role,
      timestamp: datetime($timestamp),
      parent_id: $parentId,
      embedding: $embedding,
      token_count: $tokenCount,
      edited: false,
      is_chunk: false
    })
  `,
    { id, content, snippet, role, timestamp, parentId, embedding, tokenCount }
  );

  if (parentId) {
    await neo4j.run(
      `
      MATCH (child:Message {id: $childId})
      MATCH (parent:Message {id: $parentId})
      CREATE (child)-[:REPLIES_TO]->(parent)
    `,
      { childId: id, parentId }
    );
  }

  return id;
}

async function storeChunkedMessage(
  content: string,
  role: "user" | "assistant" | "system",
  parentId: string | null,
  config: IndexConfig
): Promise<string> {
  const parentMessageId = ulid();
  const chunks = chunkContent(content, config.chunkThreshold);
  const timestamp = new Date().toISOString();
  const snippet = chunks[0].slice(0, config.snippetLength);

  for (let i = 0; i < chunks.length; i++) {
    const chunkId = ulid();
    const chunkContent = chunks[i];
    const embedding = await generateEmbedding(chunkContent);
    const tokenCount = estimateTokens(chunkContent);

    await neo4j.run(
      `
      CREATE (m:Message {
        id: $id,
        content: $content,
        snippet: $snippet,
        role: $role,
        timestamp: datetime($timestamp),
        parent_id: $parentId,
        embedding: $embedding,
        token_count: $tokenCount,
        edited: false,
        is_chunk: true,
        chunk_index: $chunkIndex,
        chunk_parent_id: $chunkParentId
      })
    `,
      {
        id: chunkId,
        content: chunkContent,
        snippet: i === 0 ? snippet : "",
        role,
        timestamp,
        parentId,
        embedding,
        tokenCount,
        chunkIndex: i,
        chunkParentId: parentMessageId,
      }
    );

    // Link chunk to parent
    if (i === 0 && parentId) {
      await neo4j.run(
        `
        MATCH (child:Message {id: $childId})
        MATCH (parent:Message {id: $parentId})
        CREATE (child)-[:REPLIES_TO]->(parent)
      `,
        { childId: chunkId, parentId }
      );
    }
  }

  return parentMessageId;
}

6.2 Tool Call Storage

async function storeToolCall(
  toolName: string,
  arguments: any,
  result: any,
  messageId: string,
  config: IndexConfig
): Promise<string> {
  const resultString = JSON.stringify(result);
  const tokenCount = estimateTokens(resultString);

  // Check if chunking is needed
  if (tokenCount > config.chunkThreshold) {
    return await storeChunkedToolCall(
      toolName,
      arguments,
      resultString,
      messageId,
      config
    );
  }

  const id = ulid();
  const timestamp = new Date().toISOString();
  const embedding = await generateEmbedding(
    `${toolName}: ${JSON.stringify(arguments)} -> ${resultString}`
  );

  await neo4j.run(
    `
    CREATE (t:ToolCall {
      id: $id,
      tool_name: $toolName,
      arguments: $arguments,
      result: $result,
      timestamp: datetime($timestamp),
      message_id: $messageId,
      embedding: $embedding,
      token_count: $tokenCount,
      is_chunk: false
    })
  `,
    {
      id,
      toolName,
      arguments: JSON.stringify(arguments),
      result: resultString,
      timestamp,
      messageId,
      embedding,
      tokenCount,
    }
  );

  await neo4j.run(
    `
    MATCH (t:ToolCall {id: $toolId})
    MATCH (m:Message {id: $messageId})
    CREATE (t)-[:CALLED_BY]->(m)
  `,
    { toolId: id, messageId }
  );

  return id;
}

async function storeChunkedToolCall(
  toolName: string,
  arguments: any,
  result: string,
  messageId: string,
  config: IndexConfig
): Promise<string> {
  const parentToolCallId = ulid();
  const chunks = chunkContent(result, config.chunkThreshold);
  const timestamp = new Date().toISOString();

  for (let i = 0; i < chunks.length; i++) {
    const chunkId = ulid();
    const chunkContent = chunks[i];
    const embedding = await generateEmbedding(
      `${toolName} [chunk ${i}]: ${chunkContent}`
    );
    const tokenCount = estimateTokens(chunkContent);

    await neo4j.run(
      `
      CREATE (t:ToolCall {
        id: $id,
        tool_name: $toolName,
        arguments: $arguments,
        result: $result,
        timestamp: datetime($timestamp),
        message_id: $messageId,
        embedding: $embedding,
        token_count: $tokenCount,
        is_chunk: true,
        chunk_index: $chunkIndex,
        chunk_parent_id: $chunkParentId
      })
    `,
      {
        id: chunkId,
        toolName,
        arguments: JSON.stringify(arguments),
        result: chunkContent,
        timestamp,
        messageId,
        embedding,
        tokenCount,
        chunkIndex: i,
        chunkParentId: parentToolCallId,
      }
    );

    // Link first chunk to message
    if (i === 0) {
      await neo4j.run(
        `
        MATCH (t:ToolCall {id: $toolId})
        MATCH (m:Message {id: $messageId})
        CREATE (t)-[:CALLED_BY]->(m)
      `,
        { toolId: chunkId, messageId }
      );
    }
  }

  return parentToolCallId;
}

6.3 Context Preparation

async function prepareContext(
  currentMessage: string,
  config: IndexConfig
): Promise<string> {
  // Search for relevant messages and tool calls
  const searchResults = await vectorSearch(
    currentMessage,
    config.includeToolCalls
  );

  // Get recent messages
  const recentMessages = await getRecentMessages(config.recentWindowSize);

  // Build appropriate index
  const index = await buildAdaptiveIndex(searchResults, recentMessages, config);

  return index;
}

6.4 Message Updates

async function updateMessage(
  id: string,
  newContent: string,
  config: IndexConfig
): Promise<void> {
  const timestamp = new Date().toISOString();

  // Check if this is a chunk or full message
  const existing = await getMessage(id);

  if (existing.isChunk) {
    throw new Error(
      "Cannot edit individual chunks. Edit parent message instead."
    );
  }

  // Re-chunk if necessary
  const tokenCount = estimateTokens(newContent);
  if (tokenCount > config.chunkThreshold) {
    // Delete old chunks if they exist
    await neo4j.run(
      `
      MATCH (chunk:Message {chunk_parent_id: $id})
      DELETE chunk
    `,
      { id }
    );

    // Create new chunked version
    await storeChunkedMessage(
      newContent,
      existing.role,
      existing.parentId,
      config
    );
    return;
  }

  const embedding = await generateEmbedding(newContent);

  await neo4j.run(
    `
    MATCH (m:Message {id: $id})
    SET m.content = $newContent,
        m.embedding = $embedding,
        m.token_count = $tokenCount,
        m.edited = true,
        m.edit_history = m.edit_history + [$editRecord]
  `,
    {
      id,
      newContent,
      embedding,
      tokenCount,
      editRecord: {
        timestamp,
        previousContent: existing.content,
      },
    }
  );
}

7. Configuration

7.1 Environment Variables

# Database
NEO4J_URI=bolt://localhost:7687
NEO4J_USER=neo4j
NEO4J_PASSWORD=password

# Embeddings
EMBEDDING_MODEL=text-embedding-3-small
EMBEDDING_DIMENSIONS=1536

# Index Configuration
INDEX_SNIPPET_LENGTH=100
INDEX_MAX_TOKENS=10000
INDEX_RECENT_WINDOW=10
INDEX_CLUSTERING_THRESHOLD=0.85
INDEX_CHUNK_THRESHOLD=4000
INDEX_INCLUDE_TOOL_CALLS=true

7.2 Performance Tuning

performance:
  batch_embedding_size: 100 # Embed multiple messages at once
  cache_embeddings: true # Cache frequently accessed
  vector_search_timeout: 5000 # Milliseconds
  index_build_timeout: 3000 # Milliseconds

chunking:
  chunk_threshold: 4000 # Tokens before chunking
  chunk_overlap: 200 # Token overlap between chunks

fallbacks:
  on_vector_timeout: recent_only # Fall back to recent messages
  on_index_timeout: simple # Use simple format
  max_retries: 3 # Tool call retries

7.3 Background Processing Recommendations

Implementations should consider background processes to improve retrieval quality over time:

Relationship Building:

  • Compute semantic similarity between messages in batches
  • Build topic clusters using algorithms like DBSCAN or hierarchical clustering
  • Create temporal summaries for time periods

Index Optimization:

  • Pre-compute frequently accessed message clusters
  • Build materialized views for common query patterns
  • Generate topic embeddings from message clusters

Maintenance:

  • Periodic re-embedding of edited messages
  • Cleanup of orphaned chunks
  • Compression of old embeddings

Example Background Tasks:

// Run nightly
async function buildSemanticClusters() {
  const messages = await getRecentMessages(1000);
  const clusters = await clusterBySimilarity(messages, 0.85);

  for (const cluster of clusters) {
    await createTopicNode(cluster);
  }
}

// Run weekly
async function optimizeVectorIndex() {
  await neo4j.run(`
    CALL db.index.vector.queryNodes(
      'message_embeddings', 
      10, 
      $embedding
    ) YIELD node, score
    // Analyze query patterns and optimize
  `);
}

These background processes are implementation-specific and should be tailored to usage patterns and scale.

8. Performance Considerations

8.1 Bottlenecks

Component Bottleneck Mitigation
Storage Embedding size (6KB/msg) Compression, dimensionality reduction
Search Vector similarity computation Approximate nearest neighbors (ANN)
Index Token presentation limit Adaptive strategies, clustering, chunking
Retrieval Sequential tool calls Batch retrieval, predictive loading
Chunking Embedding generation overhead Batch processing, async workflows

8.2 Scaling Characteristics

  • Storage: O(n) - Linear with message count
  • Vector Search: O(log n) with proper indexing
  • Index Building: O(k) where k = result count
  • Context Window Usage: O(1) - Constant regardless of history length
  • Chunking: O(n/c) where c = chunk size (reduces memory per retrieval)

8.3 Optimization Strategies

  1. Hierarchical Clustering: Pre-compute message clusters during quiet periods
  2. Embedding Cache: Cache embeddings for frequently accessed messages
  3. Progressive Loading: Start with minimal context, expand as needed
  4. Temporal Partitioning: Separate hot (recent) and cold (old) storage
  5. Chunk-Aware Retrieval: Load only relevant chunks instead of full messages
  6. Tool Call Indexing: Separately searchable tool results for debugging/analysis

Example Usage

TypeScript/Node.js

import { InfiniteChatStorage } from "./infinite-chat";

const chat = new InfiniteChatStorage({
  neo4jUri: process.env.NEO4J_URI,
  neo4jAuth: {
    user: process.env.NEO4J_USER,
    password: process.env.NEO4J_PASSWORD,
  },
  openaiKey: process.env.OPENAI_API_KEY,
  indexConfig: {
    snippetLength: 100,
    maxIndexTokens: 10000,
    recentWindowSize: 10,
    chunkThreshold: 4000,
    includeToolCalls: true,
  },
});

// Store a message (automatically chunks if needed)
const messageId = await chat.storeMessage(
  "What's our API authentication strategy?",
  "user",
  parentId
);

// Store a tool call with result (automatically chunks if needed)
const toolCallId = await chat.storeToolCall(
  "web_search",
  { query: "OAuth2 best practices" },
  largeSearchResult,
  messageId
);

// Prepare context for LLM
const context = await chat.prepareContext(
  "Tell me about our authentication decisions"
);

// Retrieve specific message with all chunks
const message = await chat.getMessageWithChunks(messageId);

Python

from infinite_chat import InfiniteChatStorage

chat = InfiniteChatStorage(
    neo4j_uri="bolt://localhost:7687",
    neo4j_auth=("neo4j", "password"),
    openai_key=os.getenv("OPENAI_API_KEY"),
    index_config={
        "snippet_length": 100,
        "max_index_tokens": 10000,
        "recent_window_size": 10,
        "chunk_threshold": 4000,
        "include_tool_calls": True
    }
)

# Store a message (automatically chunks if needed)
message_id = await chat.store_message(
    content="What's our API authentication strategy?",
    role='user',
    parent_id=parent_id
)

# Store a tool call with result
tool_call_id = await chat.store_tool_call(
    tool_name='web_search',
    arguments={'query': 'OAuth2 best practices'},
    result=large_search_result,
    message_id=message_id
)

# Prepare context for LLM
context = await chat.prepare_context(
    "Tell me about our authentication decisions"
)

# Retrieve specific message with all chunks
message = await chat.get_message_with_chunks(message_id)

Contributing

This is an open specification. Contributions, implementations, and improvements are welcome. Please submit issues and pull requests to the repository.

License

MIT License - See LICENSE file for details.