Status: Active Last Updated: 2026-03-13
This document describes the Elasticsearch storage architecture design, ILM (Index Lifecycle Management) lifecycle management strategy, and scaling operations plan for the eigenflux_server project.
- Automated Lifecycle Management: Hot data → Warm data → Cold data three-stage automatic transition
- Storage Cost Optimization: Reduce storage and memory costs through force_merge, replica reduction, and read-only archiving
- Query Performance Guarantee: Hot data high priority, fast response; cold data low priority, resource saving
- Business Code Transparency: Implement Rollover through Index Alias, business code doesn't need to be aware of underlying index changes
- Vector Search Support:
dense_vectorfield dimensions automatically inferred fromEMBEDDING_DIMENSIONSorEMBEDDING_MODEL, supports semantic search based on cosine similarity - Flexible Configuration: Dynamically configure shard and replica counts via environment variables, adapting to different deployment environments
- Elasticsearch 8.11.0
- ILM (Index Lifecycle Management)
- Composable Index Templates
- Index Aliases with Rollover
- Dense Vector Search (kNN)
Checked Items:
- Package import completeness
- Type matching
- Function signature consistency
- Constant reference correctness
Conclusion: All code compiles successfully, no syntax errors.
Key Checkpoints:
IndexMappinginpkg/es/ilm.gocorrectly references frommapping.goes.ReadIndexPatternandes.IndexNameconstant references correct inrpc/sort/dal/*.go- ES official Go client API calls comply with v8 specifications
Advantages:
- Idempotency Guarantee:
upsertILMPolicyandupsertIndexTemplateuse PUT operations, repeated execution won't error - Safe Bootstrap:
bootstrapIfNeededdoesn't force delete when old index exists, avoiding data loss - Clear Error Handling: Each step has explicit error return and logging
- Phased Initialization: Policy → Template → Index, reasonable sequence
- Environment Variable Configuration: Dynamically configure via
ES_SHARDSandES_REPLICASenvironment variables, adapting to different deployment environments - Enhanced Idempotency: Bootstrap checks if initial index
items-000001already exists, avoiding duplicate creation
Implemented Optimizations:
- ✅ Dynamic Replica Configuration: Configure via
ES_REPLICASenvironment variable, default 0 (single node), multi-node environments can set to 1 - ✅ Bootstrap Idempotency Enhancement: Added
items-000001existence check, avoiding duplicate creation in extreme cases - ✅ Warm Phase Optimization:
force_merge+replicas=0, reducing segment count and replica memory usage
Advantages:
- Write Transparency: All write operations (IndexItem, BulkIndexItems) use
es.IndexName(alias), automatically route to new index after Rollover - Read All: All read operations (SearchItems, SearchSimilarItems) use
es.ReadIndexPattern(items-*), query across all backing indices - Delete Compatibility:
DeleteItemusesdelete_by_queryAPI, supports multi-index scenarios
Key Fix:
- Original
DeleteItemused single document DELETE API, would error when alias points to multiple indices after Rollover - Fixed to
delete_by_query, precisely matchesidfield throughtermquery
Advantages:
- Cross-Index Vector Search: Uses
ReadIndexPatternensuring similarity deduplication can search full historical data - Cosine Similarity:
dense_vectorfield configured withcosinesimilarity, suitable for semantic vectors - Dimension Consistency: Index dimensions must match current embedding model output dimensions; switching models requires index rebuild or migration
Notes:
- Vector indices resident in memory, Warm/Cold phase indices still occupy memory
- ES 8.x doesn't support dynamically disabling field indexing, can only reduce memory usage through
replicas=0+force_merge
"properties": {
"id": {"type": "keyword"}, // Exact match
"raw_content": {"type": "text"}, // Full-text search
"summary": {"type": "text"},
"keywords": {"type": "text", "analyzer": "keyword_analyzer"}, // Comma-separated keywords
"domains": {"type": "text", "analyzer": "keyword_analyzer"},
"embedding": {"type": "dense_vector", "dims": EMBEDDING_DIMENSIONS, "similarity": "cosine"},
"created_at": {"type": "date"},
"updated_at": {"type": "date"},
...
}Design Assessment:
idfield:keywordtype, supports exact match andtermquery, correctkeywordsanddomains: Usekeyword_analyzer(lowercase + keyword tokenizer), suitable for comma-separated tag matchingembeddingfield:dense_vectorwith dimensions determined by current embedding configuration,cosinesimilarity, meets semantic search needs- Time fields:
datetype, supports range queries and sorting
Potential Optimization:
keywordsanddomainscould be changed tokeywordarray type, avoiding comma-separated string parsing overhead- Current design:
"AI,Machine Learning,NLP"→ requires application layerstrings.Split - Optimization:
["AI", "Machine Learning", "NLP"]→ ES native array support - Impact: Requires modifying DAL layer and database schema
- Current design:
"settings": {
"number_of_shards": getEnvInt("ES_SHARDS", 1),
"number_of_replicas": getEnvInt("ES_REPLICAS", 0),
"refresh_interval": "30s",
...
}Design Assessment:
-
number_of_shards:- Default: 1 (suitable for single node or small data < 50GB/index)
- Scalability: Each backing index independent after Rollover, can increase concurrency by adding nodes
- Configuration: Adjust via
ES_SHARDSenvironment variable (production recommendedshards = node count)
-
number_of_replicas:- Default: 0 (suitable for single node deployment or test environment)
- Risk: No replicas, node failure will lose data
- Configuration: Adjust via
ES_REPLICASenvironment variable (production recommended set to 1, requires at least 2 nodes)
-
refresh_interval: 30s:- Advantage: Reduces write pressure, improves throughput
- Disadvantage: Up to 30 seconds after write before searchable
- Use case: Non-real-time query scenarios (like feed streams)
- Recommendation: Change to
1s(ES default) for high real-time requirements
| Phase | Time Range | Rollover Condition | Main Operations | Resource Priority |
|---|---|---|---|---|
| Hot | 0-7 days | max_age: 7d OR max_size: 20gb |
Accept writes, real-time queries, vector search | 100 (highest) |
| Warm | 7-90 days | - | force_merge (merge segments), replicas=0, readonly |
50 (medium) |
| Cold | 90+ days | - | replicas=0, readonly, low priority |
0 (lowest) |
Objective: High-performance writes and queries
Configuration:
{
"actions": {
"rollover": {
"max_age": "7d",
"max_size": "20gb"
},
"set_priority": {"priority": 100}
}
}Behavior:
- Write alias
itemspoints to current Hot index (is_write_index: true) - When Rollover condition met, automatically creates new index (e.g.,
items-000002), alias switches to new index - Old index enters Warm phase
Rollover Trigger Conditions:
- Time: 7 days after index creation
- Size: Index size reaches 20GB
- Either condition triggers
Objective: Reduce storage and memory costs, maintain queryability
Configuration:
{
"min_age": "7d",
"actions": {
"forcemerge": {"max_num_segments": 1},
"allocate": {"number_of_replicas": 0},
"readonly": {},
"set_priority": {"priority": 50}
}
}Behavior:
- Force Merge: Merge all segments into 1, reducing file count and memory usage
- Replica Reduction:
replicas=0, freeing replica storage and memory - Read-only Protection: Prevent accidental writes
- Priority Reduction: Prioritize Hot index during resource competition
Notes:
- Force Merge is CPU-intensive, recommended during off-peak hours
- Vector indices still occupy memory (ES 8.x can't dynamically disable field indexing)
Objective: Minimize resource usage, archive historical data
Configuration:
{
"min_age": "90d",
"actions": {
"allocate": {"number_of_replicas": 0},
"readonly": {},
"set_priority": {"priority": 0}
}
}Behavior:
- Read-only, no replicas, lowest priority
- Query performance lower, but still queryable via
items-*pattern
Optional Extensions:
- Searchable Snapshot: Snapshot index to object storage (S3/GCS), further reducing local storage costs
- Requires configuring Snapshot Repository (not implemented in current code)
Write alias: items → Points to current Hot index (is_write_index: true)
Read pattern: items-* → Matches all backing indices
Actual index: items-000001 → Initial index
items-000002 → 2nd index after Rollover
items-000003 → 3rd index after Rollover
...
| Operation Type | Use Index/Alias | Description |
|---|---|---|
| Write | items |
Alias automatically routes to current Hot index |
| Read | items-* |
Query across all backing indices |
| Delete | items-* |
Use delete_by_query for cross-index deletion |
Initial state:
items (alias) → items-000001 (is_write_index: true)
After Rollover triggered:
items (alias) → items-000002 (is_write_index: true)
items-000001 (is_write_index: false, enters Warm phase)
When querying:
items-* matches items-000001 and items-000002, returns merged results
| Environment Variable | Default | Description | Use Case |
|---|---|---|---|
ES_SHARDS |
1 | Primary shards per index | Single node: 1; Multi-node: node count |
ES_REPLICAS |
0 | Replicas per index | Single node: 0; Multi-node: 1 |
Single Node Deployment (Dev/Test Environment):
ES_SHARDS=1
ES_REPLICAS=0Multi-Node Deployment (Production, 3 nodes):
ES_SHARDS=3
ES_REPLICAS=1Notes:
- Configuration changes require service restart, new config only applies to newly created indices
- Existing indices won't automatically update config, need manual adjustment or wait for Rollover
Steps:
- Set environment variables:
export ES_REPLICAS=1 export ES_SHARDS=3 # Assuming 3 nodes
- Add ES nodes (at least 2 nodes)
- Restart service, new indices automatically apply new config
- Manually update old indices replica count:
curl -X PUT "localhost:9200/items-*/_settings" -H 'Content-Type: application/json' -d' { "index": { "number_of_replicas": 1 } }'
Scenario: Single index data exceeds 50GB, query performance degrades
Solution:
- Set environment variable:
export ES_SHARDS=3 - Restart service, new Rollover indices automatically apply new config
- Old indices can't modify shard count (ES limitation), can only migrate via Reindex
Recommendations:
- Initial shard count = node count (e.g., 3 nodes → 3 shards)
- Single shard size controlled at 20-50GB
Temporary Solution:
- Manually delete Cold phase indices:
curl -X DELETE "localhost:9200/items-000001" - Adjust Rollover conditions (e.g.,
max_size: 10gb)
Long-term Solution:
- Configure Snapshot Repository, enable Searchable Snapshot
- Cold phase indices automatically snapshot to object storage
Cause: Vector indices resident in memory
Optimization Solutions:
- Warm phase
force_merge+replicas=0, reduce segment count and replica memory - Reduce Hot index count (shorten Rollover cycle)
- Upgrade to larger memory nodes
Current Configuration:
refresh_interval: 30s: Reduce refresh frequency, improve throughputnumber_of_replicas: 0(default): No replica writes, reduce network overhead
Further Optimization:
- Bulk writes: Use
BulkIndexItems(already implemented) - Add
index.translog.durability: async(risk: node crash may lose data)
Current Configuration:
- Hot index
priority: 100, prioritize resource allocation - Warm index
force_merge, reduce segment count
Further Optimization:
- Use
_routingparameter, route sameauthor_iddocuments to same shard - Enable query cache:
index.queries.cache.enabled: true
Current Configuration:
- Dynamic dimension
dense_vector,cosinesimilarity - kNN query across all
items-*indices
Optimization Solutions:
- Limit vector search range: Only query Hot + Warm indices (e.g.,
items-00000[2-9]*) - Use
num_candidatesparameter to control candidate count (default 10000)
| Metric | Description | Alert Threshold |
|---|---|---|
indices.count |
Index count | > 100 (check Rollover frequency) |
indices.store.size |
Total storage size | > 80% disk capacity |
jvm.mem.heap_used_percent |
JVM heap memory usage | > 85% |
indices.search.query_time_in_millis |
Query latency | > 1000ms (P99) |
indices.indexing.index_time_in_millis |
Write latency | > 500ms (P99) |
# View ILM policy
curl "localhost:9200/_ilm/policy/items-policy?pretty"
# View index ILM status
curl "localhost:9200/items-*/_ilm/explain?pretty"
# View alias bindings
curl "localhost:9200/_cat/aliases/items?v"
# View all backing indices
curl "localhost:9200/_cat/indices/items-*?v&s=index"Check:
curl "localhost:9200/items-*/_ilm/explain?pretty" | grep -A 5 "phase"Possible Causes:
- ILM policy not bound to index
- Rollover conditions not met (time < 7d and size < 20GB)
- ILM service not started
Solution:
# Manually trigger Rollover
curl -X POST "localhost:9200/items/_rollover?pretty"Cause: Using single document API (GET/DELETE) on alias, but alias points to multiple indices
Solution:
- GET → Change to Search API
- DELETE → Change to
delete_by_queryAPI (already fixed)
Cause: Only queried items alias (current Hot index), didn't query historical indices
Solution:
- Ensure using
items-*pattern query (already fixed)
Scenario: Existing items index (not alias), need to migrate to ILM management
Steps:
-
Backup data (optional):
curl -X PUT "localhost:9200/_snapshot/my_backup/snapshot_1?wait_for_completion=true" -
Delete old index:
curl -X DELETE "localhost:9200/items" -
Restart service:
./scripts/local/start_local.sh
- Service automatically executes
SetupILMon startup - Creates
items-000001and bindsitemsalias
- Service automatically executes
-
Verify:
curl "localhost:9200/_cat/aliases/items?v" curl "localhost:9200/_cat/indices/items-*?v"
Scenario: Need to migrate old data to new ILM-managed index
Steps:
curl -X POST "localhost:9200/_reindex?pretty" -H 'Content-Type: application/json' -d'
{
"source": {
"index": "items_backup"
},
"dest": {
"index": "items"
}
}
'- Automated Lifecycle Management: No manual intervention, indices automatically Rollover and downgrade
- Cost Optimization: Warm/Cold phases reduce storage and memory costs through force_merge, replica reduction
- Business Transparency: Read-write separation through aliases, business code doesn't need modification
- Strong Scalability: Supports horizontal scaling (add nodes) and vertical scaling (add shards)
- Vector Search Support: Dynamic dimension dense_vector, supports semantic similarity search
- Flexible Configuration: Dynamic configuration via environment variables, adapts to different deployment environments
- ✅ Dynamic Replica Configuration: Configure via
ES_REPLICASenvironment variable, default 0 (single node), multi-node environments can set to 1 - ✅ Dynamic Shard Configuration: Configure via
ES_SHARDSenvironment variable, default 1 - ✅ Bootstrap Idempotency Enhancement: Added
items-000001existence check, avoiding duplicate creation in extreme cases - ✅ Warm Phase Optimization:
force_merge+replicas=0, reducing segment count and replica memory usage
- Configuration Effective Timing: Environment variable changes require service restart, new config only applies to newly created indices
- Single Node Limitation:
replicas=0no data redundancy, production recommended at least 2 nodes +replicas=1 - Vector Memory Usage: Warm/Cold phase vector indices still occupy memory, need to monitor JVM heap usage
- Refresh Delay:
refresh_interval: 30scauses up to 30 seconds visibility after write, high real-time scenarios need adjustment - Rollover Frequency: Current 7 days or 20GB, need to adjust based on actual data volume
- Searchable Snapshot: Configure Snapshot Repository, Cold phase indices snapshot to object storage
- Keyword Field Optimization: Change
keywordsanddomainstokeywordarray type - Query Cache: Enable
index.queries.cache.enabledto improve repeated query performance - Monitoring and Alerting: Integrate Prometheus + Grafana, monitor ILM status and performance metrics
- Elasticsearch ILM Official Documentation
- Composable Index Templates
- Dense Vector Search
- Index Aliases
Document Version: v1.0
Last Updated: 2026-03-13
Maintainer: eigenflux_server Development Team