Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
137 changes: 137 additions & 0 deletions docs/migration/elasticsearch-to-doris.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,137 @@
---
{
"title": "Elasticsearch to Doris",
"language": "en",
"description": "Comprehensive guide to migrating data from Elasticsearch to Apache Doris"
}
---

This guide covers migrating data from Elasticsearch to Apache Doris. Doris can serve as a powerful alternative to Elasticsearch for log analytics, full-text search, and general OLAP workloads, often with better performance and lower operational complexity.

## Why Migrate from Elasticsearch to Doris?

| Aspect | Elasticsearch | Apache Doris |
|--------|---------------|--------------|
| Query Language | DSL (JSON-based) | Standard SQL |
| JOINs | Limited | Full SQL JOINs |
| Storage Efficiency | Higher storage usage | Columnar compression |
| Operational Complexity | Complex cluster management | Simpler operations |
| Full-text Search | Native inverted index | Inverted index support |
| Real-time Analytics | Good | Excellent |

## Considerations

1. **Full-text Search**: Doris supports [Inverted Index](../table-design/index/inverted-index/overview.md) for full-text search capabilities similar to Elasticsearch.

2. **Index to Table Mapping**: Each Elasticsearch index typically maps to a Doris table.

3. **Nested Documents**: Elasticsearch nested types map to Doris [VARIANT](../data-operate/import/complex-types/variant.md) type for flexible schema handling.

4. **Array Handling**: Elasticsearch doesn't have explicit array types. To read arrays correctly via the ES Catalog, configure array field metadata in the ES index mapping using `_meta.doris.array_fields`.

5. **Date Types**: Elasticsearch dates can have multiple formats. Ensure consistent date handling when migrating — use explicit casting to DATETIME.

6. **_id Field**: To preserve Elasticsearch document `_id`, enable `mapping_es_id` in the ES Catalog configuration.

7. **Performance**: For better ES Catalog read performance, enable `doc_value_scan`. Note that `text` fields don't support doc_value and will fall back to `_source`.

## Data Type Mapping

| Elasticsearch Type | Doris Type | Notes |
|--------------------|------------|-------|
| null | NULL | |
| boolean | BOOLEAN | |
| byte | TINYINT | |
| short | SMALLINT | |
| integer | INT | |
| long | BIGINT | |
| unsigned_long | LARGEINT | |
| float | FLOAT | |
| half_float | FLOAT | |
| double | DOUBLE | |
| scaled_float | DOUBLE | |
| keyword | STRING | |
| text | STRING | Consider inverted index in Doris |
| date | DATE or DATETIME | See Date Types consideration above |
| ip | STRING | |
| nested | VARIANT | See [VARIANT type](../data-operate/import/complex-types/variant.md) for flexible schema |
| object | VARIANT | See [VARIANT type](../data-operate/import/complex-types/variant.md) |
| flattened | VARIANT | Supported since Doris 3.1.4, 4.0.3 |
| geo_point | STRING | Store as "lat,lon" string |
| geo_shape | STRING | Store as GeoJSON string |

## Migration Options

### Option 1: ES Catalog (Direct Query and Migration)

The [ES Catalog](../lakehouse/catalogs/es-catalog.md) provides direct access to Elasticsearch data from Doris, enabling both querying and migration.

**Prerequisites**: Elasticsearch 5.x or higher; network connectivity between Doris FE/BE nodes and Elasticsearch.

### Option 2: Logstash Pipeline

Use Logstash to read from Elasticsearch and write to Doris via HTTP (Stream Load). This approach gives you transformation capabilities during migration.

### Option 3: Custom Script with Scroll API

For more control, use a custom script with Elasticsearch Scroll API to read data and load it into Doris via Stream Load.

## Full-text Search in Doris

Doris's [Inverted Index](../table-design/index/inverted-index/overview.md) provides full-text search capabilities similar to Elasticsearch.

### DSL to SQL Conversion Reference

| Elasticsearch DSL | Doris SQL |
|-------------------|-----------|
| `{"match": {"title": "doris"}}` | `WHERE title MATCH 'doris'` |
| `{"match_phrase": {"content": "real time"}}` | `WHERE content MATCH_PHRASE 'real time'` |
| `{"term": {"status": "active"}}` | `WHERE status = 'active'` |
| `{"terms": {"tag": ["a", "b"]}}` | `WHERE tag IN ('a', 'b')` |
| `{"range": {"price": {"gte": 10}}}` | `WHERE price >= 10` |
| `{"bool": {"must": [...]}}` | `WHERE ... AND ...` |
| `{"bool": {"should": [...]}}` | `WHERE ... OR ...` |
| `{"exists": {"field": "email"}}` | `WHERE email IS NOT NULL` |

## Feature Compatibility

### VARIANT Type vs ES Dynamic Mapping

Doris [VARIANT](../data-operate/import/complex-types/variant.md) type provides comparable functionality to Elasticsearch Dynamic Mapping for flexible schema handling.

| Feature | Doris VARIANT | ES Dynamic Mapping | Status |
|---------|--------------|-------------------|--------|
| Dynamic schema inference | Auto-infer JSON field types | Dynamic Mapping | Compatible |
| Predefined field types | `MATCH_NAME 'field': type` | Explicit Mapping | Compatible |
| Pattern-based type matching | `MATCH_NAME_GLOB 'pattern*': type` | dynamic_templates | Compatible |
| Field index configuration | `INDEX ... PROPERTIES("field_pattern"=...)` | Mapping + Index Settings | Compatible |
| Custom analyzer | `CREATE INVERTED INDEX ANALYZER` | Custom Analyzer | Compatible |
| Sub-column count limit | `variant_max_subcolumns_count` | `mapping.total_fields.limit` | Compatible |
| Sparse column optimization | `variant_enable_typed_paths_to_sparse` | N/A | Doris-specific |
| Nested array objects | Flattened handling | Nested Type | Partial |

### Search Function vs ES Query String

Doris `search()` function provides Lucene-compatible query string syntax similar to Elasticsearch `query_string`.

| Feature | Doris search() | ES query_string | Status |
|---------|---------------|----------------|--------|
| Query string syntax | Lucene mode | query_string query | Compatible |
| Multi-field search | `fields` parameter | multi_match / fields | Supported |
| best_fields mode | Supported | Supported | Supported |
| cross_fields mode | Supported | Supported | Supported |
| VARIANT sub-column search | `variant.field:term` | Object/Nested search | Supported |
| Boolean queries | AND/OR/NOT | AND/OR/NOT | Supported |
| Phrase queries | `"exact phrase"` | `"exact phrase"` | Supported |
| Wildcards | `*`, `?` | `*`, `?` | Supported |
| Regular expressions | `/pattern/` | `/pattern/` | Supported |
| Relevance scoring | Disabled | BM25 | Not supported |
| Fuzzy queries | Not supported | `term~2` | Not supported |
| Range queries | Not supported | `[a TO z]` | Not supported |
| Proximity queries | Not supported | `"foo bar"~5` | Not supported |

## Next Steps

- [Inverted Index](../table-design/index/inverted-index/overview.md) - Full-text search in Doris
- [ES Catalog](../lakehouse/catalogs/es-catalog.md) - Complete ES Catalog reference
- [Log Storage Analysis](../log-storage-analysis.md) - Optimizing log analytics in Doris
150 changes: 150 additions & 0 deletions docs/migration/mysql-to-doris.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,150 @@
---
{
"title": "MySQL to Doris",
"language": "en",
"description": "Comprehensive guide to migrating data from MySQL to Apache Doris"
}
---

This guide covers migrating data from MySQL to Apache Doris. MySQL is one of the most common migration sources, and Doris provides excellent compatibility with MySQL protocol, making migration straightforward.

## Considerations

1. **Protocol Compatibility**: Doris is MySQL protocol compatible, so existing MySQL clients and tools work with Doris.

2. **Real-time Requirements**: If you need real-time synchronization, Flink CDC supports automatic table creation and schema changes.

3. **Full Database Sync**: The Flink Doris Connector supports synchronizing entire MySQL databases including DDL operations.

4. **Auto Increment Columns**: MySQL AUTO_INCREMENT columns can map to Doris's auto-increment feature. When migrating, you can preserve original IDs by explicitly specifying column values.

5. **ENUM and SET Types**: MySQL ENUM and SET types are migrated as STRING in Doris.

6. **Binary Data**: Binary data (BLOB, BINARY) is typically stored as STRING. Consider using HEX encoding for binary data during migration.

7. **Large Table Performance**: For tables with billions of rows, consider increasing Flink parallelism, tuning Doris write buffer, and using batch mode for initial load.

## Data Type Mapping

| MySQL Type | Doris Type | Notes |
|------------|------------|-------|
| BOOLEAN / TINYINT(1) | BOOLEAN | |
| TINYINT | TINYINT | |
| SMALLINT | SMALLINT | |
| MEDIUMINT | INT | |
| INT / INTEGER | INT | |
| BIGINT | BIGINT | |
| FLOAT | FLOAT | |
| DOUBLE | DOUBLE | |
| DECIMAL(P, S) | DECIMAL(P, S) | |
| DATE | DATE | |
| DATETIME | DATETIME | |
| TIMESTAMP | DATETIME | Stored as UTC, converted on read |
| TIME | STRING | Doris does not support TIME type |
| YEAR | INT | |
| CHAR(N) | CHAR(N) | |
| VARCHAR(N) | VARCHAR(N) | |
| TEXT / MEDIUMTEXT / LONGTEXT | STRING | |
| BINARY / VARBINARY | STRING | |
| BLOB / MEDIUMBLOB / LONGBLOB | STRING | |
| JSON | VARIANT | See [VARIANT type](../data-operate/import/complex-types/variant.md) |
| ENUM | STRING | |
| SET | STRING | |
| BIT | BOOLEAN / BIGINT | BIT(1) maps to BOOLEAN |

## Migration Options

### Option 1: Flink CDC (Real-time Sync)

Flink CDC captures MySQL binlog changes and streams them to Doris. This method is suited for:

- Real-time data synchronization
- Full database migration with automatic table creation
- Continuous sync with schema evolution support

**Prerequisites**: MySQL 5.7+ or 8.0+ with binlog enabled; Flink 1.15+ with Flink CDC 3.x and Flink Doris Connector.

For detailed setup, see the [Flink Doris Connector](../ecosystem/flink-doris-connector.md) documentation.

### Option 2: JDBC Catalog

The [JDBC Catalog](../lakehouse/catalogs/jdbc-catalog.md) allows direct querying and batch migration from MySQL. This is the simplest approach for one-time or periodic batch migrations.

### Option 3: Streaming Job (Built-in CDC Sync)

Doris's built-in [Streaming Job](../data-operate/import/streaming-job/streaming-job-multi-table.md) can directly synchronize full and incremental data from MySQL to Doris without external tools like Flink. It uses CDC under the hood to read MySQL binlog and automatically creates target tables (UNIQUE KEY model) with primary keys matching the source.

This option is suited for:

- Real-time multi-table sync without deploying a Flink cluster
- Environments where you prefer Doris-native features over external tools
- Full + incremental migration with a single SQL command

**Prerequisites**: MySQL with binlog enabled (`binlog_format = ROW`); MySQL JDBC driver deployed to Doris.

#### Step 1: Enable MySQL Binlog

Ensure `my.cnf` contains:

```ini
[mysqld]
log-bin = mysql-bin
binlog_format = ROW
server-id = 1
```

#### Step 2: Create Streaming Job

```sql
CREATE JOB mysql_sync
ON STREAMING
FROM MYSQL (
"jdbc_url" = "jdbc:mysql://mysql-host:3306",
"driver_url" = "mysql-connector-j-8.0.31.jar",
"driver_class" = "com.mysql.cj.jdbc.Driver",
"user" = "root",
"password" = "password",
"database" = "source_db",
"include_tables" = "orders,customers,products",
"offset" = "initial"
)
TO DATABASE target_db (
"table.create.properties.replication_num" = "3"
)
```

Key parameters:

| Parameter | Description |
|-----------|-------------|
| `include_tables` | Comma-separated list of tables to sync |
| `offset` | `initial` for full + incremental; `latest` for incremental only |
| `snapshot_split_size` | Row count per split during full sync (default: 8096) |
| `snapshot_parallelism` | Parallelism during full sync phase (default: 1) |

#### Step 3: Monitor Sync Status

```sql
-- Check job status
SELECT * FROM jobs(type=insert) WHERE ExecuteType = "STREAMING";

-- Check task history
SELECT * FROM tasks(type='insert') WHERE jobName = 'mysql_sync';

-- Pause / Resume / Drop
PAUSE JOB WHERE jobname = 'mysql_sync';
RESUME JOB WHERE jobname = 'mysql_sync';
DROP JOB WHERE jobname = 'mysql_sync';
```

For detailed reference, see the [Streaming Job Multi-Table Sync](../data-operate/import/streaming-job/streaming-job-multi-table.md) documentation.

### Option 4: DataX

[DataX](https://github.com/alibaba/DataX) is a widely-used data synchronization tool that supports MySQL to Doris migration via the `mysqlreader` and `doriswriter` plugins.

## Next Steps

- [Flink Doris Connector](../ecosystem/flink-doris-connector.md) - Detailed connector documentation
- [Loading Data](../data-operate/import/load-manual.md) - Alternative import methods
- [Data Model](../table-design/data-model/overview.md) - Choose the right table model
Loading