Description
Roadmap 2024
Roadmap 2023
Roadmap 2022
Apache Doris 2025 Roadmap
In 2025, Apache Doris will focus on lakehouse and semi-structured data analysis, continuing to optimize core areas such as query execution, storage, and query optimizer to further improve performance, stability, and ecosystem compatibility to meet more complex scenarios and large-scale data processing requirements. Meanwhile, Doris will strengthen cloud-native capabilities and security, and explore AI integration scenarios, including vector search and AI training data management, as well as utilizing AI capabilities to assist with system monitoring and operations, providing users with a more comprehensive, efficient, and secure modern data analysis platform.
Lakehouse
1. Performance and Stability
- IO Optimization
- Parquet/ORC lazy materialization for complex data type: Improve query performance for complex data types.
- Optimize Scan task scheduling, improve small query long-tail issues.
- Support dynamic partition pruning: Optimize query efficiency for partitioned tables.
- Optimize Data Cache small file issues: Resolve performance problems caused by too many small files.
- Metadata Optimization
- Metadata cache sharing within single query: Improve query performance, reduce redundant metadata loading.
- Optimize Hive, Iceberg, Paimon metadata access performance: Improve metadata access and Plan performance.
2. Open Table Format
-
Iceberg
- Support Iceberg branch/tag access and management.
- Support more Iceberg system tables.
- Support Iceberg Update/Delete: Enhance write operation support for Iceberg tables.
- Support Iceberg small file compaction and Snapshot management.
- Support AWS S3Tables.
- Support Snowflake Iceberg table engine.
- Support Databricks Uniform DeltaLake table engine.
-
Paimon
- Support Paimon data write-back: Implement write support for Paimon data.
- Support Paimon snapshot read: Support historical data queries based on snapshots.
- Support more Paimon system tables.
-
Hive
- Support multi-Kerberos environment.
- Support multiple Hadoop configuration file management.
- Support Hive 4 transaction table.
-
Doris
- Support Doris Catalog: Provide federated queries across multiple Doris clusters.
-
Delta Lake/Hudi
- Optimize ecosystem compatibility with Iceberg.
-
Catalog
- Support Unity Catalog.
- Support Apache Polaris.
- Support Apache Gravitino.
3. Code Refactoring
- Optimize and unify data source property names: Improve data source configuration consistency.
- JDBC Catalog pluginization: Enhance JDBC Catalog extensibility.
- File system pluginization: Improve file system pluggability.
Semi-structured and Log Analysis
1. Inverted Index Enhancement
- Support more tokenizers
- Chinese ik tokenizer
- Unicode icu tokenizer
- High-performance simple tokenizer for log scenarios
- Support custom dictionary and management for tokenizers.
- Support incremental index building in disaggregated storage mode.
- Further optimize inverted index space usage.
- Enhanced index observability, including write and query performance metrics.
2. VARIANT Data Type Enhancement
- Supports 10,000 sub-columns in compute-storage decoupled architechture.
- Sparse columns support more sparse sub-columns
- Supports complex structure expansion of JSON array nested objects
- Supports specifying sub-column types
- Supports building indexes for specified fields
3. Log and Observability Ecosystem Improvement
-
Output plugin supports writing to multiple tables
- filebeat
- logstash
-
Observability ecosystem integration
- Opentelemetry
- Jeager
-
Support more log collector plugins
- ilogtail
- vector
Query Engine
1. Query Performance Optimization
- Dynamic algorithm detection and adjustment for data skew: Optimize query execution, improve performance in big data scenarios.
- ARM architecture tuning and optimization: Support more hardware architectures, improve operational efficiency.
- Adaptive concurrency: Dynamically adjust parallel task numbers based on system load and resources, improve stability in query queue and spill scenarios.
- More general top-n and global lazy-materialization ability.
- Global dict.
2. Resource Management
- Unified resource management framework for resource auditing and observability for query, load, compaction, schema change.
- Provide realtime resource monitor system tables and metrics.
- Unify resource control logics such as Workload Group Policy, Spill Disk, Query Breaker.
- More smarter scheduling algorithm to allocate resource between multi queries in a single workload group to reduce affect between big queries and small queries.
3. Vector Search
- Support vector search
4. Function Compatibility
- Enhance function compatibility with ClickHouse.
- Enhance function compatibility with Presto.
5. ETL
- Combine workload group with spill disk to reduce concurrency and limit per query's resource usage dynamically to avoid cancelling query during resource shortage.
- Enhance the stability of spill disk, for example could support 5 concurrent TPC-DS 10TB jobs on 48 core 192G memory cluster.
- Provide realtime metric for spill stage.
- Enhance mix-load memory management.
Storage and Security
1. Compute Storage decoupled
- Optimize cold reads on object storage: Improve cold data read performance.
- More user-friendly Cache strategies: Optimize Cache strategy configuration and usage.
- More user-friendly read-write separation.
- Support more cloud vendor authentication methods: Enhance security in cloud environments.
- IAM Role authentication
2. Security
- Support storage encryption: Enhance data storage security.
- Improve HTTP interface security, including HTTPS support and interface authentication.
3. ETL Enhancement
- Support temporary tables: Enhance data processing capabilities in ETL scenarios.
- Support write-write conflict handling in multi-statement transactions: Improve transaction operation reliability.
4. Disaster Recovery and High Availability
- Support backup and recovery in compute storage decouple architechture.
- Cross-cluster replication (CCR)
- Feature completeness: Ensure production environment stability through thorough chaos testing.
- Support disaggregated storage: Improve CCR adaptation in cloud-native architecture.
- Support primary-secondary switchover: Enhance high availability capabilities.
5. Real-time Data Streaming
- Support Binlog for incremental computation: Support real-time data streaming scenarios.
Query Optimizer
1. Asynchronous Materialized Views
- Data lake table format (Iceberg, Paimon, Hudi) partition incremental build: Improve materialized view build efficiency.
- Enhance observability using monitoring information and system tables: Improve operational capabilities.
- Data lineage information interface: Provide data lineage tracking capabilities.
- Logical view and materialized view interconversion: Improve view management flexibility.
- Automatic materialized views: Implement intelligent management of materialized views.
2. Feature Enhancement
- Recursive CTE: Support recursive queries.
- Filter aggregation (FILTER Clause): Improve SQL feature standard compatibility.
- Pivot and Unpivot: Support data pivoting and unpivoting operations.
- More reasonable implicit type conversion rules: Optimize type conversion logic.
- Standard SQL compatibility improvement: Enhance standard SQL support.
3. Execution Optimization
- Compression materialization: Optimize storage space utilization.
- Global lazy materialization: Improve query performance.
4. Plan Quality Enhancement
- HBO support.
- Enhance optimization rules like constant propagation, NULL propagation.
- Enhance optimization rules utilizing data characteristics.
- Data skew adaptive optimization.
- Common subplan extraction.
- Cost-based CTE materialization selection.
- Cost-based aggregation stage selection.
- Runtime Filter wait time adaptation.
- Enhance Shuffle algorithm selection for distributed plans.
- Adaptive parallelism control.
5. Plan Management
- Execution plan fixing: Support plan controllability.
- Execution plan evolution: Improve plan flexibility and intelligence.
6. Framework Optimization
- Small query scenario planning performance optimization: Improve small query execution efficiency.
- Old optimizer code removal: Simplify code maintenance.
7. Operations Enhancement
- Statistics status collection monitoring and system tables: Improve statistics observability.
- Planning time monitoring and system tables: Enhance query planning diagnostic capabilities.
- Enrich query-related information in audit logs: Improve audit capabilities.
- Error message categorization and content optimization: Improve error message readability and diagnostic capabilities.
Activity