Skip to content

4.0.0 Release Note #56823

@yiguolei

Description

@yiguolei

We are delighted to announce the official release of Apache Doris 4.0 version. Welcome you all to download, use and test it.

This release is around the three core scenarios: "AI-driven, search enhancement, and ETL improvement". This release introduces key features such as vector indexes and AI functions, improves the search function, optimizes the stability and resource utilization of offline computing, and enhances query performance and data quality through a number of underlying improvements. All these efforts aim to build a more efficient, flexible and unified enterprise-level data analysis platform for users.

During the R&D of version 4.0, more than 200 contributors submitted over 9,000 optimizations and fixes to Apache Doris. We would like to express our sincere gratitude to all contributors who participated in the R&D, testing and requirement feedback of this version.

1. Deep Integration of AI Capabilities

As LLM and vector search technologies accelerate their implementation and in-depth penetration in enterprise-level scenarios, Doris 4.0 focuses on strengthening native AI support capabilities. Leveraging vector index technology to efficiently integrate enterprise's structured and unstructured data, Doris breaks through the functional boundaries of traditional databases and directly evolves into the core "AI Analytics Hub" for enterprises, providing stable and efficient underlying data support for intelligent decision-making and innovative practices on the business.

Vector Index

The vector index feature is officially introduced, supporting efficient storage and retrieval of high-dimensional vector data (such as text embeddings, image features, etc.). Combined with Doris's native SQL analytics capabilities, users can directly perform integrated analysis of "structured data query + vector similarity search" within the database, eliminating the need for cross-system integration and significantly reducing the development and deployment costs of AI applications (e.g., intelligent recommendation, semantic search, image retrieval).

  • l2_distance_approximate(): Uses the HNSW index to approximate similarity calculation based on Euclidean Distance (L2). The smaller the value, the higher the similarity.
  • inner_product_approximate(): Uses the HNSW index to approximate similarity calculation based on Inner Product. The larger the value, the higher the similarity.

Example

--1) create table
CREATE TABLE doc_store (
  id BIGINT,
  title STRING,
  tags ARRAY<STRING>,
  embedding ARRAY<FLOAT> NOT NULL,
  INDEX idx_vec (embedding) USING ANN PROPERTIES (
      "index_type"  = "hnsw",
      "metric_type" = "l2_distance",
      "dim"         = "768",
      "quantizer"   = "flat" -- options:flat / sq8 / sq4
  ),
  INDEX idx_title (title) USING INVERTED PROPERTIES ("parser" = "english")
)
DUPLICATE KEY(id)
DISTRIBUTED BY HASH(id) BUCKETS 16
PROPERTIES("replication_num"="1");

-- 2) TopN 
SELECT id, l2_distance_approximate(embedding, [...]) AS dist
FROM doc_store
ORDER BY dist ASC
LIMIT 10;

-- 3) filter first and then topn
SELECT id, title,
       l2_distance_approximate(embedding, [...]) AS dist
FROM doc_store
WHERE title MATCH_ANY 'music'                -- filter using full text index
  AND array_contains(tags, 'recommendation') -- filter using structured filter
ORDER BY dist ASC
LIMIT 5;

-- 4) range query
SELECT COUNT(*)
FROM doc_store
WHERE l2_distance_approximate(embedding, [...]) <= 0.35;

Parameters

  • index_type: Required. Currently supports hnsw (Hierarchical Navigable Small Worlds).
  • metric_type: Required. Options are l2_distance (Euclidean Distance) or inner_product.
  • dim: Required. A positive integer that must be strictly consistent with the dimension of the imported vectors.
  • max_degree: Optional. Default value is 32. Controls the out-degree (parameter M in the HNSW algorithm) of nodes in the HNSW graph.
  • ef_construction: Optional. Default value is 40. Specifies the length of the candidate queue during the index construction phase (parameter efConstruction in the HNSW algorithm).
  • quant: Optional. Options are flat (default), sq8 (8-bit scalar quantization), or sq4 (4-bit scalar quantization). Quantization significantly reduces memory usage: the size of an SQ8 index is approximately 1/3 that of a FLAT index, enabling higher storage capacity and lower costs at the expense of a slight recall rate loss.

Notes

  • Doris uses the "pre-filtering" mechanism by default: it first applies predicate filtering using precisely locatable indexes (e.g., inverted indexes), then performs ANN (Approximate Nearest Neighbor) TopN retrieval on the remaining dataset. This ensures the interpretability of results and stability of recall rates.
  • If the SQL query contains predicates that cannot be precisely located via secondary indexes (e.g., ROUND(id) > 100 where no secondary index like an inverted index exists for the id column), the system will fall back to exact brute-force search to preserve the semantics and correctness of pre-filtering.
  • The vector column must be of type ARRAY NOT NULL, and the dimension of imported vectors must match the dim parameter of the index.
  • Currently, ANN retrieval only supports the Duplicate Key table model.

AI Functions

This allows data analysts to call large language models (LLMs) for text processing directly via simple SQL statements. Tasks such as extracting specific key information, classifying the sentiment of reviews, or generating concise text summaries can now be completed seamlessly within the database:

  • AI_CLASSIFY: Extracts a single label string (from a given set of labels) that has the highest matching degree with the text content.
  • AI_EXTRACT: Extracts information related to each specified label based on the text content.
  • AI_FILTER: Determines the correctness of the text content and returns a bool value (true/false).
  • AI_FIXGRAMMAR: Fix grammatical and spelling errors in the text.
  • AI_GENERATE: Generates content based on the input parameters.
  • AI_MASK: Replaces sensitive information in the original text with [MASKED] according to specified labels (for data desensitization).
  • AI_SENTIMENT: Analyzes the sentiment tendency of the text and returns one of the following values: positive, negative, neutral, or mixed.
  • AI_SIMILARITY: Evaluates the semantic similarity between two texts and returns a floating-point number between 0 and 10 (a higher value indicates greater semantic similarity).
  • AI_SUMMARIZE: Generates a concise summary of the text.
  • AI_TRANSLATE: Translates the text into a specified language.
  • AI_AGG: Performs cross-row aggregate analysis on multiple text entries.

Currently supported LLMs include: OpenAI, Anthropic, Gemini, DeepSeek, Local (local deployment), MoonShot, MiniMax, Zhipu, Qwen, and Baichuan.
For example, to simulate a resume screening scenario, you can use the following SQL to filter resumes at the semantic level.The example below simulates candidates' resumes and job requirements for recruitment:

CREATE TABLE candidate_profiles (
    candidate_id INT,
    name         VARCHAR(50),
    self_intro   VARCHAR(500)
)
DUPLICATE KEY(candidate_id)
DISTRIBUTED BY HASH(candidate_id) BUCKETS 1
PROPERTIES (
    "replication_num" = "1"
);

CREATE TABLE job_requirements (
    job_id   INT,
    title    VARCHAR(100),
    jd_text  VARCHAR(500)
)
DUPLICATE KEY(job_id)
DISTRIBUTED BY HASH(job_id) BUCKETS 1
PROPERTIES (
    "replication_num" = "1"
);

INSERT INTO candidate_profiles VALUES
(1, 'Alice', 'I am a senior backend engineer with 7 years of experience in Java, Spring Cloud and high-concurrency systems.'),
(2, 'Bob',   'Frontend developer focusing on React, TypeScript and performance optimization for e-commerce sites.'),
(3, 'Cathy', 'Data scientist specializing in NLP, large language models and recommendation systems.');

INSERT INTO job_requirements VALUES
(101, 'Backend Engineer', 'Looking for a senior backend engineer with deep Java expertise and experience designing distributed systems.'),
(102, 'ML Engineer',      'Seeking a data scientist or ML engineer familiar with NLP and large language models.');
You can use AI_FILTER to perform semantic matching between job requirements and candidates' profiles, thereby filtering out suitable candidates:
SELECT
    c.candidate_id, c.name,
    j.job_id, j.title
FROM candidate_profiles AS c
JOIN job_requirements AS j
WHERE AI_FILTER(CONCAT('Does the following candidate self-introduction match the job description?', 
                'Job: ', j.jd_text, ' Candidate: ', c.self_intro));
+--------------+-------+--------+------------------+
| candidate_id | name  | job_id | title            |
+--------------+-------+--------+------------------+
|            3 | Cathy |    102 | ML Engineer      |
|            1 | Alice |    101 | Backend Engineer |
+--------------+-------+--------+------------------+

2. Upgrade of Search Capabilities, Empowering Full-Text Retrieval

To address the diverse needs of enterprise-level search scenarios, this version enhances the search capability and delivers a more accurate and flexible text retrieval experience:

New Full-Text Retrieval DSL

Core Highlights

  • One function for full-text retrieval: Complex text retrieval operators are integrated into the unified SEARCH() entry. Its syntax is close to Elasticsearch Query String, which greatly reduces the complexity of SQL concatenation and migration costs.
  • Multi-condition index pushdown: Complex retrieval conditions are directly pushed down to the inverted index for execution, avoiding the repeated overhead of "parsing once and concatenating once" and significantly improving performance.

Syntax Features Supported in the Current Version

  • Term Query: field:value
  • ANY / ALL Multi-Value Matching: field:ANY(v1 v2 ...) / field:ALL(v1 v2 ...)
  • Boolean Combination: AND / OR / NOT with bracket grouping
  • Multi-Field Search: Perform boolean combination on multiple fields within a single search() function
    Syntax Features to Be Supported in Future Versions (via Continuous Iteration)
  • Phrases
  • Prefixes
  • Wildcards
  • Regular Expressions
  • Ranges
  • Lists

Example of Usage

SELECT * FROM docs WHERE search('title:apache');


SELECT * FROM docs WHERE search('tags:ANY(java python golang)');


SELECT * FROM docs WHERE search('tags:ALL(machine learning)');


SELECT * FROM docs
WHERE search('(title:Doris OR content:database) AND NOT category:archived');

SELECT * FROM docs
WHERE search('title:apache') AND publish_date >= '2025-01-01';

Text Retrieval Scoring

To better support hybrid retrieval scenarios, Doris 4.0 introduces the industry-leading BM25 relevance scoring algorithm, replacing the traditional TF-IDF algorithm. BM25 dynamically adjusts term frequency weights based on document length, significantly improving result relevance and retrieval accuracy in long-text and multi-field retrieval scenarios (such as log analysis and document retrieval).

Example of Usage

sql
SELECT *, score() as score 
FROM search_demo 
WHERE content MATCH_ANY 'search query' 
ORDER BY score DESC 
LIMIT 10;

Supported Index Types

  • Tokenized Index: Supports predefined tokenizers and custom tokenizers.
  • Non-Tokenized Index: Indexes that do not perform tokenization (whole-text indexing).
    Supported Text Retrieval Operators
  • MATCH_ANY
  • MATCH_ALL
  • MATCH_PHRASE
  • MATCH_PHRASE_PREFIX
  • SEARCH

Notes

  • Score Range: BM25 scores have no fixed upper or lower bounds; the relative magnitude of scores is more meaningful than their absolute values.
  • Empty Query: If the query term does not exist in the dataset, a score of 0 will be returned.
  • Impact of Document Length: Shorter documents typically receive higher scores when they contain the query term.
  • Number of Query Terms: For multi-term queries, the total score is the combination (sum) of the scores of individual terms.

New Built-in Tokenizers

ICU (International Components for Unicode) Tokenizer

Applicable Scenarios: Internationalized texts with complex writing systems, especially suitable for multilingual mixed documents.
Tokenization Examples:

SELECT TOKENIZE('مرحبا بالعالم Hello 世界', '"parser"="icu"');-- Result: ["مرحبا", "بالعالم", "Hello", "世界"]

SELECT TOKENIZE('มนไมเปนไปตามความตองการ', '"parser"="icu"');-- Result: ["มน", "ไมเปน", "ไป", "ตาม", "ความ", "ตองการ"]

IK Tokenizer

Applicable Scenarios: Chinese text processing requiring high-quality tokenization.

  • ik_smart: Smart mode – fewer, longer tokens with concentrated semantics, suitable for precise search.
  • ik_max_word: Maximum granularity mode – more short tokens with comprehensive coverage, suitable for recall-oriented search.

Tokenization Examples:

SELECT TOKENIZE('中华人民共和国国歌', '"parser"="ik","parser_mode"="ik_smart"');

-- Result: ["中华人民共和国", "国歌"]
SELECT TOKENIZE('中华人民共和国国歌', '"parser"="ik","parser_mode"="ik_max_word"');

-- Result: ["中华人民共和国", "中华人民", "中华", "华人", "人民共和国", "人民", "共和国", "共和", "国歌"]

Basic Tokenizer

Applicable Scenarios: Simple scenarios or scenarios requiring extremely high performance; can be used as an alternative to Unicode tokenizers in log processing scenarios.
Tokenization Examples:

-- Tokenization of English text
SELECT TOKENIZE('Hello World! This is a test.', '"parser"="basic"');-- Result: ["hello", "world", "this", "is", "a", "test"]-- Tokenization of Chinese textSELECT TOKENIZE('你好世界', '"parser"="basic"');

-- Result: ["你", "好", "世", "界"]

-- Tokenization of multilingual text

SELECT TOKENIZE('Hello你好World世界', '"parser"="basic"');
-- Result: ["hello", "你", "好", "world", "世", "界"]

-- Text containing numbers and special characters

SELECT TOKENIZE('GET /images/hm_bg.jpg HTTP/1.0', '"parser"="basic"');
-- Result: ["get", "images", "hm", "bg", "jpg", "http", "1", "0"]

-- Processing long numeric sequences
SELECT TOKENIZE('12345678901234567890', '"parser"="basic"');
-- Result: ["12345678901234567890"]

New Custom Tokenization Capabilities

  • Pipeline Combination: Build custom text processing workflows through chained configuration of char filter, tokenizer, and multiple token filters.
  • Component Reusability: Commonly used tokenizers and filters can be shared across multiple analyzers, reducing redundant definitions and lowering maintenance costs.
  • Users can leverage Doris’ custom tokenization mechanism to flexibly combine char filters, tokenizers, and token filters, thereby customizing appropriate tokenization workflows for different fields and meeting personalized text retrieval needs in complex scenarios.

3. ETL Improvement

As an increasing number of users are now migrating offline tasks—such as ETL data processing and multi-table materialized view building to Doris. To address pain points in offline batch processing scenarios (e.g., high resource consumption and frequent task out-of-memory errors), Doris 4.0 introduces the Spill to Disk feature. When the memory usage of an offline computing task exceeds the threshold, this feature automatically writes part of the intermediate data to disk, preventing task failures caused by insufficient memory and significantly improving the stability and fault tolerance of large-scale offline tasks.
Currently, the operators that support spilling to disk include:

  • Hash Join operator
  • Aggregation operator
  • Sort operator
  • CTE operator

BE Configuration Items

spill_storage_root_path=/mnt/disk1/spilltest/doris/be/storage;/mnt/disk2/doris-spill;/mnt/disk3/doris-spillspill_storage_limit=100%
  • spill_storage_root_path: Storage path for intermediate result files spilled to disk. By default, it is the same as storage_root_path.
  • spill_storage_limit: Disk space limit for spilled files. It can be configured as a specific size (e.g., 100G, 1T) or a percentage; the default value is 20%. If spill_storage_root_path is set to a dedicated disk, this parameter can be configured as 100%. Its main purpose is to prevent spilled files from occupying excessive disk space, which would disrupt normal data storage.

FE Session Variables

set enable_spill=true;set exec_mem_limit = 10g;set query_timeout = 3600;
  • enable_spill: Indicates whether a query enables spilling to disk. Disabled by default; if enabled, query spilling will be triggered when memory is tight.
  • exec_mem_limit: Specifies the maximum memory size that a single query can use.
  • query_timeout: When spilling to disk is enabled, query duration may increase significantly, so this parameter needs to be adjusted accordingly.

Testing

To verify the stability of the Spill to Disk feature, we conducted tests using the TPC-DS 10TB standard dataset. The test environment was configured with 3 BE servers (each with 16-core CPU and 64GB memory), resulting in a BE memory-to-data size ratio of 1:52.
Test results show that the total execution time was 28102.386 seconds, and all 99 queries in the TPC-DS benchmark were successfully completed—verifying the basic stability of the Spill to Disk feature.
For detailed data, please refer to: https://doris.apache.org/docs/dev/admin-manual/workload-management/spill-disk

4. End-to-End Data Quality Assurance

Data accuracy is the core prerequisite and key foundation for enterprises to make decisions. To further strengthen this foundation, Doris 4.0 systematically sorts out and standardizes the operation behavior of numerous functions, and establishes an end-to-end verification mechanism from the data import stage to the computation and analysis stage. This fully ensures the accuracy and reliability of data processing results, providing solid data support for enterprise decision-making.

Note: The sorting of these data quality issues will lead to changes in some behaviors compared to previous versions. Please read the relevant documentation carefully before upgrading.

CAST Function

CAST is one of the most logically complex functions in SQL, whose core role is to convert between different data types. This process not only requires handling a large number of detailed format rules and edge cases but also involves precise mapping of type semantics—making it a highly error-prone link in practical use.

Particularly in data import scenarios, the essence of the operation is a CAST process that converts external strings to internal database types. Thus, the behavior of CAST directly determines the accuracy and stability of import logic.

Meanwhile, we anticipate that future databases will be extensively operated by AI systems. For AI to interact with databases effectively, clear definitions of database behaviors are essential. To this end, we have introduced the BNF (Backus-Naur Form). By defining behaviors through BNF, we aim to provide clear operational guidelines for developers and AI Agents.

For example, the CAST operation for the DATE type alone already covers dozens of format combination scenarios via BNF (see: https://doris.apache.org/zh-CN/docs/dev/sql-manual/basic-element/sql-data-types/conversion/datetime-conversion). During the testing phase, we also derived millions of test cases based on these rules to ensure the correctness of results.

Strict Mode, Non-Strict Mode, and TRY_CAST

In Doris 4.0, three mechanisms for CAST operations have been added: Strict Mode, Non-Strict Mode (controlled by the enable_strict_cast session var), and the TRY_CAST function. These mechanisms better handle various scenarios during data type conversion, with details as follows:

Strict Mode
The system performs strict verification on the format, type, and value range of input data in accordance with predefined BNF (Backus-Naur Form) syntax rules. If the data fails to meet the rules (e.g., a string is passed for a field that requires a "numeric type," or a date does not conform to the "YYYY-MM-DD" standard), the system will directly terminate the data processing workflow and throw a clear error (including the specific non-compliant field and reason). This prevents invalid data from entering storage or computation processes.
This "zero-tolerance" verification logic is highly consistent with the strict data validation behavior of PostgreSQL. It ensures data accuracy and consistency from the source, making it indispensable for scenarios with extremely high requirements for data reliability—such as transaction reconciliation in the financial industry, bill accounting in finance, and information registration in government systems. Invalid data (e.g., negative transaction amounts, incorrect bill date formats) in these scenarios could lead to financial losses, compliance risks, or business process disruptions.

Non-Strict Mode
The system also verifies data against BNF rules but adopts a "fault-tolerant" processing logic: if the data is non-compliant, the workflow will not be terminated and no error will be thrown. Instead, the invalid data is automatically converted to a NULL value before the SQL continues execution (e.g., converting the string "xyz" to a numeric NULL). This ensures the SQL task completes normally, prioritizing the continuity and execution efficiency of business processes.
This mode is more suitable for scenarios where "data integrity requirements are low, but SQL execution success rate is critical"—such as log data processing, user behavior data cleaning, and ad-hoc data analysis. In these scenarios, data volumes are large and sources are complex (e.g., APP logs may contain fields with garbled formats due to device anomalies). If the entire SQL task is interrupted by a small amount of invalid data, processing efficiency will be significantly reduced; meanwhile, a small number of NULL values have minimal impact on overall analysis results (e.g., statistics on active users or click-through rates).

TRY_CAST Function
The enable_strict_cast parameter controls the behavior of all CAST operations at the statement level. However, a scenario may arise where a single SQL contains multiple CAST functions: some require Strict Mode, while others require Non-Strict Mode. To address this, the TRY_CAST function is introduced.
The TRY_CAST function converts an expression to a specified data type. If the conversion succeeds, it returns the converted value; if it fails, it returns NULL. Its syntax is TRY_CAST(source_expr AS target_type), where source_expr is the expression to be converted, and target_type is the target data type.
For example:

  • TRY_CAST('123' AS INT) returns 123;
  • TRY_CAST('abc' AS INT) returns NULL.
    The TRY_CAST function provides a more flexible approach to type conversion. In scenarios where strict conversion success is not required, this function can be used to avoid errors caused by conversion failures.

Floating Number Calculations

Doris supports two floating-point data types: FLOAT and DOUBLE. However, the uncertain behavior of INF (infinity) and NAN (not a number) has historically led to potential errors in operations like ORDER BY or GROUP BY. In this version, the behavior of these values has been standardized and clearly defined.

Arithmetic Operations
Doris floating-point numbers support common arithmetic operations such as addition, subtraction, multiplication, and division.
It is important to note that Doris does not fully comply with the IEEE 754 standard when handling floating-point division by zero. Instead, Doris references PostgreSQL’s implementation: when dividing by zero, no special values (e.g., INF) are generated; instead, the result returned is SQL NULL.

Comparison Operations
Floating-point comparisons defined by the IEEE standard have important differences from typical integer comparisons. For example:

  • Negative zero and positive zero are considered equal.
  • Any NaN (Not a Number) value is not equal to any other value, including itself.
  • All finite floating-point numbers are strictly less than +∞ and strictly greater than -∞.
    To ensure consistency and predictability of results, Doris handles NaN differently from the IEEE standard. In Doris:
  • NaN is considered greater than all other values (including Infinity).
  • NaN is equal to NaN.
    Example
mysql> select * from sort_float order by d;
+------+-----------+
| id   | d         |
+------+-----------+
|    5 | -Infinity |
|    2 |      -123 |
|    1 |       123 |
|    4 |  Infinity |
|    8 |       NaN |
|    9 |       NaN |
+------+-----------+

mysql> select 
    cast('Nan' as double) = cast('Nan' as double) , 
    cast('Nan' as double) > cast('Inf' as double) , 
    cast('Nan' as double) > cast('123456.789' as double);
+-----------------------------------------------+-----------------------------------------------+------------------------------------------------------+
| cast('Nan' as double) = cast('Nan' as double) | cast('Nan' as double) > cast('Inf' as double) | cast('Nan' as double) > cast('123456.789' as double) |
+-----------------------------------------------+-----------------------------------------------+------------------------------------------------------+
|                                             1 |                                             1 |                                                    1 |
+-----------------------------------------------+-----------------------------------------------+------------------------------------------------------+

Date Functions

This optimization focuses on two key areas: date functions and time zone support, further enhancing the accuracy and applicability of data processing:

Unified Date Overflow Behavior

The behavior of numerous date functions in overflow scenarios (e.g., dates earlier than 0000-01-01 or later than 9999-12-31) has been standardized. Previously, different functions handled overflow inconsistently; now, all relevant functions uniformly return errors when date overflow is triggered, preventing data calculation deviations caused by abnormal results.

Expanded Date Function Support

The parameter signatures of some date-type functions have been upgraded from int32 to int64. This adjustment breaks the date range limitations of the original int32 type, enabling related functions to support date calculations over a wider span.

Improved Time Zone Support Documentation

Based on Doris' actual time zone management logic (see official documentation: https://doris.apache.org/zh-CN/docs/dev/admin-manual/cluster-management/time-zone), the time zone support content has been supplemented and clarified more precisely. This includes detailed explanations of the roles and modification methods of two core parameters (system_time_zone and time_zone), as well as the specific impact of time zones on date functions (e.g., FROM_UNIXTIME, UNIX_TIMESTAMP) and data import conversions. This provides users with clearer guidance for configuring and using time zone features.
To build a truly Agent-Friendly database ecosystem and help large models understand Doris more accurately and deeply, we have systematically improved Doris' SQL Reference. This includes refining core content such as data type definitions, function definitions, and data transformation rules, laying a clear and reliable technical foundation for collaborative interaction between AI and databases.

This work has received support from many community members, whose wisdom and efforts have injected key vitality into the project. We warmly welcome more community colleagues to join us, collaborate in building a stronger ecosystem, and create greater value through collective efforts in technical exploration and ecological improvement.

5. Performance Optimization

TopN

The query pattern SELECT * FROM tableX ORDER BY columnA ASC/DESC LIMIT N is a typical TopN query, widely used in high-frequency scenarios such as log analysis, vector retrieval, and data exploration. Since such queries do not include filtering conditions, traditional execution methods require full-table scanning and sorting when dealing with large datasets. This leads to excessive unnecessary data reading and severe read amplification issues—especially in high-concurrency request or large-data storage scenarios, where performance bottlenecks for these queries become more prominent, creating an urgent need for optimization among users.

To address this pain point, we have introduced the "Lazy Materialization" optimization mechanism, which splits TopN queries into two phases for efficient execution:

    1. First Phase: Only read the sort column (columnA) and the primary key/row identifier used for data positioning. Quickly filter out the target rows that meet the LIMIT N condition through sorting.
    1. Second Phase: Precisely read all column data of the target rows based on the row identifiers.

In version 4.0, we have further extended this capability:

  • Support for TopN lazy materialization in multi-table joins.
  • Support for TopN lazy materialization in external table queries.
    In these two new scenarios, this solution significantly reduces the amount of unnecessary column reading, fundamentally mitigating read amplification. In scenarios involving wide tables with small LIMIT values, the execution efficiency of TopN queries has been improved by dozens of times.

SQL Cache

SQL Cache was a feature provided in earlier versions of Doris, but its usage was restricted by many conditions and thus disabled by default. In this version, we have systematically addressed issues that could affect the correctness of SQL Cache results, such as:

  • Changes to query permissions for catalogs, databases, tables, or columns;
  • Modifications to Session variables;
  • Presence of non-deterministic functions that cannot be simplified via constant folding rules.
    After ensuring the correctness of SQL Cache results, this feature is now enabled by default.

We have also systematically optimized the SQL parsing performance of the query optimizer, improving it by 100x. For example, consider the following SQL, where big_view is a large view containing nested views, with a total of 163 joins and 17 unions:

SELECT *, now() as etl_time from big_view;

The SQL parsing time for this query has been reduced from 400ms to 2ms. This optimization not only benefits SQL Cache but also significantly improves Doris’ performance in high-concurrency query scenarios.

JSON Performance Optimization

JSON is a common storage format for semi-structured data. In version 4.0, we have also upgraded JSONB:

Added Support for Decimal Type

This complements the mapping system between Number types in JSON and Doris internal types (previously covering Int8/Int16/Int32/Int64/Int128/Float/Double). It further meets storage and processing needs for high-precision numeric scenarios, avoiding precision loss in JSON conversion caused by type adaptation issues for large or high-precision data—this also facilitates the derivation of variant types.

Systematic Performance Optimization for JSONB-Related Functions
Full-scale performance improvements have been made to JSONB-related functions (e.g., json_extract series, json_exists_path, json_type). After optimization, the execution efficiency of these functions has generally increased by over 30%, significantly accelerating the processing speed of high-frequency operations such as JSON field extraction, type judgment, and path validation. This provides stronger support for efficient analysis of semi-structured data.
For detailed information about related features, please refer to the official Doris documentation: https://doris.apache.org/zh-CN/docs/dev/sql-manual/basic-element/sql-data-types/semi-structured/JSON

6. More User-Friendly Resource Management

Version 4.0 optimizes the usage mechanism of workload group:

  • Unified the definition methods for soft limits and hard limits of CPU and memory. There is no longer a need to enable soft/hard limit functions individually via various configuration items.
  • Simultaneous support for coexisting soft limits and hard limits in the same workload group.

This optimization not only simplifies the parameter configuration process but also enhances the flexibility of workload group, enabling more precise fulfillment of diverse resource management needs.

CPU Resource Configuration

MIN_CPU_PERCENT and MAX_CPU_PERCENT define the minimum and maximum guaranteed CPU resources for all requests in a Workload Group when CPU contention occurs:

  • MAX_CPU_PERCENT (Maximum CPU Percentage): The upper limit of CPU bandwidth for the group. Regardless of the current CPU usage, the CPU utilization of the Workload Group will never exceed this value.
  • MIN_CPU_PERCENT (Minimum CPU Percentage): The reserved CPU bandwidth for the group. During contention, other groups cannot use this portion of bandwidth; however, when resources are idle, the group can use bandwidth exceeding MIN_CPU_PERCENT.

Example: Suppose a company’s sales and marketing departments share the same Doris instance. The sales department’s workload is CPU-intensive and includes high-priority queries; the marketing department’s workload is also CPU-intensive but with lower-priority queries. By creating separate Workload Groups for each department:

  • Allocate 40% MIN_CPU_PERCENT to the sales Workload Group.
  • Allocate 30% MAX_CPU_PERCENT to the marketing Workload Group.
    This configuration ensures the sales workload gets the required CPU resources while preventing the marketing workload from impacting the sales department’s CPU needs.

Memory Resource Configuration

MIN_MEMORY_PERCENT and MAX_MEMORY_PERCENT represent the minimum and maximum memory that a Workload Group can use:

  • MAX_MEMORY_PERCENT: When requests run in the group, their memory usage will never exceed this percentage of the total memory. If exceeded, the query will trigger spilling to disk or be terminated (killed).
  • MIN_MEMORY_PERCENT: The minimum memory reserved for the group. When resources are idle, the group can use memory exceeding MIN_MEMORY_PERCENT; however, when memory is insufficient, the system will allocate memory based on MIN_MEMORY_PERCENT. It may kill some queries to reduce the group’s memory usage to MIN_MEMORY_PERCENT, ensuring other Workload Groups have sufficient memory.

Integration with Spill to Disk

In this version, we have integrated the memory management capability of Workload Groups with the Spill to Disk feature. Users can not only control spilling by setting memory sizes for individual queries but also implement dynamic spilling via the slot mechanism of Workload Groups. The following policies are implemented for Workload Groups:

  • none (default): Disables the policy. Queries use as much memory as possible, but once the Workload Group’s memory limit is reached, spilling is triggered (no selection based on query size).
  • fixed: Memory available per query = workload group’s mem_limit * query_slot_count / max_concurrency. This policy allocates fixed memory to each query based on concurrency.
  • dynamic: Memory available per query = workload group’s mem_limit * query_slot_count / sum(running query slots). This primarily addresses unused slots in fixed mode; essentially, it triggers spilling for large queries first.

Both fixed and dynamic set hard limits for queries: exceeding the limit triggers spilling or killing, and they override the static memory allocation parameters set by users. Therefore, when configuring slot_memory_policy, ensure max_concurrency for the Workload Group is properly set to avoid memory shortages.

Summary

With the deep integration of AI and search capabilities, improved stability of offline computing, and dual optimizations for performance and usability, this new version of Apache Doris further expands the application boundaries of the database. It better supports enterprises’ full-scenario data analysis needs—from traditional BI analysis to AI intelligent retrieval, and from real-time queries to large-scale offline batch processing. Whether for real-time reports and user behavior analysis in industries such as internet, finance, and retail, or document retrieval and large-scale data batch processing in government and healthcare sectors, the new version of Doris provides users with a more efficient and reliable data analysis experience.
Currently, the new version of Apache Doris is available for download. Detailed feature documents and upgrade guides can be found on the official website (https://doris.apache.org/). Community users are welcome to experience and provide feedback!

Acknowledgments

Once again, we would like to express our most sincere gratitude to all contributors who participated in the R&D, testing, and requirement feedback for this version:
Pxl, walter, Gabriel, Mingyu Chen (Rayner), Mryange, morrySnow, zhangdong, lihangyu, zhangstar333, hui lai, Calvin Kirs, deardeng, Dongyang Li, Kaijie Chen, Xinyi Zou, minghong, meiyi, James / Jibing-Li, seawinde, abmdocrt, Yongqiang YANG, Sun Chenyang, wangbo, starocean999, Socrates / 苏小刚,Gavin Chou, 924060929, HappenLee, yiguolei, daidai, Lei Zhang, zhengyu, zy-kkk, zclllyybb /zclllhhjj, bobhan1, amory, zhiqiang, Jerry Hu, Xin Liao, Siyang Tang, LiBinfeng, Tiewei Fang, Luwei, huanghaibin, Qi Chen, TengJianPing, 谢健,Lightman, zhannngchen, koarz, xy720, kkop, HHoflittlefish777, xzj7019, Ashin Gau, lw112, plat1ko, shuke, yagagagaga, shee, zgxme, qiye, zfr95, slothever, Xujian Duan, Yulei-Yang, Jack, Kang, Lijia Liu, linrrarity, Petrichor, Thearas, Uniqueyou, dwdwqfwe, Refrain, catpineapple, smiletan, wudi, caiconghui, camby, zhangyuan, jakevin, Chester, Mingxi, Rijesh Kunhi Parambattu, admiring_xm, zxealous, XLPE, chunping, sparrow, xueweizhang, Adonis Ling, Jiwen liu, KassieZ, Liu Zhenlong, MoanasDaddyXu, Peyz, 神技圈子,133tosakarin, FreeOnePlus, Ryan19929, Yixuan Wang, htyoung, smallx, Butao Zhang, Ceng, GentleCold, GoGoWen, HonestManXin, Liqf, Luzhijing, Shuo Wang, Wen Zhenghu, Xr Ling, Zhiguo Wu, Zijie Lu, feifeifeimoon, heguanhui, toms, wudongliang, yangshijie, yongjinhou, yulihua, zhangm365, Amos Bird, AndersZ, Ganlin Zhao, Jeffrey, John Zhang, M1saka, SWEI, XueYuhai, Yao-MR, York Cao, caoliang-web, echo-dundun, huanghg1994, lide, lzy, nivane, nsivarajan, py023, vlt, wlong, zhaorongsheng, AlexYue, Arjun Anandkumar, Arnout Engelen, Benjaminwei, DayuanX, DogDu, DuRipeng, Emmanuel Ferdman, Fangyuan Deng, Guangdong Liu, HB, He xueyu, Hongkun Xu, Hu Yanjun, JinYang, KeeProMise, Muhammet Sakarya, Nya~, On-Work-Song, Shane, Stalary, StarryVerse, TsukiokaKogane, Udit Chaudhary, Xin Li, XnY-wei, Xu Chen, XuJianxu, XuPengfei, Yijia Su, ZhenchaoXu, cat-with-cat, elon-X, gnehil, hongyu guo, ivin, jw-sun, koi, liuzhenwei, msridhar78, noixcn, nsn_huang, peterylh, shouchengShen, spaces-x, wangchuang, wangjing, wangqt, wubiao, xuchenhao, xyf, yanmingfu, yi wang, z404289981, zjj, zzwwhh, İsmail Tosun, 赵硕

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions