[Enhancement] Support default values for complex types (ARRAY, MAP, STRUCT) #67181

stephen-shelby · 2025-12-24T06:08:59Z

Why I'm doing:

Currently, StarRocks only supports default values for primitive types (INT, VARCHAR, etc.). Complex types like ARRAY, MAP, and STRUCT do not support default values, which limits user flexibility when defining table schemas with nested data structures.

What I'm doing:

This PR adds support for default values for complex types (ARRAY, MAP, STRUCT) in StarRocks.

1. Syntax

StarRocks uses the following syntax for complex type default values:

ARRAY: Use square brackets []

ARRAY<INT> DEFAULT [1, 2, 3]
ARRAY<VARCHAR(20)> DEFAULT ['a', 'b', 'c']

MAP: Use map{}

MAP<VARCHAR(20), INT> DEFAULT map{'key1': 100, 'key2': 200}
MAP<INT, VARCHAR(20)> DEFAULT map{1: 'value1', 2: 'value2'}

STRUCT: Use row()

STRUCT<id INT, name VARCHAR(20)> DEFAULT row(1, 'alice')
STRUCT<a INT, b VARCHAR(20), c DOUBLE> DEFAULT row(10, 'test', 3.14)

CREATE TABLE Example:

CREATE TABLE complex_defaults (
    id BIGINT,
    -- Simple types with defaults
    simple_int INT DEFAULT '0',
    simple_str VARCHAR(20) DEFAULT 'default',
    simple_bool BOOLEAN DEFAULT '1',
    simple_date DATE DEFAULT '2026-01-01',
    
    -- ARRAY types with various types
    arr_int ARRAY<INT> DEFAULT [1, 2, 3],
    arr_str ARRAY<VARCHAR(20)> DEFAULT ['a', 'b', 'c'],
    arr_bool ARRAY<BOOLEAN> DEFAULT [1, 0, 1],
    arr_date ARRAY<DATE> DEFAULT ['2024-01-01', '2024-06-01', '2024-12-31'],
    arr_datetime ARRAY<DATETIME> DEFAULT ['2024-01-01 10:00:00', '2024-12-31 23:59:59'],
    arr_empty ARRAY<INT> DEFAULT [],
    
    -- MAP types with various value types
    map_str_int MAP<VARCHAR(20), INT> DEFAULT map{'count': 100, 'total': 200},
    map_str_bool MAP<VARCHAR(20), BOOLEAN> DEFAULT map{'active': 1, 'deleted': 0},
    map_str_date MAP<VARCHAR(20), DATE> DEFAULT map{'start': '2024-01-01', 'end': '2024-12-31'},
    map_str_datetime MAP<VARCHAR(20), DATETIME> DEFAULT map{'created': '2024-01-01 10:00:00', 'updated': '2024-06-01 15:30:00'},
    map_empty MAP<VARCHAR(20), INT> DEFAULT map{},
    
    -- STRUCT types with mixed field types
    struct_simple STRUCT<id INT, name VARCHAR(20), active BOOLEAN> DEFAULT row(1, 'alice', 1),
    struct_multi STRUCT<
        id INT, 
        name VARCHAR(20), 
        created_date DATE, 
        updated_time DATETIME,
        is_valid BOOLEAN
    > DEFAULT row(10, 'test', '2024-01-01', '2024-01-01 12:00:00', 1),
    
    -- Nested types with various types
    arr_nested ARRAY<STRUCT<id INT, active BOOLEAN, tags ARRAY<VARCHAR(20)>>> 
        DEFAULT [row(1, 1, ['tag1', 'tag2']), row(2, 0, ['tag3'])],
    map_nested MAP<INT, STRUCT<name VARCHAR(20), join_date DATE, is_member BOOLEAN>> 
        DEFAULT map{1: row('alice', '2024-01-01', 1), 2: row('bob', '2024-06-01', 0)},
    struct_nested STRUCT<
        items ARRAY<INT>, 
        flags ARRAY<BOOLEAN>,
        dates MAP<VARCHAR(20), DATE>,
        status STRUCT<active BOOLEAN, created DATE>
    > DEFAULT row([1, 2, 3], [1, 0], map{'start': '2024-01-01', 'end': '2024-12-31'}, row(1, '2024-01-01')),
    
    -- Empty nested collections
    struct_with_empty STRUCT<
        id INT, 
        active BOOLEAN,
        tags ARRAY<VARCHAR(20)>, 
        attrs MAP<VARCHAR(20), INT>
    > DEFAULT row(0, 0, [], map{})
) properties("replication_num" = "1");

ALTER TABLE Examples:

-- Simple ARRAY
ALTER TABLE t ADD COLUMN arr ARRAY<INT> DEFAULT [1, 2, 3];

-- Simple MAP
ALTER TABLE t ADD COLUMN config MAP<VARCHAR(20), INT> DEFAULT map{'timeout': 30, 'retry': 3};

-- Simple STRUCT
ALTER TABLE t ADD COLUMN user_info STRUCT<id INT, name VARCHAR(20), age INT> DEFAULT row(1, 'default_user', 18);

-- Nested types
ALTER TABLE t ADD COLUMN nested_arr ARRAY<ARRAY<INT>> DEFAULT [[1, 2], [3, 4]];
ALTER TABLE t ADD COLUMN nested_map MAP<INT, STRUCT<id INT, name VARCHAR(20)>> 
    DEFAULT map{1: row(1, 'alice'), 2: row(2, 'bob')};
ALTER TABLE t ADD COLUMN nested_struct STRUCT<items ARRAY<INT>, metadata MAP<VARCHAR(20), INT>> 
    DEFAULT row([1, 2, 3], map{'count': 3, 'version': 1});

-- Empty collections
ALTER TABLE t ADD COLUMN empty_arr ARRAY<INT> DEFAULT [];
ALTER TABLE t ADD COLUMN empty_map MAP<VARCHAR(20), INT> DEFAULT map{};

2. Limitations

Literal Values Only: Default expressions must contain only literal values. Functions (including CAST) and expressions are NOT supported.

-- ✅ Valid: literal values
ALTER TABLE t ADD COLUMN arr ARRAY<INT> DEFAULT [1, 2, 3];

-- ❌ Invalid: CAST function not allowed
ALTER TABLE t ADD COLUMN arr ARRAY<INT> DEFAULT CAST('[1,2,3]' AS ARRAY<INT>);

-- ❌ Invalid: function calls not allowed
ALTER TABLE t ADD COLUMN arr ARRAY<INT> DEFAULT array_concat([1, 2], [3, 4]);

Fast Schema Evolution Required: Complex type default values are only supported when table property fast_schema_evolution is true (which is the default for new tables).

-- ✅ Works: fast_schema_evolution is true by default
CREATE TABLE t (id INT) PROPERTIES("fast_schema_evolution" = "true");
ALTER TABLE t ADD COLUMN arr ARRAY<INT> DEFAULT [1, 2, 3];  -- Success

-- ❌ Fails: fast_schema_evolution is false
CREATE TABLE t (id INT) PROPERTIES("fast_schema_evolution" = "false");
ALTER TABLE t ADD COLUMN arr ARRAY<INT> DEFAULT [1, 2, 3];  -- Error

3. Implementation Details

3.1 Frontend (FE)

Expression Parsing: FE parses complex type literals ([], map{}, row()) into expression trees (ArrayExpr, MapExpr, FunctionCallExpr)
Expression Analysis: The parsed expression goes through analyze() phase
- Type inference and validation
- May automatically add CAST expressions for type compatibility
Validation:
- User input must contain only literal values (no explicit CAST, no functions)
- After analysis, internal CAST expressions are allowed (added by analyzer)
- Validates type compatibility between default value and column type
- Checks fast_schema_evolution property before allowing complex type default values
Persistence:
- Column.defaultExpr: Stores the analyzed expression (may contain CAST)
  - Serialized to SQL string via Expr.toSql() and persisted to FE metadata (EditLog/Image)
  - Example: Original [1, 2, 3] → after analysis → [CAST(1 AS BIGINT), CAST(2 AS BIGINT), CAST(3 AS BIGINT)]
  - Re-parsed after FE restart via gsonPostProcess() to reconstruct the Expr object
- Column.defaultValue: Transient field, NOT persisted in FE
SHOW CREATE TABLE Handling: To display SQL consistent with user's original input:
- SHOW CREATE TABLE / DESC use validateComplexTypeDefaultExpressionForm() to strip auto-added CAST expressions
- This ensures displayed SQL matches what the user originally wrote
- Example: Internally stored [CAST(1 AS BIGINT), ...] → displayed as [1, 2, 3]
Thrift Serialization: When sending schema to BE, FE serializes the analyzed expression tree (with CASTs) to Thrift format:
- TColumn.default_expr: The complete analyzed expression tree serialized as TExpr (Thrift format)
- FE sends the full expression tree structure, not JSON string
Schema Propagation Paths: FE sends schema to BE through multiple paths:
- DDL operations: via TCreateTabletReq, TAlterTabletReqV2
- DML operations (INSERT/UPDATE): via TOlapTableSchemaParam in OlapTableSink

3.2 Backend (BE)

Expression to JSON Conversion: BE receives TExpr (Thrift expression tree) from FE via TColumn.default_expr
- preprocess_default_expr_for_tcolumns() function converts TExpr to JSON string
- This conversion happens when BE first receives the schema (DDL/DML paths)
- Example: TExpr (ArrayExpr with [1,2,3]) → evaluates to JSON string "[1,2,3]"
- The JSON string is stored in TColumn.default_value for later use
Protobuf Persistence: BE persists the JSON string in TabletSchemaPB.ColumnPB.default_value
- Stored in tablet metadata files on disk
- Ensures BE can provide default values after restart without re-evaluating expressions
- Only the JSON string is persisted, not the original TExpr
Runtime JSON Parsing: When providing default values during query execution:
1. DefaultValueColumnIterator::init() reads JSON string from _default_value (loaded from TabletSchema)
2. Parses JSON string using VelocyPack: JsonValue::parse_json_or_string() → vpack::Slice
3. Converts VPack slice to Datum using VPackToDatumCaster
VPackToDatumCaster: Recursively converts VPack JSON slices to Datum objects:
- Template-based implementation (cast_primitive<TYPE>(), cast_decimal<DecimalType>()) eliminates code duplication
- Dedicated methods for each complex type (cast_array(), cast_map(), cast_struct())
- Handles nested structures with arbitrary depth
- Preserves field order for MAP and STRUCT using sequential iteration (ObjectIterator(slice, true))
DefaultValueColumnIterator: Provides default Datum values during query execution:
- Segment scan: When reading old data files that don't contain newly added columns
- Partial updates: Both column mode (rowset_column_update_state.cpp) and row mode (tablet_updates.cpp)
- Schema change: When materialized columns require default values

Fixes #issue

What type of PR is this:

Does this PR entail a change in behavior?

Yes, this PR will result in a change in behavior.
No, this PR will not result in a change in behavior.

If yes, please specify the type of change:

Interface/UI changes: syntax, type conversion, expression evaluation, display information
Parameter changes: default values, similar parameters but with different default values
Policy changes: use new policy to replace old one, functionality automatically enabled
Feature removed
Miscellaneous: upgrade & downgrade compatibility, etc.

Checklist:

I have added test cases for my bug fix or my new feature
This pr needs user documentation (for new or modified features or behaviors)
- I have added documentation for my new feature or new function
This is a backport pr

Bugfix cherry-pick branch check:

Note

Enables DEFAULT values for complex types (ARRAY/MAP/STRUCT) across the stack.

FE: allow DEFAULT complex literals via grammar; analyze/validate literal-only expressions (no funcs/CAST), require fast_schema_evolution=true; persist analyzed Expr in Column.defaultExpr; serialize as TExpr (new TColumn.default_expr); InsertPlanner uses expr; SHOW CREATE strips analyzer-added CASTs; re-validation on load.
BE: evaluate TExpr default expressions to JSON (convert_default_expr_to_json_string) and inject into TColumn.default_value via preprocess_default_expr_for_tcolumns; integrate in create-table, schema change, scanners; extend JSON casting (cast_type_to_json_str with ordered structs) and add robust VPack→Datum for arrays/maps/structs; DefaultValueColumnIterator now parses JSON defaults for complex types; minor JSON converter support for DATE/DATETIME.
Wire-ups: agent task, lake/schema_change, scan paths call preprocessing when columns provided.
Tests: new FE unit tests and end-to-end SQL suites covering correctness, nesting, display, and fast-schema-evolution guardrails.

^{Written by Cursor Bugbot for commit 35f6f2d. This will update automatically on new commits. Configure here.}

github-actions · 2025-12-24T08:02:47Z

[BE Incremental Coverage Report]

❌ fail : 84 / 121 (69.42%)

file detail

	path	covered_line	new_line	coverage	not_covered_line_detail
🔵	src/exec/olap_meta_scanner.cpp	0	5	00.00%	[62, 63, 64, 65, 67]
🔵	src/agent/agent_task.cpp	0	10	00.00%	[221, 226, 227, 228, 229, 230, 641, 642, 643, 646]
🔵	src/exec/pipeline/scan/olap_chunk_source.cpp	0	1	00.00%	[44]
🔵	src/exprs/cast_expr_json.cpp	8	13	61.54%	[47, 48, 130, 174, 228]
🔵	src/storage/rowset/default_value_column_iterator.cpp	58	74	78.38%	[51, 64, 75, 83, 92, 125, 126, 127, 131, 132, 133, 134, 136, 143, 148, 149]
🔵	src/storage/lake/schema_change.cpp	6	6	100.00%	[]
🔵	src/storage/schema_change.cpp	3	3	100.00%	[]
🔵	src/exec/tablet_info.cpp	7	7	100.00%	[]
🔵	src/storage/metadata_util.cpp	2	2	100.00%	[]

alvin-celerdata · 2025-12-24T17:33:05Z

@cursor review

cursor · 2025-12-24T17:38:42Z

be/src/storage/rowset/default_value_column_iterator.cpp

            [[maybe_unused]] auto ret = dst->append_strings(slices);
        } else {
            dst->append_value_multiple_times(_mem_value, to_read);
+            _current_rowid = range.end();


Row counter not updated for certain column types

In the next_batch(const SparseRange<>& range, ...) function, _current_rowid is not updated when the column type is TYPE_OBJECT, TYPE_HLL, or TYPE_PERCENTILE. The change moved _current_rowid = range.end(); from outside the if-else block (where it was always executed) to inside only the else branch. This means the row counter won't advance for these column types, potentially causing incorrect iteration behavior during segment scans. The first next_batch(size_t* n, ...) overload correctly updates _current_rowid after the if-else for all types, showing this is inconsistent.

cursor · 2025-12-24T17:38:42Z

fe/fe-core/src/main/java/com/starrocks/sql/InsertPlanner.java

+                            if (!expr.getType().equals(targetColumn.getType())) {
+                                expr = new CastExpr(targetColumn.getType(), expr);
+                            }
+                            scalarOperator = SqlToScalarOperatorTranslator.translate(expr);


Inconsistent null check leads to potential NPE

The hasExprObject() method in DefaultExpr returns true when either exprObject != null or serializedExpr != null. However, obtainExpr() only returns a non-null value when exprObject != null. If deserialization fails in gsonPostProcess() (which only logs a warning), exprObject remains null while serializedExpr is still set. This creates an inconsistent state where hasExprObject() returns true but obtainExpr() returns null, causing a NullPointerException at line 679 when calling expr.getType().

Additional Locations (1)

fe/fe-core/src/main/java/com/starrocks/catalog/DefaultExpr.java#L80-L83

Signed-off-by: stephen <[email protected]>

sonarqubecloud · 2025-12-25T08:19:28Z

Quality Gate passed

Issues
52 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

github-actions · 2025-12-25T09:56:08Z

[Java-Extensions Incremental Coverage Report]

✅ pass : 0 / 0 (0%)

github-actions · 2025-12-25T09:57:02Z

[FE Incremental Coverage Report]

✅ pass : 178 / 214 (83.18%)

file detail

	path	covered_line	new_line	coverage	not_covered_line_detail
🔵	com/starrocks/sql/InsertPlanner.java	2	7	28.57%	[685, 686, 687, 689, 690]
🔵	com/starrocks/catalog/Column.java	30	40	75.00%	[688, 800, 831, 963, 964, 965, 966, 967, 968, 1036]
🔵	com/starrocks/catalog/DefaultExpr.java	61	76	80.26%	[90, 91, 104, 105, 106, 107, 146, 223, 229, 230, 231, 232, 244, 245, 247]
🔵	com/starrocks/sql/analyzer/ColumnDefAnalyzer.java	53	59	89.83%	[337, 338, 339, 344, 367, 368]
🔵	com/starrocks/alter/SchemaChangeHandler.java	6	6	100.00%	[]
🔵	com/starrocks/sql/analyzer/CreateTableAnalyzer.java	17	17	100.00%	[]
🔵	com/starrocks/sql/analyzer/AnalyzerUtils.java	9	9	100.00%	[]

alvin-celerdata · 2025-12-25T18:09:35Z

@cursor review

cursor · 2025-12-25T18:14:58Z

be/src/exec/tablet_info.cpp

+                LOG(WARNING) << "Failed to preprocess default_expr in OlapTableSchemaParam::init: "
+                             << preprocess_status.to_string();
+                columns_copy = t_index.column_param.columns;
+            }


Error handling fallback code unreachable due to always-OK return

The preprocess_default_expr_for_tcolumns function always returns Status::OK() even when individual column conversions fail (it only logs errors internally). The call site in tablet_info.cpp checks for failure and has fallback logic to reset columns_copy to the original columns, but this fallback code is unreachable. When some columns fail preprocessing, the code proceeds with partially-preprocessed columns instead of falling back to the originals as apparently intended. This could lead to inconsistent column configurations where some columns have converted defaults and others don't.

Additional Locations (1)

be/src/storage/metadata_util.cpp#L497-L513

stephen-shelby requested review from a team as code owners December 24, 2025 06:09

wanpengfei-git added the PROTO-REVIEW label Dec 24, 2025

wanpengfei-git requested a review from a team December 24, 2025 06:09

mergify bot assigned stephen-shelby Dec 24, 2025

stephen-shelby force-pushed the complex_default_value branch from a000a77 to c58a69a Compare December 24, 2025 07:02

wanpengfei-git added the META-REVIEW label Dec 24, 2025

wanpengfei-git requested a review from a team December 24, 2025 07:03

stephen-shelby changed the title ~~[Enhancement] support default value for array/map/struct type column~~ [Enhancement] Support default values for complex types (ARRAY, MAP, STRUCT) Dec 24, 2025

stephen-shelby force-pushed the complex_default_value branch from c58a69a to 1766b5b Compare December 24, 2025 07:17

cursor bot reviewed Dec 24, 2025

View reviewed changes

stephen-shelby force-pushed the complex_default_value branch from 1766b5b to cc8b0a8 Compare December 25, 2025 06:08

[Enhancement] support default value for array/map/struct type column

35f6f2d

Signed-off-by: stephen <[email protected]>

stephen-shelby force-pushed the complex_default_value branch from cc8b0a8 to 35f6f2d Compare December 25, 2025 06:41

cursor bot reviewed Dec 25, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Enhancement] Support default values for complex types (ARRAY, MAP, STRUCT) #67181

[Enhancement] Support default values for complex types (ARRAY, MAP, STRUCT) #67181

stephen-shelby commented Dec 24, 2025 •

edited by cursor bot

Loading

Uh oh!

github-actions bot commented Dec 24, 2025

Uh oh!

alvin-celerdata commented Dec 24, 2025

Uh oh!

cursor bot Dec 24, 2025

Uh oh!

cursor bot Dec 24, 2025

Uh oh!

sonarqubecloud bot commented Dec 25, 2025

Uh oh!

github-actions bot commented Dec 25, 2025

Uh oh!

github-actions bot commented Dec 25, 2025

Uh oh!

alvin-celerdata commented Dec 25, 2025

Uh oh!

cursor bot Dec 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[Enhancement] Support default values for complex types (ARRAY, MAP, STRUCT) #67181

Are you sure you want to change the base?

[Enhancement] Support default values for complex types (ARRAY, MAP, STRUCT) #67181

Conversation

stephen-shelby commented Dec 24, 2025 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why I'm doing:

What I'm doing:

1. Syntax

2. Limitations

3. Implementation Details

3.1 Frontend (FE)

3.2 Backend (BE)

What type of PR is this:

Checklist:

Bugfix cherry-pick branch check:

Uh oh!

github-actions bot commented Dec 24, 2025

[BE Incremental Coverage Report]

file detail

Uh oh!

alvin-celerdata commented Dec 24, 2025

Uh oh!

cursor bot Dec 24, 2025

Choose a reason for hiding this comment

Row counter not updated for certain column types

Uh oh!

cursor bot Dec 24, 2025

Choose a reason for hiding this comment

Inconsistent null check leads to potential NPE

Uh oh!

sonarqubecloud bot commented Dec 25, 2025

Quality Gate passed

Uh oh!

github-actions bot commented Dec 25, 2025

[Java-Extensions Incremental Coverage Report]

Uh oh!

github-actions bot commented Dec 25, 2025

[FE Incremental Coverage Report]

file detail

Uh oh!

alvin-celerdata commented Dec 25, 2025

Uh oh!

cursor bot Dec 25, 2025

Choose a reason for hiding this comment

Error handling fallback code unreachable due to always-OK return

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

stephen-shelby commented Dec 24, 2025 •

edited by cursor bot

Loading