Skip to content

Conversation

@stephen-shelby
Copy link
Contributor

@stephen-shelby stephen-shelby commented Dec 24, 2025

Why I'm doing:

Currently, StarRocks only supports default values for primitive types (INT, VARCHAR, etc.). Complex types like ARRAY, MAP, and STRUCT do not support default values, which limits user flexibility when defining table schemas with nested data structures.

What I'm doing:

This PR adds support for default values for complex types (ARRAY, MAP, STRUCT) in StarRocks.

1. Syntax

StarRocks uses the following syntax for complex type default values:

  • ARRAY: Use square brackets []

    ARRAY<INT> DEFAULT [1, 2, 3]
    ARRAY<VARCHAR(20)> DEFAULT ['a', 'b', 'c']
  • MAP: Use map{}

    MAP<VARCHAR(20), INT> DEFAULT map{'key1': 100, 'key2': 200}
    MAP<INT, VARCHAR(20)> DEFAULT map{1: 'value1', 2: 'value2'}
  • STRUCT: Use row()

    STRUCT<id INT, name VARCHAR(20)> DEFAULT row(1, 'alice')
    STRUCT<a INT, b VARCHAR(20), c DOUBLE> DEFAULT row(10, 'test', 3.14)

CREATE TABLE Example:

CREATE TABLE complex_defaults (
    id BIGINT,
    -- Simple types with defaults
    simple_int INT DEFAULT '0',
    simple_str VARCHAR(20) DEFAULT 'default',
    simple_bool BOOLEAN DEFAULT '1',
    simple_date DATE DEFAULT '2026-01-01',
    
    -- ARRAY types with various types
    arr_int ARRAY<INT> DEFAULT [1, 2, 3],
    arr_str ARRAY<VARCHAR(20)> DEFAULT ['a', 'b', 'c'],
    arr_bool ARRAY<BOOLEAN> DEFAULT [1, 0, 1],
    arr_date ARRAY<DATE> DEFAULT ['2024-01-01', '2024-06-01', '2024-12-31'],
    arr_datetime ARRAY<DATETIME> DEFAULT ['2024-01-01 10:00:00', '2024-12-31 23:59:59'],
    arr_empty ARRAY<INT> DEFAULT [],
    
    -- MAP types with various value types
    map_str_int MAP<VARCHAR(20), INT> DEFAULT map{'count': 100, 'total': 200},
    map_str_bool MAP<VARCHAR(20), BOOLEAN> DEFAULT map{'active': 1, 'deleted': 0},
    map_str_date MAP<VARCHAR(20), DATE> DEFAULT map{'start': '2024-01-01', 'end': '2024-12-31'},
    map_str_datetime MAP<VARCHAR(20), DATETIME> DEFAULT map{'created': '2024-01-01 10:00:00', 'updated': '2024-06-01 15:30:00'},
    map_empty MAP<VARCHAR(20), INT> DEFAULT map{},
    
    -- STRUCT types with mixed field types
    struct_simple STRUCT<id INT, name VARCHAR(20), active BOOLEAN> DEFAULT row(1, 'alice', 1),
    struct_multi STRUCT<
        id INT, 
        name VARCHAR(20), 
        created_date DATE, 
        updated_time DATETIME,
        is_valid BOOLEAN
    > DEFAULT row(10, 'test', '2024-01-01', '2024-01-01 12:00:00', 1),
    
    -- Nested types with various types
    arr_nested ARRAY<STRUCT<id INT, active BOOLEAN, tags ARRAY<VARCHAR(20)>>> 
        DEFAULT [row(1, 1, ['tag1', 'tag2']), row(2, 0, ['tag3'])],
    map_nested MAP<INT, STRUCT<name VARCHAR(20), join_date DATE, is_member BOOLEAN>> 
        DEFAULT map{1: row('alice', '2024-01-01', 1), 2: row('bob', '2024-06-01', 0)},
    struct_nested STRUCT<
        items ARRAY<INT>, 
        flags ARRAY<BOOLEAN>,
        dates MAP<VARCHAR(20), DATE>,
        status STRUCT<active BOOLEAN, created DATE>
    > DEFAULT row([1, 2, 3], [1, 0], map{'start': '2024-01-01', 'end': '2024-12-31'}, row(1, '2024-01-01')),
    
    -- Empty nested collections
    struct_with_empty STRUCT<
        id INT, 
        active BOOLEAN,
        tags ARRAY<VARCHAR(20)>, 
        attrs MAP<VARCHAR(20), INT>
    > DEFAULT row(0, 0, [], map{})
) properties("replication_num" = "1");

ALTER TABLE Examples:

-- Simple ARRAY
ALTER TABLE t ADD COLUMN arr ARRAY<INT> DEFAULT [1, 2, 3];

-- Simple MAP
ALTER TABLE t ADD COLUMN config MAP<VARCHAR(20), INT> DEFAULT map{'timeout': 30, 'retry': 3};

-- Simple STRUCT
ALTER TABLE t ADD COLUMN user_info STRUCT<id INT, name VARCHAR(20), age INT> DEFAULT row(1, 'default_user', 18);

-- Nested types
ALTER TABLE t ADD COLUMN nested_arr ARRAY<ARRAY<INT>> DEFAULT [[1, 2], [3, 4]];
ALTER TABLE t ADD COLUMN nested_map MAP<INT, STRUCT<id INT, name VARCHAR(20)>> 
    DEFAULT map{1: row(1, 'alice'), 2: row(2, 'bob')};
ALTER TABLE t ADD COLUMN nested_struct STRUCT<items ARRAY<INT>, metadata MAP<VARCHAR(20), INT>> 
    DEFAULT row([1, 2, 3], map{'count': 3, 'version': 1});

-- Empty collections
ALTER TABLE t ADD COLUMN empty_arr ARRAY<INT> DEFAULT [];
ALTER TABLE t ADD COLUMN empty_map MAP<VARCHAR(20), INT> DEFAULT map{};

2. Limitations

  1. Literal Values Only: Default expressions must contain only literal values. Functions (including CAST) and expressions are NOT supported.

    -- ✅ Valid: literal values
    ALTER TABLE t ADD COLUMN arr ARRAY<INT> DEFAULT [1, 2, 3];
    
    -- ❌ Invalid: CAST function not allowed
    ALTER TABLE t ADD COLUMN arr ARRAY<INT> DEFAULT CAST('[1,2,3]' AS ARRAY<INT>);
    
    -- ❌ Invalid: function calls not allowed
    ALTER TABLE t ADD COLUMN arr ARRAY<INT> DEFAULT array_concat([1, 2], [3, 4]);
  2. Fast Schema Evolution Required: Complex type default values are only supported when table property fast_schema_evolution is true (which is the default for new tables).

    -- ✅ Works: fast_schema_evolution is true by default
    CREATE TABLE t (id INT) PROPERTIES("fast_schema_evolution" = "true");
    ALTER TABLE t ADD COLUMN arr ARRAY<INT> DEFAULT [1, 2, 3];  -- Success
    
    -- ❌ Fails: fast_schema_evolution is false
    CREATE TABLE t (id INT) PROPERTIES("fast_schema_evolution" = "false");
    ALTER TABLE t ADD COLUMN arr ARRAY<INT> DEFAULT [1, 2, 3];  -- Error

3. Implementation Details

3.1 Frontend (FE)

  • Expression Parsing: FE parses complex type literals ([], map{}, row()) into expression trees (ArrayExpr, MapExpr, FunctionCallExpr)
  • Expression Analysis: The parsed expression goes through analyze() phase
    • Type inference and validation
    • May automatically add CAST expressions for type compatibility
  • Validation:
    • User input must contain only literal values (no explicit CAST, no functions)
    • After analysis, internal CAST expressions are allowed (added by analyzer)
    • Validates type compatibility between default value and column type
    • Checks fast_schema_evolution property before allowing complex type default values
  • Persistence:
    • Column.defaultExpr: Stores the analyzed expression (may contain CAST)
      • Serialized to SQL string via Expr.toSql() and persisted to FE metadata (EditLog/Image)
      • Example: Original [1, 2, 3] → after analysis → [CAST(1 AS BIGINT), CAST(2 AS BIGINT), CAST(3 AS BIGINT)]
      • Re-parsed after FE restart via gsonPostProcess() to reconstruct the Expr object
    • Column.defaultValue: Transient field, NOT persisted in FE
  • SHOW CREATE TABLE Handling: To display SQL consistent with user's original input:
    • SHOW CREATE TABLE / DESC use validateComplexTypeDefaultExpressionForm() to strip auto-added CAST expressions
    • This ensures displayed SQL matches what the user originally wrote
    • Example: Internally stored [CAST(1 AS BIGINT), ...] → displayed as [1, 2, 3]
  • Thrift Serialization: When sending schema to BE, FE serializes the analyzed expression tree (with CASTs) to Thrift format:
    • TColumn.default_expr: The complete analyzed expression tree serialized as TExpr (Thrift format)
    • FE sends the full expression tree structure, not JSON string
  • Schema Propagation Paths: FE sends schema to BE through multiple paths:
    • DDL operations: via TCreateTabletReq, TAlterTabletReqV2
    • DML operations (INSERT/UPDATE): via TOlapTableSchemaParam in OlapTableSink

3.2 Backend (BE)

  • Expression to JSON Conversion: BE receives TExpr (Thrift expression tree) from FE via TColumn.default_expr
    • preprocess_default_expr_for_tcolumns() function converts TExpr to JSON string
    • This conversion happens when BE first receives the schema (DDL/DML paths)
    • Example: TExpr (ArrayExpr with [1,2,3]) → evaluates to JSON string "[1,2,3]"
    • The JSON string is stored in TColumn.default_value for later use
  • Protobuf Persistence: BE persists the JSON string in TabletSchemaPB.ColumnPB.default_value
    • Stored in tablet metadata files on disk
    • Ensures BE can provide default values after restart without re-evaluating expressions
    • Only the JSON string is persisted, not the original TExpr
  • Runtime JSON Parsing: When providing default values during query execution:
    1. DefaultValueColumnIterator::init() reads JSON string from _default_value (loaded from TabletSchema)
    2. Parses JSON string using VelocyPack: JsonValue::parse_json_or_string()vpack::Slice
    3. Converts VPack slice to Datum using VPackToDatumCaster
  • VPackToDatumCaster: Recursively converts VPack JSON slices to Datum objects:
    • Template-based implementation (cast_primitive<TYPE>(), cast_decimal<DecimalType>()) eliminates code duplication
    • Dedicated methods for each complex type (cast_array(), cast_map(), cast_struct())
    • Handles nested structures with arbitrary depth
    • Preserves field order for MAP and STRUCT using sequential iteration (ObjectIterator(slice, true))
  • DefaultValueColumnIterator: Provides default Datum values during query execution:
    • Segment scan: When reading old data files that don't contain newly added columns
    • Partial updates: Both column mode (rowset_column_update_state.cpp) and row mode (tablet_updates.cpp)
    • Schema change: When materialized columns require default values

Fixes #issue

What type of PR is this:

  • BugFix
  • Feature
  • Enhancement
  • Refactor
  • UT
  • Doc
  • Tool

Does this PR entail a change in behavior?

  • Yes, this PR will result in a change in behavior.
  • No, this PR will not result in a change in behavior.

If yes, please specify the type of change:

  • Interface/UI changes: syntax, type conversion, expression evaluation, display information
  • Parameter changes: default values, similar parameters but with different default values
  • Policy changes: use new policy to replace old one, functionality automatically enabled
  • Feature removed
  • Miscellaneous: upgrade & downgrade compatibility, etc.

Checklist:

  • I have added test cases for my bug fix or my new feature
  • This pr needs user documentation (for new or modified features or behaviors)
    • I have added documentation for my new feature or new function
  • This is a backport pr

Bugfix cherry-pick branch check:

  • I have checked the version labels which the pr will be auto-backported to the target branch
    • 4.0
    • 3.5
    • 3.4
    • 3.3

Note

Enables DEFAULT values for complex types (ARRAY/MAP/STRUCT) across the stack.

  • FE: allow DEFAULT complex literals via grammar; analyze/validate literal-only expressions (no funcs/CAST), require fast_schema_evolution=true; persist analyzed Expr in Column.defaultExpr; serialize as TExpr (new TColumn.default_expr); InsertPlanner uses expr; SHOW CREATE strips analyzer-added CASTs; re-validation on load.
  • BE: evaluate TExpr default expressions to JSON (convert_default_expr_to_json_string) and inject into TColumn.default_value via preprocess_default_expr_for_tcolumns; integrate in create-table, schema change, scanners; extend JSON casting (cast_type_to_json_str with ordered structs) and add robust VPack→Datum for arrays/maps/structs; DefaultValueColumnIterator now parses JSON defaults for complex types; minor JSON converter support for DATE/DATETIME.
  • Wire-ups: agent task, lake/schema_change, scan paths call preprocessing when columns provided.
  • Tests: new FE unit tests and end-to-end SQL suites covering correctness, nesting, display, and fast-schema-evolution guardrails.

Written by Cursor Bugbot for commit 35f6f2d. This will update automatically on new commits. Configure here.

@stephen-shelby stephen-shelby requested review from a team as code owners December 24, 2025 06:09
@wanpengfei-git wanpengfei-git requested a review from a team December 24, 2025 06:09
@wanpengfei-git wanpengfei-git requested a review from a team December 24, 2025 07:03
@stephen-shelby stephen-shelby changed the title [Enhancement] support default value for array/map/struct type column [Enhancement] Support default values for complex types (ARRAY, MAP, STRUCT) Dec 24, 2025
@github-actions
Copy link

[BE Incremental Coverage Report]

fail : 84 / 121 (69.42%)

file detail

path covered_line new_line coverage not_covered_line_detail
🔵 src/exec/olap_meta_scanner.cpp 0 5 00.00% [62, 63, 64, 65, 67]
🔵 src/agent/agent_task.cpp 0 10 00.00% [221, 226, 227, 228, 229, 230, 641, 642, 643, 646]
🔵 src/exec/pipeline/scan/olap_chunk_source.cpp 0 1 00.00% [44]
🔵 src/exprs/cast_expr_json.cpp 8 13 61.54% [47, 48, 130, 174, 228]
🔵 src/storage/rowset/default_value_column_iterator.cpp 58 74 78.38% [51, 64, 75, 83, 92, 125, 126, 127, 131, 132, 133, 134, 136, 143, 148, 149]
🔵 src/storage/lake/schema_change.cpp 6 6 100.00% []
🔵 src/storage/schema_change.cpp 3 3 100.00% []
🔵 src/exec/tablet_info.cpp 7 7 100.00% []
🔵 src/storage/metadata_util.cpp 2 2 100.00% []

@alvin-celerdata
Copy link
Contributor

@cursor review

[[maybe_unused]] auto ret = dst->append_strings(slices);
} else {
dst->append_value_multiple_times(_mem_value, to_read);
_current_rowid = range.end();
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Row counter not updated for certain column types

In the next_batch(const SparseRange<>& range, ...) function, _current_rowid is not updated when the column type is TYPE_OBJECT, TYPE_HLL, or TYPE_PERCENTILE. The change moved _current_rowid = range.end(); from outside the if-else block (where it was always executed) to inside only the else branch. This means the row counter won't advance for these column types, potentially causing incorrect iteration behavior during segment scans. The first next_batch(size_t* n, ...) overload correctly updates _current_rowid after the if-else for all types, showing this is inconsistent.

Fix in Cursor Fix in Web

if (!expr.getType().equals(targetColumn.getType())) {
expr = new CastExpr(targetColumn.getType(), expr);
}
scalarOperator = SqlToScalarOperatorTranslator.translate(expr);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inconsistent null check leads to potential NPE

The hasExprObject() method in DefaultExpr returns true when either exprObject != null or serializedExpr != null. However, obtainExpr() only returns a non-null value when exprObject != null. If deserialization fails in gsonPostProcess() (which only logs a warning), exprObject remains null while serializedExpr is still set. This creates an inconsistent state where hasExprObject() returns true but obtainExpr() returns null, causing a NullPointerException at line 679 when calling expr.getType().

Additional Locations (1)

Fix in Cursor Fix in Web

@sonarqubecloud
Copy link

@github-actions
Copy link

[Java-Extensions Incremental Coverage Report]

pass : 0 / 0 (0%)

@github-actions
Copy link

[FE Incremental Coverage Report]

pass : 178 / 214 (83.18%)

file detail

path covered_line new_line coverage not_covered_line_detail
🔵 com/starrocks/sql/InsertPlanner.java 2 7 28.57% [685, 686, 687, 689, 690]
🔵 com/starrocks/catalog/Column.java 30 40 75.00% [688, 800, 831, 963, 964, 965, 966, 967, 968, 1036]
🔵 com/starrocks/catalog/DefaultExpr.java 61 76 80.26% [90, 91, 104, 105, 106, 107, 146, 223, 229, 230, 231, 232, 244, 245, 247]
🔵 com/starrocks/sql/analyzer/ColumnDefAnalyzer.java 53 59 89.83% [337, 338, 339, 344, 367, 368]
🔵 com/starrocks/alter/SchemaChangeHandler.java 6 6 100.00% []
🔵 com/starrocks/sql/analyzer/CreateTableAnalyzer.java 17 17 100.00% []
🔵 com/starrocks/sql/analyzer/AnalyzerUtils.java 9 9 100.00% []

@alvin-celerdata
Copy link
Contributor

@cursor review

LOG(WARNING) << "Failed to preprocess default_expr in OlapTableSchemaParam::init: "
<< preprocess_status.to_string();
columns_copy = t_index.column_param.columns;
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Error handling fallback code unreachable due to always-OK return

The preprocess_default_expr_for_tcolumns function always returns Status::OK() even when individual column conversions fail (it only logs errors internally). The call site in tablet_info.cpp checks for failure and has fallback logic to reset columns_copy to the original columns, but this fallback code is unreachable. When some columns fail preprocessing, the code proceeds with partially-preprocessed columns instead of falling back to the originals as apparently intended. This could lead to inconsistent column configurations where some columns have converted defaults and others don't.

Additional Locations (1)

Fix in Cursor Fix in Web

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants