[SPARK-56460][CORE] define configs in text files by cloud-fan · Pull Request #53488 · apache/spark

cloud-fan · 2025-12-16T17:09:57Z

What changes were proposed in this pull request?

This PR adds a new module common/config which introduces a framework to define configs in prototext files. When the Spark application starts, we load config definitions from all prototext files and put them in a map with config name as the key, and the proto binary of the config definition as the value. These proto-backed configs are also registered in ConfigEntry.knownConfigs via a wrapper ProtoBackedConfigEntry, together with existing Spark configs.

The proto schema defines each config entry with: key, value type, default value, scope (cluster vs session), mutability (static vs dynamic, i.e. whether the config can be changed after system initialization), visibility (public vs internal), binding policy (session/persisted/not_applicable for SQL views, UDFs, and procedures), documentation, and version. Enum values follow proto3 naming conventions with type-name prefixes (e.g. SCOPE_CLUSTER, VALUE_TYPE_BOOL, VISIBILITY_PUBLIC, BINDING_POLICY_SESSION, MUTABILITY_STATIC) to avoid namespace collisions. The BindingPolicy field maps to the existing Scala ConfigBindingPolicy enum. The Mutability field is used by SQLConf.isStaticConfigKey to determine whether a config can be changed at runtime.

The new framework will co-exist with the existing config framework, until we migrate all existing configs to this new framework.

For simple configs that are only accessed in one place, we can hardcode the config name in the place that accesses it, with a new API SQLConf#getConfByKeyStrict to avoid typo. Example in this PR: spark.sql.optimizer.datasourceV2ExprFolding

For configs that are accessed in multiple places and we want to avoid hardcoding config name, or configs that need custom validation, we can still have an entry in object SQLConf to reference the config definition. Examples in this PR: spark.sql.optimizer.maxIterations and spark.sql.shuffledHashJoinFactor.

Why are the changes needed?

Defining configs in various Scala objects is a bad practice:

Configs are registered when the Scala objects are loaded. To list all configs we must know all these Scala objects and load them.
It's hard to audit configs as they spread all over the codebase.
We will hit JVM limitation one day sooner or later, as defining configs in a Scala object is basically doing heavy work in the constructor, which has limitation of 64 kb bytecode size.

Does this PR introduce any user-facing change?

No, it's developer facing

How was this patch tested?

new tests

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code

szehon-ho

Interesting, curious the reason for prototext?

cloud-fan · 2025-12-18T03:00:44Z

@szehon-ho The reason for prototext:

It generates the java classes for the config definition, saves my time
We need to keep all config definitions in memory, and protobuf should be more memory-efficient
The proto schema file of the config definition is a good place for documentation

pan3793 · 2025-12-19T09:13:06Z

+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# SQL configuration entries (CLUSTER scope)


Do you mean APPLICATION scope? I think in the context of Spark, "cluster" usually refers to something like a K8s cluster, a Spark standalone cluster, a YARN cluster, etc.

The physical cluster is managed by resource manager and is not within the context of Spark configs. I choose CLUSTER because it's a more general word for distributed engine and it refers to the driver + executors cluster launched for the application.

My major concern is APPLICATION is a clear concept and has no ambiguity in the context of Spark, for example, there are configs like spark.app.id, spark.app.name, while CLUSTER may introduce additional ambiguity, for example, in the Spark Standalone docs, "cluster" does mean the "Spark Standalone cluster", using CLUSTER introduces unnecessary confusion.

I'll think about it more. My concern is that, with many managed Spark services out there, most users do not have the concept of application. They either submit the spark application as scheduled jobs, or use notebook directly. With Spark Connect, the concept of Spark application becomes even less obvious. Sessions and the Spark cluster under the hood are more general concepts now IMO.

gengliangwang · 2025-12-19T18:12:56Z

QQ: how to support createWithDefault, enumConf, fallbackConf in this textproto file?

For example

val STORE_ASSIGNMENT_POLICY = buildConf("spark.sql.storeAssignmentPolicy") ... .enumConf(StoreAssignmentPolicy) .createWithDefault(StoreAssignmentPolicy.ANSI) val ANSI_ENABLED = buildConf(SqlApiConfHelper.ANSI_ENABLED_KEY) ... .createWithDefault(!sys.env.get("SPARK_ANSI_SQL_MODE").contains("false"))

See https://github.com/apache/spark/pull/53488/files#diff-82e9b261f86d3cbd94a23abe94aadf45531e0b29603082336784e30ea46c9574R108-R116

We can override certain fields of the config definition at the Scala side, and default value can be supported as well.

For enum conf, one benefit is that it's type safe: the getConf API returns the enum instance directly. This means we need a ConfigEntry at the Scala side, but we can still define the enum in protobuf, and expose the protobuf enum.

mihailotim-db · 2025-12-22T12:44:11Z

Nice idea. I've recently encountered a number of issues with config discovery in Spark. Most notably, when configs are in objects that are for some reason not loaded (i.e. Hive confs in sql/core) or object is loaded but conf is defined as lazy val. Will the latter also be supported?

mridulm · 2025-12-23T13:43:46Z

Very promising !
A few high level thoughts:

Is this specific only to sql ? current formulation appears to assume yes. If yes:
- Should it be under sql/config instead ?
- If no (preferable), handling for core, common/* modules would be something to consider.
  - cluster vs session could get impacted by this as well (how to specify/differentiate from common/core configs).
Since we are relooking at configs - one 'feature' to consider would be enforcement of cluster defaults, which users cannot override (like security configs, event file location, etc).
It is not clear what the usecase for test_default is.

cloud-fan · 2025-12-23T15:12:06Z

Hi @mridulm

This is not for SQL config only. I mentioned this in the README.md: Currently, SQL configs go in sql.textproto. New categories can have their own files.
As of today, I don't think users can override spark core configs or static sql configs via public APIs. So this "feature" is already there.
Quite some boolean configs define their default values as Utils.isTesting. To avoid having to add Scala entries (e.g. the ones in SQLConf) for them, test_default is added to handle such use cases with the new text file based framework.

mridulm · 2025-12-30T05:49:06Z

This is not for SQL config only. I mentioned this in the README.md: Currently, SQL configs go in sql.textproto. New categories can have their own files.

This is great !

As of today, I don't think users can override spark core configs or static sql configs via public APIs. So this "feature" is already there.

Users can override spark core configs - both at time of submission, and before session creation (via the SparkConf passed into SparkContext).

Quite some boolean configs define their default values as Utils.isTesting. To avoid having to add Scala entries (e.g. the ones in SQLConf) for them, test_default is added to handle such use cases with the new text file based framework.

Sounds good !

cloud-fan · 2025-12-30T09:10:35Z

yea people can set spark core configs before SparkContext is instantiated, maybe I misunderstood the word "override" in your previous comment. Providing configs before creating SparkContext doesn't sound like "override". Maybe you mean override the config property files?

mridulm · 2026-01-24T09:58:15Z

To put it differently - ability to enforce deployer configs, which users cannot override: for example, for security: similar to final configuration in hadoop

cloud-fan · 2026-01-27T08:03:48Z

@mridulm maybe we can extend the config visibility: for now it's only PUBLIC and INTERNAL. Maybe we can add one more: PRIVATE. People can only set PRIVATE configs in the config file and it can't be overridden with other config files or SparkContext/SparkSession config APIs. Is it what you need?

# Conflicts: # common/utils/src/main/scala/org/apache/spark/internal/config/ConfigEntry.scala # project/SparkBuild.scala # sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

- Prefix all proto enum values following industry standard (e.g. BOOL -> VALUE_TYPE_BOOL, CLUSTER -> SCOPE_CLUSTER, PUBLIC -> VISIBILITY_PUBLIC) to avoid namespace collisions - Add BindingPolicy enum (SESSION, PERSISTED, NOT_APPLICABLE) for SQL views/UDFs/procedures - Add required binding_policy field to ConfigEntry message - Reorder message fields: key, value_type, default_value, test_default, scope, visibility, binding_policy, doc, version - Wire binding_policy through ProtoBackedConfigEntry to Scala ConfigBindingPolicy - Update all textproto files with prefixed enum values, binding_policy, and matching field order - Add validation and test for missing binding_policy Co-authored-by: Isaac

cloud-fan · 2026-04-13T14:26:57Z

@mridulm Thanks for the suggestion! I think the Hadoop final config feature is about specifying a config value (the deployer marks a specific value as final in the config file), not about defining a config (the config author decides at schema level whether it can be locked). So it's orthogonal to what this PR does — this PR focuses on config definitions/schema.

Supporting Hadoop-style final configs for Spark would be a separate effort, likely involving changes to SparkConf/SQLConf to honor a "finalized keys" set and reject overrides from runtime APIs (SET, spark.conf.set(), etc.). Happy to discuss that in a separate JIRA if you're interested!

- Add ProtoBackedOptionalConfigEntry for configs without defaults, mirroring the existing OptionalConfigEntry pattern - Add ProtoBackedBase marker trait for both proto-backed entry types - Fix README examples: correct enum names, add binding_policy field, fix "validated at load time" claim and grammar - Remove migrated configs from binding-policy exceptions file - Restore set -e in dev/mima with targeted error handling - Remove config_check.yml CI workflow (not ready to enforce yet) - Use ConfigRegistry.allKeys() for proto config lookups instead of scanning knownConfigs - Add TODO on getConfigEntries() for future simplification Co-authored-by: Isaac

…isuse - Add Mutability enum (STATIC/DYNAMIC) as a separate concern from Scope (CLUSTER/SESSION). isStaticConfigKey now checks mutability instead of scope. - Add require check in SQLConf.register to reject proto-defined keys, preventing duplicates in getConfigEntries(). - Add test validation that CLUSTER↔STATIC and SESSION↔DYNAMIC with a TODO to relax once the framework supports other combinations. Co-authored-by: Isaac

Co-authored-by: Isaac

cloud-fan marked this pull request as draft December 16, 2025 17:10

github-actions Bot added SQL BUILD labels Dec 16, 2025

cloud-fan force-pushed the conf branch 2 times, most recently from 5f0f5ea to a59ee0a Compare December 17, 2025 15:14

szehon-ho reviewed Dec 18, 2025

View reviewed changes

pan3793 reviewed Dec 19, 2025

View reviewed changes

cloud-fan force-pushed the conf branch from a59ee0a to 89ab66d Compare December 19, 2025 14:22

github-actions Bot added the DOCS label Dec 19, 2025

cloud-fan force-pushed the conf branch 2 times, most recently from 715b373 to 335436f Compare December 19, 2025 14:56

gengliangwang reviewed Dec 19, 2025

View reviewed changes

cloud-fan force-pushed the conf branch from 335436f to 69a15a9 Compare December 22, 2025 12:09

define configs in text files

4cbab21

cloud-fan force-pushed the conf branch from 69a15a9 to 4cbab21 Compare December 22, 2025 13:33

github-actions Bot added the INFRA label Dec 22, 2025

cloud-fan added 2 commits April 13, 2026 12:32

Merge remote-tracking branch 'upstream/master' into conf

c368ef8

# Conflicts: # common/utils/src/main/scala/org/apache/spark/internal/config/ConfigEntry.scala # project/SparkBuild.scala # sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

cloud-fan changed the title ~~[DO NOT MERGE] define configs in text files~~ [SPARK-56460][CORE] define configs in text files Apr 13, 2026

cloud-fan marked this pull request as ready for review April 13, 2026 14:41

cloud-fan added 2 commits April 14, 2026 02:22

Update README to document mutability guidance in Step 1

3e5c6a3

Co-authored-by: Isaac

Conversation

cloud-fan commented Dec 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

szehon-ho left a comment

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Dec 18, 2025

Uh oh!

pan3793 Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

cloud-fan Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

pan3793 Dec 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

gengliangwang Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

cloud-fan Dec 22, 2025

Choose a reason for hiding this comment

Uh oh!

mihailotim-db commented Dec 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mridulm commented Dec 23, 2025

Uh oh!

cloud-fan commented Dec 23, 2025

Uh oh!

mridulm commented Dec 30, 2025

Uh oh!

cloud-fan commented Dec 30, 2025

Uh oh!

mridulm commented Jan 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cloud-fan commented Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cloud-fan commented Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

cloud-fan commented Dec 16, 2025 •

edited

Loading

pan3793 Dec 19, 2025 •

edited

Loading

mihailotim-db commented Dec 22, 2025 •

edited

Loading

mridulm commented Jan 24, 2026 •

edited

Loading

cloud-fan commented Jan 27, 2026 •

edited

Loading