Skip to content

[SPARK-56460][CORE] define configs in text files#53488

Open
cloud-fan wants to merge 6 commits intoapache:masterfrom
cloud-fan:conf
Open

[SPARK-56460][CORE] define configs in text files#53488
cloud-fan wants to merge 6 commits intoapache:masterfrom
cloud-fan:conf

Conversation

@cloud-fan
Copy link
Copy Markdown
Contributor

@cloud-fan cloud-fan commented Dec 16, 2025

What changes were proposed in this pull request?

This PR adds a new module common/config which introduces a framework to define configs in prototext files. When the Spark application starts, we load config definitions from all prototext files and put them in a map with config name as the key, and the proto binary of the config definition as the value. These proto-backed configs are also registered in ConfigEntry.knownConfigs via a wrapper ProtoBackedConfigEntry, together with existing Spark configs.

The proto schema defines each config entry with: key, value type, default value, scope (cluster vs session), mutability (static vs dynamic, i.e. whether the config can be changed after system initialization), visibility (public vs internal), binding policy (session/persisted/not_applicable for SQL views, UDFs, and procedures), documentation, and version. Enum values follow proto3 naming conventions with type-name prefixes (e.g. SCOPE_CLUSTER, VALUE_TYPE_BOOL, VISIBILITY_PUBLIC, BINDING_POLICY_SESSION, MUTABILITY_STATIC) to avoid namespace collisions. The BindingPolicy field maps to the existing Scala ConfigBindingPolicy enum. The Mutability field is used by SQLConf.isStaticConfigKey to determine whether a config can be changed at runtime.

The new framework will co-exist with the existing config framework, until we migrate all existing configs to this new framework.

For simple configs that are only accessed in one place, we can hardcode the config name in the place that accesses it, with a new API SQLConf#getConfByKeyStrict to avoid typo. Example in this PR: spark.sql.optimizer.datasourceV2ExprFolding

For configs that are accessed in multiple places and we want to avoid hardcoding config name, or configs that need custom validation, we can still have an entry in object SQLConf to reference the config definition. Examples in this PR: spark.sql.optimizer.maxIterations and spark.sql.shuffledHashJoinFactor.

Why are the changes needed?

Defining configs in various Scala objects is a bad practice:

  • Configs are registered when the Scala objects are loaded. To list all configs we must know all these Scala objects and load them.
  • It's hard to audit configs as they spread all over the codebase.
  • We will hit JVM limitation one day sooner or later, as defining configs in a Scala object is basically doing heavy work in the constructor, which has limitation of 64 kb bytecode size.

Does this PR introduce any user-facing change?

No, it's developer facing

How was this patch tested?

new tests

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code

@cloud-fan cloud-fan marked this pull request as draft December 16, 2025 17:10
@cloud-fan cloud-fan force-pushed the conf branch 2 times, most recently from 5f0f5ea to a59ee0a Compare December 17, 2025 15:14
Copy link
Copy Markdown
Member

@szehon-ho szehon-ho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, curious the reason for prototext?

@cloud-fan
Copy link
Copy Markdown
Contributor Author

@szehon-ho The reason for prototext:

  • It generates the java classes for the config definition, saves my time
  • We need to keep all config definitions in memory, and protobuf should be more memory-efficient
  • The proto schema file of the config definition is a good place for documentation

# See the License for the specific language governing permissions and
# limitations under the License.

# SQL configuration entries (CLUSTER scope)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean APPLICATION scope? I think in the context of Spark, "cluster" usually refers to something like a K8s cluster, a Spark standalone cluster, a YARN cluster, etc.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The physical cluster is managed by resource manager and is not within the context of Spark configs. I choose CLUSTER because it's a more general word for distributed engine and it refers to the driver + executors cluster launched for the application.

Copy link
Copy Markdown
Member

@pan3793 pan3793 Dec 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My major concern is APPLICATION is a clear concept and has no ambiguity in the context of Spark, for example, there are configs like spark.app.id, spark.app.name, while CLUSTER may introduce additional ambiguity, for example, in the Spark Standalone docs, "cluster" does mean the "Spark Standalone cluster", using CLUSTER introduces unnecessary confusion.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll think about it more. My concern is that, with many managed Spark services out there, most users do not have the concept of application. They either submit the spark application as scheduled jobs, or use notebook directly. With Spark Connect, the concept of Spark application becomes even less obvious. Sessions and the Spark cluster under the hood are more general concepts now IMO.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

QQ: how to support createWithDefault, enumConf, fallbackConf in this textproto file?

For example

  val STORE_ASSIGNMENT_POLICY =
    buildConf("spark.sql.storeAssignmentPolicy")
     ...
      .enumConf(StoreAssignmentPolicy)
      .createWithDefault(StoreAssignmentPolicy.ANSI)

  val ANSI_ENABLED = buildConf(SqlApiConfHelper.ANSI_ENABLED_KEY)
    ...
    .createWithDefault(!sys.env.get("SPARK_ANSI_SQL_MODE").contains("false"))

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See https://github.com/apache/spark/pull/53488/files#diff-82e9b261f86d3cbd94a23abe94aadf45531e0b29603082336784e30ea46c9574R108-R116

We can override certain fields of the config definition at the Scala side, and default value can be supported as well.

For enum conf, one benefit is that it's type safe: the getConf API returns the enum instance directly. This means we need a ConfigEntry at the Scala side, but we can still define the enum in protobuf, and expose the protobuf enum.

@mihailotim-db
Copy link
Copy Markdown
Contributor

mihailotim-db commented Dec 22, 2025

Nice idea. I've recently encountered a number of issues with config discovery in Spark. Most notably, when configs are in objects that are for some reason not loaded (i.e. Hive confs in sql/core) or object is loaded but conf is defined as lazy val. Will the latter also be supported?

@mridulm
Copy link
Copy Markdown
Contributor

mridulm commented Dec 23, 2025

Very promising !
A few high level thoughts:

  • Is this specific only to sql ? current formulation appears to assume yes. If yes:
    • Should it be under sql/config instead ?
    • If no (preferable), handling for core, common/* modules would be something to consider.
      • cluster vs session could get impacted by this as well (how to specify/differentiate from common/core configs).
  • Since we are relooking at configs - one 'feature' to consider would be enforcement of cluster defaults, which users cannot override (like security configs, event file location, etc).
  • It is not clear what the usecase for test_default is.

@cloud-fan
Copy link
Copy Markdown
Contributor Author

Hi @mridulm

  • This is not for SQL config only. I mentioned this in the README.md: Currently, SQL configs go in sql.textproto. New categories can have their own files.
  • As of today, I don't think users can override spark core configs or static sql configs via public APIs. So this "feature" is already there.
  • Quite some boolean configs define their default values as Utils.isTesting. To avoid having to add Scala entries (e.g. the ones in SQLConf) for them, test_default is added to handle such use cases with the new text file based framework.

@mridulm
Copy link
Copy Markdown
Contributor

mridulm commented Dec 30, 2025

  • This is not for SQL config only. I mentioned this in the README.md: Currently, SQL configs go in sql.textproto. New categories can have their own files.

This is great !

  • As of today, I don't think users can override spark core configs or static sql configs via public APIs. So this "feature" is already there.

Users can override spark core configs - both at time of submission, and before session creation (via the SparkConf passed into SparkContext).

  • Quite some boolean configs define their default values as Utils.isTesting. To avoid having to add Scala entries (e.g. the ones in SQLConf) for them, test_default is added to handle such use cases with the new text file based framework.

Sounds good !

@cloud-fan
Copy link
Copy Markdown
Contributor Author

yea people can set spark core configs before SparkContext is instantiated, maybe I misunderstood the word "override" in your previous comment. Providing configs before creating SparkContext doesn't sound like "override". Maybe you mean override the config property files?

@mridulm
Copy link
Copy Markdown
Contributor

mridulm commented Jan 24, 2026

To put it differently - ability to enforce deployer configs, which users cannot override: for example, for security: similar to final configuration in hadoop

@cloud-fan
Copy link
Copy Markdown
Contributor Author

cloud-fan commented Jan 27, 2026

@mridulm maybe we can extend the config visibility: for now it's only PUBLIC and INTERNAL. Maybe we can add one more: PRIVATE. People can only set PRIVATE configs in the config file and it can't be overridden with other config files or SparkContext/SparkSession config APIs. Is it what you need?

# Conflicts:
#	common/utils/src/main/scala/org/apache/spark/internal/config/ConfigEntry.scala
#	project/SparkBuild.scala
#	sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
- Prefix all proto enum values following industry standard (e.g. BOOL -> VALUE_TYPE_BOOL,
  CLUSTER -> SCOPE_CLUSTER, PUBLIC -> VISIBILITY_PUBLIC) to avoid namespace collisions
- Add BindingPolicy enum (SESSION, PERSISTED, NOT_APPLICABLE) for SQL views/UDFs/procedures
- Add required binding_policy field to ConfigEntry message
- Reorder message fields: key, value_type, default_value, test_default, scope, visibility,
  binding_policy, doc, version
- Wire binding_policy through ProtoBackedConfigEntry to Scala ConfigBindingPolicy
- Update all textproto files with prefixed enum values, binding_policy, and matching field order
- Add validation and test for missing binding_policy

Co-authored-by: Isaac
@cloud-fan
Copy link
Copy Markdown
Contributor Author

@mridulm Thanks for the suggestion! I think the Hadoop final config feature is about specifying a config value (the deployer marks a specific value as final in the config file), not about defining a config (the config author decides at schema level whether it can be locked). So it's orthogonal to what this PR does — this PR focuses on config definitions/schema.

Supporting Hadoop-style final configs for Spark would be a separate effort, likely involving changes to SparkConf/SQLConf to honor a "finalized keys" set and reject overrides from runtime APIs (SET, spark.conf.set(), etc.). Happy to discuss that in a separate JIRA if you're interested!

@cloud-fan cloud-fan changed the title [DO NOT MERGE] define configs in text files [SPARK-56460][CORE] define configs in text files Apr 13, 2026
@cloud-fan cloud-fan marked this pull request as ready for review April 13, 2026 14:41
- Add ProtoBackedOptionalConfigEntry for configs without defaults,
  mirroring the existing OptionalConfigEntry pattern
- Add ProtoBackedBase marker trait for both proto-backed entry types
- Fix README examples: correct enum names, add binding_policy field,
  fix "validated at load time" claim and grammar
- Remove migrated configs from binding-policy exceptions file
- Restore set -e in dev/mima with targeted error handling
- Remove config_check.yml CI workflow (not ready to enforce yet)
- Use ConfigRegistry.allKeys() for proto config lookups instead of
  scanning knownConfigs
- Add TODO on getConfigEntries() for future simplification

Co-authored-by: Isaac
…isuse

- Add Mutability enum (STATIC/DYNAMIC) as a separate concern from Scope
  (CLUSTER/SESSION). isStaticConfigKey now checks mutability instead of scope.
- Add require check in SQLConf.register to reject proto-defined keys,
  preventing duplicates in getConfigEntries().
- Add test validation that CLUSTER↔STATIC and SESSION↔DYNAMIC with a TODO
  to relax once the framework supports other combinations.

Co-authored-by: Isaac
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants