[SPARK-56460][CORE] define configs in text files#53488
[SPARK-56460][CORE] define configs in text files#53488cloud-fan wants to merge 6 commits intoapache:masterfrom
Conversation
5f0f5ea to
a59ee0a
Compare
szehon-ho
left a comment
There was a problem hiding this comment.
Interesting, curious the reason for prototext?
|
@szehon-ho The reason for prototext:
|
| # See the License for the specific language governing permissions and | ||
| # limitations under the License. | ||
|
|
||
| # SQL configuration entries (CLUSTER scope) |
There was a problem hiding this comment.
Do you mean APPLICATION scope? I think in the context of Spark, "cluster" usually refers to something like a K8s cluster, a Spark standalone cluster, a YARN cluster, etc.
There was a problem hiding this comment.
The physical cluster is managed by resource manager and is not within the context of Spark configs. I choose CLUSTER because it's a more general word for distributed engine and it refers to the driver + executors cluster launched for the application.
There was a problem hiding this comment.
My major concern is APPLICATION is a clear concept and has no ambiguity in the context of Spark, for example, there are configs like spark.app.id, spark.app.name, while CLUSTER may introduce additional ambiguity, for example, in the Spark Standalone docs, "cluster" does mean the "Spark Standalone cluster", using CLUSTER introduces unnecessary confusion.
There was a problem hiding this comment.
I'll think about it more. My concern is that, with many managed Spark services out there, most users do not have the concept of application. They either submit the spark application as scheduled jobs, or use notebook directly. With Spark Connect, the concept of Spark application becomes even less obvious. Sessions and the Spark cluster under the hood are more general concepts now IMO.
715b373 to
335436f
Compare
There was a problem hiding this comment.
QQ: how to support createWithDefault, enumConf, fallbackConf in this textproto file?
For example
val STORE_ASSIGNMENT_POLICY =
buildConf("spark.sql.storeAssignmentPolicy")
...
.enumConf(StoreAssignmentPolicy)
.createWithDefault(StoreAssignmentPolicy.ANSI)
val ANSI_ENABLED = buildConf(SqlApiConfHelper.ANSI_ENABLED_KEY)
...
.createWithDefault(!sys.env.get("SPARK_ANSI_SQL_MODE").contains("false"))
There was a problem hiding this comment.
We can override certain fields of the config definition at the Scala side, and default value can be supported as well.
For enum conf, one benefit is that it's type safe: the getConf API returns the enum instance directly. This means we need a ConfigEntry at the Scala side, but we can still define the enum in protobuf, and expose the protobuf enum.
|
Nice idea. I've recently encountered a number of issues with config discovery in Spark. Most notably, when configs are in objects that are for some reason not loaded (i.e. Hive confs in sql/core) or object is loaded but conf is defined as |
|
Very promising !
|
|
Hi @mridulm
|
This is great !
Users can override spark core configs - both at time of submission, and before session creation (via the
Sounds good ! |
|
yea people can set spark core configs before |
|
To put it differently - ability to enforce deployer configs, which users cannot override: for example, for security: similar to |
|
@mridulm maybe we can extend the config visibility: for now it's only PUBLIC and INTERNAL. Maybe we can add one more: PRIVATE. People can only set PRIVATE configs in the config file and it can't be overridden with other config files or SparkContext/SparkSession config APIs. Is it what you need? |
# Conflicts: # common/utils/src/main/scala/org/apache/spark/internal/config/ConfigEntry.scala # project/SparkBuild.scala # sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
- Prefix all proto enum values following industry standard (e.g. BOOL -> VALUE_TYPE_BOOL, CLUSTER -> SCOPE_CLUSTER, PUBLIC -> VISIBILITY_PUBLIC) to avoid namespace collisions - Add BindingPolicy enum (SESSION, PERSISTED, NOT_APPLICABLE) for SQL views/UDFs/procedures - Add required binding_policy field to ConfigEntry message - Reorder message fields: key, value_type, default_value, test_default, scope, visibility, binding_policy, doc, version - Wire binding_policy through ProtoBackedConfigEntry to Scala ConfigBindingPolicy - Update all textproto files with prefixed enum values, binding_policy, and matching field order - Add validation and test for missing binding_policy Co-authored-by: Isaac
|
@mridulm Thanks for the suggestion! I think the Hadoop Supporting Hadoop-style |
- Add ProtoBackedOptionalConfigEntry for configs without defaults, mirroring the existing OptionalConfigEntry pattern - Add ProtoBackedBase marker trait for both proto-backed entry types - Fix README examples: correct enum names, add binding_policy field, fix "validated at load time" claim and grammar - Remove migrated configs from binding-policy exceptions file - Restore set -e in dev/mima with targeted error handling - Remove config_check.yml CI workflow (not ready to enforce yet) - Use ConfigRegistry.allKeys() for proto config lookups instead of scanning knownConfigs - Add TODO on getConfigEntries() for future simplification Co-authored-by: Isaac
…isuse - Add Mutability enum (STATIC/DYNAMIC) as a separate concern from Scope (CLUSTER/SESSION). isStaticConfigKey now checks mutability instead of scope. - Add require check in SQLConf.register to reject proto-defined keys, preventing duplicates in getConfigEntries(). - Add test validation that CLUSTER↔STATIC and SESSION↔DYNAMIC with a TODO to relax once the framework supports other combinations. Co-authored-by: Isaac
Co-authored-by: Isaac
What changes were proposed in this pull request?
This PR adds a new module
common/configwhich introduces a framework to define configs in prototext files. When the Spark application starts, we load config definitions from all prototext files and put them in a map with config name as the key, and the proto binary of the config definition as the value. These proto-backed configs are also registered inConfigEntry.knownConfigsvia a wrapperProtoBackedConfigEntry, together with existing Spark configs.The proto schema defines each config entry with: key, value type, default value, scope (cluster vs session), mutability (static vs dynamic, i.e. whether the config can be changed after system initialization), visibility (public vs internal), binding policy (session/persisted/not_applicable for SQL views, UDFs, and procedures), documentation, and version. Enum values follow proto3 naming conventions with type-name prefixes (e.g.
SCOPE_CLUSTER,VALUE_TYPE_BOOL,VISIBILITY_PUBLIC,BINDING_POLICY_SESSION,MUTABILITY_STATIC) to avoid namespace collisions. TheBindingPolicyfield maps to the existing ScalaConfigBindingPolicyenum. TheMutabilityfield is used bySQLConf.isStaticConfigKeyto determine whether a config can be changed at runtime.The new framework will co-exist with the existing config framework, until we migrate all existing configs to this new framework.
For simple configs that are only accessed in one place, we can hardcode the config name in the place that accesses it, with a new API
SQLConf#getConfByKeyStrictto avoid typo. Example in this PR:spark.sql.optimizer.datasourceV2ExprFoldingFor configs that are accessed in multiple places and we want to avoid hardcoding config name, or configs that need custom validation, we can still have an entry in
object SQLConfto reference the config definition. Examples in this PR:spark.sql.optimizer.maxIterationsandspark.sql.shuffledHashJoinFactor.Why are the changes needed?
Defining configs in various Scala objects is a bad practice:
Does this PR introduce any user-facing change?
No, it's developer facing
How was this patch tested?
new tests
Was this patch authored or co-authored using generative AI tooling?
Generated-by: Claude Code