Skip to content

Commit 0b57f11

Browse files
committed
clean up more comments
1 parent 58757cb commit 0b57f11

12 files changed

Lines changed: 74 additions & 76 deletions

File tree

spark/src/main/java/org/apache/comet/codegen/CometBatchKernel.java

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -44,17 +44,17 @@ protected CometBatchKernel(Object[] references) {
4444
* Deterministic expressions leave this as a no-op.
4545
*
4646
* <p>The caller invokes this before the first {@code process} call of each partition. The
47-
* generated subclass is not thread-safe across concurrent {@code process} calls; the dispatcher
47+
* generated subclass is not thread-safe across concurrent {@code process} calls. The dispatcher
4848
* allocates one per partition and serializes calls.
4949
*/
5050
public void init(int partitionIndex) {}
5151

5252
/**
5353
* Process one batch.
5454
*
55-
* @param inputs Arrow input vectors; length and concrete classes match the schema the kernel was
56-
* compiled against
57-
* @param output Arrow output vector; caller allocates to the expression's {@code dataType}
55+
* @param inputs Arrow input vectors. Length and concrete classes match the schema the kernel was
56+
* compiled against.
57+
* @param output Arrow output vector. Caller allocates to the expression's {@code dataType}.
5858
* @param numRows number of rows in this batch
5959
*/
6060
public abstract void process(ValueVector[] inputs, FieldVector output, int numRows);

spark/src/main/scala/org/apache/comet/codegen/CometBatchKernelCodegen.scala

Lines changed: 19 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -35,18 +35,18 @@ import org.apache.comet.shims.CometExprTraitShim
3535
* fuses Arrow input reads, Spark expression evaluation, and Arrow output writes into one
3636
* Janino-compiled method per `(expression, schema)` pair.
3737
*
38-
* The kernel compiles any bound Catalyst expression; the tree need not be rooted at a `ScalaUDF`.
38+
* The kernel compiles any bound Catalyst expression. The tree need not be rooted at a `ScalaUDF`.
3939
* Today's only consumer is [[org.apache.comet.udf.codegen.CometScalaUDFCodegen]].
4040
*
41-
* Constraints: one output vector per kernel; per-row scalar evaluation only (aggregate, window,
41+
* Constraints: one output vector per kernel, per-row scalar evaluation only (aggregate, window,
4242
* generator are rejected by [[canHandle]]).
4343
*
4444
* Input- and output-side emission live in [[CometBatchKernelCodegenInput]] and
4545
* [[CometBatchKernelCodegenOutput]]. This file owns the [[ArrowColumnSpec]] vocabulary, the
4646
* [[canHandle]] / [[allocateOutput]] / [[compile]] / [[generateSource]] entry points, and
4747
* cross-cutting kernel-shape decisions (NullIntolerant short-circuit, CSE variant).
4848
*
49-
* The generated kernel is the `InternalRow` that Spark's `BoundReference.genCode` reads from; see
49+
* The generated kernel is the `InternalRow` that Spark's `BoundReference.genCode` reads from. See
5050
* [[generateSource]] for how the wiring is set up.
5151
*/
5252
object CometBatchKernelCodegen extends Logging with CometExprTraitShim {
@@ -128,7 +128,7 @@ object CometBatchKernelCodegen extends Logging with CometExprTraitShim {
128128
s"spark.sql.codegen.maxFields=$maxFields)")
129129
}
130130
// HOFs are `CodegenFallback` but admitted: `CodegenFallback.doGenCode` emits one
131-
// `((Expression) references[N]).eval(row)` call site per HOF; the kernel dispatches to the
131+
// `((Expression) references[N]).eval(row)` call site per HOF. The kernel dispatches to the
132132
// HOF's interpreted `eval`, which mutates `NamedLambdaVariable.value` per element and reads
133133
// the input array through the kernel's typed Arrow getters. Per-task `boundExpr` isolation
134134
// in `CometScalaUDFCodegen.kernelCache` prevents concurrent partitions from racing on the
@@ -140,8 +140,8 @@ object CometBatchKernelCodegen extends Logging with CometExprTraitShim {
140140
//
141141
// `ExecSubqueryExpression` (`ScalarSubquery`, `InSubqueryExec`) is accepted: the surrounding
142142
// Comet operator's inherited `SparkPlan.waitForSubqueries` populates the subquery's
143-
// `result` field before evaluation; the closure serializer captures that value into the
144-
// arg-0 bytes; the dispatcher keys its compile cache on those bytes, so distinct subquery
143+
// `result` field before evaluation. The closure serializer captures that value into the
144+
// arg-0 bytes, and the dispatcher keys its compile cache on those bytes, so distinct subquery
145145
// results produce distinct cache entries.
146146
//
147147
// `Unevaluable`: rejected by default. `isCodegenInertUnevaluable` exempts version-specific
@@ -207,7 +207,7 @@ object CometBatchKernelCodegen extends Logging with CometExprTraitShim {
207207
.mkString(","))
208208
// ScalaUDF embeds stateful `ExpressionEncoder` serializers via `ctx.addReferenceObj` that
209209
// reuse internal `UnsafeRow` / `byte[]` buffers per `apply`. Each kernel instance needs its
210-
// own copy; the closure regenerates the references array per call so the dispatcher can hand
210+
// own copy. The closure regenerates the references array per call so the dispatcher can hand
211211
// a fresh array to every kernel it allocates from this `CompiledKernel`.
212212
val freshReferences: () => Array[Any] = () =>
213213
generateSource(boundExpr, inputSchema).references
@@ -245,12 +245,12 @@ object CometBatchKernelCodegen extends Logging with CometExprTraitShim {
245245
// empty string here.
246246
//
247247
// TODO(method-size): perRowBody is inlined inside process's for-loop and not split.
248-
// Sufficiently deep trees can exceed Janino's 64KB method size; wrap in
248+
// Sufficiently deep trees can exceed Janino's 64KB method size. Wrap in
249249
// ctx.splitExpressionsWithCurrentInputs when hit.
250250
val (concreteOutClass, outputSetup, perRowBody) = {
251251
// Class-field CSE. `generateExpressions` runs `subexpressionElimination` under the hood,
252252
// populating `ctx.subexprFunctions` with per-row helper calls that write common subtree
253-
// results into `addMutableState` fields; the returned `ExprCode` references those fields.
253+
// results into `addMutableState` fields. The returned `ExprCode` references those fields.
254254
// `subexprFunctionsCode` is the concatenated helper invocation block, spliced into the
255255
// per-row body by `defaultBody`.
256256
val ev = if (SQLConf.get.subexpressionEliminationEnabled) {
@@ -338,7 +338,7 @@ object CometBatchKernelCodegen extends Logging with CometExprTraitShim {
338338
* skipped on null rows. Otherwise the standard shape: run `ev.code`, then `setNull` or write
339339
* based on `ev.isNull`.
340340
*
341-
* `subExprsCode` is the CSE helper-invocation block; it must run before `ev.code`. Inside the
341+
* `subExprsCode` is the CSE helper-invocation block. It must run before `ev.code`. Inside the
342342
* short-circuit it lives in the else branch so null rows skip CSE too.
343343
*/
344344
private def defaultBody(
@@ -418,8 +418,8 @@ object CometBatchKernelCodegen extends Logging with CometExprTraitShim {
418418

419419
/**
420420
* Array column: an Arrow `ListVector` wrapping a child spec. `elementSparkType` lets the
421-
* nested-class emitter pick the right read template; the child carries the Arrow vector class.
422-
* Nested arrays compose recursively.
421+
* nested-class emitter pick the right read template, and the child carries the Arrow vector
422+
* class. Nested arrays compose recursively.
423423
*/
424424
final case class ArrayColumnSpec(
425425
nullable: Boolean,
@@ -449,8 +449,9 @@ object CometBatchKernelCodegen extends Logging with CometExprTraitShim {
449449

450450
/**
451451
* Map column: an Arrow `MapVector` (subclass of `ListVector`) whose data vector is a
452-
* `StructVector` with key at child 0 and value at child 1. Nullable keys/values are carried in
453-
* the child specs. Nested keys and values compose recursively.
452+
* `StructVector` with key at child 0 and value at child 1. Nested keys and values compose
453+
* recursively. The child specs' `nullable` field is unused on the read path. Output-side null
454+
* guards for map values come from `MapType.valueContainsNull` on the Spark `DataType`.
454455
*/
455456
final case class MapColumnSpec(
456457
nullable: Boolean,
@@ -463,9 +464,9 @@ object CometBatchKernelCodegen extends Logging with CometExprTraitShim {
463464
}
464465

465466
/**
466-
* Compiled kernel handle. `factory` is a Spark-generated stateless class safe to share across
467-
* partitions; `freshReferences` regenerates the references array per kernel allocation because
468-
* `ScalaUDF` embeds stateful `ExpressionEncoder` serializers that cannot be shared.
467+
* Compiled kernel handle. `freshReferences` regenerates the references array per kernel
468+
* allocation because `ScalaUDF` embeds stateful `ExpressionEncoder` serializers that cannot be
469+
* shared.
469470
*/
470471
final case class CompiledKernel(factory: GeneratedClass, freshReferences: () => Array[Any]) {
471472
def newInstance(): CometBatchKernel =
@@ -474,7 +475,7 @@ object CometBatchKernelCodegen extends Logging with CometExprTraitShim {
474475

475476
/**
476477
* Output of [[generateSource]]. Tests inspect `body` to assert the shape of the generated
477-
* source; see `CometCodegenSourceSuite`.
478+
* source. See `CometCodegenSourceSuite`.
478479
*/
479480
final case class GeneratedSource(body: String, code: CodeAndComment, references: Array[Any])
480481

spark/src/main/scala/org/apache/comet/codegen/CometBatchKernelCodegenInput.scala

Lines changed: 3 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -88,15 +88,12 @@ private[codegen] object CometBatchKernelCodegenInput {
8888
}
8989

9090
/**
91-
* Emit typed-getter overrides. Each switches on column ordinal; with the inlined constant
91+
* Emit typed-getter overrides. Each switches on column ordinal. With the inlined constant
9292
* ordinal from `BoundReference.genCode`, JIT folds the switch to one branch.
9393
*
9494
* `decimalTypeByOrdinal` lets the decimal getter specialize per ordinal: when only a
9595
* `DecimalType(precision <= 18)` `BoundReference` reads the ordinal, the case skips the
9696
* `BigDecimal` allocation and reads the unscaled long directly.
97-
*
98-
* TODO(unsafe-readers): primitive `v.get(i)` performs a bounds check that is redundant given `i
99-
* in [0, numRows)`.
10097
*/
10198
def emitTypedGetters(
10299
inputSchema: Seq[ArrowColumnSpec],
@@ -679,8 +676,8 @@ private[codegen] object CometBatchKernelCodegenInput {
679676

680677
/**
681678
* Emit one `InputStruct_${path}` nested class. Constructor takes `rowIdx` and stores it in a
682-
* `final` field. Scalar getters switch on field ordinal; complex getters allocate fresh inner
683-
* views (offsets computed for array/map children; rowIdx passed through for struct children).
679+
* `final` field. Scalar getters switch on field ordinal. Complex getters allocate fresh inner
680+
* views (offsets computed for array/map children, rowIdx passed through for struct children).
684681
*/
685682
private def emitStructClass(path: String, spec: StructColumnSpec): String = {
686683
val baseClassName = classOf[CometInternalRow].getName

spark/src/main/scala/org/apache/comet/codegen/CometBatchKernelCodegenOutput.scala

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -73,7 +73,7 @@ private[codegen] object CometBatchKernelCodegenOutput {
7373
* Complex top-level types route through a [[RenamedListVector]] / [[RenamedMapVector]] /
7474
* [[RenamedStructVector]] (see those for the runtime-vs-export naming gap).
7575
*
76-
* `estimatedBytes` pre-sizes the data buffer for variable-length scalar outputs; ignored for
76+
* `estimatedBytes` pre-sizes the data buffer for variable-length scalar outputs. Ignored for
7777
* other root types, and not propagated into nested var-width children (their `allocateNew` runs
7878
* through the parent's `allocateNew`, which resets child buffers).
7979
*
@@ -184,7 +184,7 @@ private[codegen] object CometBatchKernelCodegenOutput {
184184
* typed child-vector casts and whose `perRow` writes `source` into `targetVec` at `idx`.
185185
* `targetVec` is assumed pre-cast to the right Arrow class (root prelude or a parent's setup).
186186
*
187-
* Scalars emit `perRow` only; complex types emit both. Inner setup bubbles up so deep child
187+
* Scalars emit `perRow` only. Complex types emit both. Inner setup bubbles up so deep child
188188
* casts land at the batch prelude.
189189
*/
190190
private def emitWrite(
@@ -239,7 +239,7 @@ private[codegen] object CometBatchKernelCodegenOutput {
239239
// write each into the `ListVector`'s child, bracket with `startNewValue`/`endValue`. The
240240
// element write recurses through `emitWrite` on the child vector so any supported scalar
241241
// becomes a valid element. Nested complex types compose. `targetVec` is a `ListVector` at
242-
// the call site; only its data vector needs casting (in setup).
242+
// the call site, and only its data vector needs casting (in setup).
243243
//
244244
// NullableElementElision: when `containsNull == false` drop the `isNullAt` guard at
245245
// source level rather than relying on JIT folding.
@@ -274,7 +274,7 @@ private[codegen] object CometBatchKernelCodegenOutput {
274274
OutputEmit(setup, perRow)
275275
case st: StructType =>
276276
// Spark's `doGenCode` for StructType produces an `InternalRow`. Typed child-vector casts
277-
// hoist to setup; the per-row body references the hoisted names.
277+
// hoist to setup, and the per-row body references the hoisted names.
278278
//
279279
// For non-nullable fields, drop the `row.isNullAt($fi)` guard at source level so HotSpot
280280
// emits a straight write path per field rather than a branch.
@@ -311,7 +311,7 @@ private[codegen] object CometBatchKernelCodegenOutput {
311311
// entries struct and the key/value children hoist to setup.
312312
//
313313
// Per-row: read keyArray/valueArray, open via `startNewValue(idx)`, write each pair into
314-
// the entries struct (key always non-null per Spark/Arrow invariant; value guarded on
314+
// the entries struct (key always non-null per Spark/Arrow invariant, value guarded on
315315
// `valueContainsNull`), close via `endValue(idx, n)`.
316316
val entriesVar = ctx.freshName("outMapEntries")
317317
val keyVar = ctx.freshName("outMapKey")

spark/src/main/scala/org/apache/comet/codegen/CometInternalRow.scala

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -28,7 +28,7 @@ import org.apache.comet.shims.CometInternalRowShim
2828

2929
/**
3030
* Throwing-default `InternalRow` base for the codegen kernel. Subclasses override only the
31-
* getters their input shape needs; centralizing the throws absorbs forward-compat breakage when
31+
* getters their input shape needs. Centralizing the throws absorbs forward-compat breakage when
3232
* Spark adds abstract methods.
3333
*
3434
* Two consumers: the compiled kernel (`ctx.INPUT_ROW = "row"` aliases `this`) and per-column

spark/src/main/scala/org/apache/comet/serde/CometScalaUDF.scala

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -62,7 +62,7 @@ object CometScalaUDF extends CometExpressionSerde[ScalaUDF] {
6262
val attrs = expr.collect { case a: AttributeReference => a }.distinct
6363
val boundExpr = BindReferences.bindReference(expr, AttributeSeq(attrs))
6464

65-
// Gate at plan time; surface the reason via withInfo rather than crashing Janino at execute.
65+
// Gate at plan time. Surface the reason via withInfo rather than crashing Janino at execute.
6666
CometBatchKernelCodegen.canHandle(boundExpr) match {
6767
case Some(reason) =>
6868
withInfo(expr, reason)

spark/src/main/scala/org/apache/comet/udf/codegen/CometScalaUDFCodegen.scala

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -69,7 +69,7 @@ import org.apache.comet.udf.CometUDF
6969
*
7070
* `evaluate` runs under `this.synchronized` because DataFusion operators like `HashJoinExec`
7171
* pipeline build/probe via `OnceAsync` (`tokio::spawn`), so multiple Tokio worker threads can
72-
* call back into one task's dispatcher; the kernel's per-batch instance fields would race
72+
* call back into one task's dispatcher. The kernel's per-batch instance fields would race
7373
* otherwise.
7474
*
7575
* TODO(udf-codegen-pool): if intra-task UDF parallelism shows up as a bottleneck, replace the
@@ -98,7 +98,7 @@ class CometScalaUDFCodegen extends CometUDF with Logging {
9898
"CometScalaUDFCodegen requires non-null serialized expression bytes at arg 0")
9999
val bytes = exprVec.get(0)
100100

101-
// TODO(dict-encoded): kernels assume materialized inputs; dict-encoded vectors would fail the
101+
// TODO(dict-encoded): kernels assume materialized inputs. Dict-encoded vectors would fail the
102102
// cast in `specFor` below. Fix is to materialize at the dispatcher (via
103103
// `CDataDictionaryProvider`) or widen `emitTypedGetters` with a dict-index + lookup path.
104104

@@ -193,7 +193,7 @@ class CometScalaUDFCodegen extends CometUDF with Logging {
193193
*/
194194
private def specFor(v: ValueVector): ArrowColumnSpec = v match {
195195
case map: MapVector =>
196-
// MapVector extends ListVector; match it first.
196+
// MapVector extends ListVector, match it first.
197197
val struct = map.getDataVector.asInstanceOf[StructVector]
198198
val keyVec = struct.getChildByOrdinal(0).asInstanceOf[ValueVector]
199199
val valueVec = struct.getChildByOrdinal(1).asInstanceOf[ValueVector]
@@ -253,7 +253,7 @@ class CometScalaUDFCodegen extends CometUDF with Logging {
253253
object CometScalaUDFCodegen {
254254

255255
// JVM-wide counters across all per-task instances. Compile work is deduped JVM-wide via
256-
// `CodeGenerator.compile`'s source cache; these track this dispatcher's per-task cache activity.
256+
// `CodeGenerator.compile`'s source cache. These track this dispatcher's per-task cache activity.
257257
private val compileCount = new AtomicLong(0)
258258
private val cacheHitCount = new AtomicLong(0)
259259

spark/src/main/spark-4.x/org/apache/comet/shims/CometExprTraitShim.scala

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,7 @@ import org.apache.spark.sql.catalyst.expressions.{Expression, ResolvedCollation}
2323

2424
/**
2525
* Spark 4.x replaced the `NullIntolerant` marker trait with a boolean method on `Expression` and
26-
* added a `stateful` boolean. Neither exists as a trait in 4.x; this shim routes the checks
26+
* added a `stateful` boolean. Neither exists as a trait in 4.x. This shim routes the checks
2727
* through the method form.
2828
*/
2929
trait CometExprTraitShim {

spark/src/test/scala/org/apache/comet/CometCodegenFuzzSuite.scala

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -202,7 +202,7 @@ class CometCodegenFuzzSuite
202202

203203
/**
204204
* Element-level fuzz for nested array reads: `ArrayMax.doGenCode` walks every element of every
205-
* row, calling the kernel's nested element getter the path the unsafe-getter optimization
205+
* row, calling the kernel's nested element getter, the path the unsafe-getter optimization
206206
* touches and which the cardinality probe deliberately skips.
207207
*/
208208
test("array_max element fuzz: every Array<primitive> column") {
@@ -283,7 +283,7 @@ class CometCodegenFuzzSuite
283283

284284
/**
285285
* Element-level fuzz for `Array<Struct<...>>`. `array_distinct` is a non-HOF unary expression
286-
* that hashes each element to dedupe; struct hashing is field-wise, so the kernel emits element
286+
* that hashes each element to dedupe. Struct hashing is field-wise, so the kernel emits element
287287
* reads on each struct's fields. `cardinality` consumes the result without materialization.
288288
* Asserts the optimizer keeps `ArrayDistinct` so the coverage isn't vacuously folded.
289289
*/
@@ -316,8 +316,8 @@ class CometCodegenFuzzSuite
316316
}
317317

318318
/**
319-
* Top-level Array / Map cardinality probe. Struct → drill into each scalar child via
320-
* `GetStructField`; nested Array / Map sub-fields also get the cardinality probe (depth bound:
319+
* Top-level Array / Map produces a cardinality probe. Struct drills into each scalar child via
320+
* `GetStructField`. Nested Array / Map sub-fields also get the cardinality probe (depth bound:
321321
* deeper struct-of-struct nesting is skipped to keep the sweep finite).
322322
*/
323323
private def probeComplexColumn(field: StructField, viewName: String): Unit = {
@@ -355,7 +355,7 @@ class CometCodegenFuzzSuite
355355
val intDigits = precision - scale
356356
// `BigInt.apply(bits, rng)` samples uniformly on `[0, 2^bits - 1]`; bound to the decimal's
357357
// integer-part range (10^intDigits - 1) so the result fits the schema. `BigInteger.bitLength`
358-
// would overshoot slightly; min with the exact max is cheap insurance.
358+
// would overshoot slightly. Min with the exact max is cheap insurance.
359359
val intMax = BigInt(10).pow(intDigits) - 1
360360
val bits = math.max(intMax.bitLength, 1)
361361
(0 until RowCount).map { _ =>

0 commit comments

Comments
 (0)