The function transform_to_schema has an flag to populate with nulls the resulting DataSet when the columns are not in the data or in the transformations. However this won't work for fields inside structs.
Example:
from pyspark.sql import types as T
import typedspark as TS
class TestStructType(TS.Schema):
f1: TS.Column[T.StringType]
f2: TS.Column[T.StringType]
class TestSchema(TS.Schema):
a: TS.Column[TS.StructType[TestStructType]]
df = spark.createDataFrame([({"f1":"a"},)],"struct<a:struct<f1:string>>")
ds = TS.transform_to_schema(
df,
TestSchema,
fill_unspecified_columns_with_nulls=True
)
Expected behaviour:
The resulting dataset should have the f2 field populated with nulls.
Actual behaviour
Error:
TypeError: Schema TestSchema.a contains the following columns not present in data: {'f2'}