-
Notifications
You must be signed in to change notification settings - Fork 58
Process non-nullable scala type before udf #1471
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from 2 commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -24,7 +24,7 @@ declare -r SPARK_VERSION=${SPARK_VERSION:-3.3.1} | |
| declare -r LOCAL_PATH=$(cd -- "$(dirname -- "${DOCKER_FOLDER}")" &>/dev/null && pwd) | ||
| # ===============================[properties keys]================================= | ||
| declare -r SOURCE_KEY="source.path" | ||
| declare -r CHECKPOINT_KEY="checkpoint" | ||
| declare -r CHECKPOINT_KEY="checkpoint.path" | ||
| # ===============================[spark driver/executor resource]================== | ||
| declare -r RESOURCES_CONFIGS="${RESOURCES_CONFIGS:-"--conf spark.driver.memory=4g --conf spark.executor.memory=4g"}" | ||
| # ===================================[functions]=================================== | ||
|
|
@@ -89,7 +89,7 @@ function runContainer() { | |
|
|
||
| if [[ "$master" == "spark:"* ]] || [[ "$master" == "local"* ]]; then | ||
| docker run -d --init \ | ||
| --name "csv-kafka-${source_name}" \ | ||
| --name "csv-kafka${source_name}" \ | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 這邊拿掉
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 沒有,我在查上面那個bug時不小心刪掉的,已恢復。 |
||
| $network_config \ | ||
| -v "$propertiesPath":"$propertiesPath":ro \ | ||
| -v "$jar_path":/tmp/astraea-etl.jar:ro \ | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -70,7 +70,14 @@ class DataFrameProcessor(dataFrame: DataFrame) { | |
| .withColumn( | ||
| "value", | ||
| defaultConverter( | ||
| map(cols.flatMap(c => List(lit(c.name), col(c.name))): _*) | ||
| map( | ||
| cols.flatMap(c => | ||
| List( | ||
| lit(c.name), | ||
| when(col(c.name).isNotNull, col(c.name)).otherwise(lit(null)) | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 或許我們可以直接把 null 的欄位取消掉,因為當null的時候就代表沒有該值,直接過濾掉可能還可以提升一點效能
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 這是我能想到的將null欄位取消掉的寫法,看上去沒有很優雅,但我也找不到其他的了。有優雅的我再修改。 |
||
| ) | ||
| ): _* | ||
| ) | ||
| ) | ||
| ) | ||
| .withColumn( | ||
|
|
@@ -171,10 +178,6 @@ object DataFrameProcessor { | |
|
|
||
| private def schema(columns: Seq[DataColumn]): StructType = | ||
| StructType(columns.map { col => | ||
| if (col.dataType != DataType.StringType) | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 現在支援非string 型別了嗎?
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 沒錯,目前我測試下來已支援。因爲在column時是能夠處理null的,但如果放在udf中轉換回scala中的某些type就不支持null處理了。 |
||
| throw new IllegalArgumentException( | ||
| "Sorry, only string type is currently supported.Because a problem(astraea #1286) has led to the need to wrap the non-nullable type." | ||
| ) | ||
| StructField(col.name, col.dataType.sparkType) | ||
| }) | ||
| } | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
請問為何加上
.path? 如果是要統一命名的話,Metadata裡面用的變數名稱也要跟著改Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
主要是shell 如果按照checkpoint去搜索會把上方的註解也一併識別,因此乾脆改一個統一的名字。