-
Notifications
You must be signed in to change notification settings - Fork 854
fix: Add null-safety checks in OpenAIEmbedding transform method #2455
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Hey @smamindl 👋! We use semantic commit messages to streamline the release process. Examples of commit messages with semantic prefixes:
To test your commit locally, please follow our guild on building from source. |
0f86e2d to
ab84a9b
Compare
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds null-safety checks to the transform method in OpenAIEmbedding to handle cases where the OpenAI API response may contain null data. The changes introduce defensive checks using Spark's when().otherwise() pattern to prevent runtime errors when accessing the embedding data and converting it to a vector.
- Adds necessary imports (
litandwhen) for null-safety checks - Implements null checks for the
datafield before accessing array elements - Implements null checks for the embedding array before converting to vector
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| val embeddingArrayCol = element_at(responseCol.getField("data"), 1).getField("embedding") | ||
| val vectorCol = array_to_vector(embeddingArrayCol) | ||
| val dataField = responseCol.getField("data") | ||
| val embeddingArrayCol = when(dataField.isNotNull, |
Copilot
AI
Dec 10, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The null check on dataField.isNotNull doesn't handle the case where data is an empty array. If the API returns an empty array for data, element_at(dataField, 1) will return null, and calling .getField("embedding") on null will cause a runtime error. Consider adding an additional check for array size using size(dataField) > 0 or wrapping the element_at(...).getField("embedding") in a try-catch, or better yet, combine the checks: when(dataField.isNotNull && size(dataField) > 0, ...).
| val embeddingArrayCol = when(dataField.isNotNull, | |
| val embeddingArrayCol = when(dataField.isNotNull && size(dataField) > 0, |
| override def transform(dataset: Dataset[_]): DataFrame = { | ||
| val parsed = super.transform(dataset) | ||
| val responseCol = col(getOutputCol) | ||
| val embeddingArrayCol = element_at(responseCol.getField("data"), 1).getField("embedding") | ||
| val vectorCol = array_to_vector(embeddingArrayCol) | ||
| val dataField = responseCol.getField("data") | ||
| val embeddingArrayCol = when(dataField.isNotNull, | ||
| element_at(dataField, 1).getField("embedding") | ||
| ).otherwise(lit(null)) | ||
| val vectorCol = when(embeddingArrayCol.isNotNull, | ||
| array_to_vector(embeddingArrayCol) | ||
| ).otherwise(lit(null).cast(VectorType)) |
Copilot
AI
Dec 10, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new null-safety logic lacks test coverage. Since the test suite in OpenAIEmbeddingsSuite.scala has comprehensive tests for other scenarios, consider adding tests that verify the behavior when the API returns null data or empty arrays. This will ensure the null-safety checks work correctly and prevent regressions.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #2455 +/- ##
==========================================
+ Coverage 84.34% 84.37% +0.03%
==========================================
Files 335 335
Lines 17723 17728 +5
Branches 1616 1612 -4
==========================================
+ Hits 14948 14958 +10
+ Misses 2775 2770 -5 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
| import scala.language.existentials | ||
| import org.apache.spark.sql.functions.{col, element_at, struct} | ||
| import org.apache.spark.sql.functions.{col, element_at, lit, struct, when} | ||
| import spray.json.DefaultJsonProtocol._ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@smamindl please add a test in OpenAIEmbeddingSuite.scala and potentially in OpenAIPromptsuite.scala so this null case is caught if it happens in future
ranadeepsingh
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add tests
|
Fixed in |
Related Issues/PRs
Related to : 2444
#xxx
What changes are proposed in this pull request?
This PR adds null-safety checks to the transform method in OpenAIEmbedding to handle cases where the OpenAI API response may contain null data.
How is this patch tested?
Does this PR change any dependencies?
Does this PR add a new feature? If so, have you added samples on website?
website/docs/documentationfolder.Make sure you choose the correct class
estimators/transformersand namespace.DocTablepoints to correct API link.yarn run startto make sure the website renders correctly.<!--pytest-codeblocks:cont-->before each python code blocks to enable auto-tests for python samples.WebsiteSamplesTestsjob pass in the pipeline.