Problem
String API is intended as a fallback and incremental migration path to the type-safe API.
However, currently the compiler plugin does not learn schema information from string-based column access. As a result:
- code like
"full_name"<String>() cannot be replaced with full_name
- users cannot progressively move from String API to typed API
- String API and compiler plugin workflows are disconnected
This blocks the intended usage pattern where users start with String API and gradually adopt type-safe access.
Expected
Allow compiler plugin to infer schema information from string-based column access.
Proposed solution
Introduce an operation (e.g. require { ... }) that:
- asserts presence and type of columns accessed via String API
- updates schema information for the compiler plugin
- enables further usage of typed accessors
Example:
df.require { "full_name"() }
df.full_name // becomes available after require
Acceptance criteria
- Compiler plugin can learn column types from string-based access
- Typed accessors become available after schema is inferred
- String API remains usable as fallback
- Example pipeline using incremental migration compiles successfully
- Documentation explains migration path from String API to typed API
Motivation
String API is a key fallback and onboarding mechanism.
Without schema inference:
- users are forced to fully define
@DataSchema upfront
- incremental migration is not possible
- compiler plugin loses a major part of its usability
This is critical for enabling real-world adoption and should be addressed before 1.0.
With compiler plugin, ideally we want to know about schema as early as possible to help transform it. However forcing to generate DataSchema for all columns can create an entry barrier not desirable for simple pipelines. This is the motivation for String API support that was done earlier. With it, our sample pipeline incrementally evolves schema:
val repos = DataFrame
.readCsv("https://raw.githubusercontent.com/Kotlin/dataframe/master/data/jetbrains_repositories.csv")
repos
.add("name") { "full_name"<String>().substringAfterLast("/") }
.filter { name.lowercase().contains("kotlin") }
val reposUpdated = repos
.renameToCamelCase()
.rename { "stargazersCount"<Int>() }.into("stars")
.filter { stars > 50 }
.convert { "topics"<String>() }.with {
val inner = it.removeSurrounding("[", "]")
if (inner.isEmpty()) emptyList() else inner.split(',').map(String::trim)
}
.add("topicCount") { topics.size }
.add("kind") { getKind("fullName"() , topics) }
reposUpdated.writeCsv("jetbrains_repositories_new.csv")
The only thing lacking here is a convenient way to tell plugin that there's full_name: String column, so could further replace
.add("name") { "full_name"<String>().substringAfterLast("/") } => .add("name") { full_name.substringAfterLast("/") }
.add("kind") { getKind("fullName"() , topics) } => .add("kind") { getKind(fullName , topics) }
Proposed solution: require operation, addition to cast and convertTo
val repos = DataFrame
.readCsv("https://raw.githubusercontent.com/Kotlin/dataframe/master/data/jetbrains_repositories.csv")
.require { "full_name"<String>() }
repos.full_name // now we can call full_name because otherwise require would've failed
Main difference - require will add new information, not substitute it as cast and convertTo
Problem
String API is intended as a fallback and incremental migration path to the type-safe API.
However, currently the compiler plugin does not learn schema information from string-based column access. As a result:
"full_name"<String>()cannot be replaced withfull_nameThis blocks the intended usage pattern where users start with String API and gradually adopt type-safe access.
Expected
Allow compiler plugin to infer schema information from string-based column access.
Proposed solution
Introduce an operation (e.g.
require { ... }) that:Example:
df.require { "full_name"() }
df.full_name // becomes available after require
Acceptance criteria
Motivation
String API is a key fallback and onboarding mechanism.
Without schema inference:
@DataSchemaupfrontThis is critical for enabling real-world adoption and should be addressed before 1.0.
With compiler plugin, ideally we want to know about schema as early as possible to help transform it. However forcing to generate DataSchema for all columns can create an entry barrier not desirable for simple pipelines. This is the motivation for String API support that was done earlier. With it, our sample pipeline incrementally evolves schema:
The only thing lacking here is a convenient way to tell plugin that there's full_name: String column, so could further replace
.add("name") { "full_name"<String>().substringAfterLast("/") }=>.add("name") { full_name.substringAfterLast("/") }.add("kind") { getKind("fullName"() , topics) }=>.add("kind") { getKind(fullName , topics) }Proposed solution:
requireoperation, addition tocastandconvertToMain difference - require will add new information, not substitute it as cast and convertTo