Skip to content

Unblock String API workflows by allowing schema inference from string-based column access #1808

@koperagen

Description

@koperagen

Problem

String API is intended as a fallback and incremental migration path to the type-safe API.

However, currently the compiler plugin does not learn schema information from string-based column access. As a result:

  • code like "full_name"<String>() cannot be replaced with full_name
  • users cannot progressively move from String API to typed API
  • String API and compiler plugin workflows are disconnected

This blocks the intended usage pattern where users start with String API and gradually adopt type-safe access.

Expected

Allow compiler plugin to infer schema information from string-based column access.

Proposed solution

Introduce an operation (e.g. require { ... }) that:

  • asserts presence and type of columns accessed via String API
  • updates schema information for the compiler plugin
  • enables further usage of typed accessors

Example:

df.require { "full_name"() }
df.full_name // becomes available after require

Acceptance criteria

  • Compiler plugin can learn column types from string-based access
  • Typed accessors become available after schema is inferred
  • String API remains usable as fallback
  • Example pipeline using incremental migration compiles successfully
  • Documentation explains migration path from String API to typed API

Motivation

String API is a key fallback and onboarding mechanism.

Without schema inference:

  • users are forced to fully define @DataSchema upfront
  • incremental migration is not possible
  • compiler plugin loses a major part of its usability

This is critical for enabling real-world adoption and should be addressed before 1.0.

With compiler plugin, ideally we want to know about schema as early as possible to help transform it. However forcing to generate DataSchema for all columns can create an entry barrier not desirable for simple pipelines. This is the motivation for String API support that was done earlier. With it, our sample pipeline incrementally evolves schema:

val repos = DataFrame
    .readCsv("https://raw.githubusercontent.com/Kotlin/dataframe/master/data/jetbrains_repositories.csv")

repos
    .add("name") { "full_name"<String>().substringAfterLast("/") }
    .filter { name.lowercase().contains("kotlin") }

val reposUpdated = repos
    .renameToCamelCase()
    .rename { "stargazersCount"<Int>() }.into("stars")
    .filter { stars > 50 }
    .convert { "topics"<String>() }.with {
        val inner = it.removeSurrounding("[", "]")
        if (inner.isEmpty()) emptyList() else inner.split(',').map(String::trim)
    }
    .add("topicCount") { topics.size }
    .add("kind") { getKind("fullName"() , topics) }

reposUpdated.writeCsv("jetbrains_repositories_new.csv")

The only thing lacking here is a convenient way to tell plugin that there's full_name: String column, so could further replace

.add("name") { "full_name"<String>().substringAfterLast("/") } => .add("name") { full_name.substringAfterLast("/") }
.add("kind") { getKind("fullName"() , topics) } => .add("kind") { getKind(fullName , topics) }

Proposed solution: require operation, addition to cast and convertTo

val repos = DataFrame
    .readCsv("https://raw.githubusercontent.com/Kotlin/dataframe/master/data/jetbrains_repositories.csv")
    .require { "full_name"<String>() }

repos.full_name // now we can call full_name because otherwise require would've failed

Main difference - require will add new information, not substitute it as cast and convertTo

Metadata

Metadata

Assignees

Labels

APIIf it touches our APICompiler pluginAnything related to the DataFrame Compiler PluginenhancementNew feature or request

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions