Skip to content

Conversation

@FBruzzesi
Copy link
Owner

@FBruzzesi FBruzzesi commented Dec 16, 2025

Description

  • Introduce a new Field data structure to represent schema fields with enhanced nullability and metadata handling
  • Updates to_arrow() to include nullability and metadata

Possible follow up for the Field class:

  • allow for description via: "anyschema/description", via library dependent description or variable docstrings
  • add methods/properties in AnySchema to get names, dtypes, nullability, uniqueness

Type of Change

  • 🐛 Bug fix (non-breaking change which fixes an issue)
  • ✨ New feature (non-breaking change which adds functionality)
  • ⚠️ Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • 📚 Documentation update
  • 🧪 Test improvement
  • 🔧 Maintenance/Refactoring
  • ⚡ Performance improvement
  • 🏗️ Build/CI improvement

Related Issues

Changes Made

Checklist

  • My code follows the project's style guidelines (ruff)
  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

@FBruzzesi FBruzzesi added the enhancement New feature or request label Dec 16, 2025
@FBruzzesi
Copy link
Owner Author

Typical ping to: @danielgafni @camriddell @dangotbanned

It's a large diff but split as follow:

  • docs <200 lines (start from here)
  • codebase <250 lines (maybe take a look at this)
  • test remaining ~900 lines (feel free to skip)

dtype: DType
nullable: bool
unique: bool
metadata: dict[str, Any]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have a use-case in mind for Field.metadata being mutable after construction?

If not, I think it could simplify things for hashing to use something like types.MappingProxyType:

import copy
import types
from narwhals.dtypes import DType
from typing import Any
from collections.abc import Mapping


class Field:
    name: str
    dtype: DType
    nullable: bool
    unique: bool
    metadata: Mapping[str, Any]

    def __init__(
        self,
        name: str,
        dtype: DType,
        *,
        nullable: bool = False,
        unique: bool = False,
        metadata: Mapping[str, Any] | None = None,
    ) -> None:
        self.name = name
        self.dtype = dtype
        self.nullable = nullable
        self.unique = unique
        self.metadata = types.MappingProxyType(
            copy.deepcopy(metadata) if metadata is not None else {}
        )

It's not quite immutability, but it's a bit less dynamic than dict

Comment on lines 15 to +18
Currently supported special metadata keys:

* `"anyschema/nullable"`: Specifies whether the field can contain null values.
* `"anyschema/unique"`: Specifies whether all values in the field must be unique.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it too late to change this convention?

Since / in a key prevents it from being a valid python identifier, using regular TypedDict syntax isn't possible:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these keys would be typically added to library-specific field specifications by users ad-hoc, so it should be fine.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dangotbanned I was about to pull the functional syntax myself but you are already aware of it. Do you have any proposal for something which is a valid python identifier?

Copy link
Contributor

@camriddell camriddell Dec 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't want to de-rail this, but is there a specific reason for keys like "anyschema/nullable" instead of using a nested dict {"anyschema": {"nullable": ..., "unique": ...}} The latter should be easier to handle/verify in code. In the former case we need to iterate across all keys and find the ones that begin with "anyschema/", whereas in the latter case you can just check for the existence of "anyschema" in the metadata.

Copy link
Contributor

@dangotbanned dangotbanned Dec 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dangotbanned I was about to pull the functional syntax myself but you are already aware of it. Do you have any proposal for something which is a valid python identifier?

Something like @camriddell's suggestion (#76 (comment)) could work quite nicely.

Here's what that, and another one which just uses __anyschema_<key>__ look like:

from typing_extensions import TypedDict


class MetadataDunder(TypedDict, total=False):
    __anyschema_nullable__: bool
    __anyschema_unique__: bool
    __anyschema_time_zone__: str
    __anyschema_time_unit__: str


class MetadataNested(TypedDict, total=False):
    nullable: bool
    unique: bool
    time_zone: str
    time_unit: str


class MetadataInsideMe(TypedDict):
    __anyschema_metadata__: MetadataNested   # I like this one the most (ignoring the names of the `TypedDict`s)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the nested dict is good as well! Probably better than anyschema/<key>

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To keep the diff a bit smaller, I will follow up on this on a dedicated PR

Copy link

@danielgafni danielgafni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, I think it makes a lot of sense, the PR is already solid.

Some topics for discussion:

  1. Currently, we expect nullable and unique to always be set, even tho sometimes we actually have no information about them initially, for example with Narwhals or Polars columns. We default them to False, but I feel like it may be a wrong thing to do. You might want to explore allowing them to be bool | None as well. I'm not sure if this is the way, but it's something to consider. You can add a few convenience property like known_nullability or something like this.

  2. Is there a reason why this isn't a @dataclass? Would simplify some things. You can still have a custom __init__ if needed.

and self.metadata == other.metadata
)

def __hash__(self) -> int:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this point why won't we just use a frozen @dataclass?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the appeal of that, yet we might end up creating a lot of this fields and the overhead can be noticeable. The container is so simple that I don't think it would end up saving so much typing after all.

I am sure that dataclasses overhead is getting better (as in, it's getting lower over python version), but still

Copy link

@danielgafni danielgafni Dec 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO I won't worry about performance at this stage.

  1. I'm sure it won't be noticeable, especially in data workflows.

  2. Even if it happens to be noticeable, it's better to take care of this once more important problems like establishing a good API are solved. Using @dataclass is an easy way to get started with a good API, and you should be able to change this part later on without user-facing changes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm +1 on using a frozen dataclass here, it would reduce the boilerplate code by a ton. If this appears to be a bottleneck in the future then we can worry about optimization at that point.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in f822f75

I also made a benchmarking script. I will share it tomorrow

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As promised:

Instance Creation (µs per instance)
--------------------------------------------------------------------------------
   Current (Manual)                       0.266 µs  (baseline)
   Frozen DC (default_factory)            0.537 µs  (+101.9%, +0.271 µs)
   ...  <other attempts>

All other ops are in the same ballpark, creation is the one with a (relative) big difference.

I think it's ok to use it and revisit later. I don't expect a user to rely on the fact that this class is a dataclass

Copy link
Contributor

@dangotbanned dangotbanned Dec 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'd still need to do a bit of legwork for argument defaults, but my @dataclass_transform thing in (narwhals-dev/narwhals#2572) will get you the other features from @dataclass you're using IIUC

image

Feel free to steal anything of interest 😄

There's probably not much code left after removing the general hashing stuff, which I imagine you don't need here anyway

Copy link
Owner Author

@FBruzzesi FBruzzesi Dec 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @dangotbanned - I tried a few approaches by now, and once one adds layers of protections for complete immutability, then creation performance is in the same ballpark as per dataclasses.

Comment on lines 15 to +18
Currently supported special metadata keys:

* `"anyschema/nullable"`: Specifies whether the field can contain null values.
* `"anyschema/unique"`: Specifies whether all values in the field must be unique.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these keys would be typically added to library-specific field specifications by users ad-hoc, so it should be fine.

@FBruzzesi
Copy link
Owner Author

FBruzzesi commented Dec 16, 2025

Thanks @danielgafni @dangotbanned for your reviews, they made some very good points I will address 🙏🏼


P.S.: I am thinking to rename the class to AnyField just to embrace the library name 😂

@FBruzzesi
Copy link
Owner Author

FBruzzesi commented Dec 16, 2025

@danielgafni

  1. Currently, we expect nullable and unique to always be set, even tho sometimes we actually have no information about them initially, for example with Narwhals or Polars columns. We default them to False, but I feel like it may be a wrong thing to do. You might want to explore allowing them to be bool | None as well. I'm not sure if this is the way, but it's something to consider. You can add a few convenience property like known_nullability or something like this.

I don't think it shows in the commits, but the fact that I changed the nullability default while developing was already a sign that something was off to me, and this comment is a second hint towards that.

My reasoning was as follows:

  • For nullable to be True, one can always pass T | None or Optional[T]
  • However this means that if not declared like that, then the opposite holds true

I was thinking in the following direction. Suppose you have a pydantic model like the following

from pydantic import BaseModel

class User(BaseModel):
    name: str
    email: str | None

Now if you parse/validate a list of users and want to convert to a dataframe, then the name column would NOT be nullable, while email one can be:

import polars as pl
import pyarrow as pa

users = [
    User(name="francesco", email="[email protected]"),
    User(name="daniel", email=None),
    # User(name=None, email="[email protected]") -> however this would raise
]

pl.DataFrame([user.model_dump() for user in users])
shape: (2, 2)
┌───────────┬───────────────────────┐
│ nameemail                 │
│ ------                   │
│ strstr                   │
╞═══════════╪═══════════════════════╡
│ francesconot-my-email@fake.com │
│ danielnull                  │
└───────────┴───────────────────────┘

pa.Table.from_pylist([user.model_dump() for user in users])
pyarrow.Table
name: string
email: string
----
name: [["francesco","daniel"]]
email: [["[email protected]",null]]

I know polars does not distinguish between them, nor pyarrow if not explicitly passed. Here is where I think it gets interesting: as of this PR it's possible to pass this information to pyarrow:

from anyschema import AnySchema

pa.Table.from_pylist([user.model_dump() for user in users], schema=AnySchema(User).to_arrow())
pyarrow.Table
name: string not null
email: string
----
name: [["francesco","daniel"]]
email: [["[email protected]",null]]

and this was not possible before!

Important

Ok I hear you, all of this needs to be documented somehow and not stay only in my head

  1. Is there a reason why this isn't a @dataclass? Would simplify some things. You can still have a custom __init__ if needed.

Replied in #76 (comment) on the dataclasses topic

@danielgafni
Copy link

I see, I'll think about it again tomorrow gut it seems like it makes sense.

One more thing: you might want to a few public methods for Field transformations, e.g. Field.with_attributes or Field.with_metadata if you are going to change it to a frozen @dataclass. I've noticed you already had to copy .metadata in some places, this may be a little code smell.

@camriddell
Copy link
Contributor

I see, I'll think about it again tomorrow gut it seems like it makes sense.

One more thing: you might want to a few public methods for Field transformations, e.g. Field.with_attributes or Field.with_metadata if you are going to change it to a frozen @dataclass. I've noticed you already had to copy .metadata in some places, this may be a little code smell.

If we use a dataclass, we can implement this via dataclasses.replace or just not implement our own public method to slim down the API surface area (less code written is less code one needs to maintain)

@FBruzzesi FBruzzesi changed the title feat: Add Field data structure feat: Add AnyField data structure Dec 16, 2025
@FBruzzesi FBruzzesi marked this pull request as draft December 16, 2025 22:42
@FBruzzesi
Copy link
Owner Author

Thanks everyone! Converted back to draft and will finish it tomorrow ✨

@FBruzzesi FBruzzesi marked this pull request as ready for review December 17, 2025 19:48
@FBruzzesi
Copy link
Owner Author

@danielgafni in the last commit (cc21b5b) I (and Claude) added the example of requiring explicit nullability

@danielgafni
Copy link

danielgafni commented Dec 18, 2025

Alright, this seems ok in isolation but I can't stop thinking about SQLModel. This gets us back to my comment here.

An SQLModel class may represent both Python-side schema and DB-side schema. These may be different, e.g. a server-generated column may be Optional in Python. How would you separate these nullability properties and schemas in general? And then there is this whole DB-Python type mismatch problem which is an even bigger can of worms than nullability.

Or did you decide to only handle Python-side schemas and not storage schemas?


On the other hand, SQLModel may be a... not the best way to do thing in general. Perhaps it's trying to do too much. Perhaps we really just need to have 2 separate models for storage and Python schemas.

I'll probably be happy if anyschema could take an SQLModel class and transform it into a DB schema or a dataframe schema with some extra arguments / metadata passed from me.

@FBruzzesi
Copy link
Owner Author

FBruzzesi commented Dec 18, 2025

Alright, this seems ok in isolation but I can't stop thinking about SQLModel. This gets us back to my comment here.

I'll probably be happy if anyschema could take an SQLModel class and transform it into a DB schema or a dataframe schema with some extra arguments / metadata passed from me.

SQLModel is probably a weird mix that I want to tackle eventually. I will create a dedicated issue for it.

I don't have too much direct experience with it (only hackatons and hobby projects), so I hope the following statements are accurate.

An SQLModel class may represent both Python-side schema and DB-side schema. These may be different, e.g. a server-generated column may be Optional in Python. How would you separate these nullability properties and schemas in general?

Wouldn't this be the same as SQLAlchemy? Annotating a SQLAlchemy column as Optional[T] or T | None marks it as nullable. For SQLModel it flags it as required=False. Translating from the specification to a dataframe schema, especially pyarrow, a nullable column could be omitted when writing to the DB (I can share an example to better explain what I mean), but I see what you mean:

  • In dataframe land, such column can have nulls
  • In DB it cannot since it's automatically populated (e.g. SQLModel primary_key's)

A pattern I see in the SQLModel docs is to distinguish between a class for the field spec and a class for a table (source):

class HeroBase(SQLModel):
    name: str = Field(index=True)
    secret_name: str
    age: int | None = Field(default=None, index=True)


class Hero(HeroBase, table=True):
    id: int | None = Field(default=None, primary_key=True)

As of now I am more interested in translating to dataframe schemas but as discussed in our call, I don't want to limit the project. Does this answer to

Or did you decide to only handle Python-side schemas and not storage schemas?

as well?

Regarding:

And then there is this whole DB-Python type mismatch problem which is an even bigger can of worms than nullability.

I am having a discussion with my brain if I should expand/diverge from the narwhals datatypes. There are cases I would like to cover, but on the other hand I am afraid to stumble upon an infinite variety of datatypes.

But here looping back on the dataframe validator idea (cc: @camriddell), I am thinking to add a AnyField.to_validator method (and/or AnySchema.to_validator), something along the following lines (API to be improved, please don't judge this sketch below):

def to_validator(self) -> list[Expr]:
    validators = []
    col = nw.col(self.name)
    if not self.nullable:
        validators.append(col.is_null().any().alias(f"{self.name}__has_nulls")
    if self.unique:
        validators.append((col.n_unique() != col.len()).alias(f"{self.name}__not_unique"))
    return validators     

Now with if we had more dtypes (e.g. UUIDs, URLs, String(min_length, max_length)) we can have other validators. Are UUID and String with max length all we are missing from SQLAlchemy Datatype Objects? Edit: TIL PyArrow has a uuid dtype


But yeah the dtypes can get messy and the validators idea even more so. It's probably a bit out of scope of the original idea I had for the library.

@danielgafni
Copy link

danielgafni commented Dec 18, 2025

I definitely don't want to move way from Narwhals types.

I was referring to cases where both DB and Python types are Narwhals types, but due to e.g. storage restrictions they differ.

For example, PostgreSQL doesn't really support Structs or Maps, best you can do is to store such structured Narwhals columns as JSON type.

Ideally I want to be able to express this with Anyschema: it should pick the JSON type from sqlalchemy, but the DataFrame type should be a proper nw.Struct (which we'll probably have to provide manually).

@FBruzzesi
Copy link
Owner Author

For example, PostgreSQL doesn't really support Structs or Maps, best you can do is to store such structured Narwhals columns as JSON type.

Ideally I want to be able to express this with Anyschema: it should pick the JSON type from sqlalchemy, but the DataFrame type should be a proper nw.Struct (which we'll probably have to provide manually).

Next in scope is to allow easy custom mapping of fields with the metadata value under "anyschema/dtype" (or similar keyword). So in SQLAlchemy you will be able to do:

Column(
    "json_type_1",
    JSON(),
    info={"anyschema/dtype": nw.Struct(...)},
),
Column(
    "json_type_2",
    JSON(),
    info={"anyschema/dtype": nw.Struct(...)},  # different struct from above
),

If a dtype should always be remapped to a given narwhals type, then you can write a custom parser. I want to improve the ergonomics to add a parser in the pipeline, but this is already possible

@danielgafni
Copy link

danielgafni commented Dec 19, 2025

Dagster recently added support a special metadata key provided exactly via json_schema_extra

And apparently they did not go with dagster/ spec because it may break OpenAPI generators and other tooling typically working with these json schemas.

It seems like the best way is to use a nested dict, and maybe have it start with x- if there is a change it's going to be used in OpenAPI generators:

json_schema_extra={
    "x-anyschema": {  # "anyschema" is fine as well
        "nullable": true
    }
}

This doesn't break normal json schema tooling.


I suggest having a chat with Claude about this. Apparently it's not that straightforward as it seems.

@FBruzzesi
Copy link
Owner Author

Thanks everyone for the feedback 🚀 merging now 😇

@FBruzzesi FBruzzesi merged commit 6c7e8f8 into main Dec 21, 2025
11 of 12 checks passed
@FBruzzesi FBruzzesi deleted the feat/field-data-structure branch December 21, 2025 19:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants