feat: Add `AnyField` data structure #76

FBruzzesi · 2025-12-16T15:29:05Z

Description

Introduce a new Field data structure to represent schema fields with enhanced nullability and metadata handling
Updates to_arrow() to include nullability and metadata

Possible follow up for the Field class:

allow for description via: "anyschema/description", via library dependent description or variable docstrings
add methods/properties in AnySchema to get names, dtypes, nullability, uniqueness

Type of Change

🐛 Bug fix (non-breaking change which fixes an issue)
✨ New feature (non-breaking change which adds functionality)
⚠️ Breaking change (fix or feature that would cause existing functionality to not work as expected)
📚 Documentation update
🧪 Test improvement
🔧 Maintenance/Refactoring
⚡ Performance improvement
🏗️ Build/CI improvement

Related Issues

Related to [Feature] Support both constraints and custom metadata #68

Changes Made

Checklist

My code follows the project's style guidelines (ruff)
I have performed a self-review of my code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

FBruzzesi · 2025-12-16T15:36:54Z

Typical ping to: @danielgafni @camriddell @dangotbanned

It's a large diff but split as follow:

docs <200 lines (start from here)
codebase <250 lines (maybe take a look at this)
test remaining ~900 lines (feel free to skip)

dangotbanned · 2025-12-16T17:34:03Z

anyschema/_anyschema.py

+    dtype: DType
+    nullable: bool
+    unique: bool
+    metadata: dict[str, Any]


Do you have a use-case in mind for Field.metadata being mutable after construction?

If not, I think it could simplify things for hashing to use something like types.MappingProxyType:

import copy import types from narwhals.dtypes import DType from typing import Any from collections.abc import Mapping class Field: name: str dtype: DType nullable: bool unique: bool metadata: Mapping[str, Any] def __init__( self, name: str, dtype: DType, *, nullable: bool = False, unique: bool = False, metadata: Mapping[str, Any] | None = None, ) -> None: self.name = name self.dtype = dtype self.nullable = nullable self.unique = unique self.metadata = types.MappingProxyType( copy.deepcopy(metadata) if metadata is not None else {} )

It's not quite immutability, but it's a bit less dynamic than dict

dangotbanned · 2025-12-16T17:45:39Z

docs/user-guide/metadata.md

 Currently supported special metadata keys:

+* `"anyschema/nullable"`: Specifies whether the field can contain null values.
+* `"anyschema/unique"`: Specifies whether all values in the field must be unique.


Is it too late to change this convention?

Since / in a key prevents it from being a valid python identifier, using regular TypedDict syntax isn't possible:

https://typing.python.org/en/latest/spec/typeddict.html#functional-syntax

I think these keys would be typically added to library-specific field specifications by users ad-hoc, so it should be fine.

@dangotbanned I was about to pull the functional syntax myself but you are already aware of it. Do you have any proposal for something which is a valid python identifier?

Don't want to de-rail this, but is there a specific reason for keys like "anyschema/nullable" instead of using a nested dict {"anyschema": {"nullable": ..., "unique": ...}} The latter should be easier to handle/verify in code. In the former case we need to iterate across all keys and find the ones that begin with "anyschema/", whereas in the latter case you can just check for the existence of "anyschema" in the metadata.

@dangotbanned I was about to pull the functional syntax myself but you are already aware of it. Do you have any proposal for something which is a valid python identifier?

Something like @camriddell's suggestion (#76 (comment)) could work quite nicely.

Here's what that, and another one which just uses __anyschema_<key>__ look like:

from typing_extensions import TypedDict class MetadataDunder(TypedDict, total=False): __anyschema_nullable__: bool __anyschema_unique__: bool __anyschema_time_zone__: str __anyschema_time_unit__: str class MetadataNested(TypedDict, total=False): nullable: bool unique: bool time_zone: str time_unit: str class MetadataInsideMe(TypedDict): __anyschema_metadata__: MetadataNested # I like this one the most (ignoring the names of the `TypedDict`s)

I believe the nested dict is good as well! Probably better than anyschema/<key>

To keep the diff a bit smaller, I will follow up on this on a dedicated PR

danielgafni

In general, I think it makes a lot of sense, the PR is already solid.

Some topics for discussion:

Currently, we expect nullable and unique to always be set, even tho sometimes we actually have no information about them initially, for example with Narwhals or Polars columns. We default them to False, but I feel like it may be a wrong thing to do. You might want to explore allowing them to be bool | None as well. I'm not sure if this is the way, but it's something to consider. You can add a few convenience property like known_nullability or something like this.
Is there a reason why this isn't a @dataclass? Would simplify some things. You can still have a custom __init__ if needed.

anyschema/_anyschema.py

danielgafni · 2025-12-16T18:48:40Z

anyschema/_anyschema.py

+            and self.metadata == other.metadata
+        )
+
+    def __hash__(self) -> int:


At this point why won't we just use a frozen @dataclass?

I see the appeal of that, yet we might end up creating a lot of this fields and the overhead can be noticeable. The container is so simple that I don't think it would end up saving so much typing after all.

I am sure that dataclasses overhead is getting better (as in, it's getting lower over python version), but still

IMO I won't worry about performance at this stage.

I'm sure it won't be noticeable, especially in data workflows.

Even if it happens to be noticeable, it's better to take care of this once more important problems like establishing a good API are solved. Using @dataclass is an easy way to get started with a good API, and you should be able to change this part later on without user-facing changes.

I'm +1 on using a frozen dataclass here, it would reduce the boilerplate code by a ton. If this appears to be a bottleneck in the future then we can worry about optimization at that point.

Addressed in f822f75

I also made a benchmarking script. I will share it tomorrow

As promised:

Instance Creation (µs per instance) -------------------------------------------------------------------------------- Current (Manual) 0.266 µs (baseline) Frozen DC (default_factory) 0.537 µs (+101.9%, +0.271 µs) ... <other attempts>

All other ops are in the same ballpark, creation is the one with a (relative) big difference.

I think it's ok to use it and revisit later. I don't expect a user to rely on the fact that this class is a dataclass

You'd still need to do a bit of legwork for argument defaults, but my @dataclass_transform thing in (narwhals-dev/narwhals#2572) will get you the other features from @dataclass you're using IIUC

Feel free to steal anything of interest 😄

https://github.com/narwhals-dev/narwhals/blob/1550febd99a8057ebb328333ddc01361a02a8a8b/narwhals/_plan/_immutable.py

https://github.com/narwhals-dev/narwhals/blob/1550febd99a8057ebb328333ddc01361a02a8a8b/narwhals/_plan/_meta.py

https://github.com/narwhals-dev/narwhals/blob/1550febd99a8057ebb328333ddc01361a02a8a8b/tests/plan/immutable_test.py

There's probably not much code left after removing the general hashing stuff, which I imagine you don't need here anyway

Thanks @dangotbanned - I tried a few approaches by now, and once one adds layers of protections for complete immutability, then creation performance is in the same ballpark as per dataclasses.

anyschema/_anyschema.py

danielgafni · 2025-12-16T18:56:51Z

docs/user-guide/metadata.md

 Currently supported special metadata keys:

+* `"anyschema/nullable"`: Specifies whether the field can contain null values.
+* `"anyschema/unique"`: Specifies whether all values in the field must be unique.


I think these keys would be typically added to library-specific field specifications by users ad-hoc, so it should be fine.

FBruzzesi · 2025-12-16T19:32:29Z

Thanks @danielgafni @dangotbanned for your reviews, they made some very good points I will address 🙏🏼

P.S.: I am thinking to rename the class to AnyField just to embrace the library name 😂

FBruzzesi · 2025-12-16T19:48:58Z

@danielgafni

Currently, we expect nullable and unique to always be set, even tho sometimes we actually have no information about them initially, for example with Narwhals or Polars columns. We default them to False, but I feel like it may be a wrong thing to do. You might want to explore allowing them to be bool | None as well. I'm not sure if this is the way, but it's something to consider. You can add a few convenience property like known_nullability or something like this.

I don't think it shows in the commits, but the fact that I changed the nullability default while developing was already a sign that something was off to me, and this comment is a second hint towards that.

My reasoning was as follows:

For nullable to be True, one can always pass T | None or Optional[T]
However this means that if not declared like that, then the opposite holds true

I was thinking in the following direction. Suppose you have a pydantic model like the following

from pydantic import BaseModel

class User(BaseModel):
    name: str
    email: str | None

Now if you parse/validate a list of users and want to convert to a dataframe, then the name column would NOT be nullable, while email one can be:

import polars as pl
import pyarrow as pa

users = [
    User(name="francesco", email="[email protected]"),
    User(name="daniel", email=None),
    # User(name=None, email="[email protected]") -> however this would raise
]

pl.DataFrame([user.model_dump() for user in users])
shape: (2, 2)
┌───────────┬───────────────────────┐
│ name      ┆ email                 │
│ ---       ┆ ---                   │
│ str       ┆ str                   │
╞═══════════╪═══════════════════════╡
│ francesco ┆ not-my-email@fake.com │
│ daniel    ┆ null                  │
└───────────┴───────────────────────┘

pa.Table.from_pylist([user.model_dump() for user in users])
pyarrow.Table
name: string
email: string
----
name: [["francesco","daniel"]]
email: [["[email protected]",null]]

I know polars does not distinguish between them, nor pyarrow if not explicitly passed. Here is where I think it gets interesting: as of this PR it's possible to pass this information to pyarrow:

from anyschema import AnySchema

pa.Table.from_pylist([user.model_dump() for user in users], schema=AnySchema(User).to_arrow())
pyarrow.Table
name: string not null
email: string
----
name: [["francesco","daniel"]]
email: [["[email protected]",null]]

and this was not possible before!

Important

Ok I hear you, all of this needs to be documented somehow and not stay only in my head

Is there a reason why this isn't a @dataclass? Would simplify some things. You can still have a custom __init__ if needed.

Replied in #76 (comment) on the dataclasses topic

danielgafni · 2025-12-16T20:03:23Z

I see, I'll think about it again tomorrow gut it seems like it makes sense.

One more thing: you might want to a few public methods for Field transformations, e.g. Field.with_attributes or Field.with_metadata if you are going to change it to a frozen @dataclass. I've noticed you already had to copy .metadata in some places, this may be a little code smell.

camriddell · 2025-12-16T20:46:04Z

I see, I'll think about it again tomorrow gut it seems like it makes sense.

One more thing: you might want to a few public methods for Field transformations, e.g. Field.with_attributes or Field.with_metadata if you are going to change it to a frozen @dataclass. I've noticed you already had to copy .metadata in some places, this may be a little code smell.

If we use a dataclass, we can implement this via dataclasses.replace or just not implement our own public method to slim down the API surface area (less code written is less code one needs to maintain)

FBruzzesi · 2025-12-16T22:49:58Z

Thanks everyone! Converted back to draft and will finish it tomorrow ✨

FBruzzesi · 2025-12-17T21:13:38Z

@danielgafni in the last commit (cc21b5b) I (and Claude) added the example of requiring explicit nullability

danielgafni · 2025-12-18T00:31:25Z

Alright, this seems ok in isolation but I can't stop thinking about SQLModel. This gets us back to my comment here.

An SQLModel class may represent both Python-side schema and DB-side schema. These may be different, e.g. a server-generated column may be Optional in Python. How would you separate these nullability properties and schemas in general? And then there is this whole DB-Python type mismatch problem which is an even bigger can of worms than nullability.

Or did you decide to only handle Python-side schemas and not storage schemas?

On the other hand, SQLModel may be a... not the best way to do thing in general. Perhaps it's trying to do too much. Perhaps we really just need to have 2 separate models for storage and Python schemas.

I'll probably be happy if anyschema could take an SQLModel class and transform it into a DB schema or a dataframe schema with some extra arguments / metadata passed from me.

FBruzzesi · 2025-12-18T10:47:36Z

Alright, this seems ok in isolation but I can't stop thinking about SQLModel. This gets us back to my comment here.

I'll probably be happy if anyschema could take an SQLModel class and transform it into a DB schema or a dataframe schema with some extra arguments / metadata passed from me.

SQLModel is probably a weird mix that I want to tackle eventually. I will create a dedicated issue for it.

I don't have too much direct experience with it (only hackatons and hobby projects), so I hope the following statements are accurate.

An SQLModel class may represent both Python-side schema and DB-side schema. These may be different, e.g. a server-generated column may be Optional in Python. How would you separate these nullability properties and schemas in general?

Wouldn't this be the same as SQLAlchemy? Annotating a SQLAlchemy column as Optional[T] or T | None marks it as nullable. For SQLModel it flags it as required=False. Translating from the specification to a dataframe schema, especially pyarrow, a nullable column could be omitted when writing to the DB (I can share an example to better explain what I mean), but I see what you mean:

In dataframe land, such column can have nulls
In DB it cannot since it's automatically populated (e.g. SQLModel primary_key's)

A pattern I see in the SQLModel docs is to distinguish between a class for the field spec and a class for a table (source):

class HeroBase(SQLModel):
    name: str = Field(index=True)
    secret_name: str
    age: int | None = Field(default=None, index=True)


class Hero(HeroBase, table=True):
    id: int | None = Field(default=None, primary_key=True)

As of now I am more interested in translating to dataframe schemas but as discussed in our call, I don't want to limit the project. Does this answer to

Or did you decide to only handle Python-side schemas and not storage schemas?

as well?

Regarding:

And then there is this whole DB-Python type mismatch problem which is an even bigger can of worms than nullability.

I am having a discussion with my brain if I should expand/diverge from the narwhals datatypes. There are cases I would like to cover, but on the other hand I am afraid to stumble upon an infinite variety of datatypes.

But here looping back on the dataframe validator idea (cc: @camriddell), I am thinking to add a AnyField.to_validator method (and/or AnySchema.to_validator), something along the following lines (API to be improved, please don't judge this sketch below):

def to_validator(self) -> list[Expr]:
    validators = []
    col = nw.col(self.name)
    if not self.nullable:
        validators.append(col.is_null().any().alias(f"{self.name}__has_nulls")
    if self.unique:
        validators.append((col.n_unique() != col.len()).alias(f"{self.name}__not_unique"))
    return validators

Now with if we had more dtypes (e.g. UUIDs, URLs, String(min_length, max_length)) we can have other validators. Are UUID and String with max length all we are missing from SQLAlchemy Datatype Objects? Edit: TIL PyArrow has a uuid dtype

But yeah the dtypes can get messy and the validators idea even more so. It's probably a bit out of scope of the original idea I had for the library.

danielgafni · 2025-12-18T13:58:37Z

I definitely don't want to move way from Narwhals types.

I was referring to cases where both DB and Python types are Narwhals types, but due to e.g. storage restrictions they differ.

For example, PostgreSQL doesn't really support Structs or Maps, best you can do is to store such structured Narwhals columns as JSON type.

Ideally I want to be able to express this with Anyschema: it should pick the JSON type from sqlalchemy, but the DataFrame type should be a proper nw.Struct (which we'll probably have to provide manually).

FBruzzesi · 2025-12-18T14:22:26Z

For example, PostgreSQL doesn't really support Structs or Maps, best you can do is to store such structured Narwhals columns as JSON type.

Ideally I want to be able to express this with Anyschema: it should pick the JSON type from sqlalchemy, but the DataFrame type should be a proper nw.Struct (which we'll probably have to provide manually).

Next in scope is to allow easy custom mapping of fields with the metadata value under "anyschema/dtype" (or similar keyword). So in SQLAlchemy you will be able to do:

Column(
    "json_type_1",
    JSON(),
    info={"anyschema/dtype": nw.Struct(...)},
),
Column(
    "json_type_2",
    JSON(),
    info={"anyschema/dtype": nw.Struct(...)},  # different struct from above
),

If a dtype should always be remapped to a given narwhals type, then you can write a custom parser. I want to improve the ergonomics to add a parser in the pipeline, but this is already possible

danielgafni · 2025-12-19T22:26:30Z

Dagster recently added support a special metadata key provided exactly via json_schema_extra

And apparently they did not go with dagster/ spec because it may break OpenAPI generators and other tooling typically working with these json schemas.

It seems like the best way is to use a nested dict, and maybe have it start with x- if there is a change it's going to be used in OpenAPI generators:

json_schema_extra={
    "x-anyschema": {  # "anyschema" is fine as well
        "nullable": true
    }
}

This doesn't break normal json schema tooling.

I suggest having a chat with Claude about this. Apparently it's not that straightforward as it seems.

FBruzzesi · 2025-12-21T19:13:28Z

Thanks everyone for the feedback 🚀 merging now 😇

FBruzzesi added 3 commits December 16, 2025 14:34

Field implementation and parsing

78a4dee

field tests

8660f77

to_arrow with nullability and metadata

16e3950

FBruzzesi added the enhancement New feature or request label Dec 16, 2025

FBruzzesi mentioned this pull request Dec 16, 2025

[Feature]: Add AnySchema.__eq__ method #77

Closed

dangotbanned reviewed Dec 16, 2025

View reviewed changes

danielgafni reviewed Dec 16, 2025

View reviewed changes

FBruzzesi added 2 commits December 16, 2025 22:23

rename Field -> Anyfield

a0975a6

use frozen dataclass

f822f75

FBruzzesi changed the title ~~feat: Add Field data structure~~ feat: Add AnyField data structure Dec 16, 2025

FBruzzesi marked this pull request as draft December 16, 2025 22:42

FBruzzesi added 2 commits December 16, 2025 23:45

support description metadata

4098bc5

description tests

10dbe1e

FBruzzesi added 6 commits December 17, 2025 15:00

simplify adapters tests

3ea9d8a

simplify sqlalchemy adapter

30e764f

WIP simplify tests

85ee80e

simplify parse_into_field_test

a5e468c

simplify a bit more

06fb996

chores

e660b58

FBruzzesi marked this pull request as ready for review December 17, 2025 19:48

docs: add reasoning on nullability

cc21b5b

rename parse_field -> parse_into_field

3cee682

FBruzzesi added 3 commits December 19, 2025 12:05

merge main

3a752d0

fix typing issues

dc1460f

bump default python?

2edf77e

FBruzzesi mentioned this pull request Dec 19, 2025

Rename special anyschema metadata #82

Closed

merge main

0bec263

merge main, solve conflicts

7789237

FBruzzesi mentioned this pull request Dec 21, 2025

[Feature] Support both constraints and custom metadata #68

Closed

Merge branch 'main' into feat/field-data-structure

52be1b3

FBruzzesi merged commit 6c7e8f8 into main Dec 21, 2025
11 of 12 checks passed

FBruzzesi deleted the feat/field-data-structure branch December 21, 2025 19:13

feat: Add AnyField data structure #76

feat: Add AnyField data structure #76

Uh oh!

Conversation

FBruzzesi commented Dec 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of Change

Related Issues

Changes Made

Checklist

Uh oh!

FBruzzesi commented Dec 16, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

camriddell Dec 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dangotbanned Dec 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

danielgafni left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

danielgafni Dec 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dangotbanned Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

FBruzzesi Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

FBruzzesi commented Dec 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

FBruzzesi commented Dec 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

danielgafni commented Dec 16, 2025

Uh oh!

camriddell commented Dec 16, 2025

Uh oh!

FBruzzesi commented Dec 16, 2025

Uh oh!

FBruzzesi commented Dec 17, 2025

Uh oh!

feat: Add `AnyField` data structure #76

feat: Add `AnyField` data structure #76

FBruzzesi commented Dec 16, 2025 •

edited

Loading

camriddell Dec 16, 2025 •

edited

Loading

dangotbanned Dec 16, 2025 •

edited

Loading

danielgafni Dec 16, 2025 •

edited

Loading

dangotbanned Dec 17, 2025 •

edited

Loading

FBruzzesi Dec 17, 2025 •

edited

Loading

FBruzzesi commented Dec 16, 2025 •

edited

Loading

FBruzzesi commented Dec 16, 2025 •

edited

Loading

danielgafni commented Dec 18, 2025 •

edited

Loading

FBruzzesi commented Dec 18, 2025 •

edited

Loading

danielgafni commented Dec 18, 2025 •

edited

Loading

danielgafni commented Dec 19, 2025 •

edited

Loading