-
Notifications
You must be signed in to change notification settings - Fork 2
feat: Add AnyField data structure
#76
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Typical ping to: @danielgafni @camriddell @dangotbanned It's a large diff but split as follow:
|
anyschema/_anyschema.py
Outdated
| dtype: DType | ||
| nullable: bool | ||
| unique: bool | ||
| metadata: dict[str, Any] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have a use-case in mind for Field.metadata being mutable after construction?
If not, I think it could simplify things for hashing to use something like types.MappingProxyType:
import copy
import types
from narwhals.dtypes import DType
from typing import Any
from collections.abc import Mapping
class Field:
name: str
dtype: DType
nullable: bool
unique: bool
metadata: Mapping[str, Any]
def __init__(
self,
name: str,
dtype: DType,
*,
nullable: bool = False,
unique: bool = False,
metadata: Mapping[str, Any] | None = None,
) -> None:
self.name = name
self.dtype = dtype
self.nullable = nullable
self.unique = unique
self.metadata = types.MappingProxyType(
copy.deepcopy(metadata) if metadata is not None else {}
)It's not quite immutability, but it's a bit less dynamic than dict
| Currently supported special metadata keys: | ||
|
|
||
| * `"anyschema/nullable"`: Specifies whether the field can contain null values. | ||
| * `"anyschema/unique"`: Specifies whether all values in the field must be unique. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it too late to change this convention?
Since / in a key prevents it from being a valid python identifier, using regular TypedDict syntax isn't possible:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these keys would be typically added to library-specific field specifications by users ad-hoc, so it should be fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dangotbanned I was about to pull the functional syntax myself but you are already aware of it. Do you have any proposal for something which is a valid python identifier?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't want to de-rail this, but is there a specific reason for keys like "anyschema/nullable" instead of using a nested dict {"anyschema": {"nullable": ..., "unique": ...}} The latter should be easier to handle/verify in code. In the former case we need to iterate across all keys and find the ones that begin with "anyschema/", whereas in the latter case you can just check for the existence of "anyschema" in the metadata.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dangotbanned I was about to pull the functional syntax myself but you are already aware of it. Do you have any proposal for something which is a valid python identifier?
Something like @camriddell's suggestion (#76 (comment)) could work quite nicely.
Here's what that, and another one which just uses __anyschema_<key>__ look like:
from typing_extensions import TypedDict
class MetadataDunder(TypedDict, total=False):
__anyschema_nullable__: bool
__anyschema_unique__: bool
__anyschema_time_zone__: str
__anyschema_time_unit__: str
class MetadataNested(TypedDict, total=False):
nullable: bool
unique: bool
time_zone: str
time_unit: str
class MetadataInsideMe(TypedDict):
__anyschema_metadata__: MetadataNested # I like this one the most (ignoring the names of the `TypedDict`s)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe the nested dict is good as well! Probably better than anyschema/<key>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To keep the diff a bit smaller, I will follow up on this on a dedicated PR
danielgafni
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, I think it makes a lot of sense, the PR is already solid.
Some topics for discussion:
-
Currently, we expect
nullableanduniqueto always be set, even tho sometimes we actually have no information about them initially, for example with Narwhals or Polars columns. We default them toFalse, but I feel like it may be a wrong thing to do. You might want to explore allowing them to bebool | Noneas well. I'm not sure if this is the way, but it's something to consider. You can add a few convenience property likeknown_nullabilityor something like this. -
Is there a reason why this isn't a
@dataclass? Would simplify some things. You can still have a custom__init__if needed.
| and self.metadata == other.metadata | ||
| ) | ||
|
|
||
| def __hash__(self) -> int: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At this point why won't we just use a frozen @dataclass?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see the appeal of that, yet we might end up creating a lot of this fields and the overhead can be noticeable. The container is so simple that I don't think it would end up saving so much typing after all.
I am sure that dataclasses overhead is getting better (as in, it's getting lower over python version), but still
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO I won't worry about performance at this stage.
-
I'm sure it won't be noticeable, especially in data workflows.
-
Even if it happens to be noticeable, it's better to take care of this once more important problems like establishing a good API are solved. Using
@dataclassis an easy way to get started with a good API, and you should be able to change this part later on without user-facing changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm +1 on using a frozen dataclass here, it would reduce the boilerplate code by a ton. If this appears to be a bottleneck in the future then we can worry about optimization at that point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in f822f75
I also made a benchmarking script. I will share it tomorrow
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As promised:
Instance Creation (µs per instance)
--------------------------------------------------------------------------------
Current (Manual) 0.266 µs (baseline)
Frozen DC (default_factory) 0.537 µs (+101.9%, +0.271 µs)
... <other attempts>
All other ops are in the same ballpark, creation is the one with a (relative) big difference.
I think it's ok to use it and revisit later. I don't expect a user to rely on the fact that this class is a dataclass
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You'd still need to do a bit of legwork for argument defaults, but my @dataclass_transform thing in (narwhals-dev/narwhals#2572) will get you the other features from @dataclass you're using IIUC
Feel free to steal anything of interest 😄
- https://github.com/narwhals-dev/narwhals/blob/1550febd99a8057ebb328333ddc01361a02a8a8b/narwhals/_plan/_immutable.py
- https://github.com/narwhals-dev/narwhals/blob/1550febd99a8057ebb328333ddc01361a02a8a8b/narwhals/_plan/_meta.py
- https://github.com/narwhals-dev/narwhals/blob/1550febd99a8057ebb328333ddc01361a02a8a8b/tests/plan/immutable_test.py
There's probably not much code left after removing the general hashing stuff, which I imagine you don't need here anyway
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @dangotbanned - I tried a few approaches by now, and once one adds layers of protections for complete immutability, then creation performance is in the same ballpark as per dataclasses.
| Currently supported special metadata keys: | ||
|
|
||
| * `"anyschema/nullable"`: Specifies whether the field can contain null values. | ||
| * `"anyschema/unique"`: Specifies whether all values in the field must be unique. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these keys would be typically added to library-specific field specifications by users ad-hoc, so it should be fine.
|
Thanks @danielgafni @dangotbanned for your reviews, they made some very good points I will address 🙏🏼 P.S.: I am thinking to rename the class to |
I don't think it shows in the commits, but the fact that I changed the nullability default while developing was already a sign that something was off to me, and this comment is a second hint towards that. My reasoning was as follows:
I was thinking in the following direction. Suppose you have a pydantic model like the following from pydantic import BaseModel
class User(BaseModel):
name: str
email: str | NoneNow if you parse/validate a list of users and want to convert to a dataframe, then the import polars as pl
import pyarrow as pa
users = [
User(name="francesco", email="[email protected]"),
User(name="daniel", email=None),
# User(name=None, email="[email protected]") -> however this would raise
]
pl.DataFrame([user.model_dump() for user in users])
shape: (2, 2)
┌───────────┬───────────────────────┐
│ name ┆ email │
│ --- ┆ --- │
│ str ┆ str │
╞═══════════╪═══════════════════════╡
│ francesco ┆ not-my-email@fake.com │
│ daniel ┆ null │
└───────────┴───────────────────────┘
pa.Table.from_pylist([user.model_dump() for user in users])
pyarrow.Table
name: string
email: string
----
name: [["francesco","daniel"]]
email: [["[email protected]",null]]I know polars does not distinguish between them, nor pyarrow if not explicitly passed. Here is where I think it gets interesting: as of this PR it's possible to pass this information to pyarrow: from anyschema import AnySchema
pa.Table.from_pylist([user.model_dump() for user in users], schema=AnySchema(User).to_arrow())
pyarrow.Table
name: string not null
email: string
----
name: [["francesco","daniel"]]
email: [["[email protected]",null]]and this was not possible before! Important Ok I hear you, all of this needs to be documented somehow and not stay only in my head
Replied in #76 (comment) on the dataclasses topic |
|
I see, I'll think about it again tomorrow gut it seems like it makes sense. One more thing: you might want to a few public methods for |
If we use a dataclass, we can implement this via dataclasses.replace or just not implement our own public method to slim down the API surface area (less code written is less code one needs to maintain) |
Field data structureAnyField data structure
|
Thanks everyone! Converted back to draft and will finish it tomorrow ✨ |
|
@danielgafni in the last commit (cc21b5b) I (and Claude) added the example of requiring explicit nullability |
|
Alright, this seems ok in isolation but I can't stop thinking about SQLModel. This gets us back to my comment here. An SQLModel class may represent both Python-side schema and DB-side schema. These may be different, e.g. a server-generated column may be Or did you decide to only handle Python-side schemas and not storage schemas? On the other hand, SQLModel may be a... not the best way to do thing in general. Perhaps it's trying to do too much. Perhaps we really just need to have 2 separate models for storage and Python schemas. I'll probably be happy if |
SQLModel is probably a weird mix that I want to tackle eventually. I will create a dedicated issue for it. I don't have too much direct experience with it (only hackatons and hobby projects), so I hope the following statements are accurate.
Wouldn't this be the same as SQLAlchemy? Annotating a SQLAlchemy column as
A pattern I see in the SQLModel docs is to distinguish between a class for the field spec and a class for a table (source): class HeroBase(SQLModel):
name: str = Field(index=True)
secret_name: str
age: int | None = Field(default=None, index=True)
class Hero(HeroBase, table=True):
id: int | None = Field(default=None, primary_key=True)As of now I am more interested in translating to dataframe schemas but as discussed in our call, I don't want to limit the project. Does this answer to
as well? Regarding:
I am having a discussion with my brain if I should expand/diverge from the narwhals datatypes. There are cases I would like to cover, but on the other hand I am afraid to stumble upon an infinite variety of datatypes. But here looping back on the dataframe validator idea (cc: @camriddell), I am thinking to add a def to_validator(self) -> list[Expr]:
validators = []
col = nw.col(self.name)
if not self.nullable:
validators.append(col.is_null().any().alias(f"{self.name}__has_nulls")
if self.unique:
validators.append((col.n_unique() != col.len()).alias(f"{self.name}__not_unique"))
return validators Now with if we had more dtypes (e.g. But yeah the dtypes can get messy and the validators idea even more so. It's probably a bit out of scope of the original idea I had for the library. |
|
I definitely don't want to move way from Narwhals types. I was referring to cases where both DB and Python types are Narwhals types, but due to e.g. storage restrictions they differ. For example, PostgreSQL doesn't really support Structs or Maps, best you can do is to store such structured Narwhals columns as JSON type. Ideally I want to be able to express this with Anyschema: it should pick the JSON type from sqlalchemy, but the DataFrame type should be a proper nw.Struct (which we'll probably have to provide manually). |
Next in scope is to allow easy custom mapping of fields with the metadata value under Column(
"json_type_1",
JSON(),
info={"anyschema/dtype": nw.Struct(...)},
),
Column(
"json_type_2",
JSON(),
info={"anyschema/dtype": nw.Struct(...)}, # different struct from above
),If a dtype should always be remapped to a given narwhals type, then you can write a custom parser. I want to improve the ergonomics to add a parser in the pipeline, but this is already possible |
|
Dagster recently added support a special metadata key provided exactly via And apparently they did not go with It seems like the best way is to use a nested dict, and maybe have it start with json_schema_extra={
"x-anyschema": { # "anyschema" is fine as well
"nullable": true
}
}This doesn't break normal json schema tooling. I suggest having a chat with Claude about this. Apparently it's not that straightforward as it seems. |
|
Thanks everyone for the feedback 🚀 merging now 😇 |
Description
Fielddata structure to represent schema fields with enhanced nullability and metadata handlingto_arrow()to include nullability and metadataPossible follow up for the
Fieldclass:"anyschema/description", via library dependentdescriptionor variable docstringsAnySchemato get names, dtypes, nullability, uniquenessType of Change
Related Issues
Changes Made
Checklist