Skip to content

Commit 8af9b3b

Browse files
committed
release v1
1 parent 3f611e6 commit 8af9b3b

File tree

2 files changed

+11
-16
lines changed

2 files changed

+11
-16
lines changed

README.md

Lines changed: 10 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -10,17 +10,12 @@ If you're unsatisfied that documenting your data models has remained an aftertho
1010

1111
## Installation
1212

13-
Install latest release, or copy [`affinity.py`](https://raw.githubusercontent.com/liquidcarbon/affinity/main/affinity.py) into your project. It's only one file.
13+
Install with any flavor of `pip install affinity`, or copy [`affinity.py`](https://raw.githubusercontent.com/liquidcarbon/affinity/main/affinity.py) into your project. It's only one file.
1414

15-
```
16-
pip install git+https://github.com/liquidcarbon/affinity.git@latest
17-
```
18-
19-
The name `affinity` on PyPI is taken by some project from 2006; once my [pending request](https://github.com/pypi/support/issues/5148) to claim the name comes through, it will be published to PyPI.
2015

2116
## Usage
2217

23-
Now all your data models can be declared as python classes.
18+
Now all your data models can be concisely declared as python classes.
2419

2520
```python
2621
import affinity as af
@@ -34,7 +29,7 @@ class SensorData(af.Dataset):
3429
exp_id = af.ScalarI32("FK to `experiment`")
3530
LOCATION = af.Location(folder="s3://mybucket/affinity", file="raw.parquet", partition_by=["channel"])
3631

37-
# this working concept covers the following:
32+
# how to use affinity Datasets:
3833
data = SensorData() # ✅ empty dataset
3934
data = SensorData(**fields) # ✅ build manually
4035
data = SensorData.build(...) # ✅ build from a source (dataframes, DuckDB)
@@ -51,17 +46,15 @@ data.partition() # ✅ get formatted paths and partitioned da
5146

5247
The `af.Dataset` is Affinity's `BaseModel`, the base class that defines the behavior of children data classes:
5348
- concise class declaration sets the expected dtypes and descriptions for each attribute (column)
54-
- class attributes can be represented by any array (default: `pd.Series` because it handles nullable integers well; available: numpy, polars, arrow)
55-
- class instances can be constructed from any scalars or iterables
56-
- class instances can be cast into any dataframe flavor, and all their methods are available
49+
- class attributes can be represented by any array (defaults to `pd.Series` because it handles nullable integers well)
50+
- class instances can be constructed from scalars, vectors/iterables, or other datasets
5751
- type hints for scalar and vector data
5852

5953
![image](https://github.com/user-attachments/assets/613cf6a5-7db8-465d-bb6d-3072e1b7888b)
6054

6155

6256
## Detailed example: Parquet Round-Trip
6357

64-
Affinity makes class declaration as concise as possible.
6558
All you need to create a data class are typed classes and comments explaining what the fields mean.
6659

6760
#### 1. Declare class
@@ -96,7 +89,7 @@ The class attributes are instantiated Vector objects of zero length. Using the
9689

9790
#### 2. Build class instance from querying a CSV
9891

99-
To build the dataset, we use `IsotopeData.build()` method with `query` argument. We use DuckDB [FROM-first syntax](https://duckdb.org/docs/sql/query_syntax/from.html#from-first-syntax), with `rename=True` keyword argument. The fields in the query result will be assigned names and types provided in the class definition. With `rename=False` (default), the source columns must be named exactly as class attributes. When safe type casting is not possible, an error will be raised; element with z=128 would not fit this dataset.
92+
To build the dataset, we use `IsotopeData.build()` method with `query` argument. We use DuckDB [FROM-first syntax](https://duckdb.org/docs/sql/query_syntax/from.html#from-first-syntax), with `rename=True` keyword argument. The fields in the query result will be assigned names and types provided in the class definition. With `rename=False` (default), the source columns must be named exactly as class attributes. When safe type casting is not possible, an error will be raised; element with z=128 would not fit this dataset. Good thing there isn't one (not even as a Wikipedia article)!
10093

10194
```python
10295
url = "https://raw.githubusercontent.com/liquidcarbon/chembiodata/main/isotopes.csv"
@@ -157,6 +150,9 @@ pf.schema_arrow
157150
# ' + 146
158151
```
159152

153+
> [!TIP]
154+
> Though in all examples here the comment field is a string, Arrow allows non-string data in Parquet metadata (some caveats apply). If you're packaging multidimensional vectors, check out "test_objects_as_metadata" in the [test file](https://github.com/liquidcarbon/affinity/blob/main/test_affinity.py).
155+
160156
#### 5. Inspect metadata using DuckDB
161157

162158
DuckDB provides several functions for [querying Parquet metadata](https://duckdb.org/docs/data/parquet/metadata.html). We're specifically interested in key-value metadata, where both keys and values are of type `BLOB`. It can be decoded on the fly using `SELECT DECODE(key), DECODE(value) FROM parquet_kv_metadata(...)`, or like so:
@@ -230,8 +226,7 @@ filepaths[:3], datasets[:3]
230226
# abundance = [0.0759, 0.9241]])
231227
```
232228

233-
234-
229+
If you work with AWS Athena, also check out `kwargs_for_create_athena_table` method available on all Datasets.
235230

236231

237232
## Motivation

pyproject.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ build-backend = "hatchling.build"
44

55
[project]
66
name = "affinity"
7-
version = "0.9.0"
7+
version = "1.0.0"
88
description = "Module for creating well-documented datasets, with types and annotations."
99
authors = [
1010
{ name = "Alex Kislukhin" }

0 commit comments

Comments
 (0)