You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+10-15Lines changed: 10 additions & 15 deletions
Original file line number
Diff line number
Diff line change
@@ -10,17 +10,12 @@ If you're unsatisfied that documenting your data models has remained an aftertho
10
10
11
11
## Installation
12
12
13
-
Install latest release, or copy [`affinity.py`](https://raw.githubusercontent.com/liquidcarbon/affinity/main/affinity.py) into your project. It's only one file.
13
+
Install with any flavor of `pip install affinity`, or copy [`affinity.py`](https://raw.githubusercontent.com/liquidcarbon/affinity/main/affinity.py) into your project. It's only one file.
The name `affinity` on PyPI is taken by some project from 2006; once my [pending request](https://github.com/pypi/support/issues/5148) to claim the name comes through, it will be published to PyPI.
20
15
21
16
## Usage
22
17
23
-
Now all your data models can be declared as python classes.
18
+
Now all your data models can be concisely declared as python classes.
Affinity makes class declaration as concise as possible.
65
58
All you need to create a data class are typed classes and comments explaining what the fields mean.
66
59
67
60
#### 1. Declare class
@@ -96,7 +89,7 @@ The class attributes are instantiated Vector objects of zero length. Using the
96
89
97
90
#### 2. Build class instance from querying a CSV
98
91
99
-
To build the dataset, we use `IsotopeData.build()` method with `query` argument. We use DuckDB [FROM-first syntax](https://duckdb.org/docs/sql/query_syntax/from.html#from-first-syntax), with `rename=True` keyword argument. The fields in the query result will be assigned names and types provided in the class definition. With `rename=False` (default), the source columns must be named exactly as class attributes. When safe type casting is not possible, an error will be raised; element with z=128 would not fit this dataset.
92
+
To build the dataset, we use `IsotopeData.build()` method with `query` argument. We use DuckDB [FROM-first syntax](https://duckdb.org/docs/sql/query_syntax/from.html#from-first-syntax), with `rename=True` keyword argument. The fields in the query result will be assigned names and types provided in the class definition. With `rename=False` (default), the source columns must be named exactly as class attributes. When safe type casting is not possible, an error will be raised; element with z=128 would not fit this dataset. Good thing there isn't one (not even as a Wikipedia article)!
> Though in all examples here the comment field is a string, Arrow allows non-string data in Parquet metadata (some caveats apply). If you're packaging multidimensional vectors, check out "test_objects_as_metadata" in the [test file](https://github.com/liquidcarbon/affinity/blob/main/test_affinity.py).
155
+
160
156
#### 5. Inspect metadata using DuckDB
161
157
162
158
DuckDB provides several functions for [querying Parquet metadata](https://duckdb.org/docs/data/parquet/metadata.html). We're specifically interested in key-value metadata, where both keys and values are of type `BLOB`. It can be decoded on the fly using `SELECT DECODE(key), DECODE(value) FROM parquet_kv_metadata(...)`, or like so:
@@ -230,8 +226,7 @@ filepaths[:3], datasets[:3]
230
226
# abundance = [0.0759, 0.9241]])
231
227
```
232
228
233
-
234
-
229
+
If you work with AWS Athena, also check out `kwargs_for_create_athena_table` method available on all Datasets.
0 commit comments