Skip to content

Commit ea291c2

Browse files
authored
Docs 1809: lazy dataframe documentation and example notebook (#1815)
#### Reference Issues/PRs Closes #1809 #### What does this implement or fix? Adds documentation and an example notebook for the new `LazyDataFrame` API for processing operations. Also updated existing example notebooks to use the `LazyDataFrame` instead of `QueryBuilder`, and the `QueryBuilder` demo notebook to refer to the `LazyDataFrame` API being preferred. Also tweaked the docstrings for `LazyDataFrame` and `LazyDataFrameCollection` to improve rendering and remove duplication of `QueryBuilder` docstrings.
1 parent fe0516c commit ea291c2

18 files changed

+3386
-2661
lines changed

docs/mkdocs/docs/api/index.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@ The API is structured into the following components:
1010

1111
* [**Arctic**](arctic.md): Arctic is the primary API used for accessing and manipulating ArcticDB libraries.
1212
* [**Library**](library.md): The Library API enables reading and manipulating symbols inside ArcticDB libraries.
13-
* [**Query Builder**](query_builder.md): The QueryBuilder API enables the specification of complex queries, utilised in the Library API.
13+
* [**DataFrame Processing Operations API**](processing.md): Details the advanced DataFrame processing operations available within ArcticDB.
1414

1515
Most of the code snippets in the API docs require importing `arcticdb` as `adb`:
1616

docs/mkdocs/docs/api/processing.md

+12
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
DataFrame Processing Operations API
2+
===================================
3+
4+
::: arcticdb.LazyDataFrame
5+
options:
6+
inherited_members: false
7+
8+
::: arcticdb.LazyDataFrameCollection
9+
options:
10+
inherited_members: false
11+
12+
::: arcticdb.QueryBuilder

docs/mkdocs/docs/api/query_builder.md

-4
This file was deleted.

docs/mkdocs/docs/error_messages.md

+12-12
Original file line numberDiff line numberDiff line change
@@ -43,10 +43,10 @@ For legacy reasons, the terms `symbol`, `stream`, and `stream ID` are used inter
4343
| Error Code | Cause | Resolution |
4444
|------------|------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
4545
| 4000 | The number, type, or name of the columns has been changed. | Ensure that the type and order of the columns has not changed when appending or updating the previous version. This restriction only applies when `Dynamic Schema` is disabled - if you require the columns sets to change, please enable the `Dynamic Schema` option on your library. |
46-
| 4001 | The specified column does not exist. | Please specify a valid column - use the `get_description` method to see all of the columns associated with a given symbol. |
47-
| 4002 | The requested operation is not supported with the type of column provided. | Certain operations are not supported over all column types e.g. arithmetic with the `QueryBuilder` over string columns - use the `get_description` method to see all of the columns associated with a given symbol, along with their types. |
48-
| 4003 | The requested operation is not supported with the index type of the symbol provided. | Certain operations are not supported over all index types e.g. column statistics generation with a string index - use the `get_description` method to see the index(es) associated with a given symbol, along with their types. |
49-
| 4004 | The requested operation is not supported with pickled data. | Certain operations are not supported with pickled data e.g. `date_range` filtering. If such operations are required, you must ensure that the data is of a normalizable type, such that it can be written using the `write` method, and does not require the `write_pickle` method. |
46+
| 4001 | The specified column does not exist. | Please specify a valid column - use the `get_description` method to see all of the columns associated with a given symbol. |
47+
| 4002 | The requested operation is not supported with the type of column provided. | Certain operations are not supported over all column types e.g. arithmetic in the processing pipeline over string columns - use the `get_description` method to see all of the columns associated with a given symbol, along with their types. |
48+
| 4003 | The requested operation is not supported with the index type of the symbol provided. | Certain operations are not supported over all index types e.g. column statistics generation with a string index - use the `get_description` method to see the index(es) associated with a given symbol, along with their types. |
49+
| 4004 | The requested operation is not supported with pickled data. | Certain operations are not supported with pickled data e.g. `date_range` filtering. If such operations are required, you must ensure that the data is of a normalizable type, such that it can be written using the `write` method, and does not require the `write_pickle` method. |
5050

5151

5252
### Storage Errors
@@ -92,7 +92,7 @@ For legacy reasons, the terms `symbol`, `stream`, and `stream ID` are used inter
9292

9393
These errors relate to data being pickled, which limits the operations available. Internally, pickled symbols are stored as opaque, serialised binary blobs in the [data layer](technical/on_disk_storage.md#data-layer). No index or column information is maintained in this serialised object which is in contrast to non-pickled data, where this information is stored in the [index layer](technical/on_disk_storage.md#index-layer).
9494

95-
Furthermore, it is not possible to partially read/update/append the data using the ArcticDB API or use the QueryBuilder with pickled symbols.
95+
Furthermore, it is not possible to partially read/update/append the data using the ArcticDB API or use the processing pipeline with pickled symbols.
9696

9797
All of these errors are of type `arcticdb.exceptions.ArcticException`.
9898

@@ -135,20 +135,20 @@ All of these errors are of type `arcticdb.exceptions.ArcticException`.
135135
| Non-contiguous rows, range search on unsorted data?... | `read` method called with the optional `date_range` argument specified, and the symbol has a timestamp index, but it is not sorted. | To use the `date_range` argument to `read`, the user must ensure the data is sorted on the index at write time. |
136136
| Delete in range will not work as expected with a non-timeseries index | `delete_data_in_range` method called, but the symbol does not have a timestamp index. | None, the `delete_data_in_range` method does not make sense without a timestamp index. |
137137

138-
### QueryBuilder errors
138+
### Processing pipeline errors
139139

140-
Due to the client-only nature of ArcticDB, it is not possible to know if a `QueryBuilder` provided to `read` makes sense for the given symbol without interacting with the storage. In particular, we do not know:
140+
Due to the client-only nature of ArcticDB, it is not possible to know if a processing operation applied to a `LazyDataFrame`, or provided to `read` with a `QueryBuilder` object, makes sense for the given symbol without interacting with the storage. In particular, we do not know:
141141

142142
* Whether a specified column exists
143143
* What the type of the data held in a specified column is if it does exist
144144

145145
All of these errors are of type `arcticdb.exceptions.ArcticException`.
146146

147-
| Error messages | Cause | Resolution |
148-
|:--------------|:-------|:-----------|
149-
| Unexpected column name | A column name was specified with the `QueryBuilder` that does not exist for this symbol, and the library has dynamic schema disabled. | None of the supported `QueryBuilder` operations (filtering, projections, group-bys and aggregations) make sense with non-existent columns. |
150-
| Non-numeric type provided to binary operation: <typename\> | Error messages like this imply that an operation that ArcticDB does not support was provided in the `QueryBuilder` argument e.g. adding two string columns together. | The `get_description` method can be used to inspect the types of the columns. A full list of supported operations are provided in the `QueryBuilder` [API documentation](api/query_builder.md). |
151-
| Cannot compare <typename 1\> to <typename 2\> (possible categorical?) | If `get_description` indicates that a column is of categorical type, and this categorical is being used to store string values, then comparisons to other strings will fail with an error message like this one. | Categorical support in ArcticDB is [extremely limited](faq.md#does-arcticdb-support-categorical-data), but may be added in the future. |
147+
| Error messages | Cause | Resolution |
148+
|:--------------|:-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
149+
| Unexpected column name | A column name was specified for a processing operation that does not exist for this symbol, and the library has dynamic schema disabled. | Use `get_description` to ensure that column names provided in processing operations exist for the symbol. |
150+
| Non-numeric type provided to binary operation: <typename\> | Error messages like this imply that an operation that ArcticDB does not support was provided in a processing operation e.g. adding two string columns together. | The `get_description` method can be used to inspect the types of the columns. A full list of supported operations are provided in the `QueryBuilder` [API documentation](api/processing.md). |
151+
| Cannot compare <typename 1\> to <typename 2\> (possible categorical?) | If `get_description` indicates that a column is of categorical type, and this categorical is being used to store string values, then comparisons to other strings will fail with an error message like this one. | Categorical support in ArcticDB is [extremely limited](faq.md#does-arcticdb-support-categorical-data), but may be added in the future. |
152152

153153
### Encoding errors
154154

docs/mkdocs/docs/faq.md

+4-2
Original file line numberDiff line numberDiff line change
@@ -105,7 +105,7 @@ Note that this is a library configuration option that is off by default, see [`h
105105

106106
ArcticDB is primarily focused on filtering and transfering data from storage through to memory - at which point Pandas, NumPy, or other standard analytical packages can be utilised for analytics.
107107

108-
That said, ArcticDB does offer a limited set of analytical functions that are executed inside the C++ storage engine offering significant performance benefits over Pandas. For more information, see the documentation for the *QueryBuilder* class.
108+
That said, ArcticDB does offer a limited set of analytical functions that are executed inside the C++ storage engine offering significant performance benefits over Pandas. For more information, see the [documentation](api/processing.md) for the `LazyDataFrame`, `LazyDataFrameCollection`, and `QueryBuilder` classes.
109109

110110
### *What does Pickling mean?*
111111

@@ -172,7 +172,9 @@ Please see the [Runtime Configuration](runtime_config.md#versionstorenumcputhrea
172172

173173
### Does ArcticDB support categorical data?
174174

175-
ArcticDB currently offers extremely limited support for categorical data. Series and DataFrames with categorical columns can be provided to the `write` and `write_batch` methods, and will then behave as expected on `read`. However, `append` and `update` are not yet supported with categorical data, and will raise an exception if attempted. The `QueryBuilder` is also not supported with categorical data, and will either raise an exception, or give incorrect results, depending on the exact operations requested.
175+
ArcticDB currently offers extremely limited support for categorical data. Series and DataFrames with categorical columns can be provided to the `write` and `write_batch` methods, and will then behave as expected on `read`.
176+
However, `append` and `update` are not yet supported with categorical data, and will raise an exception if attempted.
177+
Analytics such as filtering using the `LazyDataFrame` or `QueryBuilder` classes is also not supported with categorical data, and will either raise an exception, or give incorrect results, depending on the exact operations requested.
176178

177179
### How does ArcticDB handle `NaN`?
178180

docs/mkdocs/docs/index.md

+11-4
Original file line numberDiff line numberDiff line change
@@ -213,18 +213,25 @@ _output (the rows in the date range and columns requested)_
213213
2000-01-01 13:00:00 18 8
214214
```
215215

216-
#### Filtering
216+
#### Filtering and Analytics
217217

218-
ArcticDB uses a Pandas-_like_ syntax to describe how to filter data. For more details including the limitations, please view the docstring ([`help(QueryBuilder)`](api/query_builder)).
218+
ArcticDB supports many common DataFrame analytics operations, including filtering, projections, group-bys, aggregations, and resampling. The most intuitive way to access these operations is via the [`LazyDataFrame`](api/processing.md#arcticdb.LazyDataFrame) API, which should feel familiar to experienced users of Pandas or Polars.
219219

220-
!!! info "ArcticDB Filtering Philosphy & Restrictions"
220+
The legacy [`QueryBuilder`](api/processing.md#arcticdb.QueryBuilder) class can also be created directly and passed into `read` calls with the same effect.
221+
222+
!!! info "ArcticDB Analytics Philosphy"
221223

222224
In most cases this is more memory efficient and performant than the equivalent Pandas operation as the processing is within the C++ storage engine and parallelized over multiple threads of execution.
223225

224226
```python
227+
import arcticdb as adb
225228
_range = (df.index[5], df.index[8])
226229
_cols = ['COL_30', 'COL_31']
227-
import arcticdb as adb
230+
# Using lazy evaluation
231+
lazy_df = library.read('test_frame', date_range=_range, columns=_cols, lazy=True)
232+
lazy_df = lazy_df[(lazy_df["COL_30"] > 10) & (lazy_df["COL_31"] < 40)]
233+
df = lazy_df.collect().data
234+
# Using the legacy QueryBuilder class gives the same result
228235
q = adb.QueryBuilder()
229236
q = q[(q["COL_30"] > 10) & (q["COL_31"] < 40)]
230237
library.read('test_frame', date_range=_range, columns=_cols, query_builder=q).data

0 commit comments

Comments
 (0)